Java Interface for HDFS File I/O 1


Java Interface for HDFS File I/O:

This post describes Java Interface for Hadoop Distributed File System. It is recommended to go through this post after having basic knowledge on Java Basic Input and Output, Java Binary Input and Output and Java File Input and Output concepts.

To explore more into Hadoop distributed file system through Java Interface, we must have knowledge on a few important main classes which provide I/O operations on Hadoop files.

FileSystem    – org.apache.hadoop.fs  – An abstract file system API.

IOUtils          – org.apache.hadoop.io  – Generic i/o code for reading and writing data to HDFS.

IOUtils:

It is a utility class (handy tool) for I/O related functionality on HDFS. It is present in org.apache.hadoop.io package.

Below are some of its important methods which we use very frequently in HDFS File I/O Operations. All these methods are static methods.

copyBytes:

 IOUtils.copyBytes(InputStream in, OutputStream out, int buffSize, boolean close) ;

This method copies data from one stream to another. The last two arguments are the buffer size used for copying and whether to close the streams when the copy is complete.

readFully:

IOUtils.readFully(InputStream in, byte[] buf, int off, int len) ;

This methods reads len bytes into byte array. off – offset in the buffer

skipFully:

IOUtils.skipFully(InputStream in, long len) ;

Similar to readFully. Skips len bytes

writeFully: 

IOUtils.writeFully(FileChannel fc, ByteBuffer buf, long offset) ;

This method writes a ByteBuffer to a FileChannel at a given offset, handling short writes.

IOUtils.writeFully(WritableByteChannel bc, ByteBuffer buf); 

This method writes a ByteBuffer to a WritableByteChannel, handling short writes.

closeStream:

IOUtils.closeStream(Closeable stream) ; 

This method is used to close input or output streams irrespective of any IOException. This method is generally placed finally clause of a try-catch block.

Configuration:

This class provides access to configuration parameters on a client or server machine. This class is present in org.apache.hadoop.conf package.

Configurations are specified by resources/properties. A property contains a set of name/value pairs as XML data. Each resource is named by a String.

By default hadoop loads configuration parameters from two files.

1. core-default.xml – default configuration properties
2. conf/core-site.xml – site specific configuration properties

For example, properties are defined in the above two files as shown below.

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
</property>

Applications can add additional properties on Configuration object.

Below are some of the useful methods on configuration object.

Configuration conf = new Configuration() — default configuration parameters will be returned.

Addition of Resources:
conf.addResource(String name)  — adds a resource called ‘name’

Getting Property Values:
conf.get(String name) — gets value of the property ‘name’.
conf.getBoolean(String name, boolean defaultValue) — gets value of property ‘name’ as boolean
conf.getClass(String name, Class<?> defaultValue) — gets class of property ‘name’
conf.getDouble(String name, double defaultValue)
conf.getFloat(String name, float defaultValue)
conf.getInt(String name, int defaultValue)
conf.getStrings(String name) — Gets ‘,’ delimited values of ‘name’ property as an array of strings.

Setting Property Values:
conf.set(String name, String value) — Set the value of the name property.
conf.setBoolean(String name, boolean value) — set the ‘name’ property to a boolean ‘value’.
conf.setClass(String name, Class<?> theClass, Class<?> intface) — Set the value of the ‘name’ property to the name of a theClass implementing the given interface ‘intface’
conf.setDouble(String name, double value)
conf.setEnum(String name, T value)
conf.setFloat(String name, float value)
conf.setInt(String name, int value)

Path:

The name of a file or directory on a File system is represented by a Path Object. This class is present in org.apache.hadoop.fs package.

Path strings use slash (‘/’) as the directory separator. A path string is absolute if it begins with a slash.

Below are some of the important methods, which we use very frequently in coding. All these apply on a path object.

Path p1 = new Path(“hdfs://localhost/usr/sample.txt”) ;
p1.getFileSystem(Configuration conf) — Return the FileSystem that owns Path p1.
String filename = p1.getName() — Returns the final element in the path.
p1.toString() — Returns a string from path.
p1.toUri() — Converts path an URI.

We can think of Path as a Hadoop file system URI, such as hdfs://localhost/user/sample.txt.

FileSystem:

FileSystem is an abstract base class for a generic file system. It may be implemented as distributed system or a local system. The local implementation is LocalFileSystem and distributed implementation is DistributedFileSystem.

All these classes are present in org.apache.hadoop.fs package.

All user code that may use the HDFS should use a FileSystem object.

Similar to DataInputStream and DataOutputStream in Java File I/O for reading or writing primitive data types, In Hadoop, their corresponding stream classes are FSDataInputStream and FSDataOutputStream respectively.

FSDataInputStream:
  • FSDataInputStream class is a specialization of java.io.DataInputStream with support for random access, so we can read from any part of the stream. It is an utility that wraps a FSInputStream in a DataInputStream and buffers input through a BufferedInputStream.
  • FSDataInputStream class implements Seekable & PositionedReadable interfaces. So, we can have random access in the stream with help of below methods.

int read(long position, byte[] buffer, int offset, int length) – Read bytes from the given position in the stream to the given buffer. The return value is the number of bytes actually read.

void readFully(long position, byte[] buffer, int offset, int length) – Read bytes from the given position in the stream to the given buffer. Continues to read until length bytes have been read. If the end of stream is reached while reading EOFException is thrown.

void readFully(long position, byte[] buffer) – buffer.length bytes will be read from position in stream

void seek(long pos) – Seek to the given offset.

long getPos() – Get the current position in the input stream.

FSDataOutputStream:
  • FSDataOutputStream class is counterpart for FSDataInputStream, to open a stream for output. It is an utility that wraps a OutputStream in a DataOutputStream, buffers output through a BufferedOutputStream and creates a checksum file.
  • Similar to FSDataInputStream, FSDataOutputStream also support getPos() method to know current position in the output stream but seek() method is not supported by FSDataOutputStream.

It is because HDFS allows only sequential writes to an open file or appends to an existing File. In other words, there is no support for writing to anywhere other than the end of the file.

  • We can invoke write() method to write to an output stream on an instance of FSDataOutputStream.

public void write(byte[] b, int off, int len) throws IOException ;

Writes len bytes from the specified byte array starting at offset off to the underlying output stream. If no exception is thrown, the counter written is incremented by len.

Below are some of the important methods from FileSystem class.

Getting FileSystem Instance:

For any File I/O operation in HDFS through Java API, the first thing we need is FileSystem instance. To get file system instance, we have three static methods from FileSystem class.

      • static FileSystem get(Configuration conf) — Returns the configured file system implementation.
      • static FileSystem get(URI uri, Configuration conf) — Returns the FileSystem for this URI.
      • static FileSystem get(URI uri, Configuration conf, String user) — Get a file system instance based on the uri, the passed configuration and the user.
Opening Existing File:

In order to read a file from HDFS, we need to open an input stream for the same. We can do the same by invoking open() method on FileSystem instance.

      • public FSDataInputStream open(Path f)
      • public abstract FSDataInputStream open(Path f, int bufferSize)

The first method uses a default buffer size of 4 K.

Creating a new File:

There are several ways to create a file in HDFS through FileSystem class. But one of the simplest method is to invoke create() method that takes a Path object for the file to be created and returns an output stream to write to.

public FSDataOutputStream create(Path f)

There are overloaded versions of this method that allow you to specify whether to forcibly overwrite existing files, the replication factor of the file, the buffer size to use when writing the file, the block size for the file, and file permissions.

The create() method creates any parent directories of the file to be written that don’t already exist.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.