Apache Hadoop is an open source framework for distributed storage and processing. The distributed storage part of framework is commonly known as Hadoop Distributed Filesystem (HDFS) [1] which is the flagship filesystem for Hadoop. The processing part is known as MapReduce. Hadoop runs on commodity hardware and has the capabilities to store and process very large amounts of data. Being distributed, it abstracts all the inherent network programming problems while providing a clean programming interface.

Architecture

An HDFS cluster consists of two types of nodes which work in a master-slave model. The master node is known as NameNode and slaves which act as workers are known as DataNodes. Basic unit of storage in HDFS is a block with a default value of 128MB. A file in HDFS is broken down to multiple blocks depending on the size and each block is replicated to avoid data loss in case a cluster node fails. The blocks and their replica’s are stored in datanodes while the location information of each block and meta data is stored in namenode. Datanode also responds to namenode for all the filesystem operations and sends information regarding blocks to namenode whenever required. Since namenode stores all the filesystem information, it becomes a single point of failure i.e. if it goes down then the filesystem cannot be recovered. To avoid this suituation, a secondary namenode is used which can be used to recover the filesytem.

Hadoop distributed File system Architecture

Hadoop distributed File system Architecture

Figure above shows the architecture of HDFS. Namenode directs the datanodes to perform different block operations. Secondary namenode, at regular intervals, reads all the filesystem and meta data information from the RAM of namenode and writes it to the harddrive. This snapshot is copied back to namenode and it helps the primary namenode to recover from failure. Datanodes communicate with each other to perform block replications [2, pp. 46-47]. In order to read a file from HDFS, client contacts namenode to get the block information where the file is stored. Client then directly communicates with datanodes (using HDFS API) to read the file [2, pp. 69-71]. For writing a file to filesystem, client communicates with namenode. Namenode selects appropriate datanodes to store the blocks and then client starts writing to first datanode. Once writing process is complete in first datanode, depending on the replication factor, the datanode communicates with other datanodes to replicate the blocks [2, pp. 72-74].

[1] – https://wiki.apache.org/hadoop/HDFS
[2] – T. White. Hadoop: The definitive guide. ” O’Reilly Media, Inc.”, 2012.