Thursday 10 April 2014

A fundamental analysis of Hadoop

What is Hadoop?

In simple terms, Hadoop is a tool used for storage and processing of large scale data sets on clusters of commodity hardware.  It was developed by two engineers at Yahoo and is now part of the Apache open source license.  

This post  here is intended to give the you at best, a very basic overview of the key components of Hadoop. It is by no means and in depth analysis of Hadoop. I myself have only just opened the door into the world of big data and I'm simply sharing what I have discovered so far.  

Components of Hadoop

Hadoop consists of two key components.  HDFS (Hadoop Distributed File System) and MapReduce and these are part of a Hadoop cluster. The Hadoop cluster is a set of machines that are networked together. Each machine in the cluster is a node, each host runs one or more Hadoop services and finally they can be classified as master or slave.  MapReduce and HDFS run as services.

Let's take a closer look at HDFS, MapReduce. 

HDFS

The job of HDFS is to split files across multiple hosts. The default block size files are split into is 64MB, however cloudera which is the cluster manager for Hadoop, recommends changing the block size to 128MB.  Blocks are replicated to hosts throughout the cluster. The default is 3.  Replication increases reliability and performance. Data can tolerate the loss of all but one replicate and it offers more opportunity for data locality.

MapReduce

A programing model or framework that is neither platform nor language specifi. written in Java,  it's sole purpose is record oriented data processing that uses key-value pairs to move data back and forth. This  facilitates task distribution across multiple hosts and where possible each host processes data that it has stored.  MapReduce has two phases - the Map phase which is where the data is created and the Reduce phase which involves the processing of the data. In between these phases exists a shuffle and sort phase which simply sends data from mapper to reducer. The full internals of MapReduce are beyond the scope of this post. I intend to write up a more indepth post after I have had time to study its internals. 

The HDFS Service Instances

HDSFS requires 3 service instances to run.  NameNode, a secondary NameNode (also a master) and DataNodes. Let's take a closer look at each of these components. 

Master NameNode

Stores metadata information about file loations in HDFs, the ownership and permission of the files, the names of the individual blocks as well as the location of those blocks. All this is stored in a file called "fsimage". Metadata is stored on disk and read when the NameNode daemon starts, this daemon runs on the master host. Block locations are not stored on disk, they are reported by the DataNodes during startup. Changes to metadata are made in RAM and they are also written to a log file on disk called "edits"

Secondary NameNode

Firstly this is not a fail over node.  It performs memory intensive administrative functions for the Master NameNode. It acts as a housekeeper whereby it periodically combines a prior file system snapshot and edit log into a new snapshot, that is then transmitted back to the master. the medata changes are written to an edit log. In a large installation it should run on a separate host and have as much ram as the master.

Data Nodes /Slave Hosts

These store the actual content of the files and they contain the blocks of the original file. Each block is given the name Blk_xxxx. The slave doesn't need to know the underlying file the block is part of, that information is handled by the NameNode. For greater redundancy the blocks are replicated across multiple slave hosts. Default is 3 replicates and  they communicate with the  NameNode.

MapReducer service instances

The MapReducer has a single master service instance called  the JobTracker who job is to parcel out work to the slave hosts. A task tracker runs on the slave hosts and it's job is to take care of the MapReduce jobs it has been assigned.

Conclusions

To summarise:

Hadoop is a framework for processing of large scale data
Cloudera acts as a cluster manager for Hadoop Services
Hadoop is made up of HDFS which is the Java written large data file system that has 3 service instances ( Master NameNode, secondary NameNode and DataNode) and MapReduce which is a framework that is used for processing the data in key value pairs.

As I said earlier, this is just a basic overview of Hadoop and I will expand on these components in the near future.  For now though, I hope you have gained some basic idea of what Hadoop is. Thanks for reading and feel free to comment or ask me questions in the comment box below.























0 comments:

Post a Comment