Hadoop is an open source Distributed processing framework that manages data processing and storage for big data applications running in clustered environments.
Hadoop Service Architecture
HDFS(Hadoop Distributed File System) Overview
Hadoop is normally deployed on a group of machines (Cluster)
- Each machine in cluster is node
- One of the node acts as the master node, This node manages the overall file system
The namenode stores
- The directory structure
- Metadata for all the files
Other nodes are called datanodes
- The data is physically stored on these nodes
Let’s see how this files is stored in HDFS
First the file is broken into blocks of size 128 MB
- This size is chosen to minimize the time to seek to the block on the disk
- The blocks are the stored across the data nodes
Over all storage picture
Block locations for each files are stored in namenode
A file is read using
- The metadata in namenode.
- The blocks in the datanode.
- The default replication factor is 3.
Features of HDFS
- High Availability
- Fault tolerance
- Data Reliability
- Replication
- Distributed storage
- Scalability
YARN Overview
YARN(Yet Another Resource Negotiator) is used for the management of resources on the Hadoop cluster.
- YARN co-ordinates all the different MapReduce task running on the cluster.
- YARN also monitors for failures and assigns new nodes when other fail
Sample MapReduce workflow
- User defines map and reduce tasks using the MapReduce API
- A job will be triggered on the cluster
- YARN figures out where and how to rub the job, and stores the result in HDFS
YARN does this using 2 services
Resourcemanager and Nodemanager
There is 1 ResourcManager for a Hadoop cluster and the ResourceManager service runs on a single node -usually the same node as HDFS namenode
- The Resource manager launches tasks that are submitted to YARN.
- It optimizes for cluster utilization based on constraints such as capacity guarantees, fairness.
A NodeManager service runs on each node in the cluster i.e. all the data nodes
- The NodeManager launches and monitors all tasks running on that node.
- It coordinates with the ResourceManager in order to perform its tasks.
- It monitors resources, logs,tracks the health of the node etc. everything related to the one node that is in its charge.