MapReduce is a Distributed computing programming model suitable for processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python.
MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.
MapReduce is defined of 2 functions
map() and reduce() and The rest is taken care by Hadoop
Let’s take an example :
Objective : Create Frequency Distribution of words in the file
we have a large text file
The text file has been divided into blocks and stored in HDFS
Each block here would represent a part of the text file
The map step will generate a list of key value pairs on each node
From now on all the inputs and outputs are formatted as <key,value> pairs
These are all copied over to one single node, On that node an operation called Sort/Merge occurs
- Map function will run once fore each line of the text file.
- Both the input and output need to be formatted as <key,value> pairs.
- This operation can run in parallel on each data node there is no interdependency in the inputs and outputs.
- At the end of the map phase, we have a set of key-value pairs from each data node
- All of these results are first copied over to a single node.
- The data is sorted based on key,key-value pairs with the same key are merged.
- reduce() will run on each pair generated by the sort/merge step
Map Reduce Entire Flow Diagram.