Introduction to Map Reduce

MapReduce is a Distributed computing programming model suitable for processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby,  Python.

MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.

MapReduce is defined of 2 functions

map() and  reduce() and The rest is taken care by Hadoop

Let’s take an example :

Objective : Create Frequency Distribution of words in the  file

we have a large text file

The text file has been divided into blocks and stored in HDFS

Each block here would represent a part of the text file

The map step will generate a list of key  value pairs on each node

From now on all the inputs and outputs are formatted as <key,value> pairs

These are all copied over to one single node, On that node an operation called Sort/Merge occurs

  • Map function will run once fore each  line of the  text file.
  • Both the  input and output need to be formatted as <key,value> pairs.
  • This operation can run in parallel on each data node there is no interdependency in the inputs and outputs.
  • At the end of the  map phase, we have a set of key-value pairs from each data node
  • All of these results are first copied over to a single node.
  • The data is  sorted based on key,key-value pairs with the same key are merged.
  • reduce() will run on each  pair generated by the sort/merge step

Map Reduce Entire Flow Diagram.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top