Introduction to MapReduce

Tavish Srivastava Last Updated : 17 Apr, 2015

4 min read

MapReduce is a programming model for processing large data sets with a parallel , distributed algorithm on a cluster (source: Wikipedia). Map Reduce when coupled with HDFS can be used to handle big data. The fundamentals of this HDFS-MapReduce system, which is commonly referred to as Hadoop was discussed in our previous article.

The basic unit of information, used in MapReduce is a (Key,value) pair. All types of structured and unstructured data need to be translated to this basic unit, before feeding the data to MapReduce model. As the name suggests, MapReduce model consist of two separate routines, namely Map-function and Reduce-function. This article will help you understand the step by step functionality of Map-Reduce model.The computation on an input (i.e. on a set of pairs) in MapReduce model occurs in three stages:

Step 1 : The map stage

Step 2 : The shuffle stage

Step 3 : The reduce stage.

Semantically, the map and shuffle phases distribute the data, and the reduce phase performs the computation. In this article we will discuss about each of these stages in detail.

[stextbox id=”section”] The Map stage [/stextbox]

MapReduce logic, unlike other data frameworks, is not restricted to just structured datasets. It has an extensive capability to handle unstructured data as well. Map stage is the critical step which makes this possible. Mapper brings a structure to unstructured data. For instance, if I want to count the number of photographs on my laptop by the location (city), where the photo was taken, I need to analyze unstructured data. The mapper makes (key, value) pairs from this data set. In this case, key will be the location and value will be the photograph. After mapper is done with its task, we have a structure to the entire data-set.

In the map stage, the mapper takes a single (key, value) pair as input and produces any number of (key, value) pairs as output . It is important to think of the map operation as stateless, that is, its logic operates on a single pair at a time (even if in practice several input pairs are delivered to the same mapper). To summarize, for the map phase, the user simply designs a map function that maps an input (key, value) pair to any number (even none) of output pairs. Most of the time, the map phase is simply used to specify the desired location of the input value by changing its key.

[stextbox id=”section”] The shuffle stage [/stextbox]

The shuffle stage is automatically handled by the MapReduce framework, i.e. the engineer has nothing to do for this stage. The underlying system implementing MapReduce routes all of the values that are associated with an individual key to the same reducer.

[stextbox id=”section”] The Reduce stage [/stextbox]

In the reduce stage, the reducer takes all of the values associated with a single key k and outputs any number of (key, value) pairs. This highlights one of the sequential aspects of MapReduce computation: all of the maps need to finish before the reduce stage can begin. Since the reducer has access to all the values with the same key, it can perform sequential computations on these values. In the reduce step, the parallelism is exploited by observing that reducers operating on different keys can be executed simultaneously. To summarize, for the reduce phase, the user designs a function that takes in input a list of values associated with a single key and outputs any number of pairs. Often the output keys of a reducer equal the input key (in fact, in the original MapReduce paper the output key must equal to the input key, but Hadoop relaxed this constraint).

Overall, a program in the MapReduce paradigm can consist of many rounds (usually called jobs) of different map and reduce functions, performed sequentially one after another.

[stextbox id=”section”] An example [/stextbox]

Let’s consider an example to understand Map-Reduce in depth. We have the following 3 sentences :

1. The quick brown fox

2. The fox ate the mouse

3. How now brown cow

Our objective is to count the frequency of each word in all the sentences. Imagine that each of these sentences acquire huge memory and hence are allotted to different data nodes. Mapper takes over this unstructured data and creates key value pairs. In this case key is the word and value is the count of this word in the text available at this data node. For instance, the 1st Map node generates 4 key-value pairs : (the,1), (brown,1),(fox,1), (quick,1). The first 3 key-value pairs go to the first Reducer and the last key-value go to the second Reducer.

Similarly, the 2nd and 3rd map functions do the mapping for the other two sentences. Through shuffling, all the similar words come to the same end. Once, the key value pairs are sorted, the reducer function operates on this structured data to come up with a summary.

[stextbox id=”section”] End Notes : [/stextbox]

Let’s take some example of Map-Reduce function usage in the industry :

• At Google:

– Index building for Google Search
– Article clustering for Google News
– Statistical machine translation

•  At Yahoo!:

– Index building for Yahoo! Search
– Spam detection for Yahoo! Mail

•  At Facebook:

– Data mining
– Ad optimization
– Spam detection Example

•  At Amazon:

– Product clustering
– Statistical machine translation

The constraint of using Map-reduce function is that user has to follow a logic format. This logic is to generate key-value pairs using Map function and then summarize using Reduce function. But luckily most of the data manipulation operations can be tricked into this format. In the next article we will take some example like how to do data-set merging, matrix multiplication, matrix transpose, etc. using Map-Reduce.

Did you find the article useful? Share with us any other practical examples of Map-Reduce function. Do let us know your thoughts about this article in the box below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Tavish Srivastava

Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate and a passionate data-science professional with 8+ years of diverse experience in markets including the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even movie related to this idea.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Vikram Chinmulgund

Hello Tavish, My question is re "the Quick brown fox" example near the end of the article. Why isn't the word "Quick" included in the Suffle & Sort stage ? Thanks, Vikram Chinmulgund

Show 1 reply

Tavish Srivastava

Hi Vikram, "Quick" word is included in Shuffle & Sort stage. It is just that it goes to Reducer 2 in this case. Hope this clarifies your doubt. Tavish

Vikram Chinmulgund

Thanks, Tavish. Are you saying that a)since there is only one instance of "Quick" it doesn't need to be shuffled and b)hence "quick" can go to any available reducer depending on load distribution. regards

Show 1 reply

Tavish Srivastava

Vikram, The main objective of shuffling is to make sure all the occurrence of same map id are available on the same reducer. As "Quick:" has only one occurrence, it can go to either of the two reducers. But to your point, shuffler still needs to work "Quick". Notice that shuffler cannot directly swap "the" and "quick" from the first mapper. If that happens, all "the" from other mappers also need to be shifted to Reducer 2. Hope this clarifies your doubt. Tavish

Sankhe

Hi Tavish, I have a question on number of reducers. As I have understood from browsing on internet about map and reduce, is that after mapper has finish its job, output of mapper(combiner optional) will go to reducer such that each key will go to different reducer. so, reducer will have just one key and its related result, for eg. Mapper 1 output : (abc,1 ) (def,1) Mapper 2 output : (abc,1 ) So input to reducer will be Reducer 1: (abc,1) (abc,1) & Reducer 2 :( def, 1) so final ouput : (abc,2) (def,1) Is this correct ? or Am I missing something ? Reference : https://www.google.com/search?q=key+value+word+count+reducer&tbm=isch&ei=5__NU67eC6TziQLLtoDQAQ#facrc=_&imgdii=_&imgrc=RDl_OVp1pIjE_M%253A%3BKAo2vtr32xkeWM%3Bhttp%253A%252F%252Fwww.cs.uml.edu%252F~jlu1%252Fdoc%252Fsource%252Freport%252Fimg%252FMapReduceExample.png%3Bhttp%253A%252F%252Fwww.cs.uml.edu%252F~jlu1%252Fdoc%252Fsource%252Freport%252FMapReduce.html%3B1344%3B624

Show 1 reply

Tavish Srivastava

Sankhe, The number of mappers and reducers can be independently chosen. And you are right, the work of reducer starts after all mappers have done the job.

Write for us

Write, captivate, and earn accolades and rewards for your work

Reach a Global Audience
Get Expert Feedback
Build Your Brand & Audience

Cash In on Your Knowledge
Join a Thriving Community
Level Up Your Data Science Game

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Introduction to MapReduce

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)