//
you're reading...
Cloud

Pairs and Stripes

When reading Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, I became quite confused by their description of the “pairs” and “stripes”. I understand it now, and would like to provide a clearer explanation and reasons behind each usage.

Sample Data

Assuming the following is the (absurdly small) data set we have in HDFS:

the dog, the cat, and the fox slept in the living room quietly.

And assume the goal of our MapReduce job is to identify all combinations of words which are next to one another. For example, ‘the’ and ‘dog’, and ‘slept’ and ‘in’.

Pairs

After a Mapper completes, the intermediate data consists of keys and values, <k,v>.  Let’s assume we have two TaskTrackers executing, and provide the following output:
Mapper1: {
["the", "dog"],
["dog", "the"],
["the", "cat"],
["cat", "and"],
["and", "the"],
["the", "fox"],
["fox", "slept"]
}
Mapper2: {
["slept", "in"],
["in", "the"],
["the", "living"],
["living", "room"],
["room", "quietly"]
}

Note: For the sake of the example, I’m keeping the sentence in order. The actual mappers may not have the blocks provided in order, but let’s not get hung up on the example. :)

This represents the pair concept. Each v represents a singular value. These are passed to the Reducers, which aggregate the list of succeeding words for each key.

Here is an example in pseudo-code:

Stripes

The concept of stripes is to aggregate data prior to the Reducers by using a Combiner. There are several benefits to this, discussed below. When a Mapper completes, its intermediate data sits idle when pairing until all Mappers are complete. With striping, the intermediate data is passed to the Combiner, which can start processing through the data like a Reducer. So, instead of Mappers sitting idle, they can execute the Combiner, until the slowest Mapper finishes.

Here is an example of the output:
Mapper1: {
["the", {"dog", "cat", "fox"}],
["dog", "the"],
["cat", "and"],
["and", "the"],
["fox", "slept"]
}
Mapper2: {
["slept", "in"],
["in", "the"],
["the", "living"],
["living", "room"],
["room", "quietly"]
}

In this example, because of the limited data set, you  may only see a glimpse of the efficiency gained. Combiners begin the process which Reducers will execute, saving some of the type spent combining later.

This is only an option when the effects of the Reducer are mathematically associative. For example, summing values. Or in this case, creating a collection of words which appear next to the given key. In massive data sets, this technique can save a lot of time and resources.

Here is an example in pseudo-code:

Benefits

  • Writing jobs to utilize Stripes maximizes the efficiency of a job. Mappers may sit idle, and the Reducers cannot start until all Mappers are finished. A particularly slow Mapper will hold up all progress, and the same amount of processing must following no matter how long the Reducers wait.
  • Overall execution time is lowered
  • Bandwidth used to transmit data is also lowered

Note: Combiners are not guaranteed to execute. Do not include anything in your Combiner which you expect to execute on the intermediate key-values.

As you can see, from their book, the efficiency gained can be exponential!

Problems

  • It is more difficult to write additional logic to allow for Combiners. (not always the case, but can be)
  • Objects which may be used to hold the “striped” data may be more memory-costly
About these ads

About Ben

Software developer who enjoys creative design and art.

Discussion

One Response to “Pairs and Stripes”

  1. Some really great articles on this web site , thanks for contribution.

    Posted by Louis Zywiec | March 22, 2012, 7:39 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: