You are viewing [info]ciurana's journal

Zhenya Ciurana - Official Author Blog

Hadoop: Java vs. Scripting Mappers/Reducers?

Journal Info

Ciurana, Eugene, headshot, photo, press kit
Name
Zhenya Ciurana
Website
Eugene Ciurana Official Author Site

Hadoop: Java vs. Scripting Mappers/Reducers?

Previous Entry Add to Memories Share Next Entry
Ciurana, Eugene, headshot, photo, press kit
Hadoop mappers/reducers may be written in Java or in any language that supports streaming via stdin and stout. The streaming API is intended for users with very limited Java knowledge, according to the documentation. Java is considered the best choice for "heavy duty" jobs. Balancing the computational network's performance against the development speed could be a reason for using the streaming vs. the Java mappers/reducers. Last, some scripted languages may work as well or better than Java in specific problem domains.

Are there cases where a scripting language is more advisable than using Java for both speed of development and performance? How can you identify them?

This is an exercise based on a training course I underwent recently at Cloudera. Assume a data set with all of Shakespeare's works and the need to identify the line offset in which a given word appears in each of them. Each work appears in a separate file. Users have the option of implementing this in either Java or in their favourite scripting language. The input to the computational network is of the form:
hamlet@11141\tKING CLAUDIUS\tWe doubt it nothing: heartily farewell.

The mapper intermediate output should be:
KING\thamlet@11141
CLAUDIUS\thamlet@11141
We\thamlet@11141
doubt\thamlet@11141
it\thamlet@11141
nothing\thamlet@11141
heartily\thamlet@11141
farewell\thamlet@11141

The reducer would then take all those intermediate results and generate a list of each word's occurrence across all works:
doubt\thamlet@111141,romeoandjuliet@23445,henryv@426917

Since mapper and reducer operate on text, a scripting language optimized for text processing could be a good idea. The mapper and reducer were initially coded in awk. The mapper strips all control and punctuation characters, leaving only the words:
  #!/usr/bin/gawk -f
  # shakesmapper.awk - map all the words in Shakespeare's works.
  {
    for (n = 2;n <= NF;n++) {
      gsub("[,:;)(|!\\[\\]\\.\\?]|--","");
      if (length($n) > 0) printf("%s\t%s\n", $n, $1);
    }
  }

The reducer is also quite simple:
  #!/usr/bin/gawk -f
  # shakesreducer.awk - reduce the output from the mapper.
  { wordsList[$1] = ($1 in wordsList) ? sprintf("%s,%s", wordsList[$1], $2) : $2; }

  END {
    for (key in wordsList)
      printf("%s\t%s\n", key, wordsList[key]);
  }

awk's main advantage is conciseness over other scripting languages and Java. By comparison, the functionally equivalent mapper Python code included in the training materials is:
#!/usr/bin/python

import re
import sys

NONALPHA = re.compile("\W")

for input in sys.stdin.readlines():
	keyline = input.split("\t", 1)
	if (len(keyline) == 2):
		(key, line) = keyline
		for w in NONALPHA.split(line):
			if w:
				print w + "\t" + key

The Python code is much more verbose. The Java mapper is even more complex and requires knowledge and understanding of the Hadoop API plus a compilation cycle; the Java mapper and reducer are organized as:
public class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
  public void map(LongWritable k, Text v, OutputCollector<Text, Text> o,
      Reporter r) throws IOException { /* implementation here */ }
  .
  .
}

and
public class LineIndexReducer extends MapReduceBase
    implements Reducer<Text, Text, Text, Text> {
  public void reduce(Text k, Iterator<Text> v,
      OutputCollector<Text, Text> o, Reporter r) throws IOException { /* implementation */ }
  .
  .
}

While the Java code isn't hard to figure out by a proficient Java coder, it may be a daunting exercise for a domain expert with little or no Java knowledge. Forcing a Java solution on the domain experts would require to either train them in Java, or add a Java coder to the team with the additional time invested in requirements gathering and coding. This may not be worth for applications where the mappers/reducers must change quickly.

Assuming the same inputs and infrastructure, are scripting languages a good choice? Performance tests for the awk, Python, and Java mappers/reducers were carried out against the same data set on the same Hadoop computational network with no other processes running alongside them. Java professionals predicted that Java code would outperform the scripted program by a significant margin. Here are the results (shorter times == better performance):




The Java code was simplified so that it didn't even attempt to remove punctuation from the data like the awk and Python versions. Java has a clear performance advantage over Python but the performance gain over awk for text processing is less than 5%. For a very large data set, is the 5% performance significant when compared against the time and cost of writing the code in Java vs. a scripting language? Is there a cost/time advantage in leveraging the domain experts' command of a domain-specific language, or is it easier to teach them a fast scripting language instead of Java?

Conclusions

  • For text inputs, the time of development will be faster in a text processing language like awk or Perl
  • Perform a quick performance analysis with a subset of the data to compare performance between languages; in this case, awk's performance is arguably in par with Java's for more complex operations with a much shorter development cycle
  • Some mathematical or binary data may benefit from calling code written in FORTRAN or Mathematica
  • For binary or XML inputs, Java offers a clear performance advantage over other languages at a cost of higher specialization required on the part of the implementer
Since separation of problem domain business logic from distributed execution is a main advantage of map/reduce networks, it may make sense to apply the languages that problem domain experts already mastered or can learn faster than Java to create almost-as-fast but quicker-to-code-and-test solutions.
Powered by LiveJournal.com