Hadoop mappers/reducers may be written in Java or in any language that supports streaming via stdin and stout. The streaming API is intended for users with very limited Java knowledge, according to the documentation. Java is considered the best choice for "heavy duty" jobs. Balancing the computational network's performance against the development speed could be a reason for using the streaming vs. the Java mappers/reducers. Last, some scripted languages may work as well or better than Java in specific problem domains.
Are there cases where a scripting language is more advisable than using Java for both speed of development and performance? How can you identify them?
This is an exercise based on a training course I underwent recently at Cloudera. Assume a data set with all of Shakespeare's works and the need to identify the line offset in which a given word appears in each of them. Each work appears in a separate file. Users have the option of implementing this in either Java or in their favourite scripting language. The input to the computational network is of the form:
The mapper intermediate output should be:
The reducer would then take all those intermediate results and generate a list of each word's occurrence across all works:
Since mapper and reducer operate on text, a scripting language optimized for text processing could be a good idea. The mapper and reducer were initially coded in awk. The mapper strips all control and punctuation characters, leaving only the words:
The reducer is also quite simple:
awk's main advantage is conciseness over other scripting languages and Java. By comparison, the functionally equivalent mapper Python code included in the training materials is:
The Python code is much more verbose. The Java mapper is even more complex and requires knowledge and understanding of the Hadoop API plus a compilation cycle; the Java mapper and reducer are organized as:
and
While the Java code isn't hard to figure out by a proficient Java coder, it may be a daunting exercise for a domain expert with little or no Java knowledge. Forcing a Java solution on the domain experts would require to either train them in Java, or add a Java coder to the team with the additional time invested in requirements gathering and coding. This may not be worth for applications where the mappers/reducers must change quickly.
Assuming the same inputs and infrastructure, are scripting languages a good choice? Performance tests for the awk, Python, and Java mappers/reducers were carried out against the same data set on the same Hadoop computational network with no other processes running alongside them. Java professionals predicted that Java code would outperform the scripted program by a significant margin. Here are the results (shorter times == better performance):
The Java code was simplified so that it didn't even attempt to remove punctuation from the data like the awk and Python versions. Java has a clear performance advantage over Python but the performance gain over awk for text processing is less than 5%. For a very large data set, is the 5% performance significant when compared against the time and cost of writing the code in Java vs. a scripting language? Is there a cost/time advantage in leveraging the domain experts' command of a domain-specific language, or is it easier to teach them a fast scripting language instead of Java?
Are there cases where a scripting language is more advisable than using Java for both speed of development and performance? How can you identify them?
This is an exercise based on a training course I underwent recently at Cloudera. Assume a data set with all of Shakespeare's works and the need to identify the line offset in which a given word appears in each of them. Each work appears in a separate file. Users have the option of implementing this in either Java or in their favourite scripting language. The input to the computational network is of the form:
hamlet@11141\tKING CLAUDIUS\tWe doubt it nothing: heartily farewell.
The mapper intermediate output should be:
KING\thamlet@11141 CLAUDIUS\thamlet@11141 We\thamlet@11141 doubt\thamlet@11141 it\thamlet@11141 nothing\thamlet@11141 heartily\thamlet@11141 farewell\thamlet@11141
The reducer would then take all those intermediate results and generate a list of each word's occurrence across all works:
doubt\thamlet@111141,romeoandjuliet@23445,henryv@426917
Since mapper and reducer operate on text, a scripting language optimized for text processing could be a good idea. The mapper and reducer were initially coded in awk. The mapper strips all control and punctuation characters, leaving only the words:
#!/usr/bin/gawk -f
# shakesmapper.awk - map all the words in Shakespeare's works.
{
for (n = 2;n <= NF;n++) {
gsub("[,:;)(|!\\[\\]\\.\\?]|--","");
if (length($n) > 0) printf("%s\t%s\n", $n, $1);
}
}
The reducer is also quite simple:
#!/usr/bin/gawk -f
# shakesreducer.awk - reduce the output from the mapper.
{ wordsList[$1] = ($1 in wordsList) ? sprintf("%s,%s", wordsList[$1], $2) : $2; }
END {
for (key in wordsList)
printf("%s\t%s\n", key, wordsList[key]);
}
awk's main advantage is conciseness over other scripting languages and Java. By comparison, the functionally equivalent mapper Python code included in the training materials is:
#!/usr/bin/python
import re
import sys
NONALPHA = re.compile("\W")
for input in sys.stdin.readlines():
keyline = input.split("\t", 1)
if (len(keyline) == 2):
(key, line) = keyline
for w in NONALPHA.split(line):
if w:
print w + "\t" + key
The Python code is much more verbose. The Java mapper is even more complex and requires knowledge and understanding of the Hadoop API plus a compilation cycle; the Java mapper and reducer are organized as:
public class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text v, OutputCollector<Text, Text> o,
Reporter r) throws IOException { /* implementation here */ }
.
.
}
and
public class LineIndexReducer extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text k, Iterator<Text> v,
OutputCollector<Text, Text> o, Reporter r) throws IOException { /* implementation */ }
.
.
}
While the Java code isn't hard to figure out by a proficient Java coder, it may be a daunting exercise for a domain expert with little or no Java knowledge. Forcing a Java solution on the domain experts would require to either train them in Java, or add a Java coder to the team with the additional time invested in requirements gathering and coding. This may not be worth for applications where the mappers/reducers must change quickly.
Assuming the same inputs and infrastructure, are scripting languages a good choice? Performance tests for the awk, Python, and Java mappers/reducers were carried out against the same data set on the same Hadoop computational network with no other processes running alongside them. Java professionals predicted that Java code would outperform the scripted program by a significant margin. Here are the results (shorter times == better performance):
The Java code was simplified so that it didn't even attempt to remove punctuation from the data like the awk and Python versions. Java has a clear performance advantage over Python but the performance gain over awk for text processing is less than 5%. For a very large data set, is the 5% performance significant when compared against the time and cost of writing the code in Java vs. a scripting language? Is there a cost/time advantage in leveraging the domain experts' command of a domain-specific language, or is it easier to teach them a fast scripting language instead of Java?
Conclusions
- For text inputs, the time of development will be faster in a text processing language like awk or Perl
- Perform a quick performance analysis with a subset of the data to compare performance between languages; in this case, awk's performance is arguably in par with Java's for more complex operations with a much shorter development cycle
- Some mathematical or binary data may benefit from calling code written in FORTRAN or Mathematica
- For binary or XML inputs, Java offers a clear performance advantage over other languages at a cost of higher specialization required on the part of the implementer
relaxed
What if All runs in JVM?
Now, Jython and AWK can run the code in JVM, meaning they compile into bytecode. Of course, the same codebase is then used, being the actual language implementation the one that dictates the performance.
What do you think? What if you run the same three scripts inside the JVM compiling into bytecode (JSR223 or the compiler with each java version language)? It would be interesting to note.
William Martinez.
Re: What if All runs in JVM?
Interesting idea. This may be a great way of leveraging both the scripting knowledge and the JVM optimizations. I'll try testing that later. Someone mentioned using jawk as well -- now *that* would be an interesting mano-a-mano.
Thanks for your comment -- cheers!
Term Papers
Hi very interesting information!!post more blog
Term papers (http://www.ghostpapers.com/)