Kneth's Korner: Map/Reduce

Review: Storm - real-time processing cookbook

Quinton Anderson has written a book on the analysis platform Storm and published it at Packt. I have worked a little with Hadoop in the last couple of years, and it is only natural to take a look at the other big data processing platform. Hadoop is a batch processing platform, and Storm is for real-time processing and analysis of data. That means the two projects are not direct competitors, and they might complement each other.

When reviewing a technical book for the general public, it is important not to review the technology but the book. You can easily write a crappy book on excellent technologies while the opposite is very difficult. This review should be read in this context.

The author starts out by explaining how to set up Storm. Storm seems to be quite a complex beast, but Mr. Anderson gets nicely through it. I would have preferred to get an introduction to some of the concept before jumping into the many tasks. But it is my personal preference, and this is a cookbook and not a text book for a university course.

The first couple of chapters is about processing real-time data. Twitter and log files are the canonical examples in this area, and the book utilizes these as well.

One chapter are on how to use C++ and Qt in your real-time data processing. If you think that Qt is about graphical user interfaces, the book will show you that Qt is a lot more. The author is using Qt's non-GUI parts in his processing.

Another chapter is about machine learning. As part of the big data revolution, machine learning has become popular again. Machine learning is a topic that Mr. Anderson is passionate about, and he analyzes in great detail the problem before showing the recipe.

One of the major disadvantages of the book is that the assumptions about the reader are pretty hard. The reader is assumed to know at least about:

Java development using and Maven and Eclipse
Some functional programming
Ubuntu or Debian (or any UNIX) command-line
Web development (HTML, CSS, JavaScript, JSON)
Data modelling experience
A little about NoSQL

This problem does not really from the author but from Storm. But the author might have chosen other examples. The book is not a university text book, and therefore I would like many more references to text book and papers.

There is a lot of source code in the cook. You should really download the examples as some of them are longer than a page. It would be great if Packt would do some kind of syntax highlighting. Most programmers find it easier to read syntax highlighted code. In particular, electronic books (I have read the book as PDF) can easily be colorized!

One of the things I really like about the book is that the author has taken time to craft understandable diagrams. A well-composed diagram is worth many words, and he often sums up the key points in a diagram. In general the author writes in a rather dry or fact based language. But on page 111, you find that the author cannot suppress his humor: "... for automating Clojure projects without setting your hair on fire."

I'm not going to say that Storm is a crappy technology but Quinton Anderson has done the job well by writing a good cook book.

If you are serious in getting into data science and data processing, I wouldn't hesitate to recommend the book. You can find the book at http://www.packtpub.com/storm-realtime-processing-cookbook/book.

Map/Reduce and GNU Parallel

This week I attended a meeting organized by DKUUG.The topic was GNU Parallel and the speaker was Ole Tange - the developer behind GNU Parallel.

To be honest, I have not used GNU Parallel before. Of course, I have heard about it as Ole always talks about the program when I meet him. His introduction to the program was great - enjoy it when DKUUG releases the video.

Simply put, GNU Parallel is able to run tasks in parallel. It can either be running locally or remotely. In order words, GNU Parallel can help you to transform your command-line into a parallel computational engine.

Lately, I have been studying Apache Hadoop. Currently, Hadoop is probably the most popular implementation of the programming paradigm Map/Reduce. GNU Parallel offers a way of specifying the Map component. It is activated by using the --pipe option. On my way home I was thinking on how to implement a simply Map/Reduce based analysis using GNU Parallel.

I have used the On Time Performance data set more than once. It is a good data set as it is highly regular and it is large (5-600,000 rows every month). The data set records every flight within USA, and you can find information about destination airport, delays, and 120 other data points. 6 months of data will result in a 1.3 GB (comma separated value) file.

A simple analysis of the data set is to generate a table of the number of time an airport is used by a flight. The three-letter airport code is unique e.g., LAX is Los Angeles International Airport. It is possible do the analysis in parallel by breaking the data file into smaller parts. This is the map task. Each task will produce a table, and the reduce task will combine the output for each map task into the final table.

In order to use GNU Parallel as driver for Map/Reduce, I have implemented the mapper and reduce in Perl. The mapper is:

#!/usr/bin/perl -w

use strict;

my %dests;
while (<>) {
    my @data = split /,/;
    my $airport = $data[14];
    $dests{$airport} = 0 if (not exists $dests{$airport});
    $dests{$airport}++
}

foreach my $airport (keys %dests) {
    printf "$airport $dests{$airport}\n";
}

The reducer is also simple:

#!/usr/bin/perl -w

use strict;

my %dests;
while (<>) {
    chomp;
    my ($airport, $count) = split / /;
    $dests{$airport} = 0 if (not exists $dests{$airport});
    $dests{$airport} += $count;
}

my $total = 0;
foreach my $airport (sort keys %dests) {
    $total += $dests{$airport};
    print "$airport $dests{$airport}\n";
}
print "Total: $total\n";

It is possible to run the Map/Reduce analysis by the command-line:

cat On_Time_Performance_1H2012.csv | parallel --pipe --blocksize 64M ./map.pl | ./reduce.pl

The input file is broken down into 64 MB chunks. GNU Parallel is line oriented so a chunk will not be exactly 64 MB but close. My laptop has four cores, and they are fully utilized.

It seems to me that GNU Parallel offers a simple approach to Map/Reduce for people living much of their life on a command-line.

Kneth's Korner

2013-11-10

Review: Storm - real-time processing cookbook

Review: Storm - real-time processing cookbook

2012-10-11

Map/Reduce and GNU Parallel

Map/Reduce and GNU Parallel

About Me

Blog Archive