Tuesday, December 20, 2011

Simple things in PIG

A trivial task in PIG, which comes up a lot when doing sanity check on mapreduce program output is to count the number of rows in outputs and inputs. For those who are used to sql count, the way it is done in PIG is not intuitive. The reason is that COUNT in pig is used to count not number of rows, but number of tuples in a record, so an extra grouping is needed.

DATASET = LOAD .....
G = group DATASET all
C = foreach G GENERATE COUNT(DATASET);
DUMP C;

Monday, November 14, 2011

Hadoop Speculative Execution

One thing that seems easy to grasp in Hadoop is the concept of speculative execution. Sounds trivial - if I have a task and am not so sure of the environment, please go ahead dear yellow elephant, schedule some more of the task, and wait which one will finish first. One task here or one task there does not matter, and dammit, I want my results NOW.

But how does the little yellow fellow know what is slow and when to schedule? When running large tasks looks like it runs some speculative tasks anyway, and many get the idea that there is some magic number of tasks launched for each job, but this would be way too expensive if the number of mappers/reducers is large. Therefore Hadoop goes the semi-smart way:

The current algorithm works the following way:
A speculative task is kicked off for mappers and reducers that have completion rate under certain pecentage of the completion rate of the majority of running tasks. For example if you have 100 mappers, 90 of which are at 80% completion, and 10 are at 20%, then hadoop will start 10 addittional tasks for the slower ones.

in versions of hadoop over 0.20.2 there are 3 new fields in the jobconf
  • mapreduce.job.speculative.speculativecap
  • mapreduce.job.speculative.slowtaskthreshold
  • mapreduce.job.speculative.slownodethreshold
Hadoop launches a speculative task for a regular task if completion < slowtaskthrehold*mean(
completion of all other tasks) and the number of speculative tasks launched< speculativecap.

In older versions of Hadoop these threshold values are fixed and cannot be modified.


Wednesday, April 6, 2011

bash-fu: seq

so on my local cluster I have 9 nodes, and every now and then I need to clean up the mess that my SGE jobs create, or collect the logs or whatsoever. So I want to ssh and execute the same command on each machine. the node names are 'compute-0-[0-9]+'.

enter seq. seq will generate a sequence of numbers in the given range. Using the xargs, I write
seq 1 9 | xargs -L1 -Ix sh -c "ssh compute-0-x ' do something awesome'".

magic is easy :)

Saturday, April 2, 2011

bash as a functional language.

Following the previous posting on similarity between the functional staple map(), and xargs, I was looking more deeply if we can consider bash programming a functional language. In the xarsg example I used before, I used the sh -c trick, which can be said is analogous to eval function in any functional language. Exactly the same can be achieved using the back-tick syntax, give a string as an input, and get the result of evaluation.

So now we have eval, and therefore closures and 'sorta' anonymous functions which can be passed around as strings and evaluated using eval. the pipes tie this all together, allowing to pipe the results from one function to another.

What else do we need to call a language functional? thinking...

Tuesday, March 29, 2011

bash-fu:xargs

The basic techniques of bash-fu are famous now.find, grep, sed, awk, some for loops and everything is at your fingertips. The second level of bash dao is using xargs: the gem of unix tools.

xargs is to bash what map is to functional programming, and it does to input exactly what map does: apply a function to each element in input. However, the problem with xargs is that it accepts only one argument at a time, so what if we want to emulate a double loop? sh comes to help:

xargs -L 1 -I {} sh -c "`foobar| xargs -L 1 -Ix x "

or something like that :D