But how does the little yellow fellow know what is slow and when to schedule? When running large tasks looks like it runs some speculative tasks anyway, and many get the idea that there is some magic number of tasks launched for each job, but this would be way too expensive if the number of mappers/reducers is large. Therefore Hadoop goes the semi-smart way:
The current algorithm works the following way:
A speculative task is kicked off for mappers and reducers that have completion rate under certain pecentage of the completion rate of the majority of running tasks. For example if you have 100 mappers, 90 of which are at 80% completion, and 10 are at 20%, then hadoop will start 10 addittional tasks for the slower ones.
in versions of hadoop over 0.20.2 there are 3 new fields in the jobconf
completion of all other tasks) and the number of speculative tasks launched< speculativecap.
In older versions of Hadoop these threshold values are fixed and cannot be modified.
The current algorithm works the following way:
A speculative task is kicked off for mappers and reducers that have completion rate under certain pecentage of the completion rate of the majority of running tasks. For example if you have 100 mappers, 90 of which are at 80% completion, and 10 are at 20%, then hadoop will start 10 addittional tasks for the slower ones.
in versions of hadoop over 0.20.2 there are 3 new fields in the jobconf
- mapreduce.job.speculative.speculativecap
- mapreduce.job.speculative.
slowtaskthreshold - mapreduce.job.speculative.
slownodethreshold
In older versions of Hadoop these threshold values are fixed and cannot be modified.