Hi peeps, now a days I am working on Hama and enjoying learning it and playing with BigData problems.


Recently, I had a task where I had to mimic a master-slave communication style. Master task was responsible for data aggregation and checking the convergence criteria of the algorithm whereas slaves were doing all the processing using the data they had.  So, Master was supposed to have no data and data was only supposed to be divided in slaves.

But, according to Eddie Yoon (one of the authors of Hama)

Like MapReduce, Hama launches the number of tasks by the number of DFS blocks or input files. Hashing is a default for only graph job.

Hama opens up the tasks based on the parts of the files that can be done or the number of input files.

Let’s take an example

1) If I have a file of size 20 MB, Hama will open only 1 task.

2) if we have 2 files with sizes 10 MB and 20 MB. Hama will only open 2 tasks, 1 task will handle one file.

3) If we have one file of 250 MB and other of 200 MB, then Hama will open (250MB+200MB)/64MB = 8 tasks. Each task will have a chunk of data. Files with size more than 64 MB will be divided.

Check out my data partition article for more details Apache Hama data partition

So, what if you want to open a task (in my example Master) without any data to do data aggregation ?



I found the following hack that works. If you guys know a more elegant solution that kindly drop a comment.

1- Create an empty file and add it to input path with your data file.

2- Hama will create task 0 without any data and you can use it as your master task. Your data will be divided to task 1 onwards depending on your data size.

P.S. I have new to Hama, drop messages if you find something that needs improvements.