Recently, I was testing Apache Hama on my local machine and was trying to test data partition. As per my knowledge, Hama should partition the data just like Hadoop since Hama is written on top on Hadoop. So, I wrote a small job and tried to print my output.

Main.java class

BSP class

As per the above code, Hama should open up 4 separate task and should divide the input data in 4 chunks because I wrote job.setNumBspTask(4); in the main class. But guess what, It only opened up on task. Why you ask ?

 

After messing with code for a long time, I realized that my input file size was quite small (5MB). Hadoop makes chunks of 128 MB (So, Hama should also do the same). Once, I increased the size of my file, Hama automatically create more tasks.

Summary:

So, Apache Hama will automatically divide the input data based on the size of the input file or number of files (that were part of the input to the Hama job). The data will be partitioned based on the default partitioning which is based on Hashing.

You can write your own partitioner to divide data differently.

 

P.S. I am very new to HAMA/Hadoop. If you find an issue or my understanding is wrong then do drop a comment. thanks.

 

EDIT 1# See comments for more details

Question:

So, if I have 2 files with sizes 10 MB and 20 MB. The BSP engine will only open 2 tasks, 1 task will handle one file. right ? And if I have one file of 250 MB size and other of 200 MB size then BSP will open (250+200)/64 = 8 tasks. Is my understanding correct ?

Answer: 

Yes, you’re right. That’s current implementation. However, we plan to allow user to force-set the number of tasks for graph job in the future. Becuase, the unpartitioned input data of graph job need to be shaked at first stage. Data locality is somewhat meaningless.