Friday, 4 October 2013

Join 2 HBase tables in Pig

When you are working with HBase, there would come a time when we need to join or link 2 different table content.

Does HBase allow joins?

Simple answer : it doesn't, at not least in the way that RDBMS support them. It means that you have to do it yourself.

There are many solutions to this problem. 
1. Map/Reduce technique.
2. Pig.
3. Hive.

I plan to cover the first 2 ones.

Let's suppose we have some data in 2 tables which we are interested to link. We have a json in a HBase row value which has a 'id' field. This 'id' field is also present in the 2nd table. How would be go about solving this?

This would work if your data is stored in this format:

hbase(main):002:0> scan 'table1', {LIMIT=>1}
ROW       COLUMN+CELL                                                                                      
row1    column=cf:a, timestamp=1378473207660, value={"G":"M","Id":"12","name":"somename"}                                                      
1 row(s) in 0.0130 seconds

hbase(main):003:0> scan 'table2', {LIMIT=>1} 
ROW         COLUMN+CELL                                                                                      
 row1       column=cf:b, timestamp=1378473207660, value={"v":"v1","Id":"12","age":"22"}                                                     
1 row(s) in 0.0170 seconds

As you can see, 'id' field is the same. We need to get the age field and the name field off the tables.

Here's a pig script to do a simple table join:


set hbase.zookeeper.quorum 'xx.xx.xxx.xxx' 

-- zookeeper quorum address of your hbase cluster. If standalone, remove this line.

set default_parallel 24; 

-- Number of parallel reduce task for the operation. If standalone, remove this line.

table1 = LOAD 'hbase://table1' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('cf:a') as (a:chararray);

-- for every row, match the regular exp to get the value of Id field.

data_parsed = FOREACH table1 GENERATE REGEX_EXTRACT(a,'\\"Id\\":(.*?),',1) as Id, a as action;


--get id field of table2

content_table = LOAD 'hbase://table2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('cf:b') as (Id:chararray); 

-- if doing a join by extra fields, should be listed here


-- do the join of two tables using Id

joined = JOIN data_parsed BY Id, content_table BY Id;

-- store the result into HDFS.

STORE joined INTO '/tmp/pig-join' USING PigStorage(); 


If you want to store the result back into HBase and not into HDFS, then comment out the last line and put this instead. Make sure the table is already created.

STORE joined INTO 'hbase://Join-Table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorgae('cf:a-1');

Happy Coding! Leave a comment if you have any questions.



No comments:

Post a Comment