When you are working with HBase, there would come a time when we need to join or link 2 different table content.
Does HBase allow joins?
Simple answer : it doesn't, at not least in the way that RDBMS support them. It means that you have to do it yourself.
There are many solutions to this problem.
1. Map/Reduce technique.
2. Pig.
3. Hive.
I plan to cover the first 2 ones.
Let's suppose we have some data in 2 tables which we are interested to link. We have a json in a HBase row value which has a 'id' field. This 'id' field is also present in the 2nd table. How would be go about solving this?
This would work if your data is stored in this format:
As you can see, 'id' field is the same. We need to get the age field and the name field off the tables.
Here's a pig script to do a simple table join:
If you want to store the result back into HBase and not into HDFS, then comment out the last line and put this instead. Make sure the table is already created.
Does HBase allow joins?
Simple answer : it doesn't, at not least in the way that RDBMS support them. It means that you have to do it yourself.
There are many solutions to this problem.
1. Map/Reduce technique.
2. Pig.
3. Hive.
I plan to cover the first 2 ones.
Let's suppose we have some data in 2 tables which we are interested to link. We have a json in a HBase row value which has a 'id' field. This 'id' field is also present in the 2nd table. How would be go about solving this?
This would work if your data is stored in this format:
hbase(main):002:0> scan 'table1', {LIMIT=>1}
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1378473207660, value={"G":"M","Id":"12","name":"somename"}
1 row(s) in 0.0130 seconds
hbase(main):003:0> scan 'table2', {LIMIT=>1}
ROW COLUMN+CELL
row1 column=cf:b, timestamp=1378473207660, value={"v":"v1","Id":"12","age":"22"}
1 row(s) in 0.0170 seconds
Here's a pig script to do a simple table join:
set hbase.zookeeper.quorum 'xx.xx.xxx.xxx'
-- zookeeper quorum address of your hbase cluster. If standalone, remove this line.
set default_parallel 24;
-- Number of parallel reduce task for the operation. If standalone, remove this line.
table1 = LOAD 'hbase://table1' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('cf:a') as (a:chararray);
-- for every row, match the regular exp to get the value of Id field.
data_parsed = FOREACH table1 GENERATE REGEX_EXTRACT(a,'\\"Id\\":(.*?),',1) as Id, a as action;
--get id field of table2
content_table = LOAD 'hbase://table2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('cf:b') as (Id:chararray);
-- if doing a join by extra fields, should be listed here
-- do the join of two tables using Id
joined = JOIN data_parsed BY Id, content_table BY Id;
-- store the result into HDFS.
STORE joined INTO '/tmp/pig-join' USING PigStorage();
STORE joined INTO 'hbase://Join-Table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorgae('cf:a-1');