Friday, 4 October 2013

Join 2 HBase tables in Pig

When you are working with HBase, there would come a time when we need to join or link 2 different table content.

Does HBase allow joins?

Simple answer : it doesn't, at not least in the way that RDBMS support them. It means that you have to do it yourself.

There are many solutions to this problem. 
1. Map/Reduce technique.
2. Pig.
3. Hive.

I plan to cover the first 2 ones.

Let's suppose we have some data in 2 tables which we are interested to link. We have a json in a HBase row value which has a 'id' field. This 'id' field is also present in the 2nd table. How would be go about solving this?

This would work if your data is stored in this format:

hbase(main):002:0> scan 'table1', {LIMIT=>1}
ROW       COLUMN+CELL                                                                                      
row1    column=cf:a, timestamp=1378473207660, value={"G":"M","Id":"12","name":"somename"}                                                      
1 row(s) in 0.0130 seconds

hbase(main):003:0> scan 'table2', {LIMIT=>1} 
ROW         COLUMN+CELL                                                                                      
 row1       column=cf:b, timestamp=1378473207660, value={"v":"v1","Id":"12","age":"22"}                                                     
1 row(s) in 0.0170 seconds

As you can see, 'id' field is the same. We need to get the age field and the name field off the tables.

Here's a pig script to do a simple table join:


set hbase.zookeeper.quorum 'xx.xx.xxx.xxx' 

-- zookeeper quorum address of your hbase cluster. If standalone, remove this line.

set default_parallel 24; 

-- Number of parallel reduce task for the operation. If standalone, remove this line.

table1 = LOAD 'hbase://table1' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('cf:a') as (a:chararray);

-- for every row, match the regular exp to get the value of Id field.

data_parsed = FOREACH table1 GENERATE REGEX_EXTRACT(a,'\\"Id\\":(.*?),',1) as Id, a as action;


--get id field of table2

content_table = LOAD 'hbase://table2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('cf:b') as (Id:chararray); 

-- if doing a join by extra fields, should be listed here


-- do the join of two tables using Id

joined = JOIN data_parsed BY Id, content_table BY Id;

-- store the result into HDFS.

STORE joined INTO '/tmp/pig-join' USING PigStorage(); 


If you want to store the result back into HBase and not into HDFS, then comment out the last line and put this instead. Make sure the table is already created.

STORE joined INTO 'hbase://Join-Table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorgae('cf:a-1');

Happy Coding! Leave a comment if you have any questions.



HBase Tutorial - Read/Write records using Get/Put


Past few weeks, I've been working on Hadoop/HBase and just wanted to share how simple it is to use HBase as a non-relational, distributed database.


Tables in HBase can server as the input and output for MapReduce jobs run in Hadoop and may be accessed through the Java API which is vast. But it can also be accessed though the Thrift gateway APIs(Python etc).


Note: This is based on the fact that you know how to create a java project in eclipse and this assumes you already have.

Go over to the quickstart guide provided here. Make sure HBase is running locally(Standalone mode). After you know how it functions and how it can used for your specific use cases, it can run on pseudo-distributed mode. 


Put Operation using Java API

Make sure that your Java Project created in eclipse has hbase and hadoop jars. 

Right click on your Java Project -> Build Path -> Configure Build Path -> Libraries - > Add external Jars -> "Add the HBase and Hadoop relevant jars".

Now you are good to go.

Go to your HBase shell and create a table called 'testtable'.


hbase(main):000:0> create ‘testtable’, ‘cf’
0 row(s) in 1.0970 seconds

Copy/Paste the code:


 import org.apache.hadoop.conf.Configuration;  
 import org.apache.hadoop.hbase.HBaseConfiguration;  
 import org.apache.hadoop.hbase.client.Get;  
 import org.apache.hadoop.hbase.client.HTable;  
 import org.apache.hadoop.hbase.client.Result;  
 import org.apache.hadoop.hbase.util.Bytes;  
 import java.io.IOException;  

 public class ClassName {  

  public static void main(String[] args) throws IOException {  

   Configuration conf = HBaseConfiguration.create(); //create a configuration object.  

   HTable table = new HTable(conf, "testtable"); // testtable is name of the table  

   Put put = new Put(Bytes.toBytes("row1")); // insert a row with row1 as the key  

   put.add(Bytes.toBytes("cf"), Bytes.toBytes("a"),  
    Bytes.toBytes("val1")); // column family -> "cf" & column qualifier "a". Value is val1.

   put.add(Bytes.toBytes("cf"), Bytes.toBytes("b"),  
    Bytes.toBytes("val2")); // column family -> "cf" & column qualifier "b". Value is val2. 

   table.put(put); //insert the put object into HBase.

   System.out.println("Success!");
    }
}

You should get Success if everything goes right!


Get Example using Java API

Now we will retrieve the rows we just put in HBase.

Copy/Paste the following code:
 import org.apache.hadoop.conf.Configuration;  
 import org.apache.hadoop.hbase.HBaseConfiguration;  
 import org.apache.hadoop.hbase.client.Get;  
 import org.apache.hadoop.hbase.client.HTable;  
 import org.apache.hadoop.hbase.client.Result;  
 import org.apache.hadoop.hbase.util.Bytes;  
 import java.io.IOException;  
 public class GetExample {  
  public static void main(String[] args) throws IOException {  
   Configuration conf = HBaseConfiguration.create();   
   HTable table = new HTable(conf, "testtable");   // Create a configuration object.
   Get get = new Get(Bytes.toBytes("row1")); // get object initialized with row1 as key  

   get.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("a"));   // add column family and column qualifier to the get object
   Result result = table.get(get);   

   byte[] val = result.getValue(Bytes.toBytes("cf"),  
    Bytes.toBytes("a"));   // by default result is a byte array

   System.out.println("Value: " + Bytes.toString(val));   // convert the byte array into String.
  }  
 }  



Happy Coding!! 

HBase is very easy to play around with. Comment below if you have any questions. I'm just getting started with the tutorials. Watch out for more :)