Skip to content

An example in Scala of reading data saved in hbase by Spark and an example of converter for python

License

Notifications You must be signed in to change notification settings

GenTang/spark_hbase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark_hbase

Spark has their own example about integrating HBase and Spark in scala HBaseTest.scala and python converter HBaseConverters.scala.

However, the python converter HBaseResultToStringConverter in HBaseConverters.scala return only the value of first column in the result. And HBaseTest.scala stops just at returning org.apache.hadoop.hbase.client.Result and doing .count() call.

Here we provide a new example in Scala about transferring data saved in hbase into String by Spark and a new example of python converter.

The example in scala HBaseInput.scala transfers the data saved in hbase into RDD[String] which contains columnFamily, qualifier, timestamp, type, value.

The example of converter for python pythonConverters.scala transfer the data saved in hbase into string which contains the same information as the example above. We can use ast package to easily transfer this string to dictionary

How to run

  1. Make sure that you well set up git
  2. Download this application by
 $ git clone https://github.com/GenTang/spark_hbase.git
  1. Build the assembly by using SBT assembly
$ <the path to spark_hbase>/sbt/sbt clean assembly
  • Run example python script hbase_input.py which use pythonConverter ImmutableBytesWritableToStringConverter and HBaseResultToStringConverter to convert the data in hbase to dictionary

    • If you are using SPARK_CLASSPATH:

      1. Add export SPARK_CLASSPATH=$SPARK_CLASSPATH":<the path to hbase>/lib/*:<the path to spark_hbase>/target/scala-2.10/spark_hbase-assembly-1.0.jar to ./conf/spark-env.sh.

      2. Launch the script by

      $ ./bin/spark-submit <the path to hbase_input.py> \
         <host> <table> <column>
    • You can also use spark.executor.extraClassPath and --driver-class-path (recommended):

      1. Add spark.executor.extraClassPath <the path to hbase>/lib/* to spark-defaults.conf.

      2. Launch the script by

       $ ./bin/spark-submit \
          --driver-class-path <the path to spark_hbase>/target/scala-2.10/spark_hbase-assembly-1.0.jar \
          <the path to hbase_input.py> \
          <host> <table> <column>
  • Run example scala script HBaseInput.scala

    • If you are using SPARK_CLASSPATH:

      1. Add export SPARK_CLASSPATH=$SPARK_CLASSPATH":<the path to hbase>/lib/* to ./conf/spark-env.sh.

      2. Launch the script by

      $ ./bin/spark-submit \
         --class examples.HBaseInput \
         <the path to spark_hbase>/target/scala-2.10/spark_hbase-assembly-1.0.jar \
         <host> <table> 
    • You can also use spark.executor.extraClassPath and --driver-class-path (recommended):

      1. The same configuration as above

      2. Launch the script by

      $ ./bin/spark-submit \
         --driver-class-path <the path to hbase>/lib/*: \
         --class examples.HBaseInput \
         <the path to spark_hbase>/target/scala-2.10/spark_hbase-assembly-1.0.jar \
         <host> <table> 

Example of results

Assume that you have already some data in hbase as follow:

hbase(main):028:0> scan "test"
ROW                          COLUMN+CELL
 r1                          column=c1:a, timestamp=1420329575846, value=a1
 r1                          column=c1:b, timestamp=1420329640962, value=b1
 r2                          column=c1:a, timestamp=1420329683843, value=a2
 r3                          column=c1:,  timestamp=1420329810504, value=3

By launching $ ./bin/spark-submit --driver-class-path <the path to spark_hbase>/target/scala-2.10/spark_hbase-assembly-1.0.jar <the path to hbase_input.py> localhost test c1, you will get

 (u'r1', {'columnFamliy': 'c1', 'timestamp': '1420329575846', 'type': 'Put', 'qualifier': 'a', 'value': 'a1'}) 
 (u'r1', {'columnFamliy': 'c1', 'timestamp': '1420329640962', 'type': 'Put', 'qualifier': 'b', 'value': 'b1'}) 
 (u'r2', {'columnFamliy': 'c1', 'timestamp': '1420329683843', 'type': 'Put', 'qualifier': 'a', 'value': 'a2'}) 
 (u'r3', {'columnFamliy': 'c1', 'timestamp': '1420329810504', 'type': 'Put', 'qualifier': '', 'value': '3'})

About

An example in Scala of reading data saved in hbase by Spark and an example of converter for python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published