Tuesday, January 6, 2015

How to access HBase from spark-shell using YARN as the master on CDH 5.3 and Spark 1.2

How to access HBase from spark-shell using YARN as the master on CDH 5.3 and Spark 1.2

From terminal:

# export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/etc/hbase/conf/hbase-site.xml

# spark-shell --master yarn-client


Now you can access HBase from the Spark shell prompt:

import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat

val tableName = "My_HBase_Table_Name"

val hconf = HBaseConfiguration.create()

hconf.set(TableInputFormat.INPUT_TABLE, tableName)

val admin = new HBaseAdmin(hconf)
if (!admin.isTableAvailable(tableName)) {
  val tableDesc = new HTableDescriptor(tableName)
  admin.createTable(tableDesc)
}

val hBaseRDD = sc.newAPIHadoopRDD(hconf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

val result = hBaseRDD.count()


Thanks to these refs for pointers:
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/44744
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-and-non-existent-TableInputFormat-td14370.html

6 comments:

  1. Hi Dylan,
    thanks for the guide, really helpful. Just a note for future developers. Since CDH 5.4.0 it is necessary to import the incubating version of htrace-core version (at least 3.1.0) instead of the symlinked 3.0.4 because HTrace moved to Apache (http://htrace.incubator.apache.org/) and so class org.htrace.Trace is now org.apache.htrace.Trace (NoClassDefFoundError otherwise). So instead of /opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar we need to take /opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar. Probably in future version of CDH they will fix it and everything will work with the "standard" symlinked jar, meanwhile..


    Bye,
    Michele

    ReplyDelete
  2. Hi Dylan
    I use spark-shell --jars options and it still fails.
    According to the your suggestion, use export SPARK_CLASSPATH works successfully.
    Do you know the difference between spark-shell jars options and SPARK_CLASSPATH ?

    Thanks you!

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. Thanks for the post. I used it as skeleton to do the same on Hortonworks.

    ReplyDelete
  5. Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big data hadoop online Course India

    ReplyDelete