Some Little Bits: How to access HBase from spark-shell using YARN as the master on CDH 5.3 and Spark 1.2

Tuesday, January 6, 2015

How to access HBase from spark-shell using YARN as the master on CDH 5.3 and Spark 1.2

From terminal:

# export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/etc/hbase/conf/hbase-site.xml

# spark-shell --master yarn-client

Now you can access HBase from the Spark shell prompt:

import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat

val tableName = "My_HBase_Table_Name"

val hconf = HBaseConfiguration.create()

hconf.set(TableInputFormat.INPUT_TABLE, tableName)

val admin = new HBaseAdmin(hconf)
if (!admin.isTableAvailable(tableName)) {
val tableDesc = new HTableDescriptor(tableName)
admin.createTable(tableDesc)
}

val hBaseRDD = sc.newAPIHadoopRDD(hconf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

val result = hBaseRDD.count()

Thanks to these refs for pointers:
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/44744
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-and-non-existent-TableInputFormat-td14370.html

6 comments:

AnonymousJune 3, 2015 at 3:13 PM
Hi Dylan,
thanks for the guide, really helpful. Just a note for future developers. Since CDH 5.4.0 it is necessary to import the incubating version of htrace-core version (at least 3.1.0) instead of the symlinked 3.0.4 because HTrace moved to Apache (http://htrace.incubator.apache.org/) and so class org.htrace.Trace is now org.apache.htrace.Trace (NoClassDefFoundError otherwise). So instead of /opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar we need to take /opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar. Probably in future version of CDH they will fix it and everything will work with the "standard" symlinked jar, meanwhile..

Bye,
Michele
ReplyDelete
Replies
Be MySelfJune 29, 2015 at 11:26 PM
Hi Dylan
I use spark-shell --jars options and it still fails.
According to the your suggestion, use export SPARK_CLASSPATH works successfully.
Do you know the difference between spark-shell jars options and SPARK_CLASSPATH ?

Thanks you!
ReplyDelete
Replies
AnonymousJanuary 7, 2016 at 12:19 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
ShushuApril 14, 2016 at 1:54 AM
Thanks for the post. I used it as skeleton to do the same on Hortonworks.
ReplyDelete
Replies
TejutejuJuly 13, 2018 at 3:22 AM
Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big data hadoop online Course India

ReplyDelete
Replies

Add comment