Tuesday, January 27, 2015

Reading mongodump bson file from Spark in scala using mongo-hadoop

I couldn't find a complete Scala version using mongo-hadoop v1.3.1 to read a mongodump bson file, so here's one I prepared earlier:

val bsonData = sc.newAPIHadoopFile(
"file:///your/file.bson",
classOf[com.mongodb.hadoop.BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object, org.bson.BSONObject]]),
classOf[Object],
classOf[org.bson.BSONObject])


Note that (for v1.3.1) we need to subclass com.mongodb.hadoop.BSONFileInputFormat to avoid this compilation error: "inferred type arguments do not conform to method newAPIHadoopFile's type parameter bounds".  This isn't required if reading from Mongo directly using com.mongodb.hadoop.MongoInputFormat.

Also, you can pass a Configuration object as a final parameter if you need to set any specific conf values.

For more bson examples see here: https://github.com/mongodb/mongo-hadoop/blob/master/BSON_README.md

For Java examples see here: http://crcsmnky.github.io/2014/07/13/mongodb-spark-input/

No comments:

Post a Comment