Some Little Bits: 2015

Sunday, November 1, 2015

Convert binary word2vec model to text vectors

If you have a binary model generated from google's awesome and super fast word2vec word embeddings tool, you can easily use python with gensim to convert this to a text representation of the word vectors.

Input: binary word embedding model from google's word2vec tool

Output: text vectors for word embeddings

Python conversion code:
from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('path/to/mymodel.bin', binary=True)
model.save_word2vec_format('path/to/mymodel.txt', binary=False)

I recommend using Anaconda from Continuum Analytics for a bundled python distribution. To install gensim in Anaconda just type: conda install gensim :)

Original ref: https://www.kaggle.com/c/word2vec-nlp-tutorial/forums/t/13828/how-to-convert-bin-file-of-word2vec-model-into-txt-r/91564

Thursday, October 22, 2015

Analyse consecutive timeseries pairs in Spark

With time series data it is useful to construct pairs of elements for analysis.

Here is a way to construct a Spark RDD from a time series that has the pairs together in the final RDD:

val arr = Array((1, "A"), (8, "D"), (7, "C"), (3, "B"), (9, "E"))
val rdd = sc.parallelize(arr)
val sorted = rdd.sortByKey(true)
val zipped = sorted.zipWithIndex.map(x => (x._2, x._1))
val pairs = zipped.join(zipped.map(x => (x._1 - 1, x._2))).sortBy(_._1)

Which produces the consecutive elements as pairs in the RDD for further processing:
(0,((1,A),(3,B)))
(1,((3,B),(7,C)))
(2,((7,C),(8,D)))
(3,((8,D),(9,E)))

Ref: https://www.mail-archive.com/user@spark.apache.org/msg39353.html

Tuesday, August 11, 2015

Data Science Blogs around the Interwebs

So many great data science/machine learning blogs, so little time!

Company Blogs

http://googleresearch.blogspot.com.au/

https://research.facebook.com/blog/ai/
https://research.facebook.com/blog/datascience/

https://blog.twitter.com/data

http://data.quora.com/

https://engineering.linkedin.com/big-data
http://blog.linkedin.com/topic/economic-graph/

http://www.ebaytechblog.com/category/machine-learning/

http://blogs.aws.amazon.com/bigdata/

http://nerds.airbnb.com/data/

https://codeascraft.com/category/data/ (Etsy)

http://blog.kaggle.com/

Personal Blogs

http://blog.echen.me/ (Edward Chen)

http://fastml.com/ (Zygmunt Zając)

http://flowingdata.com/ (Nathan Yau)

http://simplystatistics.org/ (Jeff Leek, Roger Peng, and Rafa Irizarry)

http://hunch.net/ (John Langford et al.)

http://www.walkingrandomly.com/ (Mike Croucher)

Aggregators and other links

http://www.r-bloggers.com/

https://www.reddit.com/r/machinelearning

http://www.datatau.com/

http://stats.stackexchange.com/

Thursday, July 30, 2015

Ubuntu 14 Classic Desktop set up

Whenever I set up a VM with Ubuntu I want to fall back to a lighter desktop with fewer effects. Here's how to install the classic Gnome desktop and make the new Unity desktop optional on login.

Ref: http://www.howtogeek.com/189912/how-to-install-the-gnome-classic-desktop-in-ubuntu-14.04/

1) sudo apt-get update; sudo apt-get install gnome-session-fallback

2) Logout

3) Login with Metacity icon option

Happy days.

Tuesday, June 30, 2015

Scripted backup of all SQL Server DB's

This is a handy script to backup all or specified DB's to disk with data stamps and environment in the filename.

Thanks Greg Robidoux!
Original ref: http://www.mssqltips.com/sqlservertip/1070/simple-script-to-backup-all-sql-server-databases/

--Simple script to backup all SQL Server databases
--http://www.mssqltips.com/sqlservertip/1070/simple-script-to-backup-all-sql-server-databases/

DECLARE @name VARCHAR(50) -- database name
DECLARE @path VARCHAR(256) -- path for backup files
DECLARE @fileName VARCHAR(256) -- filename for backup
DECLARE @fileDate VARCHAR(20) -- used for file name

DECLARE @env VARCHAR(20) = 'Local'

-- specify database backup directory
SET @path = 'C:\Backup\'

-- specify filename format
SELECT @fileDate = CONVERT(VARCHAR(20),GETDATE(),112)
print @fileDate

DECLARE db_cursor CURSOR FOR
SELECT name FROM master.dbo.sysdatabases WHERE name NOT IN ('master','model','msdb','tempdb','ReportServer','ReportServerTempDB') -- Exclude these databases
--SELECT name FROM master.dbo.sysdatabases WHERE name IN ('MyDB1','MyDB2','MyDB3') -- Or, only include these databases

OPEN db_cursor
FETCH NEXT FROM db_cursor INTO @name

WHILE @@FETCH_STATUS = 0
BEGIN
SET @fileName = @path + @name + '_' + @env + '_' + @fileDate + '.bak'
print 'Backing up ' + @name + ' to ' + @fileName

BACKUP DATABASE @name TO DISK = @fileName

print 'Finished ' + @name
FETCH NEXT FROM db_cursor INTO @name
END

CLOSE db_cursor
DEALLOCATE db_cursor

Thursday, May 14, 2015

VirtualBox Host to Guest & Guest to Host networking

The simplest explanation of how to enable Host to Guest & Guest to Host networking with VirtualBox that I've seen:

"Give the guest two network adapters, one NAT and the other Host-only.

The NAT one will allow the guest to see the Internet, and the Host-only one will allow the host to see the guest."

Thanks Matthew!

Source:
http://stackoverflow.com/questions/61156/virtualbox-host-guest-network-setup

Bonus points - mounting a Windows host shared folder with Linux guest:
sudo mkdir ~/_shared/
sudo mount -t vboxsf <Insert Exact Virtualbox Shared Folder Name> ~/_shared/

(or to mount as non-root user:)
sudo mount -t vboxsf -o rw,uid=1000,gid=1000 <Insert Exact Virtualbox Shared Folder Name> ~/_shared/

Thursday, April 23, 2015

SQL Server script to Bulk Insert from a CSV file into a table

The following SQL Server script uses Bulk Insert to insert a CSV file into a target table via a staging table.

Here are steps:

Create Raw Staging Table With All Varchar Columns
Importing From CSV To Raw Staging Table
Create Target Table With Typed Columns
Copy Raw Staging Table To Target Table

-- ================================================
PRINT '1 CREATE RAW STAGING TABLE WITH ALL VARCHAR COLUMNS'

CREATE TABLE [dbo].[MyTableName_BULK_INSERT](
[SomeStringColumn] [nvarchar](255) NULL,
[SomeIntColumn] [nvarchar](255) NULL
) ON [PRIMARY]

-- ================================================
PRINT '2 IMPORTING FROM CSV TO RAW STAGING TABLE'

BULK INSERT MyTableName_BULK_INSERT
FROM 'C:\Temp\MySourceFile.csv'
WITH
(
FIRSTROW = 2, -- If has a header row
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n',
MAXERRORS = 0,
ERRORFILE = 'C:\Temp\Bulk_Insert_Errors.log',
CODEPAGE = 'ACP',
DATAFILETYPE = 'widechar'
)

-- select count(*) from MyTableName_BULK_INSERT
-- select top 10 * from MyTableName_BULK_INSERT

-- ================================================
PRINT '3 CREATE TARGET TABLE WITH TYPED COLUMNS'

CREATE TABLE [dbo].[MyTableName](
[SomeStringColumn] [varchar](255) NOT NULL,
[SomeIntColumn] [int] NOT NULL
CONSTRAINT [PK_MyTableName] PRIMARY KEY CLUSTERED
([SomeStringColumn] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
)

-- ================================================
PRINT '4 COPY RAW STAGING TABLE TO TARGET TABLE'

INSERT INTO [MyTableName]
([SomeStringColumn]
,[SomeIntColumn])
(SELECT
[SomeStringColumn]
,[SomeIntColumn]
FROM [MyTableName_BULK_INSERT])

-- select count(*) from [MyTableName]
-- select top 10 * from [MyTableName]

-- ================================================
PRINT 'FINISHED'

Wednesday, April 22, 2015

Using bcp command to export and import a SQL Server table between databases

Here's how to use the bcp command line tool to export a SQL Server table to disk and then import into another table that may be on a different database and server.

1) Script the table definition and create in the target database (Right click table > Script table as)
CREATE TABLE [dbo].[MyExampleTable](
[Id] [uniqueidentifier] NOT NULL,
[DateCreated] [datetime] NULL,
[DateUpdated] [datetime] NULL,
[OtherExampleColumn] [varchar](100) NULL
)

2) Export source table to disk using bcp:
bcp MySourceDatabaseName.dbo.MyExampleTable out C:\SomeFolder\MyExampleTable.dat -c -t, -S localhost -T

3) Import file into target SQL table using bcp:
bcp MyTargetDatabaseName.dbo.MyExampleTable in C:\SomeFolder\MyExampleTable.dat -c -t, -S localhost -T

For further options and details, see this link:
https://www.simple-talk.com/sql/database-administration/working-with-the-bcp-command-line-utility/

Tuesday, January 27, 2015

Reading mongodump bson file from Spark in scala using mongo-hadoop

I couldn't find a complete Scala version using mongo-hadoop v1.3.1 to read a mongodump bson file, so here's one I prepared earlier:

val bsonData = sc.newAPIHadoopFile(
"file:///your/file.bson",
classOf[com.mongodb.hadoop.BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object, org.bson.BSONObject]]),
classOf[Object],
classOf[org.bson.BSONObject])

Note that (for v1.3.1) we need to subclass com.mongodb.hadoop.BSONFileInputFormat to avoid this compilation error: "inferred type arguments do not conform to method newAPIHadoopFile's type parameter bounds". This isn't required if reading from Mongo directly using com.mongodb.hadoop.MongoInputFormat.

Also, you can pass a Configuration object as a final parameter if you need to set any specific conf values.

For more bson examples see here: https://github.com/mongodb/mongo-hadoop/blob/master/BSON_README.md

For Java examples see here: http://crcsmnky.github.io/2014/07/13/mongodb-spark-input/

Tuesday, January 6, 2015

How to access HBase from spark-shell using YARN as the master on CDH 5.3 and Spark 1.2

From terminal:

# export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/etc/hbase/conf/hbase-site.xml

# spark-shell --master yarn-client

Now you can access HBase from the Spark shell prompt:

import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat

val tableName = "My_HBase_Table_Name"

val hconf = HBaseConfiguration.create()

hconf.set(TableInputFormat.INPUT_TABLE, tableName)

val admin = new HBaseAdmin(hconf)
if (!admin.isTableAvailable(tableName)) {
val tableDesc = new HTableDescriptor(tableName)
admin.createTable(tableDesc)
}

val hBaseRDD = sc.newAPIHadoopRDD(hconf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

val result = hBaseRDD.count()

Thanks to these refs for pointers:
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/44744
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-and-non-existent-TableInputFormat-td14370.html