Sunday, November 1, 2015

Convert binary word2vec model to text vectors

If you have a binary model generated from google's awesome and super fast word2vec word embeddings tool, you can easily use python with gensim to convert this to a text representation of the word vectors.

Input: binary word embedding model from google's word2vec tool

Output: text vectors for word embeddings

Python conversion code:
from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('path/to/mymodel.bin', binary=True)
model.save_word2vec_format('path/to/
mymodel.txt', binary=False)

I recommend using Anaconda from Continuum Analytics for a bundled python distribution.  To install gensim in Anaconda just type: conda install gensim :)


Original ref: https://www.kaggle.com/c/word2vec-nlp-tutorial/forums/t/13828/how-to-convert-bin-file-of-word2vec-model-into-txt-r/91564

Thursday, October 22, 2015

Analyse consecutive timeseries pairs in Spark

With time series data it is useful to construct pairs of elements for analysis.

Here is a way to construct a Spark RDD from a time series that has the pairs together in the final RDD:


val arr = Array((1, "A"), (8, "D"), (7, "C"), (3, "B"), (9, "E"))
val rdd = sc.parallelize(arr)
val sorted = rdd.sortByKey(true)
val zipped = sorted.zipWithIndex.map(x => (x._2, x._1))
val pairs = zipped.join(zipped.map(x => (x._1 - 1, x._2))).sortBy(_._1)


Which produces the consecutive elements as pairs in the RDD for further processing:
(0,((1,A),(3,B)))
(1,((3,B),(7,C)))
(2,((7,C),(8,D)))
(3,((8,D),(9,E)))


Ref: https://www.mail-archive.com/user@spark.apache.org/msg39353.html

Thursday, July 30, 2015

Ubuntu 14 Classic Desktop set up

Whenever I set up a VM with Ubuntu I want to fall back to a lighter desktop with fewer effects.  Here's how to install the classic Gnome desktop and make the new Unity desktop optional on login.

Ref: http://www.howtogeek.com/189912/how-to-install-the-gnome-classic-desktop-in-ubuntu-14.04/

1) sudo apt-get update; sudo apt-get install gnome-session-fallback

2) Logout

3) Login with Metacity icon option

Happy days.

Tuesday, June 30, 2015

Scripted backup of all SQL Server DB's

This is a handy script to backup all or specified DB's to disk with data stamps and environment in the filename.

Thanks Greg Robidoux!
Original ref: http://www.mssqltips.com/sqlservertip/1070/simple-script-to-backup-all-sql-server-databases/


--Simple script to backup all SQL Server databases
--http://www.mssqltips.com/sqlservertip/1070/simple-script-to-backup-all-sql-server-databases/

DECLARE @name VARCHAR(50) -- database name  
DECLARE @path VARCHAR(256) -- path for backup files  
DECLARE @fileName VARCHAR(256) -- filename for backup  
DECLARE @fileDate VARCHAR(20) -- used for file name

DECLARE @env VARCHAR(20) = 'Local'

-- specify database backup directory
SET @path = 'C:\Backup\'

-- specify filename format
SELECT @fileDate = CONVERT(VARCHAR(20),GETDATE(),112) 
print @fileDate

DECLARE db_cursor CURSOR FOR  
SELECT name FROM master.dbo.sysdatabases  WHERE name NOT IN ('master','model','msdb','tempdb','ReportServer','ReportServerTempDB')  -- Exclude these databases
--SELECT name FROM master.dbo.sysdatabases  WHERE name IN ('MyDB1','MyDB2','MyDB3')  -- Or, only include these databases

OPEN db_cursor   
FETCH NEXT FROM db_cursor INTO @name   

WHILE @@FETCH_STATUS = 0   
BEGIN   
       SET @fileName = @path + @name + '_' + @env + '_' + @fileDate + '.bak'  
       print 'Backing up ' + @name + ' to ' + @fileName  
       
       BACKUP DATABASE @name TO DISK = @fileName
       
       print 'Finished ' + @name
       FETCH NEXT FROM db_cursor INTO @name   
END   

CLOSE db_cursor   
DEALLOCATE db_cursor

Thursday, May 14, 2015

VirtualBox Host to Guest & Guest to Host networking

The simplest explanation of how to enable Host to Guest & Guest to Host networking with VirtualBox that I've seen:

"Give the guest two network adapters, one NAT and the other Host-only.

The NAT one will allow the guest to see the Internet, and the Host-only one will allow the host to see the guest."

Thanks Matthew!

Source:
http://stackoverflow.com/questions/61156/virtualbox-host-guest-network-setup


Bonus points - mounting a Windows host shared folder with Linux guest:
sudo mkdir ~/_shared/
sudo mount -t vboxsf <Insert Exact Virtualbox Shared Folder Name> ~/_shared/

(or to mount as non-root user:)
sudo mount -t vboxsf -o rw,uid=1000,gid=1000 <Insert Exact Virtualbox Shared Folder Name> ~/_shared/

Thursday, April 23, 2015

SQL Server script to Bulk Insert from a CSV file into a table

The following SQL Server script uses Bulk Insert to insert a CSV file into a target table via a staging table.

Here are steps:

  1. Create Raw Staging Table With All Varchar Columns
  2. Importing From CSV To Raw Staging Table
  3. Create Target Table With Typed Columns
  4. Copy Raw Staging Table To Target Table


-- ================================================
PRINT '1 CREATE RAW STAGING TABLE WITH ALL VARCHAR COLUMNS'

CREATE TABLE [dbo].[MyTableName_BULK_INSERT](
[SomeStringColumn] [nvarchar](255) NULL,
[SomeIntColumn] [nvarchar](255) NULL
) ON [PRIMARY]


-- ================================================
PRINT '2 IMPORTING FROM CSV TO RAW STAGING TABLE'

BULK INSERT MyTableName_BULK_INSERT
FROM 'C:\Temp\MySourceFile.csv'
WITH
(
FIRSTROW = 2,  -- If has a header row
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n',
MAXERRORS = 0,
ERRORFILE = 'C:\Temp\Bulk_Insert_Errors.log',
CODEPAGE = 'ACP',
DATAFILETYPE = 'widechar'
)

-- select count(*) from MyTableName_BULK_INSERT
-- select top 10 * from MyTableName_BULK_INSERT


-- ================================================
PRINT '3 CREATE TARGET TABLE WITH TYPED COLUMNS'

CREATE TABLE [dbo].[MyTableName](
[SomeStringColumn] [varchar](255) NOT NULL,
[SomeIntColumn] [int] NOT NULL
 CONSTRAINT [PK_MyTableName] PRIMARY KEY CLUSTERED 
 ([SomeStringColumn] ASC)
  WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
)


-- ================================================
PRINT '4 COPY RAW STAGING TABLE TO TARGET TABLE'

INSERT INTO [MyTableName]
([SomeStringColumn]
,[SomeIntColumn])
(SELECT 
[SomeStringColumn]
,[SomeIntColumn]
FROM [MyTableName_BULK_INSERT])

-- select count(*) from [MyTableName]
-- select top 10 * from [MyTableName]


-- ================================================
PRINT 'FINISHED'



Wednesday, April 22, 2015

Using bcp command to export and import a SQL Server table between databases

Here's how to use the bcp command line tool to export a SQL Server table to disk and then import into another table that may be on a different database and server.


1) Script the table definition and create in the target database (Right click table > Script table as)
CREATE TABLE [dbo].[MyExampleTable](
[Id] [uniqueidentifier] NOT NULL,
[DateCreated] [datetime] NULL,
[DateUpdated] [datetime] NULL,
[OtherExampleColumn] [varchar](100) NULL
)


2) Export source table to disk using bcp:
bcp MySourceDatabaseName.dbo.MyExampleTable out C:\SomeFolder\MyExampleTable.dat -c -t, -S localhost -T


3) Import file into target SQL table using bcp:
bcp MyTargetDatabaseName.dbo.MyExampleTable in C:\SomeFolder\MyExampleTable.dat -c -t, -S localhost -T


For further options and details, see this link:
https://www.simple-talk.com/sql/database-administration/working-with-the-bcp-command-line-utility/

Tuesday, January 27, 2015

Reading mongodump bson file from Spark in scala using mongo-hadoop

I couldn't find a complete Scala version using mongo-hadoop v1.3.1 to read a mongodump bson file, so here's one I prepared earlier:

val bsonData = sc.newAPIHadoopFile(
"file:///your/file.bson",
classOf[com.mongodb.hadoop.BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object, org.bson.BSONObject]]),
classOf[Object],
classOf[org.bson.BSONObject])


Note that (for v1.3.1) we need to subclass com.mongodb.hadoop.BSONFileInputFormat to avoid this compilation error: "inferred type arguments do not conform to method newAPIHadoopFile's type parameter bounds".  This isn't required if reading from Mongo directly using com.mongodb.hadoop.MongoInputFormat.

Also, you can pass a Configuration object as a final parameter if you need to set any specific conf values.

For more bson examples see here: https://github.com/mongodb/mongo-hadoop/blob/master/BSON_README.md

For Java examples see here: http://crcsmnky.github.io/2014/07/13/mongodb-spark-input/

Tuesday, January 6, 2015

How to access HBase from spark-shell using YARN as the master on CDH 5.3 and Spark 1.2

How to access HBase from spark-shell using YARN as the master on CDH 5.3 and Spark 1.2

From terminal:

# export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/etc/hbase/conf/hbase-site.xml

# spark-shell --master yarn-client


Now you can access HBase from the Spark shell prompt:

import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat

val tableName = "My_HBase_Table_Name"

val hconf = HBaseConfiguration.create()

hconf.set(TableInputFormat.INPUT_TABLE, tableName)

val admin = new HBaseAdmin(hconf)
if (!admin.isTableAvailable(tableName)) {
  val tableDesc = new HTableDescriptor(tableName)
  admin.createTable(tableDesc)
}

val hBaseRDD = sc.newAPIHadoopRDD(hconf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

val result = hBaseRDD.count()


Thanks to these refs for pointers:
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/44744
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-and-non-existent-TableInputFormat-td14370.html