If you have a binary model generated from google's awesome and super fast word2vec word embeddings tool, you can easily use python with gensim to convert this to a text representation of the word vectors.
Input: binary word embedding model from google's word2vec tool
Output: text vectors for word embeddings
Python conversion code:
from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('path/to/mymodel.bin', binary=True)
model.save_word2vec_format('path/to/mymodel.txt', binary=False)
I recommend using Anaconda from Continuum Analytics for a bundled python distribution. To install gensim in Anaconda just type: conda install gensim :)
Original ref: https://www.kaggle.com/c/word2vec-nlp-tutorial/forums/t/13828/how-to-convert-bin-file-of-word2vec-model-into-txt-r/91564
Sunday, November 1, 2015
Thursday, October 22, 2015
Analyse consecutive timeseries pairs in Spark
With time series data it is useful to construct pairs of elements for analysis.
Here is a way to construct a Spark RDD from a time series that has the pairs together in the final RDD:
val arr = Array((1, "A"), (8, "D"), (7, "C"), (3, "B"), (9, "E"))
val rdd = sc.parallelize(arr)
val sorted = rdd.sortByKey(true)
val zipped = sorted.zipWithIndex.map(x => (x._2, x._1))
val pairs = zipped.join(zipped.map(x => (x._1 - 1, x._2))).sortBy(_._1)
Which produces the consecutive elements as pairs in the RDD for further processing:
(0,((1,A),(3,B)))
(1,((3,B),(7,C)))
(2,((7,C),(8,D)))
(3,((8,D),(9,E)))
Ref: https://www.mail-archive.com/user@spark.apache.org/msg39353.html
Here is a way to construct a Spark RDD from a time series that has the pairs together in the final RDD:
val arr = Array((1, "A"), (8, "D"), (7, "C"), (3, "B"), (9, "E"))
val rdd = sc.parallelize(arr)
val sorted = rdd.sortByKey(true)
val zipped = sorted.zipWithIndex.map(x => (x._2, x._1))
val pairs = zipped.join(zipped.map(x => (x._1 - 1, x._2))).sortBy(_._1)
Which produces the consecutive elements as pairs in the RDD for further processing:
(0,((1,A),(3,B)))
(1,((3,B),(7,C)))
(2,((7,C),(8,D)))
(3,((8,D),(9,E)))
Ref: https://www.mail-archive.com/user@spark.apache.org/msg39353.html
Tuesday, August 11, 2015
Data Science Blogs around the Interwebs
So many great data science/machine learning blogs, so little time!
Company Blogs
http://googleresearch.blogspot.com.au/
https://research.facebook.com/blog/ai/
https://research.facebook.com/blog/datascience/
https://blog.twitter.com/data
http://data.quora.com/
https://engineering.linkedin.com/big-data
http://blog.linkedin.com/topic/economic-graph/
http://www.ebaytechblog.com/category/machine-learning/
http://blogs.aws.amazon.com/bigdata/
http://nerds.airbnb.com/data/
https://codeascraft.com/category/data/ (Etsy)
http://blog.kaggle.com/
Personal Blogs
http://blog.echen.me/ (Edward Chen)
http://fastml.com/ (Zygmunt ZajÄ…c)
http://flowingdata.com/ (Nathan Yau)
http://simplystatistics.org/ (Jeff Leek, Roger Peng, and Rafa Irizarry)
http://hunch.net/ (John Langford et al.)
http://www.walkingrandomly.com/ (Mike Croucher)
Aggregators and other links
http://www.r-bloggers.com/
https://www.reddit.com/r/machinelearning
http://www.datatau.com/
http://stats.stackexchange.com/
Company Blogs
http://googleresearch.blogspot.com.au/
https://research.facebook.com/blog/ai/
https://research.facebook.com/blog/datascience/
https://blog.twitter.com/data
http://data.quora.com/
https://engineering.linkedin.com/big-data
http://blog.linkedin.com/topic/economic-graph/
http://www.ebaytechblog.com/category/machine-learning/
http://blogs.aws.amazon.com/bigdata/
http://nerds.airbnb.com/data/
https://codeascraft.com/category/data/ (Etsy)
http://blog.kaggle.com/
Personal Blogs
http://blog.echen.me/ (Edward Chen)
http://fastml.com/ (Zygmunt ZajÄ…c)
http://flowingdata.com/ (Nathan Yau)
http://simplystatistics.org/ (Jeff Leek, Roger Peng, and Rafa Irizarry)
http://hunch.net/ (John Langford et al.)
http://www.walkingrandomly.com/ (Mike Croucher)
Aggregators and other links
http://www.r-bloggers.com/
https://www.reddit.com/r/machinelearning
http://www.datatau.com/
http://stats.stackexchange.com/
Thursday, July 30, 2015
Ubuntu 14 Classic Desktop set up
Whenever I set up a VM with Ubuntu I want to fall back to a lighter desktop with fewer effects. Here's how to install the classic Gnome desktop and make the new Unity desktop optional on login.
Ref: http://www.howtogeek.com/189912/how-to-install-the-gnome-classic-desktop-in-ubuntu-14.04/
1) sudo apt-get update; sudo apt-get install gnome-session-fallback
2) Logout
3) Login with Metacity icon option
Happy days.
Ref: http://www.howtogeek.com/189912/how-to-install-the-gnome-classic-desktop-in-ubuntu-14.04/
1) sudo apt-get update; sudo apt-get install gnome-session-fallback
2) Logout
3) Login with Metacity icon option
Happy days.
Tuesday, June 30, 2015
Scripted backup of all SQL Server DB's
This is a handy script to backup all or specified DB's to disk with data stamps and environment in the filename.
Thanks Greg Robidoux!
Original ref: http://www.mssqltips.com/sqlservertip/1070/simple-script-to-backup-all-sql-server-databases/
--Simple script to backup all SQL Server databases
--http://www.mssqltips.com/sqlservertip/1070/simple-script-to-backup-all-sql-server-databases/
DECLARE @name VARCHAR(50) -- database name
DECLARE @path VARCHAR(256) -- path for backup files
DECLARE @fileName VARCHAR(256) -- filename for backup
DECLARE @fileDate VARCHAR(20) -- used for file name
DECLARE @env VARCHAR(20) = 'Local'
-- specify database backup directory
SET @path = 'C:\Backup\'
-- specify filename format
SELECT @fileDate = CONVERT(VARCHAR(20),GETDATE(),112)
print @fileDate
DECLARE db_cursor CURSOR FOR
SELECT name FROM master.dbo.sysdatabases WHERE name NOT IN ('master','model','msdb','tempdb','ReportServer','ReportServerTempDB') -- Exclude these databases
--SELECT name FROM master.dbo.sysdatabases WHERE name IN ('MyDB1','MyDB2','MyDB3') -- Or, only include these databases
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO @name
WHILE @@FETCH_STATUS = 0
BEGIN
SET @fileName = @path + @name + '_' + @env + '_' + @fileDate + '.bak'
print 'Backing up ' + @name + ' to ' + @fileName
BACKUP DATABASE @name TO DISK = @fileName
print 'Finished ' + @name
FETCH NEXT FROM db_cursor INTO @name
END
CLOSE db_cursor
DEALLOCATE db_cursor
Thanks Greg Robidoux!
Original ref: http://www.mssqltips.com/sqlservertip/1070/simple-script-to-backup-all-sql-server-databases/
--Simple script to backup all SQL Server databases
--http://www.mssqltips.com/sqlservertip/1070/simple-script-to-backup-all-sql-server-databases/
DECLARE @name VARCHAR(50) -- database name
DECLARE @path VARCHAR(256) -- path for backup files
DECLARE @fileName VARCHAR(256) -- filename for backup
DECLARE @fileDate VARCHAR(20) -- used for file name
DECLARE @env VARCHAR(20) = 'Local'
-- specify database backup directory
SET @path = 'C:\Backup\'
-- specify filename format
SELECT @fileDate = CONVERT(VARCHAR(20),GETDATE(),112)
print @fileDate
DECLARE db_cursor CURSOR FOR
SELECT name FROM master.dbo.sysdatabases WHERE name NOT IN ('master','model','msdb','tempdb','ReportServer','ReportServerTempDB') -- Exclude these databases
--SELECT name FROM master.dbo.sysdatabases WHERE name IN ('MyDB1','MyDB2','MyDB3') -- Or, only include these databases
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO @name
WHILE @@FETCH_STATUS = 0
BEGIN
SET @fileName = @path + @name + '_' + @env + '_' + @fileDate + '.bak'
print 'Backing up ' + @name + ' to ' + @fileName
BACKUP DATABASE @name TO DISK = @fileName
print 'Finished ' + @name
FETCH NEXT FROM db_cursor INTO @name
END
CLOSE db_cursor
DEALLOCATE db_cursor
Thursday, May 14, 2015
VirtualBox Host to Guest & Guest to Host networking
The simplest explanation of how to enable Host to Guest & Guest to Host networking with VirtualBox that I've seen:
"Give the guest two network adapters, one NAT and the other Host-only.
The NAT one will allow the guest to see the Internet, and the Host-only one will allow the host to see the guest."
Thanks Matthew!
Source:
http://stackoverflow.com/questions/61156/virtualbox-host-guest-network-setup
Bonus points - mounting a Windows host shared folder with Linux guest:
sudo mkdir ~/_shared/
sudo mount -t vboxsf <Insert Exact Virtualbox Shared Folder Name> ~/_shared/
(or to mount as non-root user:)
sudo mount -t vboxsf -o rw,uid=1000,gid=1000 <Insert Exact Virtualbox Shared Folder Name> ~/_shared/
"Give the guest two network adapters, one NAT and the other Host-only.
The NAT one will allow the guest to see the Internet, and the Host-only one will allow the host to see the guest."
Thanks Matthew!
Source:
http://stackoverflow.com/questions/61156/virtualbox-host-guest-network-setup
Bonus points - mounting a Windows host shared folder with Linux guest:
sudo mkdir ~/_shared/
sudo mount -t vboxsf <Insert Exact Virtualbox Shared Folder Name> ~/_shared/
(or to mount as non-root user:)
sudo mount -t vboxsf -o rw,uid=1000,gid=1000 <Insert Exact Virtualbox Shared Folder Name> ~/_shared/
Thursday, April 23, 2015
SQL Server script to Bulk Insert from a CSV file into a table
The following SQL Server script uses Bulk Insert to insert a CSV file into a target table via a staging table.
Here are steps:
-- ================================================
PRINT '1 CREATE RAW STAGING TABLE WITH ALL VARCHAR COLUMNS'
CREATE TABLE [dbo].[MyTableName_BULK_INSERT](
[SomeStringColumn] [nvarchar](255) NULL,
[SomeIntColumn] [nvarchar](255) NULL
) ON [PRIMARY]
-- ================================================
PRINT '2 IMPORTING FROM CSV TO RAW STAGING TABLE'
BULK INSERT MyTableName_BULK_INSERT
FROM 'C:\Temp\MySourceFile.csv'
WITH
(
FIRSTROW = 2, -- If has a header row
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n',
MAXERRORS = 0,
ERRORFILE = 'C:\Temp\Bulk_Insert_Errors.log',
CODEPAGE = 'ACP',
DATAFILETYPE = 'widechar'
)
-- select count(*) from MyTableName_BULK_INSERT
-- select top 10 * from MyTableName_BULK_INSERT
-- ================================================
PRINT '3 CREATE TARGET TABLE WITH TYPED COLUMNS'
CREATE TABLE [dbo].[MyTableName](
[SomeStringColumn] [varchar](255) NOT NULL,
[SomeIntColumn] [int] NOT NULL
CONSTRAINT [PK_MyTableName] PRIMARY KEY CLUSTERED
([SomeStringColumn] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
)
-- ================================================
PRINT '4 COPY RAW STAGING TABLE TO TARGET TABLE'
INSERT INTO [MyTableName]
([SomeStringColumn]
,[SomeIntColumn])
(SELECT
[SomeStringColumn]
,[SomeIntColumn]
FROM [MyTableName_BULK_INSERT])
-- select count(*) from [MyTableName]
-- select top 10 * from [MyTableName]
-- ================================================
PRINT 'FINISHED'
Here are steps:
- Create Raw Staging Table With All Varchar Columns
- Importing From CSV To Raw Staging Table
- Create Target Table With Typed Columns
- Copy Raw Staging Table To Target Table
-- ================================================
PRINT '1 CREATE RAW STAGING TABLE WITH ALL VARCHAR COLUMNS'
CREATE TABLE [dbo].[MyTableName_BULK_INSERT](
[SomeStringColumn] [nvarchar](255) NULL,
[SomeIntColumn] [nvarchar](255) NULL
) ON [PRIMARY]
-- ================================================
PRINT '2 IMPORTING FROM CSV TO RAW STAGING TABLE'
BULK INSERT MyTableName_BULK_INSERT
FROM 'C:\Temp\MySourceFile.csv'
WITH
(
FIRSTROW = 2, -- If has a header row
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n',
MAXERRORS = 0,
ERRORFILE = 'C:\Temp\Bulk_Insert_Errors.log',
CODEPAGE = 'ACP',
DATAFILETYPE = 'widechar'
)
-- select count(*) from MyTableName_BULK_INSERT
-- select top 10 * from MyTableName_BULK_INSERT
-- ================================================
PRINT '3 CREATE TARGET TABLE WITH TYPED COLUMNS'
CREATE TABLE [dbo].[MyTableName](
[SomeStringColumn] [varchar](255) NOT NULL,
[SomeIntColumn] [int] NOT NULL
CONSTRAINT [PK_MyTableName] PRIMARY KEY CLUSTERED
([SomeStringColumn] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
)
-- ================================================
PRINT '4 COPY RAW STAGING TABLE TO TARGET TABLE'
INSERT INTO [MyTableName]
([SomeStringColumn]
,[SomeIntColumn])
(SELECT
[SomeStringColumn]
,[SomeIntColumn]
FROM [MyTableName_BULK_INSERT])
-- select count(*) from [MyTableName]
-- select top 10 * from [MyTableName]
-- ================================================
PRINT 'FINISHED'
Wednesday, April 22, 2015
Using bcp command to export and import a SQL Server table between databases
Here's how to use the bcp command line tool to export a SQL Server table to disk and then import into another table that may be on a different database and server.
1) Script the table definition and create in the target database (Right click table > Script table as)
CREATE TABLE [dbo].[MyExampleTable](
[Id] [uniqueidentifier] NOT NULL,
[DateCreated] [datetime] NULL,
[DateUpdated] [datetime] NULL,
[OtherExampleColumn] [varchar](100) NULL
)
2) Export source table to disk using bcp:
bcp MySourceDatabaseName.dbo.MyExampleTable out C:\SomeFolder\MyExampleTable.dat -c -t, -S localhost -T
3) Import file into target SQL table using bcp:
bcp MyTargetDatabaseName.dbo.MyExampleTable in C:\SomeFolder\MyExampleTable.dat -c -t, -S localhost -T
For further options and details, see this link:
https://www.simple-talk.com/sql/database-administration/working-with-the-bcp-command-line-utility/
1) Script the table definition and create in the target database (Right click table > Script table as)
CREATE TABLE [dbo].[MyExampleTable](
[Id] [uniqueidentifier] NOT NULL,
[DateCreated] [datetime] NULL,
[DateUpdated] [datetime] NULL,
[OtherExampleColumn] [varchar](100) NULL
)
2) Export source table to disk using bcp:
bcp MySourceDatabaseName.dbo.MyExampleTable out C:\SomeFolder\MyExampleTable.dat -c -t, -S localhost -T
3) Import file into target SQL table using bcp:
bcp MyTargetDatabaseName.dbo.MyExampleTable in C:\SomeFolder\MyExampleTable.dat -c -t, -S localhost -T
For further options and details, see this link:
https://www.simple-talk.com/sql/database-administration/working-with-the-bcp-command-line-utility/
Tuesday, January 27, 2015
Reading mongodump bson file from Spark in scala using mongo-hadoop
I couldn't find a complete Scala version using mongo-hadoop v1.3.1 to read a mongodump bson file, so here's one I prepared earlier:
val bsonData = sc.newAPIHadoopFile(
"file:///your/file.bson",
classOf[com.mongodb.hadoop.BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object, org.bson.BSONObject]]),
classOf[Object],
classOf[org.bson.BSONObject])
Note that (for v1.3.1) we need to subclass com.mongodb.hadoop.BSONFileInputFormat to avoid this compilation error: "inferred type arguments do not conform to method newAPIHadoopFile's type parameter bounds". This isn't required if reading from Mongo directly using com.mongodb.hadoop.MongoInputFormat.
Also, you can pass a Configuration object as a final parameter if you need to set any specific conf values.
For more bson examples see here: https://github.com/mongodb/mongo-hadoop/blob/master/BSON_README.md
For Java examples see here: http://crcsmnky.github.io/2014/07/13/mongodb-spark-input/
val bsonData = sc.newAPIHadoopFile(
"file:///your/file.bson",
classOf[com.mongodb.hadoop.BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object, org.bson.BSONObject]]),
classOf[Object],
classOf[org.bson.BSONObject])
Note that (for v1.3.1) we need to subclass com.mongodb.hadoop.BSONFileInputFormat to avoid this compilation error: "inferred type arguments do not conform to method newAPIHadoopFile's type parameter bounds". This isn't required if reading from Mongo directly using com.mongodb.hadoop.MongoInputFormat.
Also, you can pass a Configuration object as a final parameter if you need to set any specific conf values.
For more bson examples see here: https://github.com/mongodb/mongo-hadoop/blob/master/BSON_README.md
For Java examples see here: http://crcsmnky.github.io/2014/07/13/mongodb-spark-input/
Tuesday, January 6, 2015
How to access HBase from spark-shell using YARN as the master on CDH 5.3 and Spark 1.2
How to access HBase from spark-shell using YARN as the master on CDH 5.3 and Spark 1.2
From terminal:# export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/etc/hbase/conf/hbase-site.xml
# spark-shell --master yarn-client
Now you can access HBase from the Spark shell prompt:
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
val tableName = "My_HBase_Table_Name"
val hconf = HBaseConfiguration.create()
hconf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(hconf)
if (!admin.isTableAvailable(tableName)) {
val tableDesc = new HTableDescriptor(tableName)
admin.createTable(tableDesc)
}
val hBaseRDD = sc.newAPIHadoopRDD(hconf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val result = hBaseRDD.count()
Thanks to these refs for pointers:
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/44744
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-and-non-existent-TableInputFormat-td14370.html
Subscribe to:
Comments (Atom)