Monday, September 5, 2016

Simple git workflow


Git commands

git <command> --help
git log
git status
git add <files/folders>
(undo add: git reset HEAD <files/folders>)
git commit -a --message=<msg>
git pull
git push

Useful Software

What
Link
Description
Recommended and official Git Bash console clienthttps://git-scm.com/downloadsWindows, OSX and Linux. Good to have this in addition to another client.
Recommended git GUIhttps://www.gitkraken.com/downloadWindows, OSX and Linux
Github desktophttps://desktop.github.com/Windows and OSX
Windows shell interface (like TortiseSvn)https://tortoisegit.org/Windows only
 

Tuesday, July 12, 2016

Installing Java on RHEL with alternatives

Solid post on installing Java 8 on CentOS/RHEL:
http://tecadmin.net/install-java-8-on-centos-rhel-and-fedora/#


Find java download link
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Downloading Latest Java Archive
cd /usr/java/ (or your taget directory)
wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u91-b14/jdk-8u91-linux-x64.tar.gz"
tar xzf jdk-8u91-linux-x64.tar.gz

Install Java with Alternatives
cd /usr/java/jdk1.8.0_91/
alternatives --install /usr/bin/java java /usr/java/jdk1.8.0_91/bin/java 2
alternatives --config java

Optionally include javac and jar:
alternatives --install /usr/bin/jar jar /usr/java/jdk1.8.0_91/bin/jar 2
alternatives --install /usr/bin/javac javac /usr/java/jdk1.8.0_91/bin/javac 2
alternatives --set jar /usr/java/jdk1.8.0_91/bin/jar
alternatives --set javac /usr/java/jdk1.8.0_91/bin/javac

Check Installed Java Version
java -version

Configuring Environment Variables
export JAVA_HOME=/usr/java/jdk1.8.0_91
export JRE_HOME=/usr/java/jdk1.8.0_91/jre
export PATH=$PATH:/usr/java/jdk1.8.0_91/bin:/usr/java/jdk1.8.0_91/jre/bin




Follow blog post for more env config steps.

Tuesday, June 21, 2016

How to list word occurences using CountVectorizer from Scikit Learn

A simple and efficient way to get document frequency counts of words from a corpus is to use CountVectorizer from Scikit Learn

Getting back to the word from the index is not immediately obvious, here's how to do it:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

docs = <load your docs as an iterable>

count_vect = CountVectorizer()
doc_counts = count_vect.fit_transform(docs)  # this is of type scipy.sparse.csr.csr_matrix which is why we need to use

.ravel() below.

word_counts = zip(count_vect.get_feature_names(), np.asarray(doc_counts.sum(axis=0)).ravel())
word_counts = sorted(word_counts, key=lambda idx: -1 * idx[1] )
 

# Display top 100 words by frequency
word_counts[:100]

Thursday, June 9, 2016

How to 7zip each file seperately in a directory (Windows)

Let's zip up all those log files sitting in that directory into seperate 7z files....

FOR %i IN (*.*) DO "C:\Program Files\7-Zip\7z.exe" a -mx=9 "%i.7z" "%i"

Refs:
http://superuser.com/questions/312652/how-do-i-create-seperate-7z-files-from-each-selected-directory-with-7zip-command
http://askubuntu.com/questions/491223/7z-ultra-settings-for-zip-format

Monday, April 18, 2016

Managing Java versions and Eclipse

Some little bits on managing Java versions and Eclipse:

Managing JRE installations in Eclipse
http://www.codejava.net/ides/eclipse/managing-jre-installations-in-eclipse

How to switch JDK version on Mac OS X
https://www.jayway.com/2014/01/15/how-to-switch-jdk-version-on-mac-os-x-maverick/
Edit your ~/.bash_profile and add the following:
function setjdk() {
  if [ $# -ne 0 ]; then
   removeFromPath '/System/Library/Frameworks/JavaVM.framework/Home/bin'
   if [ -n "${JAVA_HOME+x}" ]; then
    removeFromPath $JAVA_HOME
   fi
   export JAVA_HOME=`/usr/libexec/java_home -v $@`
   export PATH=$JAVA_HOME/bin:$PATH
  fi
 }
 function removeFromPath() {
  export PATH=$(echo $PATH | sed -E -e "s;:$1;;" -e "s;$1:?;;")
 }
setjdk 1.7


Monday, March 21, 2016

Installing Java 8 on Ubuntu

Quick, easy point of reference for installing Java 8 on Ubuntu:

http://stackoverflow.com/questions/25549492/install-jdk8-in-ubuntu-14-04

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer


http://askubuntu.com/questions/315646/update-java-alternatives-vs-update-alternatives-config-java
sudo update-alternatives --config java


Tuesday, February 23, 2016

Debugging spark job incorrect JVM jar file loaded

I recently had an issue with incorrect version of a JVM jar file being loaded in a Spark job on Hadoop.

This Scala code snippet helped debug the issue and determine the loaded jar file path:

val jarPath = classOf[MyObject].getProtectionDomain().getCodeSource().getLocation().getPath()

In my case this pointed to an old version of guava:
/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/guava-11.0.jar

Then by setting spark.driver.extraClassPath and spark.executor.extraClassPath arguments for spark-submit, the correct version of the jar file was loaded successfully:

spark-submit --class com.MyClass <other_spark_args> --conf "spark.driver.extraClassPath=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/guava-15.0.jar" --conf "spark.executor.extraClassPath=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/guava-15.0.jar" /home/mypath/myjarfile.jar <my_job_params>

For more info with extraClassPath, see: http://spark.apache.org/docs/latest/configuration.html