Some Little Bits: 2018

Wednesday, November 28, 2018

Making git on Windows behave with CRLF line endings

Git on Windows by default is a bit too clever for itself with line endings, typically having the config autocrlf=true

When I checkout a Linux/OSX repo that contains shell scripts that are used in a built Docker image - please leave line endings as LF as per the repo - don't convert to CRLF.

To achieve this for the entire repo, git clone like so:
git clone --config core.autocrlf=input <repo>

More info: https://git-scm.com/docs/git-config#git-config-coreautocrlf

Thursday, September 27, 2018

Serialising a RandomForestClassificationModel from PySpark to a SequenceFile on hdfs

Prior to Spark 2.0 the org.apache.spark.ml.classification.RandomForestClassificationModel doesn't have a save() method in it's Scala API as it doesn't implement the MLWritable interface.

If it did, then from PySpark we could easily call this like so:
lrModel._java_obj.save(model_path)
loaded_lrModel_JVM = sqlContext._jvm.org.apache.spark.ml.classification.LogisticRegressionModel.load(model_path)
loaded_lrModel = LogisticRegressionModel(loaded_lrModel_JVM)

(Note that this issue doesn't apply to the older deprecated org.apache.spark.mllib.tree.model.RandomForestModel from Spark MLlib which does have a save() method in v1.6)

This is a problem as my current client is constrained to using Spark v1.6

rdd.saveAsObjectFile() is an alternative way to serialise/deserialise a model using the Hadoop API to a SequenceFile.

Here is the relatively simple Scala approach:
// Save
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///some/path/rfModel")

// Load
val rfModel = sc.objectFile[RandomForestClassificationModel]("hdfs:///some/path/rfModel").first()

Due to serialisation issues with Py4J the PySpark approach is more complex:
# Save
gateway = sc._gateway
java_list = gateway.jvm.java.util.ArrayList()
java_list.add(rfModel._java_obj)
modelRdd = sc._jsc.parallelize(java_list)
modelRdd.saveAsObjectFile("hdfs:///some/path/rfModel")

# Load
rfObjectFileLoaded = sc._jsc.objectFile("hdfs:///some/path/rfModel")
rfModelLoaded_JavaObject = rfObjectFileLoaded.first()
rfModelLoaded = RandomForestClassificationModel(rfModelLoaded_JavaObject)
predictions = rfModelLoaded.transform(test_input_df)

Reference source of RandomForestClassifier v1.6 vs. v2.2:
https://github.com/apache/spark/blob/v1.6.2/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala
https://github.com/apache/spark/blob/v2.2.0/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala

Reference for MLWritable:
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/ml/util/MLWritable.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/ml/util/MLWritable.html

Thursday, August 16, 2018

Calling Scala API methods from PySpark when using the spark.ml library

While implementing Logistic Regression on an older Spark 1.6 cluster I was surprised by how many Python API methods were missing so that the task of saving and loading a serialised model was unavailable.

However using the Py4j calls we can reach directly into the Spark Scala API.

Say we have a `pyspark.ml.classification.LogisticRegressionModel` object, we can call save like so:

lrModel._java_obj.save(model_path)

And load, which is different due being a static method:

loaded_lrModel_JVM = sqlContext._jvm.org.apache.spark.ml.classification.LogisticRegressionModel.load(model_path)

loaded_lrModel = LogisticRegressionModel(loaded_lrModel_JVM)

This helps future proof SparkML development since `spark.mllib` is effectively deprecated, and on any Spark 2.x upgrade there should be minimal breaking changes to the API.

Monday, August 6, 2018

Find maximum row per group in Spark DataFrame

This is a great Spark resource on getting the max row, thanks zero323!
https://stackoverflow.com/questions/35218882/find-maximum-row-per-group-in-spark-dataframe

Using join (it will result in more than one row in group in case of ties):

import pyspark.sql.functions as F
from pyspark.sql.functions import count, col 

cnts = df.groupBy("id_sa", "id_sb").agg(count("*").alias("cnt")).alias("cnts")
maxs = cnts.groupBy("id_sa").agg(F.max("cnt").alias("mx")).alias("maxs")

cnts.join(maxs, 
  (col("cnt") == col("mx")) & (col("cnts.id_sa") == col("maxs.id_sa"))
).select(col("cnts.id_sa"), col("cnts.id_sb"))

Using window functions (will drop ties):

from pyspark.sql.functions import row_number
from pyspark.sql.window import Window

w = Window().partitionBy("id_sa").orderBy(col("cnt").desc())

(cnts
  .withColumn("rn", row_number().over(w))
  .where(col("rn") == 1)
  .select("id_sa", "id_sb"))

Using struct ordering:

from pyspark.sql.functions import struct

(cnts
  .groupBy("id_sa")
  .agg(F.max(struct(col("cnt"), col("id_sb"))).alias("max"))
  .select(col("id_sa"), col("max.id_sb")))

Friday, June 22, 2018

Adding a custom Python library path in a Jupyter Notebook

This code adds a Python library path relative to your working folder to enable you to load custom library functions from a Jupyter Notebook:

import sys, os
extra_path = os.path.join(os.getcwd(), "lib")
if extra_path not in sys.path:
sys.path.append(extra_path)
print('Added extra_path:', extra_path)

Then import like so:
import <my_file_name_in_lib_folder> as funcs

If in a Jupyter Notebook with Python 3.4+ the following will automatically reload the library:
import importlib
importlib.reload(funcs)

Sunday, April 8, 2018

Preparing a hashed password with Jupyter Notebook

http://jupyter-notebook.readthedocs.io/en/latest/public_server.html#preparing-a-hashed-password

You can prepare a hashed password manually, using the function notebook.auth.security.passwd():

In [1]: from notebook.auth import passwd
In [2]: passwd()
Enter password:
Verify password:
Out[2]: 'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'

You can then add the hashed password to your jupyter_notebook_config.py. The default location for this file jupyter_notebook_config.py is in your Jupyter folder in your home directory, ~/.jupyter, e.g.:

c.NotebookApp.password = u'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'

Wednesday, April 4, 2018

Resolving conflicts on a git branch

You'll need to update your branch with new commits from master, resolve those conflicts and push the updated/resolved branch to GitHub.

Resolve conflicts

I found these instructions much better than the instructions from Bitbucket involving a detached head! Thanks jaw6!

git checkout master
git pull
git checkout <branch>
git merge master
[ ... resolve any conflicts ... ]
git add [files that were conflicted]
git commit
git push

Direct Reference: https://github.com/githubteacher/github-for-developers-sept-2015/issues/648

Merge master

git checkout somebranch # gets you on branch somebranch
git fetch origin # gets you up to date with origin
git merge origin/master

Reference: https://stackoverflow.com/questions/20101994/git-pull-from-master-into-the-development-branch

If it all goes wrong somehow, reset with the following (but ensure you're previously checked in as you can lose work with this command!!)

git reset –hard
git reset --hard <SOME-COMMIT>

Reference: https://stackoverflow.com/questions/9529078/how-do-i-use-git-reset-hard-head-to-revert-to-a-previous-commit

Bonus: reverting local changes

Revert changes made to your working copy:
git checkout .

Revert changes made to the index (i.e., that you have added), do this. Warning this will reset all of your unpushed commits to master!:
git reset

Revert a change that you have committed:
git revert <commit 1> <commit 2>

Remove untracked files (e.g., new files, generated files):
git clean -f

Remove untracked files directories (e.g., new or automatically generated directories):
git clean -fd

Reference: https://stackoverflow.com/questions/1146973/how-do-i-revert-all-local-changes-in-git-managed-project-to-previous-state