Git on Windows by default is a bit too clever for itself with line endings, typically having the config autocrlf=true
When I checkout a Linux/OSX repo that contains shell scripts that are used in a built Docker image - please leave line endings as LF as per the repo - don't convert to CRLF.
To achieve this for the entire repo, git clone like so:
git clone --config core.autocrlf=input <repo>
More info: https://git-scm.com/docs/git-config#git-config-coreautocrlf
Wednesday, November 28, 2018
Thursday, September 27, 2018
Serialising a RandomForestClassificationModel from PySpark to a SequenceFile on hdfs
Prior to Spark 2.0 the org.apache.spark.ml.classification.RandomForestClassificationModel doesn't have a save() method in it's Scala API as it doesn't implement the MLWritable interface.
If it did, then from PySpark we could easily call this like so:
lrModel._java_obj.save(model_path)
loaded_lrModel_JVM = sqlContext._jvm.org.apache.spark.ml.classification.LogisticRegressionModel.load(model_path)
loaded_lrModel = LogisticRegressionModel(loaded_lrModel_JVM)
(Note that this issue doesn't apply to the older deprecated org.apache.spark.mllib.tree.model.RandomForestModel from Spark MLlib which does have a save() method in v1.6)
This is a problem as my current client is constrained to using Spark v1.6
rdd.saveAsObjectFile() is an alternative way to serialise/deserialise a model using the Hadoop API to a SequenceFile.
Here is the relatively simple Scala approach:
// Save
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///some/path/rfModel")
// Load
val rfModel = sc.objectFile[RandomForestClassificationModel]("hdfs:///some/path/rfModel").first()
Due to serialisation issues with Py4J the PySpark approach is more complex:
# Save
gateway = sc._gateway
java_list = gateway.jvm.java.util.ArrayList()
java_list.add(rfModel._java_obj)
modelRdd = sc._jsc.parallelize(java_list)
modelRdd.saveAsObjectFile("hdfs:///some/path/rfModel")
# Load
rfObjectFileLoaded = sc._jsc.objectFile("hdfs:///some/path/rfModel")
rfModelLoaded_JavaObject = rfObjectFileLoaded.first()
rfModelLoaded = RandomForestClassificationModel(rfModelLoaded_JavaObject)
predictions = rfModelLoaded.transform(test_input_df)
Reference source of RandomForestClassifier v1.6 vs. v2.2:
https://github.com/apache/spark/blob/v1.6.2/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala
https://github.com/apache/spark/blob/v2.2.0/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala
Reference for MLWritable:
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/ml/util/MLWritable.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/ml/util/MLWritable.html
If it did, then from PySpark we could easily call this like so:
lrModel._java_obj.save(model_path)
loaded_lrModel_JVM = sqlContext._jvm.org.apache.spark.ml.classification.LogisticRegressionModel.load(model_path)
loaded_lrModel = LogisticRegressionModel(loaded_lrModel_JVM)
(Note that this issue doesn't apply to the older deprecated org.apache.spark.mllib.tree.model.RandomForestModel from Spark MLlib which does have a save() method in v1.6)
This is a problem as my current client is constrained to using Spark v1.6
rdd.saveAsObjectFile() is an alternative way to serialise/deserialise a model using the Hadoop API to a SequenceFile.
Here is the relatively simple Scala approach:
// Save
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///some/path/rfModel")
// Load
val rfModel = sc.objectFile[RandomForestClassificationModel]("hdfs:///some/path/rfModel").first()
Due to serialisation issues with Py4J the PySpark approach is more complex:
# Save
gateway = sc._gateway
java_list = gateway.jvm.java.util.ArrayList()
java_list.add(rfModel._java_obj)
modelRdd = sc._jsc.parallelize(java_list)
modelRdd.saveAsObjectFile("hdfs:///some/path/rfModel")
# Load
rfObjectFileLoaded = sc._jsc.objectFile("hdfs:///some/path/rfModel")
rfModelLoaded_JavaObject = rfObjectFileLoaded.first()
rfModelLoaded = RandomForestClassificationModel(rfModelLoaded_JavaObject)
predictions = rfModelLoaded.transform(test_input_df)
Reference source of RandomForestClassifier v1.6 vs. v2.2:
https://github.com/apache/spark/blob/v1.6.2/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala
https://github.com/apache/spark/blob/v2.2.0/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala
Reference for MLWritable:
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/ml/util/MLWritable.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/ml/util/MLWritable.html
Thursday, August 16, 2018
Calling Scala API methods from PySpark when using the spark.ml library
While implementing Logistic Regression on an older Spark 1.6 cluster I was surprised by how many Python API methods were missing so that the task of saving and loading a serialised model was unavailable.
However using the Py4j calls we can reach directly into the Spark Scala API.
Say we have a `pyspark.ml.classification.LogisticRegressionModel` object, we can call save like so:
lrModel._java_obj.save(model_path)
And load, which is different due being a static method:
loaded_lrModel_JVM = sqlContext._jvm.org.apache.spark.ml.classification.LogisticRegressionModel.load(model_path)
loaded_lrModel = LogisticRegressionModel(loaded_lrModel_JVM)
This helps future proof SparkML development since `spark.mllib` is effectively deprecated, and on any Spark 2.x upgrade there should be minimal breaking changes to the API.
However using the Py4j calls we can reach directly into the Spark Scala API.
Say we have a `pyspark.ml.classification.LogisticRegressionModel` object, we can call save like so:
lrModel._java_obj.save(model_path)
And load, which is different due being a static method:
loaded_lrModel_JVM = sqlContext._jvm.org.apache.spark.ml.classification.LogisticRegressionModel.load(model_path)
loaded_lrModel = LogisticRegressionModel(loaded_lrModel_JVM)
This helps future proof SparkML development since `spark.mllib` is effectively deprecated, and on any Spark 2.x upgrade there should be minimal breaking changes to the API.
Monday, August 6, 2018
Find maximum row per group in Spark DataFrame
This is a great Spark resource on getting the max row, thanks zero323!
https://stackoverflow.com/questions/35218882/find-maximum-row-per-group-in-spark-dataframe
https://stackoverflow.com/questions/35218882/find-maximum-row-per-group-in-spark-dataframe
Using
join (it will result in more than one row in group in case of ties):import pyspark.sql.functions as F
from pyspark.sql.functions import count, col
cnts = df.groupBy("id_sa", "id_sb").agg(count("*").alias("cnt")).alias("cnts")
maxs = cnts.groupBy("id_sa").agg(F.max("cnt").alias("mx")).alias("maxs")
cnts.join(maxs,
(col("cnt") == col("mx")) & (col("cnts.id_sa") == col("maxs.id_sa"))
).select(col("cnts.id_sa"), col("cnts.id_sb"))
Using window functions (will drop ties):
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
w = Window().partitionBy("id_sa").orderBy(col("cnt").desc())
(cnts
.withColumn("rn", row_number().over(w))
.where(col("rn") == 1)
.select("id_sa", "id_sb"))
Using
struct ordering:from pyspark.sql.functions import struct
(cnts
.groupBy("id_sa")
.agg(F.max(struct(col("cnt"), col("id_sb"))).alias("max"))
.select(col("id_sa"), col("max.id_sb")))
Friday, June 22, 2018
Adding a custom Python library path in a Jupyter Notebook
This code adds a Python library path relative to your working folder to enable you to load custom library functions from a Jupyter Notebook:
import sys, os
extra_path = os.path.join(os.getcwd(), "lib")
if extra_path not in sys.path:
sys.path.append(extra_path)
print('Added extra_path:', extra_path)
Then import like so:
import <my_file_name_in_lib_folder> as funcs
If in a Jupyter Notebook with Python 3.4+ the following will automatically reload the library:
import importlib
importlib.reload(funcs)
import sys, os
extra_path = os.path.join(os.getcwd(), "lib")
if extra_path not in sys.path:
sys.path.append(extra_path)
print('Added extra_path:', extra_path)
Then import like so:
import <my_file_name_in_lib_folder> as funcs
If in a Jupyter Notebook with Python 3.4+ the following will automatically reload the library:
import importlib
importlib.reload(funcs)
Sunday, April 8, 2018
Preparing a hashed password with Jupyter Notebook
http://jupyter-notebook.readthedocs.io/en/latest/public_server.html#preparing-a-hashed-password
You can prepare a hashed password manually, using the function notebook.auth.security.passwd():
In [1]: from notebook.auth import passwd In [2]: passwd() Enter password: Verify password: Out[2]: 'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'You can then add the hashed password to your jupyter_notebook_config.py. The default location for this file jupyter_notebook_config.py is in your Jupyter folder in your home directory, ~/.jupyter, e.g.:
c.NotebookApp.password = u'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'
Wednesday, April 4, 2018
Resolving conflicts on a git branch
You'll need to update your branch with new commits from master, resolve those conflicts and push the updated/resolved branch to GitHub.
Resolve conflicts
I found these instructions much better than the instructions from Bitbucket involving a detached head! Thanks jaw6!
git checkout master
git pull
git checkout <branch>
git merge master
[ ... resolve any conflicts ... ]
git add [files that were conflicted]
git commit
git push
Direct Reference: https://github.com/githubteacher/github-for-developers-sept-2015/issues/648
Merge master
git checkout somebranch # gets you on branch somebranch
git fetch origin # gets you up to date with origin
git merge origin/master
Reference: https://stackoverflow.com/questions/20101994/git-pull-from-master-into-the-development-branch
If it all goes wrong somehow, reset with the following (but ensure you're previously checked in as you can lose work with this command!!)
git reset –hard
git reset --hard <SOME-COMMIT>
Reference: https://stackoverflow.com/questions/9529078/how-do-i-use-git-reset-hard-head-to-revert-to-a-previous-commit
Bonus: reverting local changes
Revert changes made to your working copy:
git checkout .
Revert changes made to the index (i.e., that you have added), do this. Warning this will reset all of your unpushed commits to master!:
git reset
Revert a change that you have committed:
git revert <commit 1> <commit 2>
Remove untracked files (e.g., new files, generated files):
git clean -f
Remove untracked files directories (e.g., new or automatically generated directories):
git clean -fd
Reference: https://stackoverflow.com/questions/1146973/how-do-i-revert-all-local-changes-in-git-managed-project-to-previous-state
Subscribe to:
Comments (Atom)