Some Little Bits: October 2015

Thursday, October 22, 2015

Analyse consecutive timeseries pairs in Spark

With time series data it is useful to construct pairs of elements for analysis.

Here is a way to construct a Spark RDD from a time series that has the pairs together in the final RDD:

val arr = Array((1, "A"), (8, "D"), (7, "C"), (3, "B"), (9, "E"))
val rdd = sc.parallelize(arr)
val sorted = rdd.sortByKey(true)
val zipped = sorted.zipWithIndex.map(x => (x._2, x._1))
val pairs = zipped.join(zipped.map(x => (x._1 - 1, x._2))).sortBy(_._1)

Which produces the consecutive elements as pairs in the RDD for further processing:
(0,((1,A),(3,B)))
(1,((3,B),(7,C)))
(2,((7,C),(8,D)))
(3,((8,D),(9,E)))

Ref: https://www.mail-archive.com/user@spark.apache.org/msg39353.html