Skip to content

Instantly share code, notes, and snippets.

@kjsingh
Created August 6, 2019 22:52
Show Gist options
  • Save kjsingh/b23d154f71fb210175b7cf8dfd55e557 to your computer and use it in GitHub Desktop.
Save kjsingh/b23d154f71fb210175b7cf8dfd55e557 to your computer and use it in GitHub Desktop.
Split RDDs based on index
val rdd = sc.parallelize(1 to 100)
//rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
var max = 3
//max: Int = 3
var multiRdds:List[org.apache.spark.rdd.RDD[Int]] = Nil
//multiRdds: List[org.apache.spark.rdd.RDD[Int]] = List()
for(i <- 0 until max) {multiRdds = rdd.zipWithIndex.filter(_._2 % max == i).map(_._1)::multiRdds}
for(r <- multiRdds) println(r.collect.mkString(" "))
//3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99
//2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98
//1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment