Skip to content

Instantly share code, notes, and snippets.

@SlavikBaranov
Created April 6, 2016 08:25
Show Gist options
  • Save SlavikBaranov/6876150c05805283c15ff31663618761 to your computer and use it in GitHub Desktop.
Save SlavikBaranov/6876150c05805283c15ff31663618761 to your computer and use it in GitHub Desktop.
/*
1. Download spark-csv & apache commons-csv
http://mvnrepository.com/artifact/com.databricks/spark-csv_2.10/1.3.0
http://mvnrepository.com/artifact/org.apache.commons/commons-csv/1.2
2. Run spark-shell with command:
spark-shell --jars /<path to>/spark-csv_2.10-1.3.0.jar,/<path to>/commons-csv-1.2.jar
*/
// Import
import org.apache.spark.sql.SaveMode
// Read parquet directory & register a table
val df = sqlContext.read.parquet("<path to parquet>")
df.registerTempTable("df")
// Print schema
df.printSchema()
// Run SQL queries & output result to console
sqlContext.sql("SELECT COUNT(DISTINCT userId) FROM df").show()
// Create a data frame to output result
val res = sqlContext.sql("SELECT userId, numFriends FROM df WHERE numFriends < 10")
// Make sure that result is not too big
res.count
// Save result to file
res.repartition(1).write.format("csv").mode(SaveMode.Overwrite).save("<path to a directory with CSV>")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment