Skip to content

Instantly share code, notes, and snippets.

@Yyukan
Created April 25, 2018 12:35
Show Gist options
  • Save Yyukan/d09ad16b34234b4972c8fcc7e44cb2f3 to your computer and use it in GitHub Desktop.
Save Yyukan/d09ad16b34234b4972c8fcc7e44cb2f3 to your computer and use it in GitHub Desktop.
Json and CSV file diff using Spark
object Diff extends App {
val jsonInput = "file.json"
val csvInput = "file.csv"
val config: SparkConf = new SparkConf()
.setAppName("Step diff")
.setMaster("local[8]")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config(config)
.getOrCreate()
val csvIds: DataFrame = spark.sqlContext.read.csv(csvInput)
val jsonIds: DataFrame = spark.sqlContext.read.json(jsonInput)
val diff: Dataset[Row] = csvIds except jsonIds
diff.rdd.coalesce(1).saveAsTextFile("diff.txt")
}
@Yyukan
Copy link
Author

Yyukan commented Apr 25, 2018

scalaVersion := "2.11.8"

val sparkVersion = "2.1.1"

libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment