Skip to content

Instantly share code, notes, and snippets.

@oleg-agapov
Last active March 10, 2023 11:01
Show Gist options
  • Save oleg-agapov/ede30dd66ee5118f14f53f240c2c360c to your computer and use it in GitHub Desktop.
Save oleg-agapov/ede30dd66ee5118f14f53f240c2c360c to your computer and use it in GitHub Desktop.

Problem

Find out a schema of JSON files so it can be used during table creation.

0. Create a VM in Google Cloud

gcloud compute instances create duckdb-vs-spark \
  --project=PROJECT_ID \
  --zone=europe-west4-a \
  --machine-type=c3-highcpu-8 \
  --network-interface=network-tier=PREMIUM,nic-type=GVNIC,subnet=default \
  --maintenance-policy=MIGRATE \
  --provisioning-model=STANDARD \
  --service-account=SERVICE_ACCOUNT \
  --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \
  --tags=http-server,https-server \
  --create-disk=auto-delete=yes,boot=yes,device-name=duckdb-vs-spark,image=projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20230302,mode=rw,size=30,type=projects/de-zoomcamp-2023-project/zones/europe-west4-a/diskTypes/pd-balanced \
  --no-shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring \
  --reservation-affinity=any

1. Check python version

$ python3
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

2. Download sample data

mkdir data
cd data
wget https://data.gharchive.org/2023-03-08-{0..23}.json.gz

...
Total wall clock time: 34s
Downloaded: 24 files, 2.2G in 27s (81.3 MB/s)

3. Create venv for DuckDB

# first install venv package for python
sudo apt-get update
sudo apt install python3.10-venv

# now create the environment
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install duckdb==0.7.1 

4. Testing DuckDB

duck-test.py:

import duckdb
sql = """
SELECT json_group_structure(json)
FROM (
  SELECT *
  FROM read_ndjson_objects('data/*.json.gz')
  LIMIT 100000
)
"""

result_schema = duckdb.sql(sql).fetchall()

with open('duck.out', 'w') as f:
    f.write(result_schema[0][0])

Run:

$ time python3 duck-test.py 

real    0m8.039s
user    0m14.400s
sys     0m4.117s

5. Installing Spark

First install Java:

wget https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz
tar xzfv openjdk-11.0.2_linux-x64_bin.tar.gz
export JAVA_HOME="${HOME}/jdk-11.0.2"
export PATH="${JAVA_HOME}/bin:${PATH}"

Checking:

$ java --version
openjdk 11.0.2 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)

Now install Spark:

wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
tar xzfv spark-3.3.2-bin-hadoop3.tgz

export SPARK_HOME="${HOME}/spark-3.3.2-bin-hadoop3"
export PATH="${SPARK_HOME}/bin:${PATH}"

export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.5-src.zip:$PYTHONPATH"

Checking:

$ pyspark
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/10 09:08:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/

Using Python version 3.10.6 (main, Nov 14 2022 16:10:14)
Spark context Web UI available at http://duckdb-vs-spark.europe-west4-a.c.de-zoomcamp-2023-project.internal:4040
Spark context available as 'sc' (master = local[*], app id = local-1678439315153).
SparkSession available as 'spark'.
>>> 

6. Testing Spark

spark-test.py

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName('Testing Spark') \
    .getOrCreate()
   
df = spark.read.json('data/*.json.gz')

df.printSchema()

Checking:

$ time spark-submit spark-test.py > spark.out

real    0m36.675s
user    3m57.297s
sys     0m1.910s

(Bonus) 7. Trying out DuckDB without Python

Download DuckDB:

wget https://github.com/duckdb/duckdb/releases/download/v0.7.1/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip

Script:

SELECT json_group_structure(json)
FROM (
  SELECT *
  FROM read_ndjson_objects('data/*.json.gz')
  limit 100000
);

Checking:

time ./duckdb -json -s "$(cat script.sql)"

real    0m7.462s
user    0m15.735s
sys     0m4.032s

Results

DuckDB only able to process about 2% of the data (100K out of ~4.7M rows) within 7 seconds.

Local Spark is able to process 100% of the data for 36 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment