Find out a schema of JSON files so it can be used during table creation.
gcloud compute instances create duckdb-vs-spark \
--project=PROJECT_ID \
--zone=europe-west4-a \
--machine-type=c3-highcpu-8 \
--network-interface=network-tier=PREMIUM,nic-type=GVNIC,subnet=default \
--maintenance-policy=MIGRATE \
--provisioning-model=STANDARD \
--service-account=SERVICE_ACCOUNT \
--scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \
--tags=http-server,https-server \
--create-disk=auto-delete=yes,boot=yes,device-name=duckdb-vs-spark,image=projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20230302,mode=rw,size=30,type=projects/de-zoomcamp-2023-project/zones/europe-west4-a/diskTypes/pd-balanced \
--no-shielded-secure-boot \
--shielded-vtpm \
--shielded-integrity-monitoring \
--reservation-affinity=any
$ python3
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
mkdir data
cd data
wget https://data.gharchive.org/2023-03-08-{0..23}.json.gz
...
Total wall clock time: 34s
Downloaded: 24 files, 2.2G in 27s (81.3 MB/s)
# first install venv package for python
sudo apt-get update
sudo apt install python3.10-venv
# now create the environment
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install duckdb==0.7.1
duck-test.py
:
import duckdb
sql = """
SELECT json_group_structure(json)
FROM (
SELECT *
FROM read_ndjson_objects('data/*.json.gz')
LIMIT 100000
)
"""
result_schema = duckdb.sql(sql).fetchall()
with open('duck.out', 'w') as f:
f.write(result_schema[0][0])
Run:
$ time python3 duck-test.py
real 0m8.039s
user 0m14.400s
sys 0m4.117s
First install Java:
wget https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz
tar xzfv openjdk-11.0.2_linux-x64_bin.tar.gz
export JAVA_HOME="${HOME}/jdk-11.0.2"
export PATH="${JAVA_HOME}/bin:${PATH}"
Checking:
$ java --version
openjdk 11.0.2 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)
Now install Spark:
wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
tar xzfv spark-3.3.2-bin-hadoop3.tgz
export SPARK_HOME="${HOME}/spark-3.3.2-bin-hadoop3"
export PATH="${SPARK_HOME}/bin:${PATH}"
export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.5-src.zip:$PYTHONPATH"
Checking:
$ pyspark
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/10 09:08:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.3.2
/_/
Using Python version 3.10.6 (main, Nov 14 2022 16:10:14)
Spark context Web UI available at http://duckdb-vs-spark.europe-west4-a.c.de-zoomcamp-2023-project.internal:4040
Spark context available as 'sc' (master = local[*], app id = local-1678439315153).
SparkSession available as 'spark'.
>>>
spark-test.py
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.appName('Testing Spark') \
.getOrCreate()
df = spark.read.json('data/*.json.gz')
df.printSchema()
Checking:
$ time spark-submit spark-test.py > spark.out
real 0m36.675s
user 3m57.297s
sys 0m1.910s
Download DuckDB:
wget https://github.com/duckdb/duckdb/releases/download/v0.7.1/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
Script:
SELECT json_group_structure(json)
FROM (
SELECT *
FROM read_ndjson_objects('data/*.json.gz')
limit 100000
);
Checking:
time ./duckdb -json -s "$(cat script.sql)"
real 0m7.462s
user 0m15.735s
sys 0m4.032s
DuckDB only able to process about 2% of the data (100K out of ~4.7M rows) within 7 seconds.
Local Spark is able to process 100% of the data for 36 seconds.