Skip to content

Instantly share code, notes, and snippets.

@ganeshan
Forked from mneedham/parquet-cli.sh
Created August 12, 2024 14:26
Show Gist options
  • Save ganeshan/f0a25c63e1090ebda56d7796fc650c3e to your computer and use it in GitHub Desktop.
Save ganeshan/f0a25c63e1090ebda56d7796fc650c3e to your computer and use it in GitHub Desktop.
An intro to Apache Parquet
# The NYC Taxis Dataset - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
pip install parquet-cli
parq data/yellow_tripdata_2022-01.parquet
parq data/yellow_tripdata_2022-01.parquet --schema
parq data/yellow_tripdata_2022-01.parquet --head 10
parq data/yellow_tripdata_2022-01.parquet --tail 10
import pyarrow.parquet as pq
file = pq.ParquetFile("data/yellow_tripdata_2022-01.parquet")
file.metadata
file.schema
file.read().to_pandas()
df = file.read().to_pandas()
df.to_csv("trips.csv")
df.to_json("trips.json", orient="records", lines=True)
stat -f %z data/yellow_tripdata_2022-01.parquet | numfmt --to=iec
stat -f %z trips.csv | numfmt --to=iec
stat -f %z trips.json | numfmt --to=iec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment