Skip to content

Instantly share code, notes, and snippets.

@bsweger
Last active August 28, 2024 20:09
Show Gist options
  • Save bsweger/d53c4c1174b9f59eed7a8505503d37cf to your computer and use it in GitHub Desktop.
Save bsweger/d53c4c1174b9f59eed7a8505503d37cf to your computer and use it in GitHub Desktop.
Accessing Cloud-Based Hubverse Data

Cloud-Based Hubs

The Hubverse is in the processing of making hub data available via publicly-accessible AWS S3 buckets.

Cloud-based hubs "mirror" the data stored in a hub's GitHub repository and provide a few advantages for data consumers:

  • No need to clone a repository to access data
  • Cloud-based model-output files are in parquet format, which is easier to work with and more performant

The examples here use the CDC's FluSight Forecast Hub, which is available in the following S3 bucket:

s3://cdcepi-flusight-forecast-hub

Other files in the gist provide examples of accessing cloud-based hub data via DuckDB, Python, and R.

Alternately, you can use the AWS CLI to copy the files from S3 to your local machine for use with your favorite data analysis software.

List the available directories in hub's S3 bucket:

aws s3 ls s3://cdcepi-flusight-forecast-hub/ --no-sign-request
                           PRE auxiliary-data/
                           PRE hub-config/
                           PRE model-metadata/
                           PRE model-output/
                           PRE raw/
                           PRE target-data/

Copy files from a specific directory (in this case, model-output) to a local machine:

aws s3 cp s3://cdcepi-flusight-forecast-hub/model-output/ . --recursive --no-sign-request

Using SQL to access S3-based model-output files

Setup

Install the DuckDB CLI: https://duckdb.org/docs/installation/?version=stable&environment=cli

Query the model-output parquet files directly from S3

  1. Open the DuckDB CLI:

    duckdb
  2. Install the httpfs extension that allows cloud-based data access:

    INSTALL httpfs;
  3. Override the http_keep_alive setting to prevent errors when accessing many S3 files simultaneously:

    SET http_keep_alive=false;

Ad-hoc query

Run a regular SQL query, specifying the S3 location in the FROM clause (more information about DuckDB's read_parquet feature):

SELECT reference_date, target, count(*) as num_submissions
FROM read_parquet('s3://cdcepi-flusight-forecast-hub/model-output/*/*.parquet')
WHERE location = 'US' and output_type = 'quantile'
GROUP BY ALL;

Note: to query data for a specific model, it will be faster to add the model_id to the S3 path in the FROM clause than to add the model_id to the WHERE clause (because the database engine needs to read far less data from disk).

For example:

SELECT reference_date, target, count(*) as num_submissions
FROM read_parquet('s3://cdcepi-flusight-forecast-hub/model-output/Umass-flusion/*.parquet')
WHERE location = 'US' and output_type = 'quantile'
GROUP BY ALL;

RECOMMENDED: Creating an in-memory table for faster interactive exploration

Pulling large chunks for data from S3 can take a while. If you plan to make repetitive queries, you can create an in-memory table (so you only need to pull the data from S3 once).

CREATE TABLE model_output AS
SELECT *
FROM read_parquet('s3://cdcepi-flusight-forecast-hub/model-output/*/*.parquet'); 

Then you can query the in-memory model_output table directly:

SELECT reference_date, target, count(*) as num_submissions
FROM model_output
WHERE location = 'US' and output_type = 'quantile'
GROUP BY ALL;

Saving query output

Detailed information: https://duckdb.org/docs/api/cli/dot_commands#output-writing-results-to-a-file

For example, saving query results to a .csv:

.mode csv
.once model_output_data.csv

SELECT reference_date, target, count(*) as num_submissions
-- this example uses a previously-created in-memory table
FROM model_output
WHERE location = 'US' and output_type = 'quantile'
GROUP BY ALL;

Additional information about using the DuckDB CLI: https://duckdb.org/docs/api/cli/overview.html

Using Python to access S3-based model-output files

These are high-level examples for using pyarrow and pandas to read a hub's model-output files from S3. Each of these libraries supports efficiency-boosting operations like using partitions and filtering data, so it's worth consulting the documentation for details.

pyarrow

from pyarrow import fs
import pyarrow.parquet as pq

hub_bucket_name = 'cdcepi-flusight-forecast-hub'
model_output_path = 'model-output'
bucket_uri = f'{hub_bucket_name}/{model_output_path}'

s3 = fs.S3FileSystem(region="us-east-1")

table = pq.read_table(bucket_uri, filesystem=s3)

pandas

import pandas as pd

hub_bucket_name = 'cdcepi-flusight-forecast-hub'
model_output_path = 'model-output'
bucket_uri = f's3://{hub_bucket_name}/{model_output_path}'

table = pd.read_parquet(bucket_uri)

Using R to access S3-based model-output files

hubData package

hubData is a hubverse-maintained R package that presents model-output data as a tibble

To access the model-output data:

library(dplyr)
library(hubData)

hub_path_cloud <- s3_bucket('cdcepi-flusight-forecast-hub/')
data_cloud <- connect_hub(hub_path_cloud, file_format="parquet") %>%
  filter(output_type == "quantile", location == "US") %>%
  collect_hub()

To access target data:

library(aws.s3)
target_data <- aws.s3::s3read_using(read.csv, object="s3://cdcepi-flusight-forecast-hub/target-data/target-hospital-admissions.csv")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment