Integrating Snowflake with Trino

In today's discourse, we delve into the intricacies of accessing Snowflake via the Trino project. This article illuminates the seamless integration of Trino with Snowflake, offering a comprehensive analysis of its benefits and implications.

Previous articles on Snowflake and Trino:

Motivation

A common query among potential adopters of Snowflake pertains to its compatibility with on-premise data and cloud platforms like Azure. In this article, we address this question head-on, exploring the feasibility of accessing Snowflake alongside on-premise data through the Trino project. Let's unravel the possibilities together.

High Level Steps

Deploy Trino in Docker
Get a trial Snowflake account
Connect the dots
Conclusion

Step by step instructions

Navigating the landscape of data integration can be daunting, especially when considering the compatibility of Snowflake with on-premise environments. In this tutorial, we aim to simplify the process by utilizing a Docker environment to simulate on-premise conditions. Our approach prioritizes simplicity, leveraging standard Snowflake configurations and a basic Trino Docker setup. It's essential to consult your documentation for specific scenarios, but let's begin with the fundamentals.

Deploy Trino in Docker

I have a compose file called compose-trino.yaml with the following contents:

services:

  trino:
    container_name: trino
    hostname: trino
    build: trino/.
    ports:
      - "8080:8080"
    environment:
      - _JAVA_OPTIONS=-Dfile.encoding=UTF-8
    volumes:
      - ./trino/catalog:/etc/trino/catalog
      - ./trino/etc:/etc/trino

In the current directory, I have a folder called trino. Within the folder, I have the following files:

FROM trinodb/trino:442
LABEL version="1.0"
LABEL description="trino container"
ENV REFRESHED_AT 2024_03_15

I also have two more folders called etc and catalog.

Within the catalog directory, I've setup a snowflake.properties file with the following contents:

connector.name=snowflake
connection-url=jdbc:snowflake://<account>.snowflakecomputing.com
connection-user=root
connection-password=secret
snowflake.account=account
snowflake.database=database
snowflake.role=role
snowflake.warehouse=warehouse

If you encounter any hurdles along the way, don't hesitate to refer to the comprehensive Trino documentation available here. Let's dive in!

Once you set up the Snowflake environment, you can adjust these properties with your values.

Within the etc directory, I have a jvm.config with the following contents:

--add-opens=java.base/java.nio=ALL-UNNAMED
-Djdk.module.illegalAccess=permit

These particular JDK flags are Snowflake specific.

I also have config.properties with the following contents:

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
discovery.uri=http://example.net:8080

and finally, node.properties with the following contents:

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/tmp/trino/data

With everything in place, you're now ready to initiate the Compose environment. Execute the following command to start the environment: docker compose -f compose-trino.yaml up -d.

Upon successful configuration, you should observe a running container named trino. You can confirm this by executing the command: docker ps.

f426506aa443   snowflake-docker-trino   "/usr/lib/trino/bin/…"   53 minutes ago   Up 47 minutes (healthy)   0.0.0.0:8080->8080/tcp   trino

If you encounter any issues, you can further troubleshoot by examining the Trino logs using the following command: docker logs trino.

You can access the trino container with the following command:

docker exec -it trino trino

Once logged in, you can verify the correct configuration of the Snowflake catalog by executing the following command:

trino> show catalogs;
  Catalog  
-----------
 snowflake 
 system

Get a trial Snowflake account

For the next phase of this tutorial, kindly proceed to sign up for a Snowflake Trial Account through the provided link: Snowflake Trial Account. Opt for the standard edition since we won't be utilizing enterprise features. During the signup process, I've selected the Azure eastus2 region for my Snowflake deployment.

Upon completing the signup, you'll receive a verification email. Once verified, you'll gain access to your Snowflake environment. Retrieve the necessary details from the email sent by Snowflake, particularly the credentials, and populate the snowflake.properties file located in the trino/catalog directory.

Connect the dots

Snowflake provides a variety of demo tutorials, including the Tasty Bytes series. For this tutorial, we'll focus on the "Load sample data with SQL from S3 bucket" worksheet. Alternatively, feel free to select a dataset of your preference.

---> set the Role
USE ROLE accountadmin;

---> set the Warehouse
USE WAREHOUSE compute_wh;

---> create the Tasty Bytes Database
CREATE OR REPLACE DATABASE tasty_bytes_sample_data;

---> create the Raw POS (Point-of-Sale) Schema
CREATE OR REPLACE SCHEMA tasty_bytes_sample_data.raw_pos;

---> create the Raw Menu Table
CREATE OR REPLACE TABLE tasty_bytes_sample_data.raw_pos.menu
(
    menu_id NUMBER(19,0),
    menu_type_id NUMBER(38,0),
    menu_type VARCHAR(16777216),
    truck_brand_name VARCHAR(16777216),
    menu_item_id NUMBER(38,0),
    menu_item_name VARCHAR(16777216),
    item_category VARCHAR(16777216),
    item_subcategory VARCHAR(16777216),
    cost_of_goods_usd NUMBER(38,4),
    sale_price_usd NUMBER(38,4),
    menu_item_health_metrics_obj VARIANT
);

---> confirm the empty Menu table exists
SELECT * FROM tasty_bytes_sample_data.raw_pos.menu;

---> create the Stage referencing the Blob location and CSV File Format
CREATE OR REPLACE STAGE tasty_bytes_sample_data.public.blob_stage
url = 's3://sfquickstarts/tastybytes/'
file_format = (type = csv);

---> query the Stage to find the Menu CSV file
LIST @tasty_bytes_sample_data.public.blob_stage/raw_pos/menu/;

---> copy the Menu file into the Menu table
COPY INTO tasty_bytes_sample_data.raw_pos.menu
FROM @tasty_bytes_sample_data.public.blob_stage/raw_pos/menu/;

---> how many rows are in the table?
SELECT COUNT(*) AS row_count FROM tasty_bytes_sample_data.raw_pos.menu;

---> what do the top 10 rows look like?
SELECT TOP 10 * FROM tasty_bytes_sample_data.raw_pos.menu;

---> what menu items does the Freezing Point brand sell?
SELECT 
   menu_item_name
FROM tasty_bytes_sample_data.raw_pos.menu
WHERE truck_brand_name = 'Freezing Point';

---> what is the profit on Mango Sticky Rice?
SELECT 
   menu_item_name,
   (sale_price_usd - cost_of_goods_usd) AS profit_usd
FROM tasty_bytes_sample_data.raw_pos.menu
WHERE 1=1
AND truck_brand_name = 'Freezing Point'
AND menu_item_name = 'Mango Sticky Rice';

---> to finish, let's extract the Mango Sticky Rice ingredients from the semi-structured column
SELECT 
    m.menu_item_name,
    obj.value:"ingredients"::ARRAY AS ingredients
FROM tasty_bytes_sample_data.raw_pos.menu m,
    LATERAL FLATTEN (input => m.menu_item_health_metrics_obj:menu_item_health_metrics) obj
WHERE 1=1
AND truck_brand_name = 'Freezing Point'
AND menu_item_name = 'Mango Sticky Rice';

We have a dataset in Snowflake, let's now pivot back to Trino and access the Snowflake data from there.

If your Compose environment is currently active but lacks essential configurations such as snowflake.database, snowflake.warehouse, or other pertinent Snowflake properties, it's crucial to halt the environment. Prior to proceeding, ensure these properties are appropriately configured. Once adjusted, you can restart the Compose environment and continue with the integration process seamlessly.

docker compose -f compose-trino.yaml down

Back in the snowflake.properties file, change the properties to:

connection-user=snowflakeuser
connection-password=snowflakepassword
snowflake.database=tasty_bytes_sample_data
snowflake.role=accountadmin
snowflake.warehouse=compute_wh

Restart the environment and access the trino shell.

Within the Trino shell, type:

use snowflake.raw_pos;

Since the Snowflake catalog is already configured to connect to our Trino environment, we can omit the database name from the fully qualified table name. Pick any of the above queries from the Snowflake worksheet and try running them in the trino container.

trino:raw_pos> SELECT COUNT(*) AS row_count FROM raw_
pos.menu;
 row_count 
-----------
       100 
(1 row)

Query 20240315_185131_00013_45g27, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.84 [1 rows, 0B] [1 rows/s, 0B/s]

trino:raw_pos> SELECT 
            ->    menu_item_name
            -> FROM raw_pos.menu
            -> WHERE truck_brand_name = 'Freezing Poi
nt';
   menu_item_name   
--------------------
 Lemonade           
 Sugar Cone         
 Waffle Cone        
 Two Scoop Bowl     
 Bottled Water      
 Bottled Soda       
 Ice Tea            
 Ice Cream Sandwich 
 Mango Sticky Rice  
 Popsicle           
(10 rows)

Query 20240315_185212_00015_45g27, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
1.23 [10 rows, 0B] [8 rows/s, 0B/s]

Indeed, accessing Snowflake datasets using Trino from our local environment demonstrates the flexibility and interoperability of these tools. This integration allows us to seamlessly work with data across different platforms, enhancing our analytical capabilities and workflow efficiency.

Additionally, you can access the Trino UI via http://localhost:8080. With the default configuration, no password is required, and the username is set to admin. By navigating to the "finished queries" section, you can review the queries you've executed, providing valuable insights into your workflow and facilitating debugging if needed. This feature enhances visibility and transparency into your data operations within the Trino environment.

Conclusion

Trino and its commercial version, Starburst, are potent tools for federating data across different sources. This article showcases how easily Snowflake can be accessed using local tools with Trino. The synergy between Snowflake and Trino offers a robust solution for data management and analytics, empowering organizations to leverage cloud data warehousing and distributed query processing for enhanced insights.

dbist/snowflake-trino.md