In today's discourse, we delve into the intricacies of accessing Snowflake via the Trino project. This article illuminates the seamless integration of Trino with Snowflake, offering a comprehensive analysis of its benefits and implications.
Previous articles on Snowflake and Trino:
- SaaS Galore: Integrating CockroachDB with Confluent Kafka, FiveTran and Snowflake
- Data federation with CockroachDB and Presto
A common query among potential adopters of Snowflake pertains to its compatibility with on-premise data and cloud platforms like Azure. In this article, we address this question head-on, exploring the feasibility of accessing Snowflake alongside on-premise data through the Trino project. Let's unravel the possibilities together.
- Deploy Trino in Docker
- Get a trial Snowflake account
- Connect the dots
- Conclusion
Navigating the landscape of data integration can be daunting, especially when considering the compatibility of Snowflake with on-premise environments. In this tutorial, we aim to simplify the process by utilizing a Docker environment to simulate on-premise conditions. Our approach prioritizes simplicity, leveraging standard Snowflake configurations and a basic Trino Docker setup. It's essential to consult your documentation for specific scenarios, but let's begin with the fundamentals.
I have a compose file called compose-trino.yaml
with the following contents:
services:
trino:
container_name: trino
hostname: trino
build: trino/.
ports:
- "8080:8080"
environment:
- _JAVA_OPTIONS=-Dfile.encoding=UTF-8
volumes:
- ./trino/catalog:/etc/trino/catalog
- ./trino/etc:/etc/trino
In the current directory, I have a folder called trino
. Within the folder, I have the following files:
FROM trinodb/trino:442
LABEL version="1.0"
LABEL description="trino container"
ENV REFRESHED_AT 2024_03_15
I also have two more folders called etc
and catalog
.
Within the catalog
directory, I've setup a snowflake.properties
file with the following contents:
connector.name=snowflake
connection-url=jdbc:snowflake://<account>.snowflakecomputing.com
connection-user=root
connection-password=secret
snowflake.account=account
snowflake.database=database
snowflake.role=role
snowflake.warehouse=warehouse
If you encounter any hurdles along the way, don't hesitate to refer to the comprehensive Trino documentation available here. Let's dive in!
Once you set up the Snowflake environment, you can adjust these properties with your values.
Within the etc
directory, I have a jvm.config
with the following contents:
--add-opens=java.base/java.nio=ALL-UNNAMED
-Djdk.module.illegalAccess=permit
These particular JDK flags are Snowflake specific.
I also have config.properties
with the following contents:
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
discovery.uri=http://example.net:8080
and finally, node.properties
with the following contents:
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/tmp/trino/data
With everything in place, you're now ready to initiate the Compose environment. Execute the following command to start the environment: docker compose -f compose-trino.yaml up -d
.
Upon successful configuration, you should observe a running container named trino
. You can confirm this by executing the command: docker ps
.
f426506aa443 snowflake-docker-trino "/usr/lib/trino/bin/…" 53 minutes ago Up 47 minutes (healthy) 0.0.0.0:8080->8080/tcp trino
If you encounter any issues, you can further troubleshoot by examining the Trino logs using the following command: docker logs trino
.
You can access the trino container with the following command:
docker exec -it trino trino
Once logged in, you can verify the correct configuration of the Snowflake catalog by executing the following command:
trino> show catalogs;
Catalog
-----------
snowflake
system
For the next phase of this tutorial, kindly proceed to sign up for a Snowflake Trial Account through the provided link: Snowflake Trial Account. Opt for the standard edition since we won't be utilizing enterprise features. During the signup process, I've selected the Azure eastus2 region for my Snowflake deployment.
Upon completing the signup, you'll receive a verification email. Once verified, you'll gain access to your Snowflake environment. Retrieve the necessary details from the email sent by Snowflake, particularly the credentials, and populate the snowflake.properties
file located in the trino/catalog
directory.
Snowflake provides a variety of demo tutorials, including the Tasty Bytes series. For this tutorial, we'll focus on the "Load sample data with SQL from S3 bucket" worksheet. Alternatively, feel free to select a dataset of your preference.
---> set the Role
USE ROLE accountadmin;
---> set the Warehouse
USE WAREHOUSE compute_wh;
---> create the Tasty Bytes Database
CREATE OR REPLACE DATABASE tasty_bytes_sample_data;
---> create the Raw POS (Point-of-Sale) Schema
CREATE OR REPLACE SCHEMA tasty_bytes_sample_data.raw_pos;
---> create the Raw Menu Table
CREATE OR REPLACE TABLE tasty_bytes_sample_data.raw_pos.menu
(
menu_id NUMBER(19,0),
menu_type_id NUMBER(38,0),
menu_type VARCHAR(16777216),
truck_brand_name VARCHAR(16777216),
menu_item_id NUMBER(38,0),
menu_item_name VARCHAR(16777216),
item_category VARCHAR(16777216),
item_subcategory VARCHAR(16777216),
cost_of_goods_usd NUMBER(38,4),
sale_price_usd NUMBER(38,4),
menu_item_health_metrics_obj VARIANT
);
---> confirm the empty Menu table exists
SELECT * FROM tasty_bytes_sample_data.raw_pos.menu;
---> create the Stage referencing the Blob location and CSV File Format
CREATE OR REPLACE STAGE tasty_bytes_sample_data.public.blob_stage
url = 's3://sfquickstarts/tastybytes/'
file_format = (type = csv);
---> query the Stage to find the Menu CSV file
LIST @tasty_bytes_sample_data.public.blob_stage/raw_pos/menu/;
---> copy the Menu file into the Menu table
COPY INTO tasty_bytes_sample_data.raw_pos.menu
FROM @tasty_bytes_sample_data.public.blob_stage/raw_pos/menu/;
---> how many rows are in the table?
SELECT COUNT(*) AS row_count FROM tasty_bytes_sample_data.raw_pos.menu;
---> what do the top 10 rows look like?
SELECT TOP 10 * FROM tasty_bytes_sample_data.raw_pos.menu;
---> what menu items does the Freezing Point brand sell?
SELECT
menu_item_name
FROM tasty_bytes_sample_data.raw_pos.menu
WHERE truck_brand_name = 'Freezing Point';
---> what is the profit on Mango Sticky Rice?
SELECT
menu_item_name,
(sale_price_usd - cost_of_goods_usd) AS profit_usd
FROM tasty_bytes_sample_data.raw_pos.menu
WHERE 1=1
AND truck_brand_name = 'Freezing Point'
AND menu_item_name = 'Mango Sticky Rice';
---> to finish, let's extract the Mango Sticky Rice ingredients from the semi-structured column
SELECT
m.menu_item_name,
obj.value:"ingredients"::ARRAY AS ingredients
FROM tasty_bytes_sample_data.raw_pos.menu m,
LATERAL FLATTEN (input => m.menu_item_health_metrics_obj:menu_item_health_metrics) obj
WHERE 1=1
AND truck_brand_name = 'Freezing Point'
AND menu_item_name = 'Mango Sticky Rice';
We have a dataset in Snowflake, let's now pivot back to Trino and access the Snowflake data from there.
If your Compose environment is currently active but lacks essential configurations such as snowflake.database
, snowflake.warehouse
, or other pertinent Snowflake properties, it's crucial to halt the environment. Prior to proceeding, ensure these properties are appropriately configured. Once adjusted, you can restart the Compose environment and continue with the integration process seamlessly.
docker compose -f compose-trino.yaml down
Back in the snowflake.properties
file, change the properties to:
connection-user=snowflakeuser
connection-password=snowflakepassword
snowflake.database=tasty_bytes_sample_data
snowflake.role=accountadmin
snowflake.warehouse=compute_wh
Restart the environment and access the trino shell.
Within the Trino shell, type:
use snowflake.raw_pos;
Since the Snowflake catalog is already configured to connect to our Trino environment, we can omit the database name from the fully qualified table name. Pick any of the above queries from the Snowflake worksheet and try running them in the trino container.
trino:raw_pos> SELECT COUNT(*) AS row_count FROM raw_
pos.menu;
row_count
-----------
100
(1 row)
Query 20240315_185131_00013_45g27, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.84 [1 rows, 0B] [1 rows/s, 0B/s]
trino:raw_pos> SELECT
-> menu_item_name
-> FROM raw_pos.menu
-> WHERE truck_brand_name = 'Freezing Poi
nt';
menu_item_name
--------------------
Lemonade
Sugar Cone
Waffle Cone
Two Scoop Bowl
Bottled Water
Bottled Soda
Ice Tea
Ice Cream Sandwich
Mango Sticky Rice
Popsicle
(10 rows)
Query 20240315_185212_00015_45g27, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
1.23 [10 rows, 0B] [8 rows/s, 0B/s]
Indeed, accessing Snowflake datasets using Trino from our local environment demonstrates the flexibility and interoperability of these tools. This integration allows us to seamlessly work with data across different platforms, enhancing our analytical capabilities and workflow efficiency.
Additionally, you can access the Trino UI via http://localhost:8080. With the default configuration, no password is required, and the username is set to admin
. By navigating to the "finished queries" section, you can review the queries you've executed, providing valuable insights into your workflow and facilitating debugging if needed. This feature enhances visibility and transparency into your data operations within the Trino environment.
Trino and its commercial version, Starburst, are potent tools for federating data across different sources. This article showcases how easily Snowflake can be accessed using local tools with Trino. The synergy between Snowflake and Trino offers a robust solution for data management and analytics, empowering organizations to leverage cloud data warehousing and distributed query processing for enhanced insights.