Lunch & Learn: SQL and ETLs for Presto and MARS

Blurb

Presto lets us trade APIs for queryable data-structures. Many APIs are already in Presto including Jira, TIPS, and DAG data. MARS uses Presto to run batch-procesing jobs against these data-structures. SQL is the (unfortunate) lingua-franca in this land.

This is a new and foreign situation for some of us, but it's not scary. If you know some MQL or even basic Python you can pick up Presto's SQL easily to quickly explore data within dev-prod and the rest of the company. And MARS is both simpler and more powerful than you might guess.

In a few days everyone in dev-prod and STAR will automatically have access to Presto. Let's hit the ground running.

I'll do a quick end-to-end demo of exploring the Jira, TIPS, and DAG data catalogs. Then we'll do a whirlwind tour of SQL (including what makes Presto's SQL not so terrible). Then I'll put that SQL into a multi-step MARS job. Then I'll use the resulting data in JupyterHub. Then I'll give a quick description of how we'll keep this new kingdom tidy.

If you are keen on following along, check out the DBeaver onboarding page: https://github.com/10gen/mongo-presto/blob/master/docs/configuration/dbeaver/connection.md

Setup

ensure on VPN
ensure can access MongoDB Cloud here

Slides

SQL, Presto, MARS, and Jupyter (and Jira!)

 ryan.timmons@
 2021-12-03

Recording here. Good luck following that guy talking.

Pretend Goal: Data Consistency TIPS vs Cedar

Cedar thinks it's sent some data to TIPS but TIPS thinks otherwise. Both can't be correct. Let's see.

Agenda:

SQL and Presto
Marsing
"Policies" and getting started
Using this in Jupyterhub

Access and DBeaver

🦫🦫 install & configure 🦫🦫

Elephants: SQL????

Yup. No excuses. Our offerings aren't ready yet. BUT: Presto does give us a back-door to using MQL. Stay tuned.

SQL: Tables

SELECT
  *
FROM
  evergreen_cedar_atlas.cedar.perf_results AS cedars
WHERE
  cedars.created_at < DATE_ADD('day', -1, CURRENT_TIMESTAMP)
  AND
  CARDINALITY(rollups.stats) > 2;

Note the "data" column. Cmd+Click on the object. Tab to switch view.

data[1].cedar_perf_result_id

How the heck? And how do we use it.

Let's first talk about what's going on behind the scenes here. Where did this data come from and how does presto know what it looks like?

How Presto Prestoes

look at performance evgdw and compare with results view in DBeaver
https://prestodb.io/docs/current/connector.html
https://prestodb.io/docs/current/connector/hive.html

Sneak-Peek of MARS:

defines a dag of jobs
jobs take data from one place and put it another with some config in between
sources are many including atlas
destinations are many including atlas (but usually it's presto)
other destinations too? hold on we'll find out.

Our Atlas-Backed Catalogs

These use the mongodb connector. Analytics nodes with presto user accounts and roles facilitating the connections.

dev_prod_build_baron_atlas (DAG)
dev_prod_performance_atlas (TIPS)
evergreen_cedar_atlas (EVG) (but test results are wip)
patch_build_optimizer_atlas (STM)
tpm_build_failures_atlas (TPMs)
evergreen_metrics_atlas (ignore!)
Missing: evergreen app (for now)

Note: Security is a thing (Mana)

Our AWS Catalogs

In awsdatacatalog. These use the hive connector.

dev_prod_staging

evergreen_base (private!)
dev_prod_live

Alation

How do you know what's what? WIP!

SQL: Presto Data Types

Looking back at the results of the select query from before.

SELECT
  *
FROM
  dev_prod_performance_atlas.expanded_metrics.time_series

Note the "data" column. Cmd+Click on the object. Tab to switch view.

data[1].cedar_perf_result_id

https://prestodb.io/docs/current/language/types.html#structural

SQL: Unnest

https://prestodb.io/docs/current/sql/select.html#unnest

SELECT
  data
FROM
  dev_prod_performance_atlas.expanded_metrics.time_series

select series.*, datas.*
from 
    dev_prod_performance_atlas.expanded_metrics.time_series as series,
    unnest(data) as datas

with x (variant, tasks) as (
  values
  	('variant1', array['task-1a', 'task-1b']),
  	('variant2', array['task-2a', 'task-2b'])
)
select  *
from    x;

select 	variant,
		unnested.newname
from 	
	x,
	unnest(tasks) as unnested(newname)
;

SQL: Scope and Subqueries

select series.*, datas.*
from 
    dev_prod_performance_atlas.expanded_metrics.time_series as series,
    unnest(data) as datas

WITH series_points AS (
   select series.*, datas.*
   from 
     dev_prod_performance_atlas.expanded_metrics.time_series as series,
     unnest(data) as datas
)
SELECT *
FROM   series_points

SQL: Cartesian Joins

with 
tips_series as (
    select series.*, datas.*
    from 
        dev_prod_performance_atlas.expanded_metrics.time_series as series,
        unnest(data) as datas
),
cedar_points as (
    select cedars._id, stats.*
    from evergreen_cedar_atlas.cedar.perf_results as cedars,
    UNNEST(cedars.rollups."stats") as stats
    where cedars.created_at < date_add('day', -1, current_timestamp)
)
select  *
from    tips_series ts,
        cedar_points cs
where   ts.cedar_perf_result_id = cs._id;

SQL: "Natural" (left/right inner) joins

select  *
from    tips_series ts,
        cedar_points cs
where   ts.cedar_perf_result_id = cs._id;

-- vs

SELECT  *
FROM    
    tips_series      AS ts
    LEFT JOIN 
	    cedar_points AS cs
	ON ts.cedar_perf_result_id = cs._id;

SQL: Outer Joins

select *
from a, b
where a_id = b_id

              .───────.   .───────.
           ,─'         ,─'         '─.
         ,'          ,'   `.          `.
        ╱           ╱       ╲           ╲
┌───┐  ;           ;         :           :   ┌───┐
│ a │  │     ?     │             ?       │   │ b │
└───┘  │           │    ▲    │           │   └───┘
       :           :    │    ;           ;
        ╲           ╲   │   ╱           ╱
         ╲           ╲  │  ╱           ╱
          `.          `.│,'          ,'
            '─.       ,─│─.       ,─'
               `─────'  │  `─────'
                        │
                ┌──────────────┐
                │ a_id = b_id  │
                └──────────────┘

In a but not b?

select *
from
  a left outer join b
  on a_id = b_id
and b_id is null

SQL: The solution

with 
tips_series as (
    select series.*, datas.*
    from 
        dev_prod_performance_atlas.expanded_metrics.time_series as series,
        unnest(data) as datas
),
cedar_points as (
    select cedars._id, stats.*
    from evergreen_cedar_atlas.cedar.perf_results as cedars,
    UNNEST(cedars.rollups."stats") as stats
    where cedars.created_at < date_add('day', -1, current_timestamp)
)
select  *
from    cedar_points cs
    left outer join time_series ts on ts.cedar_perf_result_id = cs._id
where ts.cedar_perf_result_id is null

SQL: Relational Calculus

select *
from a
where not exists (
  select 1
  from b
  where b_id = a_id
)

Useful when CBO isn't doing what you want. Or when calc is just clearer.

SQL: Relational Calculus

In a but not b?

select *
from
  a left outer join b
  on a_id = b_id
and b_id is null

select *
from a
where not exists (
  select 1
  from b
  where b_id = a_id
);

Warning

Joining across Atlas clusters is often very slow.
So let's not run the query "hot" against the cluster

MOVING ON: Marsing

Agenda:

SQL and Presto
Marsing
"Policies" and getting started
Using this in Jupyterhub

MARS

home
docs

Sneak-Peek of MARS:

defines a dag of jobs
jobs take data from one place and put it another with some config in between
sources are many including atlas
destinations are many including atlas (but usually it's presto)
...and ADL
AND THEN
runs them based on triggers like cron or webhooks etc
allows for alerting and retry management, etc

MARS for Snapshots

https://mars.dataplatform.prod.corp.mongodb.com/app/job/c22837b4-45b0-43e2-a7d2-124079f9cef9

MARS for Reporting

Complex DAGs and output destinations
AirFlow?
Python?

ETLs Repo

rtimmons/mongo-etls (STILL VERY WIP and will create a 10gen repo when it's for reals)

Agenda:

SQL and Presto
Marsing
"Policies" and getting started
Using this in Jupyterhub

Design and Policies

Notable Policies:

All Data is Exposed (even if ugly)
Analytics Nodes are Private
Views Data is Stable; Changes to Views Follow a Process
Services are the APIs for OLTP
Presto is the API for OLAP

JypterHub

Example notebook -> Or on docs

When?

Everyone in {dev-prod, Kelsey's org} has access to Presto now. If you are not in these groups and want access, please slack me ryan.timmons.
Kick the tires and try things out.
If you want to create mars jobs to materialize data from atlas or whatever please let me know.

ANYWAY

"Policies" are going out for leads review on Monday
Once approved, we'll set the repo and documentation up and start rolling it out for official use.
TLDR: Hopefully production-ready by January.

RCS

My Reference: NWA2H57UT3ZZ117PTJPY1
@quick(Lunch & Learn SQL and ETLs for Presto and MARS) @ft

rtimmons/presto-mars-demo.md

Lunch & Learn: SQL and ETLs for Presto and MARS

Blurb

Setup

Slides

SQL, Presto, MARS, and Jupyter (and Jira!)

Pretend Goal: Data Consistency TIPS vs Cedar

Agenda:

Access and DBeaver

Elephants: SQL????

SQL: Tables

How Presto Prestoes

Sneak-Peek of MARS:

Our Atlas-Backed Catalogs

Our AWS Catalogs

Alation

SQL: Presto Data Types

SQL: Unnest

SQL: Scope and Subqueries

SQL: Cartesian Joins

SQL: "Natural" (left/right inner) joins

SQL: Outer Joins

SQL: The solution

SQL: Relational Calculus

SQL: Relational Calculus

Warning

Agenda:

MARS

Sneak-Peek of MARS:

MARS for Snapshots

MARS for Reporting

ETLs Repo

Agenda:

Design and Policies

JypterHub

When?

RCS