Presto lets us trade APIs for queryable data-structures. Many APIs are already in Presto including Jira, TIPS, and DAG data. MARS uses Presto to run batch-procesing jobs against these data-structures. SQL is the (unfortunate) lingua-franca in this land.
This is a new and foreign situation for some of us, but it's not scary. If you know some MQL or even basic Python you can pick up Presto's SQL easily to quickly explore data within dev-prod and the rest of the company. And MARS is both simpler and more powerful than you might guess.
In a few days everyone in dev-prod and STAR will automatically have access to Presto. Let's hit the ground running.
I'll do a quick end-to-end demo of exploring the Jira, TIPS, and DAG data catalogs. Then we'll do a whirlwind tour of SQL (including what makes Presto's SQL not so terrible). Then I'll put that SQL into a multi-step MARS job. Then I'll use the resulting data in JupyterHub. Then I'll give a quick description of how we'll keep this new kingdom tidy.
If you are keen on following along, check out the DBeaver onboarding page: https://github.com/10gen/mongo-presto/blob/master/docs/configuration/dbeaver/connection.md
- ensure on VPN
- ensure can access MongoDB Cloud here
ryan.timmons@
2021-12-03
Recording here. Good luck following that guy talking.
Cedar thinks it's sent some data to TIPS but TIPS thinks otherwise. Both can't be correct. Let's see.
- SQL and Presto
- Marsing
- "Policies" and getting started
- Using this in Jupyterhub
π¦«π¦« install & configure π¦«π¦«
Yup. No excuses. Our offerings aren't ready yet. BUT: Presto does give us a back-door to using MQL. Stay tuned.
SELECT
*
FROM
evergreen_cedar_atlas.cedar.perf_results AS cedars
WHERE
cedars.created_at < DATE_ADD('day', -1, CURRENT_TIMESTAMP)
AND
CARDINALITY(rollups.stats) > 2;
Note the "data" column. Cmd+Click on the object. Tab to switch view.
data[1].cedar_perf_result_id
How the heck? And how do we use it.
Let's first talk about what's going on behind the scenes here. Where did this data come from and how does presto know what it looks like?
- look at performance evgdw and compare with results view in DBeaver
- https://prestodb.io/docs/current/connector.html
- https://prestodb.io/docs/current/connector/hive.html
- defines a dag of jobs
- jobs take data from one place and put it another with some config in between
- sources are many including atlas
- destinations are many including atlas (but usually it's presto)
- other destinations too? hold on we'll find out.
These use the mongodb connector. Analytics nodes with presto
user accounts and roles facilitating the connections.
- dev_prod_build_baron_atlas (DAG)
- dev_prod_performance_atlas (TIPS)
- evergreen_cedar_atlas (EVG) (but test results are wip)
- patch_build_optimizer_atlas (STM)
- tpm_build_failures_atlas (TPMs)
- evergreen_metrics_atlas (ignore!)
- Missing: evergreen app (for now)
Note: Security is a thing (Mana)
In awsdatacatalog
. These use the hive connector.
- dev_prod_staging
- evergreen_base (private!)
- dev_prod_live
How do you know what's what? WIP!
Looking back at the results of the select query from before.
SELECT
*
FROM
dev_prod_performance_atlas.expanded_metrics.time_series
Note the "data" column. Cmd+Click on the object. Tab to switch view.
data[1].cedar_perf_result_id
SELECT
data
FROM
dev_prod_performance_atlas.expanded_metrics.time_series
select series.*, datas.*
from
dev_prod_performance_atlas.expanded_metrics.time_series as series,
unnest(data) as datas
with x (variant, tasks) as (
values
('variant1', array['task-1a', 'task-1b']),
('variant2', array['task-2a', 'task-2b'])
)
select *
from x;
select variant,
unnested.newname
from
x,
unnest(tasks) as unnested(newname)
;
select series.*, datas.*
from
dev_prod_performance_atlas.expanded_metrics.time_series as series,
unnest(data) as datas
WITH series_points AS (
select series.*, datas.*
from
dev_prod_performance_atlas.expanded_metrics.time_series as series,
unnest(data) as datas
)
SELECT *
FROM series_points
with
tips_series as (
select series.*, datas.*
from
dev_prod_performance_atlas.expanded_metrics.time_series as series,
unnest(data) as datas
),
cedar_points as (
select cedars._id, stats.*
from evergreen_cedar_atlas.cedar.perf_results as cedars,
UNNEST(cedars.rollups."stats") as stats
where cedars.created_at < date_add('day', -1, current_timestamp)
)
select *
from tips_series ts,
cedar_points cs
where ts.cedar_perf_result_id = cs._id;
select *
from tips_series ts,
cedar_points cs
where ts.cedar_perf_result_id = cs._id;
-- vs
SELECT *
FROM
tips_series AS ts
LEFT JOIN
cedar_points AS cs
ON ts.cedar_perf_result_id = cs._id;
select *
from a, b
where a_id = b_id
.βββββββ. .βββββββ.
,β' ,β' 'β.
,' ,' `. `.
β± β± β² β²
βββββ ; ; : : βββββ
β a β β ? β ? β β b β
βββββ β β β² β β βββββ
: : β ; ;
β² β² β β± β±
β² β² β β± β±
`. `.β,' ,'
'β. ,βββ. ,β'
`βββββ' β `βββββ'
β
ββββββββββββββββ
β a_id = b_id β
ββββββββββββββββ
In a but not b?
select *
from
a left outer join b
on a_id = b_id
and b_id is null
with
tips_series as (
select series.*, datas.*
from
dev_prod_performance_atlas.expanded_metrics.time_series as series,
unnest(data) as datas
),
cedar_points as (
select cedars._id, stats.*
from evergreen_cedar_atlas.cedar.perf_results as cedars,
UNNEST(cedars.rollups."stats") as stats
where cedars.created_at < date_add('day', -1, current_timestamp)
)
select *
from cedar_points cs
left outer join time_series ts on ts.cedar_perf_result_id = cs._id
where ts.cedar_perf_result_id is null
select *
from a
where not exists (
select 1
from b
where b_id = a_id
)
Useful when CBO isn't doing what you want. Or when calc is just clearer.
In a but not b?
select *
from
a left outer join b
on a_id = b_id
and b_id is null
select *
from a
where not exists (
select 1
from b
where b_id = a_id
);
-
Joining across Atlas clusters is often very slow.
-
So let's not run the query "hot" against the cluster
MOVING ON: Marsing
- SQL and Presto
- Marsing
- "Policies" and getting started
- Using this in Jupyterhub
- defines a dag of jobs
- jobs take data from one place and put it another with some config in between
- sources are many including atlas
- destinations are many including atlas (but usually it's presto)
- ...and ADL
- AND THEN
- runs them based on triggers like cron or webhooks etc
- allows for alerting and retry management, etc
https://mars.dataplatform.prod.corp.mongodb.com/app/job/c22837b4-45b0-43e2-a7d2-124079f9cef9
- Complex DAGs and output destinations
- AirFlow?
- Python?
rtimmons/mongo-etls (STILL VERY WIP and will create a 10gen repo when it's for reals)
- SQL and Presto
- Marsing
- "Policies" and getting started
- Using this in Jupyterhub
Notable Policies:
- All Data is Exposed (even if ugly)
- Analytics Nodes are Private
- Views Data is Stable; Changes to Views Follow a Process
- Services are the APIs for OLTP
- Presto is the API for OLAP
Example notebook -> Or on docs
- Everyone in {dev-prod, Kelsey's org} has access to Presto now. If you are not in these groups and want access, please slack me
ryan.timmons
. - Kick the tires and try things out.
- If you want to create mars jobs to materialize data from atlas or whatever please let me know.
ANYWAY
- "Policies" are going out for leads review on Monday
- Once approved, we'll set the repo and documentation up and start rolling it out for official use.
- TLDR: Hopefully production-ready by January.
My Reference: NWA2H57UT3ZZ117PTJPY1
@quick(Lunch & Learn SQL and ETLs for Presto and MARS) @ft