-
-
Save pingzh/e849084934588a20ce3c440598669fd8 to your computer and use it in GitHub Desktop.
import time | |
import logging | |
from airflow.utils.db import create_session | |
from airflow.utils import timezone | |
from airflow.models import TaskInstance | |
from airflow.models.serialized_dag import SerializedDagModel | |
logger = logging.getLogger(__name__) | |
out_hdlr = logging.FileHandler('./log.txt') | |
out_hdlr.setFormatter(logging.Formatter('%(asctime)s %(message)s')) | |
out_hdlr.setLevel(logging.INFO) | |
logger.addHandler(out_hdlr) | |
logger.setLevel(logging.INFO) | |
def create_tis_in_new_dag_run(dag, execution_date, number_of_tis): | |
tasks = list(dag.task_dict.values())[0:number_of_tis] | |
t1 = time.time() | |
success = True | |
try: | |
with create_session() as session: | |
for task in : | |
ti = TaskInstance(task, execution_date) | |
session.add(ti) | |
session.flush() | |
except: | |
success = False | |
t2 = time.time() | |
logger.info('Created %s tis. success?: %s, perf: %s', number_of_tis, success, t2 - t1) | |
return t2 - t1, success | |
def perf_tis_creation(dag): | |
perf = {} | |
for number_of_tis in [1000, 3000, 5000, 10000, 15000, 20000, 25000]: | |
next_run_date = timezone.utcnow() | |
perf, success = create_tis_in_new_dag_run(dag, next_run_date, number_of_tis) | |
perf[number_of_tis] = (perf, success) | |
time.sleep(5) | |
if __name__ == '__main__': | |
dag_id = 'a-very-large-dag' | |
dm = SerializedDagModel.get(dag_id) | |
dag = dm.dag | |
perf_tis_creation(dag) |
KevinYang21
commented
Dec 21, 2021
•
Mysql is shocking. Postgres has near linear performance:
[2022-01-05 10:57:25,672] 1106663 MainProcess {{__main__ test_ti_creation.py:32}} INFO - Created 1000 tis. success?: False, perf: 0.0025680065155029297
[2022-01-05 10:57:30,678] 1106663 MainProcess {{__main__ test_ti_creation.py:32}} INFO - Created 3000 tis. success?: False, perf: 0.0017096996307373047
[2022-01-05 10:57:35,686] 1106663 MainProcess {{__main__ test_ti_creation.py:32}} INFO - Created 5000 tis. success?: False, perf: 0.0025453567504882812
[2022-01-05 10:57:40,698] 1106663 MainProcess {{__main__ test_ti_creation.py:32}} INFO - Created 10000 tis. success?: False, perf: 0.0053005218505859375
[2022-01-05 10:57:45,706] 1106663 MainProcess {{__main__ test_ti_creation.py:32}} INFO - Created 15000 tis. success?: False, perf: 0.0023920536041259766
[2022-01-05 10:57:50,711] 1106663 MainProcess {{__main__ test_ti_creation.py:32}} INFO - Created 20000 tis. success?: False, perf: 0.0014641284942626953
[2022-01-05 10:57:55,722] 1106663 MainProcess {{__main__ test_ti_creation.py:32}} INFO - Created 25000 tis. success?: False, perf: 0.0048677921295166016
Pay more attention to the output Ash, so quick cos it failed.
Elephant in the room indeed :)
BTW @ashb -> is that an accident you had success: False (maybe that's the reason for the numbers?)
Oh hmmm, no, didn't notice that.
Better numbers (had to change it a bit for latest main):
[2022-01-05 12:05:13,647] 1262808 MainProcess {{__main__ test_ti_creation.py:35}} INFO - Created 1000 tis. success?: True, perf: 0.2504695159950643
[2022-01-05 12:05:19,537] 1262808 MainProcess {{__main__ test_ti_creation.py:35}} INFO - Created 3000 tis. success?: True, perf: 0.8673499860014999
[2022-01-05 12:05:25,884] 1262808 MainProcess {{__main__ test_ti_creation.py:35}} INFO - Created 5000 tis. success?: True, perf: 1.3316335039999103
[2022-01-05 12:05:33,485] 1262808 MainProcess {{__main__ test_ti_creation.py:35}} INFO - Created 10000 tis. success?: True, perf: 2.5858788749974337
[2022-01-05 12:05:42,555] 1262808 MainProcess {{__main__ test_ti_creation.py:35}} INFO - Created 15000 tis. success?: True, perf: 4.056032521002635
[2022-01-05 12:05:52,678] 1262808 MainProcess {{__main__ test_ti_creation.py:35}} INFO - Created 20000 tis. success?: True, perf: 5.108857400002307
[2022-01-05 12:06:03,551] 1262808 MainProcess {{__main__ test_ti_creation.py:35}} INFO - Created 21407 tis. success?: True, perf: 5.853602839000814
Mysql 5.7 numbers on my laptop:
[2022-01-05 13:00:36,973] {test_ti_creation.py:55} INFO - Created 1000 tis. success?: True, perf: 0.3342053010055679
[2022-01-05 13:00:42,990] {test_ti_creation.py:55} INFO - Created 3000 tis. success?: True, perf: 1.0060032569963369
[2022-01-05 13:00:49,775] {test_ti_creation.py:55} INFO - Created 5000 tis. success?: True, perf: 1.7641245650011115
[2022-01-05 13:00:58,649] {test_ti_creation.py:55} INFO - Created 10000 tis. success?: True, perf: 3.860055482000462
[2022-01-05 13:01:09,464] {test_ti_creation.py:55} INFO - Created 15000 tis. success?: True, perf: 5.801471699000103
[2022-01-05 13:01:22,447] {test_ti_creation.py:55} INFO - Created 20000 tis. success?: True, perf: 7.970231813000282
[2022-01-05 13:01:36,190] {test_ti_creation.py:55} INFO - Created 21407 tis. success?: True, perf: 8.734915605004062
So smaller absolute, but still follow the same pattern (My large dag has "only" 21407 tasks)
My version, here https://gist.github.com/ashb/f43ab6300f0f84ed91a599923de3a673 (updated to run against Main, which needs a DagRun for a TaskInstance to be created)
I also tried out the bulk performance modes mentioned https://docs.sqlalchemy.org/en/13/faq/performance.html?highlight=bulk%20create#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow which give about a 15-20% speed up for bulk_save_objects on Mysql (and, unfortunately for you) a 40% speed up on PostgreSQL apache/airflow#20722
Edit: 140% on Postgres now!
@ashb nice, we have been using bulk_save_objects for a while, which saved us around ~28 seconds in production when creating large dag run (from ~45 seconds to ~17 seconds).
Wow, looking at those numbers, PostgreSQL is so much better.
Yeah I was surprised by just how much it sped up!
Good stuff @ashb ! I remember times when I got even better results by just batchig DB writes (Telecom Billing process). I think there are only two reason whu you should NOT do batching:
- When you care about transactional integrity "per object"
- When you have risk of deadlocks
I think in this case we have neither the 1) need nor 2) risk.