Skip to content

Instantly share code, notes, and snippets.

View mataralhawiti's full-sized avatar
:octocat:
Enjoying this beautiful life

Matar mataralhawiti

:octocat:
Enjoying this beautiful life
  • Riyadh, Saudi Arabia
View GitHub Profile
@iht
iht / relay_options.py
Created January 25, 2021 17:57
Relay your custom runtime options to Dataflow, assuming you use Flex Templates
def run_pipeline(argv):
opts: PipelineOptions = PipelineOptions(argv)
gcloud_opts: GoogleCloudOptions = opts.view_as(GoogleCloudOptions)
if opts.i_want_streaming_engine:
gcloud_opts.enable_streaming_engine = True
else:
gcloud_opts.enable_streaming_engine = False
...
@iht
iht / child_master_dag.py
Created June 22, 2020 19:12
The child master DAG, that executes tasks only when both parent DAGs are completed successfully
"""Trigger Dags #1 and #2 and do something if they succeed."""
from airflow import DAG
from airflow.operators.sensors import ExternalTaskSensor
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
with DAG(
'master_dag',
schedule_interval='*/1 * * * *', # Every 1 minute
from kafka import KafkaConsumer, KafkaProducer, TopicPartition
from datetime import datetime
import boto3
import os
import sys
#settings
client = ["node01:9092", "node02:9092", "node03:9092"]
topic = 'test-auto-kafka'
nbrrecords = int(50)
@mostafa-asg
mostafa-asg / gist:68f5f29c7b73c419610ceafcf726379d
Created March 4, 2020 13:19
Spark: Example of command which launches multiple workers on each slave node:
```
SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./sbin/start-slaves.sh
```
This will launch three worker instances on each node. Each worker instance will use two cores.
Also it is possible to manually start workers and connect to Spark’s master node. To do this use command:
```
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
```
Recall that cluster write throughput is directly proportional to the number of nodes N,
and inversely proportional to the number of replicas RF. If a single node writes 15,000 rows per second,
then you would expect a 5 node cluster writing 3 replicas will be roughly 15,000 * N / RF or 25,000 rows/s.
[Source](https://www.datastax.com/blog/2011/05/understanding-hinted-handoff-cassandra-08)
@KonoMaxi
KonoMaxi / 1_job_manager.py
Last active February 5, 2020 02:48
The JobManager currently handles pipelines of jobs for me in azure functions.
import json
import random
import logging
import re
from azure.cosmosdb.table.tableservice import TableService
from azure.storage.queue import QueueService, QueueMessageFormat
class JobManager(object):
def __init__(self, accound_name: str, account_key: str, job_group: str, job_id: str = None):
@aymericbeaumet
aymericbeaumet / delete-likes-from-twitter.md
Last active September 21, 2024 00:39
[Recipe] Delete all your likes/favorites from Twitter

Ever wanted to delete all your likes/favorites from Twitter but only found broken/expensive tools? You are in the right place.

  1. Go to: https://twitter.com/{username}/likes
  2. Open the console and run the following JavaScript code:
setInterval(() => {
  for (const d of document.querySelectorAll('div[data-testid="unlike"]')) {
    d.click()
 }
@rmoff
rmoff / List all available Kafka Connect plugins.md
Created May 18, 2018 14:29
List all available Kafka Connect plugins
$ curl -s -XGET http://localhost:8083/connector-plugins|jq '.[].class'

"io.confluent.connect.activemq.ActiveMQSourceConnector"
"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector"
"io.confluent.connect.hdfs.HdfsSinkConnector"
"io.confluent.connect.hdfs.tools.SchemaSourceConnector"
"io.confluent.connect.ibm.mq.IbmMQSourceConnector"
"io.confluent.connect.jdbc.JdbcSinkConnector"
"io.confluent.connect.jdbc.JdbcSourceConnector"

"io.confluent.connect.jms.JmsSourceConnector"

# ---- Base python ----
FROM python:3.6 AS base
# Create app directory
WORKDIR /app
# ---- Dependencies ----
FROM base AS dependencies
COPY gunicorn_app/requirements.txt ./
# install app dependencies
RUN pip install -r requirements.txt
@SerCeMan
SerCeMan / intensivedata.txt
Last active March 17, 2021 08:26
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
# Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
1. **Chapter 1. Reliable, Scalable and Maintainable Applications**
1. R faults != failures, faults cause failures. Systems should be fault-tolerant, resistant to some types of faults
2. S Amazon cares about 99.9% percentile because people with higher latencies usually are people who have the most data and therefore, they’re most valuable customers
3. S tail latency amplification - multiple requests one critical path during one page served
2. **Chapter 2. Data Models and Query Languages**
1. Hierarchical model - imperative querying, no way to change schema, children are ordered, no many-to-many
2. CODASYL (network) vs SQL
3. NoSQL - often no schema (precisely - schema on read vs schema on write)