Headline

Checklist

1. Understand the problem and establish design scope

User/Customer

who will use this system
how this system will be used
features they will use

Scale (read / write) [SCALE == PARTITIONING]

how many read query per second?
how much data is stored in a day

Perfomance

[RELIABLE == REPLICATION AND CHECKPOINTING]
[SPEED == IN-MEMORY]
What is the expected p99 latency for read/write
Can we do things asyncly (delay between write and read)

Cost

cheap Load Balancer
open source tech to reduce the cost
data retention and DB clearing

Non-functional Requirement

Use CAP + Scalable (CAPS)
- Consistency (everyone sees the same data)
- Availablity (service/hardware failure recovering - no single point of failure)
- Performant(handle lots of traffic)
- Scailability

2. api / data model / HL design

basic APIs

input/output/generic

HL Design

[Add abstraction layer to make it simpler] All problems in computer science can be solved by another level of indirection
single responsblity (seperation of concerns)

Data Model (what we are storing)

3. detailed design and deep-

generic

Load Balancing
- 3 Types (Layer 4 vs Layer 7 vs Elastic LB)
- SSL Termination
- LB Algorithm:
  - Static (Round-robin - simple hashing)
  - Dynamic (Least Response Time, Least Connection, Agent-based)
API design
- pagination algorithms
  - offset pagination
  - cursor pagination
Session Management
- traditional vs jwt (json web token)
Data Model
- DB Types:
  - SQL
  - NoSql
    - key-value
    - document
    - wide-column
    - graph
Cache
- Cache Architecutre (cache for everything)
Sharding
- Horizontal Sharding (range based sharding)
  - Dedicated Server + Cluster Proxy similar to MongoDB
  - Co-located Cluster Proxy similar to Cassandra
- Vertical Sharding
- Hash Sharding (Simple and consistant hashing)
- Cost optimization for sharding:
  - pre-node TPS estimation and leaking bucket optimization
Fault Tolerance
- Replication
  - Primary with backupt replication
  - Primary with a coordinator with seperate read/write
- find leader algorithm?
  - RAFT
- Checkpointing
- WAL (Write Ahead Log)
- task breaking using StepFunciton
- Circuite Breaking in API Gateway
Strong Consistency:
- RAFT strong replication
- Distributed Transactions:
  - Two-phase Commit (2PC)
  - Liniear Order Execution (Saga)
End talks
- What metrics to emit:
  - Latency
  - Cache misses
  - CPU and Memory Utilization
  - Network I/O
  - cloudwatch (cheaper) / DataDog
- Logging (small but useful):
  - Details of every request to the Cache
  - Who and When access the cache
  - Return status code
- Auditing:
  - simple canary flow (canary to make sure all is good)
  - Immutable DB like Quantum Ledger

specific

Deduplication
- db lookup
- in-memory hash
- bloom filter
Rate Limiter Algo:
- Fixed Window
- Sliding Window
- Leacking Bucket
Queue
- Distributed Queue
- Dedup
- Sharding

4. bottlenecks

scale, redundancy, SPOFs, metrics, logs, alerts, dashboards, pagerduty, deployment, failure scenarios

Timeline

[0 - 15 min] initial E2E Diagram

0xhmn/gist:1c468d18fe7f290d4e49543c42ff9e2a

Headline

Checklist

1. Understand the problem and establish design scope

2. api / data model / HL design

3. detailed design and deep-

4. bottlenecks

Timeline

0xhmn commented Jan 31, 2023

0xhmn commented Jan 31, 2023

0xhmn commented Jan 31, 2023