- User/Customer
- who will use this system
- how this system will be used
- features they will use
- Scale (read / write)
[SCALE == PARTITIONING]
- how many read query per second?
- how much data is stored in a day
- Perfomance
[RELIABLE == REPLICATION AND CHECKPOINTING]
[SPEED == IN-MEMORY]
- What is the expected
p99
latency for read/write - Can we do things asyncly (delay between write and read)
- Cost
- cheap Load Balancer
- open source tech to reduce the cost
- data retention and DB clearing
- Non-functional Requirement
- Use CAP + Scalable (CAPS)
- Consistency (everyone sees the same data)
- Availablity (service/hardware failure recovering - no single point of failure)
- Performant(handle lots of traffic)
- Scailability
- basic APIs
- input/output/generic
- HL Design
- [Add abstraction layer to make it simpler] All problems in computer science can be solved by another level of indirection
- single responsblity (seperation of concerns)
- Data Model (what we are storing)
generic
-
Load Balancing
- 3 Types (Layer 4 vs Layer 7 vs Elastic LB)
- SSL Termination
- LB Algorithm:
- Static (Round-robin - simple hashing)
- Dynamic (
Least Response Time, Least Connection, Agent-based
)
-
API design
- pagination algorithms
- offset pagination
- cursor pagination
- pagination algorithms
-
Session Management
- traditional vs jwt (json web token)
-
Data Model
- DB Types:
- SQL
- NoSql
- key-value
- document
- wide-column
- graph
- DB Types:
-
Cache
- Cache Architecutre (cache for everything)
-
Sharding
- Horizontal Sharding (range based sharding)
- Dedicated Server + Cluster Proxy similar to MongoDB
- Co-located Cluster Proxy similar to Cassandra
- Vertical Sharding
- Hash Sharding (Simple and consistant hashing)
- Cost optimization for sharding:
- pre-node TPS estimation and leaking bucket optimization
- Horizontal Sharding (range based sharding)
-
Fault Tolerance
- Replication
- Primary with backupt replication
- Primary with a coordinator with seperate read/write
- find leader algorithm?
- Checkpointing
- WAL (Write Ahead Log)
- task breaking using StepFunciton
- Circuite Breaking in API Gateway
- Replication
-
Strong Consistency:
- RAFT strong replication
- Distributed Transactions:
- Two-phase Commit (2PC)
- Liniear Order Execution (Saga)
-
End talks
- What metrics to emit:
- Latency
- Cache misses
- CPU and Memory Utilization
- Network I/O
- cloudwatch (cheaper) / DataDog
- Logging (small but useful):
- Details of every request to the Cache
- Who and When access the cache
- Return status code
- Auditing:
- simple canary flow (canary to make sure all is good)
- Immutable DB like Quantum Ledger
- What metrics to emit:
specific
-
Deduplication
- db lookup
- in-memory hash
- bloom filter
-
Rate Limiter Algo:
- Fixed Window
- Sliding Window
- Leacking Bucket
-
Queue
- Distributed Queue
- Dedup
- Sharding
scale, redundancy, SPOFs, metrics, logs, alerts, dashboards, pagerduty, deployment, failure scenarios
- [0 - 15 min] initial E2E Diagram