John Ruiz jar349

Exercise Comments

Overview

In general, this pipeline is a set of immutable append-only logs, processors, and stores. As any level 1 system context diagram ought to be, it is technology agnostic and could be implemented with a number of technologies, but I admit to having Apache Kafka and Apache Spark Streaming in mind as I designed this. In order to understand the value of this approach, I recommend reading Martin Kleppmann's Turning the database inside-out. I urge you strongly to read that article before continuing to evaluate the diagram or continuing with these comments.

The purpose of the pipeline is to end up with always-up-to-date stores of data that can be performantly queried at scale. Source files stream through the pipeline and cause streaming updates to what can be thought of as "materialized views" whose implementation and technology can be chosen based on the query characteristics. For example, an ela

Keybase proof

I hereby claim:

I am jar349 on github.
I am jar349 (https://keybase.io/jar349) on keybase.
I have a public key whose fingerprint is A8B1 4C32 28C9 9570 9921 5FDC 6313 A9F6 513E 0468

To claim this, I am signing this object:

	== Rules ==
	On Infrastructure
	-----------------
	There is one system, not a collection of systems.
	The desired state of the system should be a known quantity.
	The "known quantity" must be machine parseable.
	The actual state of the system must self-correct to the desired state.
	The only authoritative source for the actual state of the system is the system.
	The entire system must be deployable using source media and text files.

	"""
	kafka.py - tests the local kafka broker
	"""
	import datetime
	import json
	import os

	from kafka import KafkaProducer, KafkaConsumer, TopicPartition