Skip to content

Instantly share code, notes, and snippets.

@garrypolley
Last active October 9, 2020 14:04
Show Gist options
  • Save garrypolley/da50c0a063fc5d103dae3baec139adeb to your computer and use it in GitHub Desktop.
Save garrypolley/da50c0a063fc5d103dae3baec139adeb to your computer and use it in GitHub Desktop.
Ideas on how to consume Kafka Data

Consumption options

There are a few ways we can consume the data generated within our Kafka architecture. I want our API side consumers to do minimal work to consume topics. With that in mind there are a few ways we can get data out of topics and into our APIs that feed our Apps. API in this case is referring to the Backend For Frontend (BFF). There are 3 ways I think these patters can work.

  1. Direct from Kafka Topics
  2. From a shared consolidated store built off the topics
  3. Consumers that work via webhooks -- off of topics

In all cases

I think in all cases where we consume from Kafka with the API Team we should use (Kafkajs)[https://kafka.js.org/docs/consuming]. Much like the existing producers in the (atlas-stream-consumers)[https://github.com/C2FO/atlas-stream-consumers/blob/master/src/utils/kafka-publisher.js]

There are a few nuances to why the 3 approaches are different. And they will be explained below for each scenario.

  • Data Access Reads (internal network versus external)
  • Data Access Write (where writes occur within the code)
  • Data Access Decoupled (how well it supports true decoupling)

Note: we'll never be 100% decoupled with our data. However, each app should be able to run on its own without other app data. In practice our products communicate with each other and therefore some data will be shared -- however the shape of the data should not be forced from one app to another as it is today.

In the 3 scenarios there are 3 pieces that need to be built. Each approach shifts the responsibility of who writes these pieces and where the logic lives. Below are the pieces of code that need to be written.

  • Topic Consumer -- connect to Kafka
  • Topic Data Reader -- read the data that was produced in Kafka
  • Topic Data Writer -- write the read data to a place

There are different possible owners for each model which will be either API Team or Shared Team.

  • API Team -- the team owning their API
  • Shared Team -- a team that specializes in managing and owning the Kafka Infrastructure (can have people from many teams)

It's very possible we could super simplify with AWS Lambdas as an event source for the topic consumers:

https://aws.amazon.com/about-aws/whats-new/2020/08/aws-lambda-now-supports-amazon-managed-streaming-for-apache-kafka-as-an-event-source/

We could use lambdas for all of it -- this would almost completly remove the need for the Topic Consumer because that'd happen automatically for us from the Kafka to Lambda.

Direct From Kafka Topics

If we go the route of apps consuming directly teams will need to know a little more about Kafka specifically. Also we will need to better monitor the kafka producers overtime to be sure we are not overloading the system.

  • Topic Consumer (API Team) -- this will be written by the API owning team
  • Topic Data Reader (API Team) -- would be part of the consumer in this case, transforming the data before writing
  • Topic Data Writer (API Team) -- also part of the consumer, likely writing directly to the DB or API endpoint

This is a pretty straightforward approach. Deploy a consumer to the topic that reads the data and then writes it. Likely all in one.

The major cons of this approach are:

  • Teams need to know more about Kafka
  • We create more load for the producers
  • Data Reader and writer must be inside our network

From a shared consolidated store built off topics

A shared consolidated store implies that we push all data we may care about to a shared data store like postgres. Then teams consume data from this shared store. This does not lead to the app being decoupled from the data as easily.

  • Topic Consumer (Shared Team) -- consume topics and write them to a DB like model events
  • Topic Data Reader (API Team) -- read out of the shared consolidated store and transform the data
  • Topic Data Writer (API Team) -- the API team then writes to their own datastore as needed

This approach is still straight forward, but has different cons.

  • Data reader and writer must be inside our network
  • There is a less realtime aspect because the "shared" data store needs to be read
  • Will likely lead to bad coupling inside the shared data store

Overall this approach my be good for short term getting the overall pieces in place.

Consumers via Webhooks

This approach tries to balance knowledge of Kafka, decoupling data, and allowing realtime data updates. This approach is also the most complex in terms of what needs to be built out. Since there will be web hooks we need the system sending webhooks to have some fault tolerance. There is a great write up here on how Hootsuite has done this: https://medium.com/hootsuite-engineering/a-scalable-reliable-webhook-dispatcher-powered-by-kafka-2dc3d677f16b

  • Topic Consumer (Shared Team) -- consume topics as normal
  • Topic Data Reader (Shared Team) -- as the data is read in it will then be publish out to configured WebHooks
  • Topic Data Writer (API Team) -- create API endpoints that take in the event via an HTTP call (API teams choose how to write)

This approach puts a lot of burden on the Shared Team at the benefit of API teams not having to really do anything with Kafka directly. This also allows APIs to be hosted anywhere in the world -- because the POST can go outbound of our data center.

As part of the implementation we could follow a recommendation from Square: https://developer.squareup.com/blog/reliable-webhooks-using-serverless-architecture/

There are a few cons here as well

  • Requires creating a WebHook infrastructure -- mostly managing retries to APIs and knowing when they fail
  • Need to have a Shared Team that is responsible for the base infrastructure (though we do need this anyway)
  • Our developers may not fully understand how their data is being sourced
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment