Skip to content

Instantly share code, notes, and snippets.

@kcmannem
Created August 29, 2019 02:59
Show Gist options
  • Save kcmannem/d5263d29ba8e3f848f46dd4d32bbfe27 to your computer and use it in GitHub Desktop.
Save kcmannem/d5263d29ba8e3f848f46dd4d32bbfe27 to your computer and use it in GitHub Desktop.
# Cost of Concourse
If we're able to disect and expose the costs for running Concourse. We can better answer customer questions as to why they're spending so much for this tool. Costs that are caused by running a customers workload may get grouped as a cost for running Concourse itself. By having a framework which decomposes the fixed, variable, and marginal costs, we can better nagivate and control this conversation.
# Types of Cost
By design Concourse has a set of cluster manangment components that drives costs up when compared to similar sized worker pool on other build systems. We can classify this as the __Fixed Cost__ for running Concourse. Regardless of the deployment size, at minimum this cost has to be payed. The supported deployment method of Concourse for our customers is through BOSH, so we mustn't forget it's costs as well.
This is what I see as the minium well running deployement scheme:
Fixed_Cost = BOSH Director + 1 LB + 2 Web + 1 DB
* 2 Web nodes are required for HA
* BOSH creates an LB which is also required for multi-web environment
* Director is needed for day1/2 operations
* Concourse has never used more than 1 DB
It's important to note that the size of the VM's created for these components also affects the cost of operation. Looking at our internal deployments, I'd argue that we can disregard this variable as we tend to scale (`n1-standard` - 2 cpus, 8 gbs) horizontally vs vertically.
You could argue that some deployments have more than 2 web nodes but I'd file this under a marginal cost as it has more to do with how much it would cost to run an additional build on top of the baseline the fixed cost provides.
The __Variable Cost__ is composed by the size of worker pool being used. Bigger/More workers will provide customers with larger throughput of builds. Concourse affects this throughput slightly (idk maybe more) because it also neighbors administrative workloads alongside customer workloads and places load in a non optimal ways. These come in the form of check containers (some other stuff I might be forgetting) and placement strategies. Our best case scenario is that workloads ran on Concourse have the same throughput and cost as if the allocated worker pool had been running these scripts without Concourse middleware.
Variable_Cost = N(Workers)
The balance we want to find here when comparing against other tooling could be:
build througput ~ avg build time ~ variable_cost
We can make an argument about sacrificing 1/3 but we're probably worse at all of them currently.
One of the trickiest, and unanswered question is "How much does each additional build cost". It's answer lies in how many builds can Concourse even run. Our fixed cost per build is just `fixed_cost/n_builds` but this cost curve is not flat. Each web node manages a limited number of builds well, before additional nodes need to be added in. Therefore the __Marginal Cost__ is dictated by how well Concourse scales (`builds`/`web`).
The nature of how build events are triggered through out the pipeline are dictated by heavy loaded DB queries. At some point the DB will have to be scaled vertically. We have no insights on when exactly to do this. We do not have any benchmarks as to how many builds a single/additional Web node can manage either (Clara's work with the algorithm gave us some numbers which we can use in the future).
# Thoughts
By going through this excersize we shouldn't get discourged by the inefficencies and we shouldn't be ignorant of thee costs. Customers should know the value they're paying for. I came across this:
*" It is generally better to optimize your price for your own value provided to customers, especially if you are offering more value than the competition."*
@kcmannem
Copy link
Author

Cost of Concourse

If we're able to disect and expose the costs for running Concourse. We can better answer customer questions as to why they're spending so much for this tool. Costs that are caused by running a customers workload may get grouped as a cost for running Concourse itself. By having a framework which decomposes the fixed, variable, and marginal costs, we can better nagivate and control this conversation.

Types of Cost

By design Concourse has a set of cluster manangment components that drives costs up when compared to similar sized worker pool on other build systems. We can classify this as the Fixed Cost for running Concourse. Regardless of the deployment size, at minimum this cost has to be payed. The supported deployment method of Concourse for our customers is through BOSH, so we mustn't forget it's costs as well.

This is what I see as the minium well running deployement scheme:

Fixed_Cost = BOSH Director + 1 LB + 2 Web + 1 DB

* 2 Web nodes are required for HA
* BOSH creates an LB which is also required for multi-web environment
* Director is needed for day1/2 operations
* Concourse has never used more than 1 DB

It's important to note that the size of the VM's created for these components also affects the cost of operation. Looking at our internal deployments, I'd argue that we can disregard this variable as we tend to scale (n1-standard - 2 cpus, 8 gbs) horizontally vs vertically.
You could argue that some deployments have more than 2 web nodes but I'd file this under a marginal cost as it has more to do with how much it would cost to run an additional build on top of the baseline the fixed cost provides.

The Variable Cost is composed by the size of worker pool being used. Bigger/More workers will provide customers with larger throughput of builds. Concourse affects this throughput slightly (idk maybe more) because it also neighbors administrative workloads alongside customer workloads and places load in a non optimal ways. These come in the form of check containers (some other stuff I might be forgetting) and placement strategies. Our best case scenario is that workloads ran on Concourse have the same throughput and cost as if the allocated worker pool had been running these scripts without Concourse middleware.

Variable_Cost = N(Workers)

The balance we want to find here when comparing against other tooling could be:

build througput ~ avg build time ~ variable_cost

We can make an argument about sacrificing 1/3 but we're probably worse at all of them currently.

One of the trickiest, and unanswered question is "How much does each additional build cost". It's answer lies in how many builds can Concourse even run. Our fixed cost per build is just fixed_cost/n_builds but this cost curve is not flat. Each web node manages a limited number of builds well, before additional nodes need to be added in. Therefore the Marginal Cost is dictated by how well Concourse scales (builds/web).
The nature of how build events are triggered through out the pipeline are dictated by heavy loaded DB queries. At some point the DB will have to be scaled vertically. We have no insights on when exactly to do this. We do not have any benchmarks as to how many builds a single/additional Web node can manage either (Clara's work with the algorithm gave us some numbers which we can use in the future).

Thoughts

By going through this excersize we shouldn't get discourged by the inefficencies and we shouldn't be ignorant of thee costs. Customers should know the value they're paying for. I came across this:

" It is generally better to optimize your price for your own value provided to customers, especially if you are offering more value than the competition."

@jama22
Copy link

jama22 commented Aug 29, 2019

I'd add context to the following sections:

Cost of Concourse

Since Concourse is not a hosted service, a Concourse user must pay the price in hardware in order to run the software. While Cloud computing is still generally considered "cheap" compared to bare-metal pricing, it is still worthwhile to take cost into account when attempting to grow the Concourse user-base.

For Early Adopters & Small Teams

  • its important that we keep low footprint option for users who are trying the software out and are just running it for their team
  • prior beefs: bosh for example, had a low adoption curve because it required (at a minimum) 8gb of working RAM to run

For on-premise users

  • its important to consider that the total cost of operating a tool is closely scrutinized
  • operational budgets are set for the quarter/year and exceeding them can cause resentment towards the tool
  • depending on the deployment topology, it may not be possible for users to freely scale out infrastructure

Focusing on Costs

  • While more complicated of the use cases, propose to focus on the BOSH deployment due to potential business impact with Pivotal Concourse
  • insert your paragraph " we're able to dissect and expose the costs for running Concourse. We can better answer customer questions as to why they're spending so much for this tool. Costs that are caused by running a customers workload may get grouped as a cost for running Concourse itself. By having a framework which decomposes the fixed, variable, and marginal costs, we can better navigate and control this conversation."

Types of Costs

  • I kinda see where you're going with the Fixed Costs definiton. I'd also just add a note to say something along the lines of "these are obviously assumptions around the fact that customers typically NEVER scale out these base components unless directly instructed. Generally customers who are savvy on their deployments only know to scale workers, so we shall consider those variable"

@xtreme-sameer-vohra
Copy link

By design Concourse has a set of cluster manangment components that drives costs up when compared to similar sized worker pool on other build systems. Jenkins has a similar architecture to Concourse with masters and slaves

@xtreme-sameer-vohra
Copy link

Minor correction BOSH creates an LB which is also required for multi-web environment -> BOSH assigns an LB to the Web VMs

@kcmannem
Copy link
Author

👍 thanks for the jenkins and bosh clarifications, I didn't know.

@xtreme-sameer-vohra
Copy link

xtreme-sameer-vohra commented Aug 30, 2019

It seems as there are 2 different areas that could be explored independently. It would be helpful to identify which path(s) we intend to explore and their priority.

A cost analysis of Concourse vs. Some-Other CI system

This is in service of a prospective Concourse user who is very price sensitive and isn't willing to switch to Concourse unless they understand the costs thoroughly.
If this is correct, do we have a sense how many folks are feeling this pain and if resolved will switch over ?

A cost breakdown of operating Concourse

This is in service of a current Concourse user who is a heavy user of Concourse. It is likely they offer it as a service internally and would like to be able to attribute costs to their end users based on the user's usage.
If this is interesting, what are the sets of data we can identify to iteratively get more granular data.


Data driven approach to improve Concourse

This is a scenario that was brought up during the meeting but isn't captured in the above proposal. It is about having a way to quantifiably determine how a new version of Concourse compares to an older version.
This would help us understand how new work is going to affect our end users. Furthermore, it helps users in the We are using Concourse, but its too expensive as it would allow us to provide them quantified improvements to upgrade.

@ddadlani
Copy link

Can we back this up with information from current heavy users/customers of Concourse. I like what you have outlined but without concrete data we run the risk of assuming certain things, e.g. that web is a fixed cost. Have we seen users scale up web due to high load before? How often does that happen? I agree that workers are much more likely to be scaled up than web, but it'd be nice to have that data stored somewhere.

@kcmannem
Copy link
Author

@ddadlani I'd put that under the marginal cost. I had a short chat with james who mentioned that customers don't scale unless we tell them to. And they'll use what ever we tell them in the beggining or what the PA setup. But you're right part of the next steps will be to gather this data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment