aka Service Lifecycle Contexts
Request-scoped contexts are unambigiously good. Other than a brief mention of main()
they're the only use case covered by the official context announcement and documentation. Every single feature of contexts makes sense in a request/response scenario:
- Cancelation provides a unified API for canceling work whose result is no longer needed
- Deadlines and timeouts provide a unified API for preventing requests from blocking indefinitely.
- Values provide a unified API for tracing and other request-scoped data without expecting all libraries and frameworks to be aware of their types.
Pipeline cancellation is another documented use of contexts, but it can be viewed as a form of chained/continuous request/responses, so all of the same arguments and principles for request-scoped context apply to pipelines as well.
The question I'm considering is: should you have a context that represents the lifecycle of a long lived service (and therefore cancellation would signal shutdown). Long-lived services such as Nomad agents have complex shutdown semantics:
- On one hand they must be "crash safe" - an agent should be able to die at any point and recover on startup. The worst case scenario is that the agent must refuse to restart due to corrupted state although this should be considered a bug. At no point should forcibly restarting a process cause incorrect: either recover or refuse to run.
- On the other hand they must make a best effort at a graceful shutdown.
There are numerous places where making a best effort to gracefully shutdown is a critical Nomad feature:
- Consul TTL Healthchecks are heartbeated on shutdown to make a best effort at preventing TTL expirations during agent restarts.
- Uses a 2 channel shutdown-signal + shutdown-complete with timeout approach. Shutdown signal could be replaced with a context, but that context must not be used when making Consul API calls.
- When run in
-dev
mode the Nomad agent cleans up all running tasks before exiting.
- Uses a 2 channel shutdown-signal + shutdown-complete approach. Shutdown signal could be replaced with a context, but that context must not be used when communicating with drivers (eg Docker API, executor RPCs, exec'ing rkt commands).
- TODO local and/or remote state sync'ing?
As noted above uses 1 and 2 could use a Context.Done()
chan for receiving the shutdown signal, but this has 2 gotchas:
- The shutdown context must not be used for communicating with Consul or drivers. Doing so would cancel these operations and defeat the purpose of attempting to shutdown gracefully.
- The parent that cancels the context must know it needs to wait for its children to exit.
2 might not seem like a big deal, but it means every parent of a goroutine that requires a coordinated shutdown must implement a coordinated shutdown. For example even if Agent could just cancel Client's context and exit because Agent doesn't care about any "results" from Client, Client cares about waiting for drivers to exit in dev mode. So Client knows to wait on a graceful shutdown of drivers, but Agent also needs to.
In practice this means contexts only complicate shutdown for non-leaf (or close to leaf) goroutines. As soon as some descendent goroutine requires a coordinated shutdown, it infects every parent and defeats much of the simplicity of using a context for shutting down.
The open questions in my mind is:
Does the benefit of having a global context tree representing the lifecycle of a service outweigh the cognitive overhead of knowing when to use a simple context cancellation vs a coordinated shutdown mechanism?
I believe the only way to answer it is to look at APIs of possible implementations.
Let's see an example of using a global context tree with a struct that requires a graceful (blocking) shutdown:
type T struct {
// ctx is cancelled to signal a shutdown
ctx context.Context
// cancel T's context to signal a shutdown
cancel context.CancelFunc
// doneCh is closed when graceful shutdown is complete
doneCh chan struct{}
}
// NewT creates a T that exits
func NewT(pctx context.Context) *T {
t := &T{
doneCh: make(chan struct{}),
}
t.ctx, t.cancel = context.WithCancel(pctx)
return t
}
// Run is called in a goroutine by T's parent.
func (t *T) Run() {
defer close(t.doneCh)
work := make(chan int)
go someAncillaryProcess(t.ctx)
for {
select {
case <-work:
// do work
case <-t.ctx.Done():
// cancelled; exit
return
}
}
}
// Shutdown gracefully
func (t *T) Shutdown() {
t.preShutdown()
t.cancel()
<-t.doneCh
t.postShutdown()
}
The first question is: should the parent context be canceled before or after calling Shutdown? There's no way for the canceler to know. If t.preShutdown()
requires someAncillaryProcess(...)
to be running, the parent must call Shutdown first. However since Shutdown cancels the local context, there's no point in passing in a parent context as the child context is canceled before it is every time.
Obviously a developer would want to document such dependencies and a common pattern could be established to prevent errors, but I am left wondering if the parent context is ever useful?
Other than cancellation none of the other features of contexts make sense for service lifecycles. This leads me to believe service lifecycle context trees are not idiomatic.
Timeouts and deadlines make no sense for "service" goroutines. The only case I can imagine is as a failsafe when testing, but the Go test tool already provides a timeout mechanism that is much more robust.
Don't do it for services. Explicitly pass dependencies.
Resources by core developers:
- Cancelation, Context, and Plumbing https://talks.golang.org/2014/gotham-context.slide#16
context
package docs https://golang.org/pkg/context/context
package blog post https://blog.golang.org/context
Resources by community members:
- Context is for Cancelation https://dave.cheney.net/2017/01/26/context-is-for-cancelation
- tl;dr - don't use Values as a bag of dependencies
- Context isn't for Cancellation https://dave.cheney.net/2017/08/20/context-isnt-for-cancellation
- tl;dr - cancellation doesn't work for coordinated shutdown
- How to correctly use context.Context in Go 1.7 https://medium.com/@cep21/how-to-correctly-use-context-context-in-go-1-7-8f2c0fafdf39
- tl;dr - basically an expanded version of the official blog post
- Context should go away for Go 2 https://faiface.github.io/post/context-should-go-away-go2/
- I mostly disagree with this post but included it for the sake of being comprehensive
- Prometheus uses service contexts but has no single root and also uses every other imaginable shutdown signalling mechanism: https://github.com/prometheus/prometheus/blob/v2.3.1/cmd/prometheus/main.go