Skip to content

Instantly share code, notes, and snippets.

@rrbutani
Last active September 11, 2024 07:58
Show Gist options
  • Save rrbutani/a27a5417bd71967d3870713df1b9ca12 to your computer and use it in GitHub Desktop.
Save rrbutani/a27a5417bd71967d3870713df1b9ca12 to your computer and use it in GitHub Desktop.

buck2 is very explicit re: the interactions dep files have w/caching:

  • dep files can (after execution) narrow the set of inputs for an action
  • the input set of the action as known to your buck2 daemon will have this narrowed set but...
    • the actual action cache key (as is given to disk/remote caches, etc.) will still have the full set of inputs
  • ultimately this means that as far as remote caching for cold builds is concerned, the action at hand is still keyed on all of its inputs — including the ones that are pruned
    • this means that — for cold builds — if there have been any changes to any of the "pruned out" files since the last run of the action (+ cache population), the action will not hit in the cache
  • additionally, in the event that there are analysis-level changes that result in the starlark action being reconstructed (i.e. BUILD file is edited), we also lose the narrowed set of inputs
    • for changes that materially affect the produced action (i.e. new .cc file added to a target or even to a dependent of a target) this makes sense..
    • but in cases where the produced Action is actually "identical" to the original (prior to pruning) action this is unnecessary and unfortunate

If we think about what's happening at the action/remote cache level, the behavior w.r.t. to cold builds makes sense.

As an example:

  1. Analysis produces action A
    • A has inputs { 1, 2, 3, 4 }
    • A produces U (a list of unused inputs) and O (an output file)
    • At this point A's cache key is hash({ 1 ..= 4 }, ...)
  2. A executes, yielding a narrowed list of inputs: { 1 }
    • i.e. U contains { 2, 3, 4 }
    • At this point A (as it exists in-memory in the daemon) uses the narrowed list of inputs — we'll call this modified version of A, B
    • However, the remote cache is given { 1 ..= 4 } => { U, O } instead of { 1 } => { U, O }
    • ...
    • So long as there aren't changes invalidating B (i.e. changes to a BUILD/BUCK file, repo rule restart, etc.), B will continue to remain in memory
      • However, if analysis runs again for any reason, (assuming there are no material changes to the analysis time constructs producing A), B will be replaced with a freshly computed A
    • On future builds involving B, when Skyframe/DICE ask B whether it is up-to-date, B consults its narrowed inputs list and thus isn't sensitive to changes to { 2, 3, 4 }

NOTE: a side-effect of B not being sensitive to changes in { 2, 3, 4 } is that changes to those files will not trigger the addition of new cache entries...

Future cold builds where { 1, 2, 3, 4 } are the same (and the action A as yielded by analysis is the same..) will have a cache hit:

  • they'll then download U, modify A such that it has the narrowed input list (i.e. A becomes B) and subsequent builds of the action will (subject to the same caveats above) not be sensitive to changes in { 2, 3, 4 }

Cold builds where { 2, 3, 4 } have changes will not get a cache hit.


This is unfortunate but I don't think there's a way around it? There's kind of a circular dependency here.

  • consider: if we had entered our initial build of A into the cache as { 1 } => { U, O } this wouldn't solve the problem
    • future cold builds would still need to know to use that key for A instead of { 1, 2, 3, 4 }
  • if we introduced a special cache blob for "unused inputs lists" and rewired our action execution machinery to query this cache + narrow inputs on actions before looking for actual cache hits, we still have an issue: what would this "unused inputs lists" cache be keyed on?
    • to be clear: unused_input_list_cache[ ??? ] = U
    • if we key it on { 1, 2, 3, 4 } that lines up nicely with the version of A we get out of analysis but... it's too sensitive!
      • changes to { 2, 3, 4 } would cause us to miss the cache which defeats the point — we'd only get hits in this cache in cases where { 2, 3, 4 } haven't changed since the original run which is the status quo
    • if we key it on just { 1 }, how do we know to use this key in a cold build?
      • also: if we already know to use this narrowed key, U has no extra information for us...
    • using only file paths instead of file contents for { 1, 2, 3, 4 } as the key is tempting and seems like it sidesteps the over-sensitivity issue but...
      • this is only safe to do for files that were in the unused list!
      • modifications to files that were used can easily affect the list of files that are used/unused and so we do want to be sensitive to changes to such files
      • knowing which files we can ignore content changes for is... once again, equivalent to knowing the list of unused inputs
    • ultimately we want some way of associating the action with the list of unused inputs
      • fuzzy schemes that use the action's target label and such all have correctness issues
      • the only canonical identifier we have for the action is its hash which by definition is too sensitive
      • what we really want is a caching scheme that lets us express:
        • cache[action { inputs(specific 1, any 2, any 3, any 4), ... }]
        • such that actions that are structurally identical but have different hashes for 2, 3, and 4 still match this cache entry
        • unfortunately I think ^ is fundamentally incompatible with merkle-tree based caching schemes
          • we could have the lookup process start by querying for a "structure-only" version of the action (i.e. paths only, not content hashes) that then produces a manifest describing the sensitivity for each of the inputs (i.e. requires exact match or not)
            • i.e. structure_cache[A: action { cmd, inputs(path1, path2, path3, path4), ... }] = { sensitivity(S1, S2, S3, S4) }
            • i.e. component_sensitivity[S1] = exact_hash(...)
            • i.e. component_sensitivity[S2 .. S4] = unused
          • but... this has some issues:
            • expensive: many lookups
            • the structure-only cache entries are not unique!
              • as discussed, changes in inputs can very much affect the list of inputs that are unused or not
              • this doesn't exactly pose a correctness issue — consumers would need to check that all the listed sensitivity items match before using the cache entry — but it's still annoying
              • to account for this we'd really need to have the structure_cache in the above return a list of "sensitivity"s...
                • not clear how we "resolve" any particular one with hashing; seems like we'd have to test them in sequence (though we can sort and do clever things, etc.)
                • bottom line is that it's expensive
      • the approach implemented by Bazel/buck2 works because it uses an out-of-band method of associating the action with its "narrowed" input list; i.e. relying on the fact that the action won't be replaced in memory until/unless its full input list/structure is modified

note: actual dynamic deps sidesteps this issue


Note

You can see the pruning in action by running (without caching) bazel aquery followed by bazel build:

bazel aquery //:b
action '//:b: a.txt -> bazel-out/k8-fastbuild/bin/b.out'
  Mnemonic: Ex
  Target: //:b
  Configuration: k8-fastbuild
  Execution platform: @local_config_platform//:host
  ActionKey: f504c1367cbd0ab79ea4a3d6e8d099d32678fb86e994d9b9d5e68eba619ab6a4
  Inputs: [a.txt, b.txt]
  Outputs: [bazel-out/k8-fastbuild/bin/b.out]
  Command Line: (exec cp \
    b.txt \
    bazel-out/k8-fastbuild/bin/b.out)
# Configuration: 7e9df8fa2f6dac8029396f42123b7f41cd174371128cf0a776032c570d3d69c8
# Execution platform: @@local_config_platform//:host
  ExecutionInfo: {internal-inline-outputs: bazel-out/k8-fastbuild/bin/b.unused}
bazel build //:b
Target //:b up-to-date:
  bazel-bin/b.out
  bazel-bin/b.unused
INFO: Elapsed time: 0.253s, Critical Path: 0.05s
INFO: 3 processes: 2 internal, 1 linux-sandbox.
INFO: Build completed successfully, 3 total actions
bazel aquery //:b
action '//:b: b.txt -> bazel-out/k8-fastbuild/bin/b.out'
  Mnemonic: Ex
  Target: //:b
  Configuration: k8-fastbuild
  Execution platform: @local_config_platform//:host
  ActionKey: f504c1367cbd0ab79ea4a3d6e8d099d32678fb86e994d9b9d5e68eba619ab6a4
  Inputs: [b.txt] # !!!
  Outputs: [bazel-out/k8-fastbuild/bin/b.out]
  Command Line: (exec cp \
    b.txt \
    bazel-out/k8-fastbuild/bin/b.out)
# Configuration: 7e9df8fa2f6dac8029396f42123b7f41cd174371128cf0a776032c570d3d69c8
# Execution platform: @@local_config_platform//:host
  ExecutionInfo: {internal-inline-outputs: bazel-out/k8-fastbuild/bin/b.unused}

Can verify the behavior described by running the follow sequences:

  • unused inputs are respected:
    • build b, modify a.txt -> b should not be re-built
  • the cache key covers all inputs (including those that are pruned):
    • build a, b with a disk cache
    • do a bazel clean --expunge
    • modify a
    • build b -> should re-build
  • the unused inputs list is used from the cache, if available
    • build a, b with a disk cache
    • do a bazel clean --expunge
    • build b -> should hit in the cache
    • modify a
    • build b -> should not be re-built
  • the cache is not populated when there are changes to the unused inputs:
    • build a, b with a disk cache
    • modify a
    • build b -> should not be re-built
    • do a bazel clean --expunge, leave a in its modified state
    • build b -> should rebuild (i.e. won't hit in the cache..)
  • the on-disk analysis cache includes the updated (pruned) input list:
    • build a, b
    • do a bazel shutdown (note: not clean --expunge)
    • modify a
    • build b -> should not be re-built
  • changes to things that influence the key of Action cause the in-memory and on-disk action to be replaced with a freshly computed version:
    • ...
    • note: in practice this is actually pretty good; the set of things that can influence an action's key is pretty tight
    • also, the Action in memory retains its original (full input list) key which is exactly what we want re: comparing with newly constructed versions of this Action

misc:

# shellcheck shell=bash
use nix -p bazel_7
/.direnv
/bazel-*
/MODULE.bazel.lock
load(":defs.bzl", "example")
example(
name = "a",
src = ":a.txt",
)
example(
name = "b",
src = ":b.txt",
deps = [":a"],
)
ExampleInfo = provider(
fields = dict(files = "depset[File]"), # only source files for this example
)
def _example_impl(ctx):
# contrived usage of `unused_inputs_list`: files in `deps` are always unused
# by the action but we add them to `inputs` anyways
#
# our `actions.write` call below produces an `unused_inputs_list` file
# containing all the files in `deps`
src = ctx.file.src
out = ctx.actions.declare_file(ctx.attr.name + ".out")
unused = ctx.actions.declare_file(ctx.attr.name + ".unused")
deps = depset(transitive = [ e[ExampleInfo].files for e in ctx.attr.deps ])
ctx.actions.write(
unused, is_executable = False, content = "\n".join([
d.path
for d in deps.to_list()
])
)
ctx.actions.run(
inputs = depset(direct = [src], transitive = [deps]),
outputs = [out],
unused_inputs_list = unused,
executable = "cp",
arguments = [src.path, out.path],
mnemonic = "Ex",
progress_message = "%{label}: %{input} -> %{output}",
)
return [
DefaultInfo(
files = depset(direct = [out, unused]),
),
ExampleInfo(
files = depset(direct = [src], transitive = [deps])
),
]
example = rule(
implementation = _example_impl,
attrs = dict(
src = attr.label(allow_single_file = True),
deps = attr.label_list(providers = [ExampleInfo]),
),
provides = [ExampleInfo],
)
@rrbutani
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment