Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save rrbutani/6b626bc2afa17cafaf2f7e90dc890e7f to your computer and use it in GitHub Desktop.
Save rrbutani/6b626bc2afa17cafaf2f7e90dc890e7f to your computer and use it in GitHub Desktop.

Background

See this gist about post-hoc action input pruning and the limitations w.r.t to caching.

buck2 has starlark APIs for (limited) dynamic deps (i.e. dynamic action creation albeit only with input subsetting): https://buck2.build/docs/rule_authors/dynamic_dependencies/

Bazel's internal dependency engine (skyframe) is capable of representing such dependencies and internal rulesets make use of this functionality (i.e. ThinLTO and C++20 module support in rules_cc) however — starlark rules have no way to express such graphs.

This PoC

In the specific case described in the previous gist the dynamism is constrained enough (i.e. the number of actions are known statically, only the relevant subset of inputs is "dynamic") that we can attempt to model it in Bazel with TreeArtifacts 12. At a high level this works by:

  1. running a (quick) action that has all of the files listed as inputs and producing (for each actual action) a TreeArtifact containing the slimmed down set of inputs
  2. running the actual action with the TreeArtifact from #1 listed as the inputs

This kind of a priori pruning of unused inputs has a couple of upsides:

  • no potential correctness issues — unlike with input pruning (i.e. unused_inputs_list) where Bazel has to take it on faith that the action genuinely didn't rely on any of the file inputs that are being listed as inputs, here the sandbox is enforcing that "unused" inputs actually are unused
    • note that the correctness issues are limited in scope to local incremental builds and cannot result in cache corruption due to the nature of unused_inputs_list and its interaction with caching — see the previous gist for details)
  • better interaction with caching: because #2 has the narrowed set of inputs listed when its executed, it's genuinely not sensitive to the "unused" inputs — even on clean builds

The downsides are that:

  • some work is done twice; i.e. we'll end up parsing inputs to discover used headers in both #1 and #2 (in the steps listed above)
  • a little bit of work needs to be done to create a tool to drive the actual tool to discover headers and to then assemble the TreeArtifact
    • potentially an added bit of maintenance burden? probably not burdensome in practice though, provided your tool has a way to discover headers and stop
  • getting the symlinks right is a little tricky; see:
  • this only provides upside if the tool invocation in #1 is very fast compared with #2
  • leaning heavily on this technique (instead of expressing the exact set of used headers statically) reduces the fidelity of the static build graph
    • it's a tradeoff between user burden and perf/static information, as always

Note

This approach is not dissimilar to Bazel's (kind of Blaze only — i.e. google internal only) "include scanning" feature for rules_cc. (also see here)

The motivation for ^ is to reduce the number/size of files that must be sent to remote workers when using RBE (remote build execution). Unlike this approach, include scanning (today — there has been discussion about using clang-scan-deps) uses a brittle grep-based tool to prune down the referenced headers of an action. Additionally, as far as I know, include scanning does not actually rewrite action cache keys and thus has similar caveats to unused_inputs_list.

Note

If you're familiar with how build systems model ThinLTO or C++20 Modules the above probably sounds familiar and janky — as mentioned, Bazel does use its native dynamic dependency capabilities (1, 2a, 2b) to model these language/toolchain features.

Testing

Run the following:

  • bazel build //:a --disk_cache=./disk-cache
    • note the times in the output
  • bazel clean --expunge
  • modify one of the unused headers (i.e. d.header, e.header)
  • bazel build //:a --disk_cache=./disk-cache
    • 1 action should run (scan deps), 1 should hit in the cache (compile)
    • the compile action should have a timestamp before the scan deps invocation's output

^ demonstrates that even on a cold clean build, the action cache key for the compile action is scoped down to the headers that are actually used (unlike with the unused_inputs_list approach)


Extra Checks and Hygiene Restrictions

note that an actual rules_cc-esque ruleset would probably run several validation actions:

  • check that each public header parses and does not have any implicit header dependencies (--process_headers_in_dependencies + parse_headers feature)
  • check that no (non-public) headers are unused by a library
    • afaik Bazel does not do this
  • mostly unrelated to header input pruning and requires support from the tool: check that no source files are implicitly relying on transitively available headers (layering_check)
    • details on layering_check in clang here
    • note that clang actually has support for specifying a user-facing module name (i.e. a bazel target label!) in the module maps so that layering check errors are more useful to users

Footnotes

  1. https://bazel.build/reference/glossary#tree-artifact

  2. https://jmmv.dev/2019/12/bazel-dynamic-execution-tree-artifacts.html

# TODO(nixos): necessary for `$PATH` for `/usr/bin/env python3`
common --action_env=PATH
common --action_env=NIX_LD
# shellcheck shell=bash
use nix -p bazel_7 python3
# NOTE(nixos): necessary to use the python3 interpreter that rules_python
# fetches...
export NIX_LD="$(
cat "$(
nix eval --expr "((import <nixpkgs> {}).stdenv.cc).outPath" --impure --raw
)/nix-support/dynamic-linker"
)"
/.direnv
/bazel-*
/MODULE.bazel.lock
"0: hello from header a"
include(b)
include(c)
"3: end of header a"
include(c)
include(a)
"4: hello from a"
include(c)
include(c)
"1: header b"
include(c)
"2: header b"
include(c)
include(c)
include(c)
load("@rules_python//python:defs.bzl", "py_binary")
load(":defs.bzl", "library", "binary")
py_binary(
name = "compiler",
srcs = ["compiler.py"],
)
################################################################################
library(name = "b", hdrs = ["b.header"], deps = [":c"])
library(name = "c", hdrs = ["c.header"])
library(name = "d", hdrs = ["d.header"], deps = [":e", ":b", ":c"])
library(name = "e", hdrs = ["e.header"], deps = [":c"])
binary(
name = "a",
src = "a.source",
hdrs = ["a.header"],
deps = [
":b",
":c",
# actually unused:
":d",
":e",
],
)
"##############################################################################"
#!/usr/bin/env python3
# Strawman example "compiler" for a very simple "language":
# - source files end in `.source` and have only 1 syntactic construct:
# + `include(<name>)` lines that result in the "compiler" looking for a file
# named `<name>.header` and substituting it in
# + all other lines are left as is
# + `<name>` must be a single path component (i.e. no directories)
# - header files end in `.header` and have the same syntax
#
# Recursive includes are disallowed though this is not enforced anywhere.
#
# A particular header can be included multiple times.
import argparse
import functools
import os
from pathlib import Path
import sys
from typing import Iterable
def arg_parser():
parser = argparse.ArgumentParser()
subs = parser.add_subparsers()
comp = subs.add_parser("compile")
comp.add_argument("--input", type=Path, required=True)
comp.add_argument("--output", type=Path, required=True)
comp.add_argument("--available-headers", type=Path, nargs="*", default=[])
comp.set_defaults(func=compile)
scan = subs.add_parser("scan-deps")
scan.add_argument("--input", type=Path, required=True)
scan.add_argument("--pruned-header-out-dir", type=Path, required=True)
scan.add_argument("--available-headers", type=Path, nargs="*", default=[])
scan.set_defaults(func=scan_deps)
def subcmd_error(_args): raise ValueError("must provide a subcommand")
parser.set_defaults(func=subcmd_error)
return parser
p = lambda *a, **kw: print(*a, **kw, file=sys.stderr)
class Include(str): pass
class Line(str): pass
def parse_file(file_path: Path) -> Iterable[Include | Line]:
p(f"Reading '{file_path}'")
with open(file_path, "r") as f:
while (line := f.readline()):
line_ = line.strip()
if line_.startswith("include(") and line_.endswith(")"):
yield Include(line_.removeprefix("include(").removesuffix(")"))
else:
yield Line(line)
def _make_header_map(available_headers: frozenset[Path]) -> dict[str, Path]:
out = {}
for h in available_headers:
base = h.name
assert base.endswith(".header")
name = base.removesuffix(".header")
assert name
assert name not in out
out[name] = h
return out
@functools.cache
def resolve_header_path(available_headers: frozenset[Path], name: str) -> Path:
map = _make_header_map(available_headers)
if not name in map: raise ValueError(f"no header found for `{name}`")
return map[name]
################################################################################
def scan_deps(args):
hdrs = frozenset(args.available_headers)
out: Path = args.pruned_header_out_dir
p(f"Scanning inputs of {args.input}; pruning into → {out}")
p(f"{len(args.available_headers)} headers provided.")
@functools.cache
def direct_references(file_path: Path) -> list[str]: return [
str(entry) for entry in parse_file(file_path) if type(entry) is Include
]
def recursive_references(file_path: Path) -> Iterable[tuple[str, Path]]:
for include_name in direct_references(file_path):
include_path = resolve_header_path(hdrs, include_name)
yield include_name, include_path
yield from recursive_references(include_path)
referenced_headers = { n: p for n, p in recursive_references(args.input) }
referenced_header_paths: set[Path] = set(referenced_headers.values())
p()
had_unused = False
for h in args.available_headers:
if h not in referenced_header_paths:
had_unused = True
p(f"Header {h} was not used.")
if not had_unused:
p("No unused headers.")
if out.exists():
assert not os.listdir(out)
os.makedirs(out, exist_ok=True)
for name, path in referenced_headers.items():
# NOTE: we need to make a new output symlink that references the header
# file in question relative to the output directory's path
#
# if we just create a symlink pointing at the input path (i.e. a staged
# input symlink) we get this error:
# "error while validating output tree artifact <tree>: <file> (Too many levels of symbolic links)"
#
# if we point at the (one level deep) resolved path of the staged input
# symlink we get the sandbox-only absolute path for the file (i.e.
# `/tmp/bazel-source-roots/0/a.header`) which Bazel considers dangling
# when validating the TreeArtifact
#
# if we resolve the symlink all the way, we get a host filesystem
# absolute path which is not hermetic and which (I believe) Bazel will
# not track in SkyFrame (not certain) and will not interact with RBE
# well
#
# See: https://github.com/bazelbuild/bazel/issues/20891
# See: https://github.com/bazel-contrib/rules_oci/pull/559/files
relative_header_path = path.relative_to(out, walk_up=True)
os.symlink(relative_header_path, out.joinpath(name + ".header"))
p(f"\n{len(referenced_header_paths)} headers used.")
def compile(args):
hdrs = frozenset(args.available_headers)
p(f"Compiling: {args.input}{args.output}")
p(f"{len(hdrs)} headers provided.")
@functools.cache
def cache_parse(file_path: Path) -> list[Include | Line]:
return list(parse_file(file_path))
def recursively_expand(file_path: Path) -> Iterable[str]:
for item in cache_parse(file_path):
match item:
case Include(name):
yield from recursively_expand(resolve_header_path(
hdrs, name,
))
case Line(line): yield line
case other: raise ValueError(f"unreachable: {other}")
os.makedirs(args.output.parent, exist_ok=True)
with open(args.output, "w") as out:
out.writelines(recursively_expand(args.input))
if __name__ == "__main__":
(args := arg_parser().parse_args()).func(args)
# Just so it's apparent from stdout whether we ran or hit in the cache.
import datetime
print(f"\nFinished at {datetime.datetime.now()}")
"header d: you should not see this!"
include(e)
include(b)
include(c)
MyInfo = provider(
fields = dict(
headers = 'depset[File]',
),
)
def _library_impl(ctx):
direct_headers = ctx.files.hdrs
deps = ctx.attr.deps
all_headers = depset(
direct = direct_headers,
transitive = [d[MyInfo].headers for d in deps],
)
return [
DefaultInfo(files = all_headers),
MyInfo(headers = all_headers),
]
library = rule(
implementation = _library_impl,
attrs = dict(
hdrs = attr.label_list(allow_files = [".header"]),
deps = attr.label_list(providers = [MyInfo]),
),
provides = [MyInfo],
)
def _binary_impl(ctx):
all_headers = depset(
direct = ctx.files.hdrs,
transitive = [d[MyInfo].headers for d in ctx.attr.deps],
)
src = ctx.file.src
compiler = ctx.executable._compiler
out = ctx.actions.declare_file(ctx.attr.name + ".out")
headers_for_src = ctx.actions.declare_directory("_" + ctx.attr.name + ".headers")
# NOTE: `TreeArtifact`s
# First, run `scan-deps` to winnow the set of headers:
ctx.actions.run(
outputs = [headers_for_src],
inputs = depset(direct = [src], transitive = [all_headers]),
executable = compiler,
arguments = [
ctx.actions.args()
.add("scan-deps")
.add("--input", src)
.add("--pruned-header-out-dir", headers_for_src.path)
.add_all("--available-headers", all_headers)
],
mnemonic = "ScanDeps",
progress_message = "Scanning %{input} for deps (%{label})",
# NOTE: in a "real" use case we might tag this action as "local" to cut
# down on the number of files that need to be copied to RBE workers.
# execution_requirements = { ... },
# TODO(nixos): necessary for `$PATH` for `/usr/bin/env python3`
use_default_shell_env = True,
)
# Then run `compile` with the narrowed set of headers (symlinks):
ctx.actions.run(
outputs = [out],
inputs = depset(direct = [src, headers_for_src]),
executable = compiler,
arguments = [
ctx.actions.args()
.add("compile")
.add("--input", src)
.add("--output", out)
.add_all("--available-headers", [headers_for_src])
],
mnemonic = "Compile",
progress_message = "Compiling %{input} -> %{output} (%{label})",
# TODO(nixos): necessary for `$PATH` for `/usr/bin/env python3`
use_default_shell_env = True, )
return [DefaultInfo(files = depset([out]))]
binary = rule(
implementation = _binary_impl,
attrs = dict(
src = attr.label(allow_single_file = [".source"]),
hdrs = attr.label_list(allow_files = [".header"]),
deps = attr.label_list(providers = [MyInfo]),
_compiler = attr.label(
executable = True,
cfg = "exec",
default = Label("//:compiler"),
)
),
)
"header e: you should not see this either!"
include(c)
module(name = "dynamic_input_subsetting_with_tree_artifacts")
bazel_dep(name = "rules_python", version = "0.35.0")
python = use_extension("@rules_python//python/extensions:python.bzl", "python")
python.toolchain(python_version = "3.12", is_default=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment