Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save rrbutani/8752b3c4be62e856b44f3695ae774198 to your computer and use it in GitHub Desktop.

Select an option

Save rrbutani/8752b3c4be62e856b44f3695ae774198 to your computer and use it in GitHub Desktop.

Previously we explored a way to have some degree of dynamic dependencies in Bazel (input "subsetting") using TreeArtifacts.

The particular use case modeled in the previous gist involved:

  • a language with somewhat coarse library and binary rules (i.e. each rule describes a collection of files and their collective dependencies — the default for most Bazel rulesets)
  • a monolithic (and slow!) compiler whose compilation unit size is the entire binary (rather than something smaller like modules or source files)
    • i.e. exacerbating the pain of not having "perfect" file-level dependency information
  • source files that can be easily (and quickly) scanned to determine which dependencies are unused

To increase cache hit rates (a.k.a. to keep binaries from being rebuilt as a result of changes to inputs that are not actually used to produce the binary being built), the previous gist employs a "scan-deps" action that runs before the compiler is invoked and "winnows" the set of inputs, producing a subset in a TreeArtifact.

Important

As mentioned in the previous gist, normally the way this would be modeled in Bazel is to use unused_inputs_list; i.e.:

  • the compiler action would produce a list of files that were not actually needed
  • subsequent builds (at least, with the same SkyFrame in-memory state/persistent action cache) will have the compiler action not be sensitive to changes to the inputs that were declared as unused

There are a couple of reasons to prefer a priori (explicit) input pruning to the unused_inputs_list approach:

  • incremental build correctness: unused_inputs_list hinges on the tool accurately describing what inputs it does not need
    • if the tool incorrectly lists an input as unused:
      • changes to that input will not result in rebuilds of the action
      • clean builds (where the action is rebuilt) will return different results (i.e. a correctness issue)
    • in contrast, with a priori input pruning the compiler action is executed (and sandboxed) such that the unused inputs are not available:
      • if the analysis about which inputs are unused was incorrect, the action will fail to execute
  • CI/remote caching:
    • in Bazel (and buck2), unused_inputs_list is only able to eliminate rebuilds if an action has already executed on a particular daemon
      • i.e. information about which inputs have been "pruned" from the action cannot make it into the action cache; it lives on a particular machine either in-memory (SkyFrame) or in the persistent action cache (on disk within an execroot)
    • this means that (for CI set ups where there isn't a persistent Bazel daemon) CI will be unable to leverage this unused dependency information and will need to rebuild
    • in contrast, with a priori input pruning, the "scan-deps" action will have to run when unused inputs change but the subsequent compiler action will not (will hit in the cache, ECO)

The downsides to a priori input pruning (with TreeArtifacts) are:

  • longer critical path; in the case where you do have to rerun the actual action (i.e. run the compiler) you're doing work twice
    • reading in the files to scan for deps and again when the actual action runs
    • just running the compiler would have been faster
    • (the assumption is that we're dealing with a tool where )
  • more complexity
    • you have to produce a "copy" (or symlink tree subset, at least) of the inputs and feed it to the actual action which can lead to some annoying issues
    • i.e. paths in diagnostics looking "wrong"

In this gist we model the same use case as the original gist except using shadowed_actions instead of subsetting inputs via TreeArtifact symlinks.

The upsides here are mostly ergonomics/reduced complexity:

  • don't need to adjust inputs or paths in flags/diagnostics output for the actual action
  • let's us sidestep TreeArtifact weirdness

Tip

The "double" aspect is our way of smuggling the ScanDeps Action (which we want to produce) as part of an aspect into the rule that wishes to use it as a shadowed_action.

(you can get the Action directly within the confines of one rule using _skylark_testable and ctx.created_actions() but that's not what that API is intended for...)

Warning

An artificial dependency from the ScanDep action to the Compile action is required to force the ScanDep action to be built + its inputs to be pruned before the Compile action runs

Otherwise, the Compile action will just run with the full (unpruned) set of inputs.

Important

This patch is needed to get Bazel to inherit only the unpruned set of inputs from the shadowed action:

Without this patch, Compile inherits all the inputs from ScanDeps, including the pruned inputs.


When running with the patch linked above you should see:

bazel build //:a
INFO: From ScanDeps a.inner.unused_hdrs:
5 header(s) available
3 header(s) used
INFO: From Compile a:
3 header(s) available
Target //:a up-to-date:
  bazel-bin/a
echo "hey" >> e.headerbazel build //:a
Target //:a up-to-date:
  bazel-bin/a
INFO: Elapsed time: 0.047s, Critical Path: 0.00s
INFO: 1 process: 2 action cache hit, 1 internal.
# no rebuilds
echo "hey" >> c.headerbazel build //:a
INFO: From ScanDeps a.inner.unused_hdrs:
5 header(s) available
3 header(s) used
INFO: From Compile a:
3 header(s) available
INFO: Found 1 target...
Target //:a up-to-date:
  bazel-bin/a
INFO: Elapsed time: 0.072s, Critical Path: 0.04s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
# `a` *is* rebuilt
echo "include(d)" >> a.headerbazel build //:a
INFO: From ScanDeps a.inner.unused_hdrs:
5 header(s) available
5 header(s) used
INFO: From Compile a:
5 header(s) available
INFO: Found 1 target...
Target //:a up-to-date:
  bazel-bin/a
INFO: Elapsed time: 0.120s, Critical Path: 0.06s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
# `a` is rebuilt; d and e were now made available to the `Compile` action
# enable path mapping:
common --experimental_output_paths=strip
common --action_env=PATH
use nix -p bazel_8
/.direnv
/bazel-*
/MODULE.bazel.lock
"0: hello from header a"
include(b)
include(c)
"3: end of header a"
include(c)
include(a)
"4: hello from a"
include(c)
include(c)
"1: header b"
include(c)
"2: header b"
include(c)
include(c)
include(c)
load(":defs.bzl", "library", "binary")
library(name = "b", hdrs = ["b.header"], deps = [":c"])
library(name = "c", hdrs = ["c.header"])
library(name = "d", hdrs = ["d.header"], deps = [":e", ":b", ":c"])
library(name = "e", hdrs = ["e.header"], deps = [":c"])
binary(
name = "a",
src = "a.source",
hdrs = ["a.header"],
deps = [
":b",
":c",
# actually unused:
":d",
":e",
],
)
exports_files(["compiler.bash"])
"##############################################################################"
#!/usr/bin/env bash
set -euo pipefail
shopt -s globstar
shopt -s nullglob
shopt -s lastpipe
readonly MODE="$1"
readonly SOURCE="$2"
readonly OUT="$3"
declare -A headers
declare -A used_headers
# $1: file path, $2: emit (yes/no)
walk_recursively() {
local -a lines
readarray -t lines < "$1"
for line in "${lines[@]}"; do
if [[ "$line" =~ ^include\(.*\)$ ]]; then
header_key=$(cut -d\( -f2 <<<"$line" | cut -d\) -f1)
if [[ -v headers["$header_key"] ]]; then
used_headers["$header_key"]=1
walk_recursively "${headers["$header_key"]}" "$2"
fi
elif [[ $2 == "yes" ]]; then
echo "$line"
fi
done
}
scan() {
walk_recursively "$SOURCE" no
>&2 echo "${#used_headers[@]} header(s) used"
for header_key in "${!headers[@]}"; do
if ! [[ -v used_headers["$header_key"] ]]; then
echo "${headers["$header_key"]}" # unused header path
fi
done
}
compile() {
walk_recursively "$SOURCE" yes
}
################################################################################
ls **/*.header | while read path; do
key="$(basename "$path" .header)"
headers["$key"]="$path"
done
>&2 echo "${#headers[@]} header(s) available"
mkdir -p "$(dirname "$OUT")"
exec > "$OUT"
case "$MODE" in
scan) scan ;;
compile) compile ;;
esac
"header d: you should not see this!"
include(e)
include(b)
include(c)
LibInfo = provider(fields = dict(headers = "depset[File]"))
library = rule(
implementation = lambda ctx: [
LibInfo(headers = depset(ctx.files.hdrs, transitive = [
d[LibInfo].headers for d in ctx.attr.deps
]))
],
attrs = dict(
hdrs = attr.label_list(allow_files = [".header"]),
deps = attr.label_list(providers = [LibInfo]),
),
provides = [LibInfo],
)
#-------------------------------------------------------------------------------
BinInfo = provider(fields = dict(source = "File", headers = "depset[File]"))
_binary_inner = rule(
implementation = lambda ctx: [
BinInfo(
source = ctx.file.src,
headers = depset(
direct = ctx.files.hdrs,
transitive = [ d[LibInfo].headers for d in ctx.attr.deps],
)
)
],
attrs = dict(
src = attr.label(allow_single_file = [".source"]),
hdrs = attr.label_list(allow_files = [".header"]),
deps = attr.label_list(providers = [LibInfo]),
),
provides = [BinInfo],
)
#-------------------------------------------------------------------------------
# NOTE: double aspect to make the `Action` available
# - alternatively we could put the `scan` action on `_binary_inner` of course
# NOTE: artificial dependency from the `ScanDep` action to the `Compile` action
# to force the `ScanDep` action to be built + inputs to be pruned before the
# `Compile` action runs
#
# Otherwise, the `Compile` action will just run with the full (unpruned) set of
# inputs.
# NOTE: this patch is needed to get Bazel to inherit only the unpruned set of
# inputs from the shadowed action:
# - https://github.com/rrbutani/bazel/commit/ccd4b2763c0e75d03e67a42f71e3ba941f50404f
#
# Without this patch, `Compile` inherits all the inputs from `ScanDeps`,
# including the pruned inputs.
ScanInnerInfo = provider(fields = dict(unused_inputs = "File"))
ScanInfo = provider(fields = dict(
action = "Action to shadow", unused_inputs = "File",
))
def _scan_deps_inner_impl(target, ctx):
unused_headers = ctx.actions.declare_file(ctx.label.name + ".unused_hdrs")
bin_info = target[BinInfo]
ctx.actions.run(
outputs = [unused_headers],
inputs = depset([bin_info.source], transitive = [bin_info.headers]),
unused_inputs_list = unused_headers,
executable = ctx.executable._compiler,
arguments = [ctx.actions.args()
.add("scan")
.add(bin_info.source)
.add(unused_headers)
],
mnemonic = "ScanDeps",
execution_requirements = { "supports-path-mapping": "1" },
use_default_shell_env = True,
)
return [ScanInnerInfo(unused_inputs = unused_headers)]
_scan_deps_inner = aspect(
implementation = _scan_deps_inner_impl,
required_providers = [BinInfo],
attrs = dict(
_compiler = attr.label(
executable = True,
allow_single_file = True,
default = Label("//:compiler.bash"),
cfg = config.exec(),
),
),
)
_scan_deps = aspect(
implementation = lambda target, _ctx: ScanInfo(
action = [ a for a in target.actions if a.mnemonic == "ScanDeps" ][0],
unused_inputs = target[ScanInnerInfo].unused_inputs,
),
required_providers = [BinInfo],
requires = [_scan_deps_inner],
provides = [ScanInfo],
)
#-------------------------------------------------------------------------------
def _binary_impl(ctx):
out = ctx.actions.declare_file(ctx.label.name)
ctx.actions.run(
outputs = [out],
shadowed_action = ctx.attr.bin[ScanInfo].action,
# see above: artificial dependency on an output of `ScanDeps` is needed:
inputs = [ctx.attr.bin[ScanInfo].unused_inputs],
executable = ctx.executable._compiler,
arguments = [(ctx.actions.args()
.add("compile")
.add(ctx.attr.bin[BinInfo].source)
.add(out)
)],
mnemonic = "Compile",
execution_requirements = { "supports-path-mapping": "1" },
use_default_shell_env = True,
)
return [DefaultInfo(files = depset([out]))]
_binary = rule(
implementation = _binary_impl,
attrs = dict(
bin = attr.label(
providers = [BinInfo],
aspects = [_scan_deps],
),
_compiler = attr.label(
executable = True,
allow_single_file = True,
default = Label("//:compiler.bash"),
cfg = config.exec(),
),
),
)
def _binary_macro_impl(name, visibility, src, hdrs, deps, **kw):
inner_n = name + ".inner"
_binary_inner(name = inner_n, src = src, hdrs = hdrs, deps = deps, **kw)
_binary(name = name, bin = inner_n, **kw)
binary = macro(
implementation = _binary_macro_impl,
inherit_attrs = _binary_inner,
)
"header e: you should not see this either!"
include(c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment