This document describes a systematic method for detection, isolation, correction and prevention of regression, on matters of compiler performance. The method focuses on the creation of small end-to-end scaling tests that characterize, and stand in for, larger codebases on which the compiler is seen to perform poorly. These scaling tests have a number of advantages over "full sized" performance-testing codebases:
- They run quickly, just long enough to measure counters
- They use a clear criterion (super-linearity) for "bad performance"
- They are small and comprehensible
- They can focus on a single, isolated performance problem
- They can be written and shared by users, characterizing patterns in private codebases
The method is first described in general; we then work through an example of finding and fixing a compilation-performance bug.
(The same method can also be used for measuring and correcting problems in generated code, insofar as the problem you're interested manifests as a thing that can be counted while compiling, eg. quantity of LLVM code emitted.)
-
Formulate a linear (or sub-linear) relationship that should hold between some aspect of program input size
N
and some measurementW
of work done by the compiler. -
Add a statistic counter to the compiler that measures
W
. -
Write a small scaling testcase: a
.gyb
file varying inN
, that can be run byutils/scale-test
. -
Run
utils/scale-test
selecting forW
, and see ifW
exceedsO(N^1)
. -
If not, goto #1 and examine a new relationship.
-
If so, run the testcase at a small scale under a debugger, breaking on the context you count
W
in, and find the unexpected contexts / causes of unexpected work. -
Fix the unexpected causes of work, rerun the test to confirm (sub-)linearity.
-
Add the testcase to the
validation-tests/compiler_scale
directory to prevent future performance regression.
As an example (arising from a bug reported in the field), consider work done typechecking the function bodies of property getters and setters that are synthesized for nominal types. Ideally, compiling a multi-file module, we expect that only the getters and setters in the current file being complied are subject to function-body-level typechecking.
That is, our independent scaling variable N
will scale the number
of files, and our dependent work variable W
will count the number
of calls to the function-body typechecking routine in the compiler.
We will check to see that this relationship is linear.
Formulating this relationship precisely (as a testcase), we begin
by ensuring that the compiler has a statistic to measure the
dependent variable. To do this, we select a location in the compiler
to instrument, ensure the containing file includes swift/Basic/Statistic.h
and defines a local macro DEBUG_TYPE
naming the local context. Then we add a single SWIFT_FUNC_STAT
macro in the function we want to count. In this case,
we place a counter immediately inside typeCheckAbstractFunctionBody
:
--- a/lib/Sema/TypeCheckStmt.cpp
+++ b/lib/Sema/TypeCheckStmt.cpp
@@ -27,2 +27,3 @@
#include "swift/Basic/SourceManager.h"
+#include "swift/Basic/Statistic.h"
#include "swift/Parse/Lexer.h"
@@ -40,2 +41,4 @@ using namespace swift;
+#define DEBUG_TYPE "TypeCheckStmt"
+
namespace {
@@ -1250,2 +1253,4 @@ bool TypeChecker::typeCheckAbstractFunctionBody(AbstractFunctionDecl *AFD) {
+ SWIFT_FUNC_STAT;
+
Optional<FunctionBodyTimer> timer;
Recall that the relationship we're interested in testing is one
that arises when doing a multi-file compilation: when swiftc
is invoked with multiple input files comprising a module, and
its driver subsequently runs one frontend job for each file
in the module, treating one file as primary and parsing (but
not translating) all the other files as additional inputs.
This type of scale test is not the default mode of the scale-test
script, but it is supported with the --sum-multi
command line
flag. In this mode, scale-test
collects and sums-together all
statistics of all the primary-file frontend jobs in a multi-file
driver job.
In order to test the relationship, we start with a single declaration in each file, and vary the number of files. The simplest test therefore looks like this:
struct Struct${N} {
var Field : Int?
}
When we run this file under scale-test
, however, we see
only a linear relationship:
$ utils/scale-test --sum-multi --select typeCheckAbstractFunctionBody scale.gyb
O(n^1.0) : TypeCheckStmt.typeCheckAbstractFunctionBody
To see actual counts (to be sure we're not misunderstanding)
we can pass --values
:
$ utils/scale-test --values --sum-multi --select typeCheckAbstractFunctionBody scale.gyb
O(n^1.0) : TypeCheckStmt.typeCheckAbstractFunctionBody
= [30, 60, 90, 120, 150, 180, 210, 240, 270]
This is encouraging, but since it doesn't reproduce bad behaviour described in the bug report, we make our testcase just a little bit more complicated by making the structs in each file refer to the file adjacent to them in the module:
struct Struct${N} {
% if int(N) > 1:
var Field : Struct${int(N)-1}?
% else:
var Field : Int?
% end
}
Now the 0th struct has an Int?
-typed property and every other struct
has a property that refers to a definition in a neighbouring file;
a bit more pathological than in most codebases, but not too
unrealistic. When we run this case under scale-test
, we see the
problematic behaviour very clearly:
$ utils/scale-test --values --sum-multi --select typeCheckAbstractFunctionBody scale.gyb
O(n^2.0) : TypeCheckStmt.typeCheckAbstractFunctionBody
= [102, 402, 902, 1602, 2502, 3602, 4902, 6402, 8102]
Next we get to the part that's hard to describe through anything
other than debugging experience: finding and fixing the bug. The
scale-test
script can save us some difficulty in setup by
invoking lldb
on one of the frontend jobs if we pass it --debug
,
but beyond that it's up to us to diagnose and fix:
$ utils/scale-test --debug --sum-multi scale.gyb
(lldb) target create "swiftc"
Current executable set to 'swiftc' (x86_64).
(lldb) settings set -- target.run-args "-frontend" "-c" "-o" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/out.o" "-primary-file" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in0.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in0.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in1.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in2.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in3.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in4.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in5.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in6.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in7.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in8.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in9.swift" "-Xllvm" "-stats" "-Xllvm" "-stats-json" "-Xllvm" "-info-output-file=/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/stats.json"
(lldb) br set --name typeCheckAbstractFunctionBody
Breakpoint 1: where = swiftc`swift::TypeChecker::typeCheckAbstractFunctionBody(swift::AbstractFunctionDecl*) + 21 [inlined] swift::AbstractFunctionDecl::getBodyKind() const at Decl.h:4657, address = 0x000000010094a465
(lldb) r
...
In this case the unwanted work can be inhibited by modifying
addTrivialAccessorsToStorage
; the fix involves redirecting
some calls from typeCheckDecl
to the less-involved validateDecl
(along with some other compensating changes omitted for brevity here)
...
@@ -765,4 +765,3 @@ void swift::addTrivialAccessorsToStorage(AbstractStorageDecl *storage,
synthesizeTrivialGetter(getter, storage, TC);
- TC.typeCheckDecl(getter, true);
- TC.typeCheckDecl(getter, false);
+ TC.validateDecl(getter);
@@ -774,4 +773,3 @@ void swift::addTrivialAccessorsToStorage(AbstractStorageDecl *storage,
synthesizeTrivialSetter(setter, storage, setterValueParam, TC);
- TC.typeCheckDecl(setter, true);
- TC.typeCheckDecl(setter, false);
+ TC.validateDecl(setter);
}
...
With our fix applied, the scale-test
now shows correct
behaviour:
$ utils/scale-test --values --sum-multi --select typeCheckAbstractFunctionBody --values scale.gyb
O(n^1.0) : TypeCheckStmt.typeCheckAbstractFunctionBody
= [30, 60, 90, 120, 150, 180, 210, 240, 270]
To prevent future regression, we put the scale test in the
testsuite. The scale-test
driver script will consider any
test as failing if it selects a counter that scales worse
than O(n^1.2)
(to give a little room for error).
So our investigation testcase is nearly usable as a regression
test. We make a few modifications. First, we'll pass --parse
since there's no need to run full code generation each time.
Second we'll pass a more limited set of scaling steps, sufficient
to show the problem, but faster to run. Finally we'll use the
lit.py
test-execution framework to both run the
test, and make the test conditional on a release-mode compiler,
to further limit the impact on testsuite execution time. The
final regression test, which we put in
validation-test/compiler_scale/scale_neighbouring_getset.gyb
,
looks like this:
// RUN: %scale-test --sum-multi --parse --begin 5 --end 16 --step 5 --select TypeCheckStmt %s
// REQUIRES: tools-release
struct Struct${N} {
% if int(N) > 1:
var Field : Struct${int(N)-1}?
% else:
var Field : Int?
% end
}