Compilation performance

This document describes a systematic method for detection, isolation, correction and prevention of regression, on matters of compiler performance. The method focuses on the creation of small end-to-end scaling tests that characterize, and stand in for, larger codebases on which the compiler is seen to perform poorly. These scaling tests have a number of advantages over "full sized" performance-testing codebases:

They run quickly, just long enough to measure counters
They use a clear criterion (super-linearity) for "bad performance"
They are small and comprehensible
They can focus on a single, isolated performance problem
They can be written and shared by users, characterizing patterns in private codebases

The method is first described in general; we then work through an example of finding and fixing a compilation-performance bug.

(The same method can also be used for measuring and correcting problems in generated code, insofar as the problem you're interested manifests as a thing that can be counted while compiling, eg. quantity of LLVM code emitted.)

General method

Formulate a linear (or sub-linear) relationship that should hold between some aspect of program input size N and some measurement W of work done by the compiler.
Add a statistic counter to the compiler that measures W.
Write a small scaling testcase: a .gyb file varying in N, that can be run by utils/scale-test.
Run utils/scale-test selecting for W, and see if W exceeds O(N^1).
If not, goto #1 and examine a new relationship.
If so, run the testcase at a small scale under a debugger, breaking on the context you count W in, and find the unexpected contexts / causes of unexpected work.
Fix the unexpected causes of work, rerun the test to confirm (sub-)linearity.
Add the testcase to the validation-tests/compiler_scale directory to prevent future performance regression.

Example

As an example (arising from a bug reported in the field), consider work done typechecking the function bodies of property getters and setters that are synthesized for nominal types. Ideally, compiling a multi-file module, we expect that only the getters and setters in the current file being complied are subject to function-body-level typechecking.

That is, our independent scaling variable N will scale the number of files, and our dependent work variable W will count the number of calls to the function-body typechecking routine in the compiler. We will check to see that this relationship is linear.

Formulating this relationship precisely (as a testcase), we begin by ensuring that the compiler has a statistic to measure the dependent variable. To do this, we select a location in the compiler to instrument, ensure the containing file includes swift/Basic/Statistic.h and defines a local macro DEBUG_TYPE naming the local context. Then we add a single SWIFT_FUNC_STAT macro in the function we want to count. In this case, we place a counter immediately inside typeCheckAbstractFunctionBody:

--- a/lib/Sema/TypeCheckStmt.cpp
+++ b/lib/Sema/TypeCheckStmt.cpp
@@ -27,2 +27,3 @@
 #include "swift/Basic/SourceManager.h"
+#include "swift/Basic/Statistic.h"
 #include "swift/Parse/Lexer.h"
@@ -40,2 +41,4 @@ using namespace swift;
 
+#define DEBUG_TYPE "TypeCheckStmt"
+
 namespace {
@@ -1250,2 +1253,4 @@ bool TypeChecker::typeCheckAbstractFunctionBody(AbstractFunctionDecl *AFD) {
 
+  SWIFT_FUNC_STAT;
+
   Optional<FunctionBodyTimer> timer;

Recall that the relationship we're interested in testing is one that arises when doing a multi-file compilation: when swiftc is invoked with multiple input files comprising a module, and its driver subsequently runs one frontend job for each file in the module, treating one file as primary and parsing (but not translating) all the other files as additional inputs.

This type of scale test is not the default mode of the scale-test script, but it is supported with the --sum-multi command line flag. In this mode, scale-test collects and sums-together all statistics of all the primary-file frontend jobs in a multi-file driver job.

In order to test the relationship, we start with a single declaration in each file, and vary the number of files. The simplest test therefore looks like this:

struct Struct${N} {
    var Field : Int?
}

When we run this file under scale-test, however, we see only a linear relationship:

$ utils/scale-test --sum-multi --select typeCheckAbstractFunctionBody scale.gyb
O(n^1.0) : TypeCheckStmt.typeCheckAbstractFunctionBody

To see actual counts (to be sure we're not misunderstanding) we can pass --values:

$ utils/scale-test --values --sum-multi --select typeCheckAbstractFunctionBody scale.gyb
O(n^1.0) : TypeCheckStmt.typeCheckAbstractFunctionBody
                =  [30, 60, 90, 120, 150, 180, 210, 240, 270]

This is encouraging, but since it doesn't reproduce bad behaviour described in the bug report, we make our testcase just a little bit more complicated by making the structs in each file refer to the file adjacent to them in the module:

struct Struct${N} {
% if int(N) > 1:
    var Field : Struct${int(N)-1}?
% else:
    var Field : Int?
% end
}

Now the 0th struct has an Int?-typed property and every other struct has a property that refers to a definition in a neighbouring file; a bit more pathological than in most codebases, but not too unrealistic. When we run this case under scale-test, we see the problematic behaviour very clearly:

$ utils/scale-test --values --sum-multi --select typeCheckAbstractFunctionBody scale.gyb
O(n^2.0) : TypeCheckStmt.typeCheckAbstractFunctionBody
                =  [102, 402, 902, 1602, 2502, 3602, 4902, 6402, 8102]

Next we get to the part that's hard to describe through anything other than debugging experience: finding and fixing the bug. The scale-test script can save us some difficulty in setup by invoking lldb on one of the frontend jobs if we pass it --debug, but beyond that it's up to us to diagnose and fix:

$ utils/scale-test --debug --sum-multi scale.gyb
(lldb) target create "swiftc"
Current executable set to 'swiftc' (x86_64).
(lldb) settings set -- target.run-args  "-frontend" "-c" "-o" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/out.o" "-primary-file" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in0.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in0.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in1.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in2.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in3.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in4.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in5.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in6.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in7.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in8.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in9.swift" "-Xllvm" "-stats" "-Xllvm" "-stats-json" "-Xllvm" "-info-output-file=/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/stats.json"
(lldb) br set --name typeCheckAbstractFunctionBody
Breakpoint 1: where = swiftc`swift::TypeChecker::typeCheckAbstractFunctionBody(swift::AbstractFunctionDecl*) + 21 [inlined] swift::AbstractFunctionDecl::getBodyKind() const at Decl.h:4657, address = 0x000000010094a465
(lldb) r
...

In this case the unwanted work can be inhibited by modifying addTrivialAccessorsToStorage; the fix involves redirecting some calls from typeCheckDecl to the less-involved validateDecl (along with some other compensating changes omitted for brevity here)

...
@@ -765,4 +765,3 @@ void swift::addTrivialAccessorsToStorage(AbstractStorageDecl *storage,
   synthesizeTrivialGetter(getter, storage, TC);
-  TC.typeCheckDecl(getter, true);
-  TC.typeCheckDecl(getter, false);
+  TC.validateDecl(getter);
 
@@ -774,4 +773,3 @@ void swift::addTrivialAccessorsToStorage(AbstractStorageDecl *storage,
     synthesizeTrivialSetter(setter, storage, setterValueParam, TC);
-    TC.typeCheckDecl(setter, true);
-    TC.typeCheckDecl(setter, false);
+    TC.validateDecl(setter);
   }
...

With our fix applied, the scale-test now shows correct behaviour:

$ utils/scale-test --values --sum-multi --select typeCheckAbstractFunctionBody --values scale.gyb
O(n^1.0) : TypeCheckStmt.typeCheckAbstractFunctionBody
                =  [30, 60, 90, 120, 150, 180, 210, 240, 270]

To prevent future regression, we put the scale test in the testsuite. The scale-test driver script will consider any test as failing if it selects a counter that scales worse than O(n^1.2) (to give a little room for error).

So our investigation testcase is nearly usable as a regression test. We make a few modifications. First, we'll pass --parse since there's no need to run full code generation each time. Second we'll pass a more limited set of scaling steps, sufficient to show the problem, but faster to run. Finally we'll use the lit.py test-execution framework to both run the test, and make the test conditional on a release-mode compiler, to further limit the impact on testsuite execution time. The final regression test, which we put in validation-test/compiler_scale/scale_neighbouring_getset.gyb, looks like this:

// RUN: %scale-test --sum-multi --parse --begin 5 --end 16 --step 5 --select TypeCheckStmt %s
// REQUIRES: tools-release

struct Struct${N} {
% if int(N) > 1:
    var Field : Struct${int(N)-1}?
% else:
    var Field : Int?
% end
}

modocache/CompilationPerformance.md

Compilation performance

General method

Example