effect of bounds and overflow checks in zlib-rs

Some further zlib-rs benchmarks using commit c9b7299fa81ca2a8448cbe3e96a42275c9b41b24

Bounds checks

Rust will panic when you try to index out of bounds

This check has a cost: your CPU must perform this check. Typically this involves two instructions: a compare "is the index in bounds" and a jump "go to the panic code". Hence, we intuitively expect programs that perform bounds checking to be slightly slower. In theory, C has an edge here because it does not, by default, check that array access is in bounds.

So we can run a little experiment. We can disable bounds checks in the rust compiler, and observe whether that gives any speedup. The bounds check insertion code is conveniently centralized in a single call. We can comment out this call (and some other thing the compiler tells you to because they are now unused).

https://github.com/rust-lang/rust/blob/ae9f5019f9ce6eb3ecd96206ade4a612efe20fd5/compiler/rustc_mir_build/src/build/expr/as_place.rs#L621

And run our benchmarks

Benchmark 1 (61 runs): ./baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          82.1ms ±  992us    80.7ms … 86.8ms          2 ( 3%)        0%
  peak_rss           26.6MB ± 50.8KB    26.6MB … 26.7MB         11 (18%)        0%
  cpu_cycles          296M  ± 3.28M      292M  …  311M           4 ( 7%)        0%
  instructions        645M  ±  269       645M  …  645M           0 ( 0%)        0%
  cache_references   20.0M  ±  170K     19.7M  … 20.5M           3 ( 5%)        0%
  cache_misses        492K  ± 82.2K      379K  …  768K           3 ( 5%)        0%
  branch_misses      3.11M  ± 5.07K     3.08M  … 3.12M           3 ( 5%)        0%
Benchmark 2 (62 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.7ms ±  993us    80.4ms … 86.1ms          3 ( 5%)          -  0.5% ±  0.4%
  peak_rss           26.6MB ± 50.5KB    26.6MB … 26.7MB         11 (18%)          -  0.0% ±  0.1%
  cpu_cycles          294M  ± 3.70M      288M  …  312M           3 ( 5%)          -  0.8% ±  0.4%
  instructions        590M  ±  286       590M  …  590M           0 ( 0%)        ⚡-  8.5% ±  0.0%
  cache_references   19.8M  ±  181K     19.5M  … 20.4M           5 ( 8%)          -  0.7% ±  0.3%
  cache_misses        429K  ±  105K      311K  …  772K           2 ( 3%)        ⚡- 12.9% ±  6.8%
  branch_misses      3.06M  ± 5.40K     3.05M  … 3.09M           5 ( 8%)        ⚡-  1.8% ±  0.1%

Benchmark 1 (21 runs): ./baseline 6 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           240ms ± 1.88ms     237ms …  245ms          2 (10%)        0%
  peak_rss           24.7MB ± 52.7KB    24.6MB … 24.8MB          4 (19%)        0%
  cpu_cycles         1.02G  ± 6.25M     1.01G  … 1.03G           4 (19%)        0%
  instructions       2.11G  ±  285      2.11G  … 2.11G           0 ( 0%)        0%
  cache_references    104M  ±  792K      103M  …  106M           1 ( 5%)        0%
  cache_misses       1.63M  ±  398K     1.08M  … 2.65M           2 (10%)        0%
  branch_misses      9.26M  ± 8.45K     9.25M  … 9.28M           0 ( 0%)        0%
Benchmark 2 (21 runs): ./target/release/examples/blogpost-compress 6 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           238ms ± 1.74ms     237ms …  245ms          2 (10%)          -  0.7% ±  0.5%
  peak_rss           24.7MB ± 73.5KB    24.5MB … 24.8MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles         1.01G  ± 8.82M     1.00G  … 1.04G           2 (10%)          -  0.7% ±  0.5%
  instructions       1.91G  ±  272      1.91G  … 1.91G           0 ( 0%)        ⚡-  9.5% ±  0.0%
  cache_references    105M  ± 1.05M      103M  …  107M           2 (10%)          +  0.3% ±  0.6%
  cache_misses       1.77M  ±  695K     1.27M  … 4.25M           1 ( 5%)          +  8.7% ± 21.6%
  branch_misses      9.53M  ± 9.02K     9.51M  … 9.55M           0 ( 0%)        💩+  2.8% ±  0.1%

Benchmark 1 (12 runs): ./baseline 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           439ms ± 2.42ms     437ms …  445ms          0 ( 0%)        0%
  peak_rss           24.5MB ± 67.5KB    24.4MB … 24.6MB          3 (25%)        0%
  cpu_cycles         1.93G  ± 8.92M     1.92G  … 1.95G           1 ( 8%)        0%
  instructions       3.34G  ±  462      3.34G  … 3.34G           1 ( 8%)        0%
  cache_references    194M  ±  828K      192M  …  195M           0 ( 0%)        0%
  cache_misses       2.07M  ±  326K     1.71M  … 2.90M           1 ( 8%)        0%
  branch_misses      19.1M  ± 10.3K     19.1M  … 19.1M           0 ( 0%)        0%
Benchmark 2 (12 runs): ./target/release/examples/blogpost-compress 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           436ms ± 2.70ms     433ms …  441ms          0 ( 0%)          -  0.8% ±  0.5%
  peak_rss           24.6MB ± 64.5KB    24.5MB … 24.6MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles         1.91G  ± 6.34M     1.90G  … 1.93G           0 ( 0%)          -  0.9% ±  0.3%
  instructions       3.07G  ±  356      3.07G  … 3.07G           0 ( 0%)        ⚡-  8.2% ±  0.0%
  cache_references    193M  ±  661K      192M  …  195M           0 ( 0%)          -  0.3% ±  0.3%
  cache_misses       1.37M  ±  209K     1.09M  … 1.84M           0 ( 0%)        ⚡- 33.6% ± 11.2%
  branch_misses      19.3M  ± 74.3K     19.2M  … 19.4M           0 ( 0%)        💩+  1.3% ±  0.2%

What we see here is a clear drop in the number of instructions executed, but no significant change in wall time. This is first of all a confirmation that our change to rustc did have an effect (you have to be careful with cargo caching things). Secondly, it shows that the instructions for the bounds check have no real cost, most likely due to instruction-level parallelism (making the comparison effectively free) and modern branch predictors (making the branch effectively free).

So, the difference between zlib-ng and zlib-rs is not the addition of bounds checks.

Overflow Checks

With the standard debug profile, rust will panic when numerical operations overflow. In release mode, wrapping arithmetic is used instead which may mask bugs, but is (said to be) faster.

Overflow bugs usually don't show up in your tests, and with rust's standard behavior, you'll never know if they do exist in production, until your whole app comes crashing down. This sort of issue is tough to reproduce or troubleshoot from some unrelated backtrace, so enabling overflow checks sounds useful.

We can enable overflow checks in release builds in the top-level Cargo.toml:

[profile.release]
overflow-checks = true

Turns out, overflow checks do actually slightly degrade performance in our case. Here I compare against a build of zlib-rs without overflow checks

Benchmark 1 (61 runs): ./target/release/examples/baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          82.3ms ±  679us    80.8ms … 84.0ms          0 ( 0%)        0%
  peak_rss           26.6MB ± 59.2KB    26.3MB … 26.6MB         12 (20%)        0%
  cpu_cycles          295M  ± 2.05M      292M  …  305M           2 ( 3%)        0%
  instructions        640M  ±  274       640M  …  640M           0 ( 0%)        0%
  cache_references   19.9M  ±  193K     19.6M  … 20.9M           3 ( 5%)        0%
  cache_misses        470K  ±  110K      356K  … 1.05M           3 ( 5%)        0%
  branch_misses      3.11M  ± 2.96K     3.11M  … 3.13M           1 ( 2%)        0%
Benchmark 2 (59 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          85.1ms ± 1.27ms    82.8ms … 89.7ms          3 ( 5%)        💩+  3.5% ±  0.4%
  peak_rss           26.6MB ± 56.2KB    26.5MB … 26.6MB         14 (24%)          -  0.0% ±  0.1%
  cpu_cycles          307M  ± 4.70M      300M  …  326M           4 ( 7%)        💩+  4.1% ±  0.4%
  instructions        758M  ±  245       758M  …  758M           0 ( 0%)        💩+ 18.4% ±  0.0%
  cache_references   19.8M  ±  185K     19.4M  … 20.2M           0 ( 0%)          -  0.6% ±  0.3%
  cache_misses        452K  ± 84.6K      316K  …  694K           1 ( 2%)          -  3.7% ±  7.6%
  branch_misses      3.09M  ± 8.71K     3.08M  … 3.11M           0 ( 0%)          -  0.6% ±  0.1%

Benchmark 1 (21 runs): ./target/release/examples/baseline 6 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           239ms ± 3.01ms     236ms …  248ms          2 (10%)        0%
  peak_rss           24.6MB ± 57.2KB    24.5MB … 24.6MB          5 (24%)        0%
  cpu_cycles         1.02G  ± 11.2M     1.01G  … 1.05G           2 (10%)        0%
  instructions       2.10G  ±  311      2.10G  … 2.10G           0 ( 0%)        0%
  cache_references    105M  ±  819K      104M  …  108M           1 ( 5%)        0%
  cache_misses       2.09M  ±  562K     1.54M  … 4.02M           1 ( 5%)        0%
  branch_misses      9.27M  ± 7.43K     9.26M  … 9.28M           0 ( 0%)        0%
Benchmark 2 (21 runs): ./target/release/examples/blogpost-compress 6 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           247ms ± 2.36ms     244ms …  252ms          0 ( 0%)        💩+  3.2% ±  0.7%
  peak_rss           24.6MB ± 75.7KB    24.4MB … 24.6MB          0 ( 0%)          -  0.1% ±  0.2%
  cpu_cycles         1.05G  ± 9.69M     1.04G  … 1.07G           0 ( 0%)        💩+  3.2% ±  0.6%
  instructions       2.34G  ±  234      2.34G  … 2.34G           0 ( 0%)        💩+ 11.8% ±  0.0%
  cache_references    105M  ± 1.02M      103M  …  107M           0 ( 0%)          -  0.5% ±  0.5%
  cache_misses       1.70M  ±  491K     1.12M  … 2.69M           0 ( 0%)        ⚡- 18.7% ± 15.8%
  branch_misses      9.28M  ± 9.09K     9.27M  … 9.31M           1 ( 5%)          +  0.2% ±  0.1%

Benchmark 1 (12 runs): ./target/release/examples/baseline 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           437ms ± 2.22ms     435ms …  443ms          1 ( 8%)        0%
  peak_rss           24.5MB ± 59.3KB    24.4MB … 24.5MB          3 (25%)        0%
  cpu_cycles         1.92G  ± 8.83M     1.91G  … 1.95G           2 (17%)        0%
  instructions       3.32G  ±  316      3.32G  … 3.32G           1 ( 8%)        0%
  cache_references    194M  ±  721K      193M  …  196M           1 ( 8%)        0%
  cache_misses       1.92M  ±  374K     1.35M  … 2.77M           0 ( 0%)        0%
  branch_misses      19.2M  ± 14.6K     19.2M  … 19.2M           1 ( 8%)        0%
Benchmark 2 (12 runs): ./target/release/examples/blogpost-compress 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           451ms ± 1.81ms     449ms …  454ms          0 ( 0%)        💩+  3.1% ±  0.4%
  peak_rss           24.5MB ± 64.5KB    24.4MB … 24.5MB          0 ( 0%)          -  0.0% ±  0.2%
  cpu_cycles         1.98G  ± 7.52M     1.97G  … 2.00G           0 ( 0%)        💩+  3.1% ±  0.4%
  instructions       3.60G  ±  216      3.60G  … 3.60G           0 ( 0%)        💩+  8.5% ±  0.0%
  cache_references    194M  ±  791K      193M  …  195M           0 ( 0%)          -  0.0% ±  0.3%
  cache_misses       2.36M  ±  828K     1.58M  … 4.46M           1 ( 8%)          + 22.6% ± 28.3%
  branch_misses      19.1M  ± 14.0K     19.1M  … 19.2M           0 ( 0%)          -  0.5% ±  0.1%

That performance degradation is not acceptable for us, but I think the impact of overflow checks is much lower in most applications (doing much less number crunching) and the benefits of finding bugs this way is worth it.

folkertdev/bounds and overflow checks.md

Bounds checks

Overflow Checks