Some further zlib-rs benchmarks using commit c9b7299fa81ca2a8448cbe3e96a42275c9b41b24
Rust will panic when you try to index out of bounds
This check has a cost: your CPU must perform this check. Typically this involves two instructions: a compare "is the index in bounds" and a jump "go to the panic code". Hence, we intuitively expect programs that perform bounds checking to be slightly slower. In theory, C has an edge here because it does not, by default, check that array access is in bounds.
So we can run a little experiment. We can disable bounds checks in the rust compiler, and observe whether that gives any speedup. The bounds check insertion code is conveniently centralized in a single call. We can comment out this call (and some other thing the compiler tells you to because they are now unused).
And run our benchmarks
Benchmark 1 (61 runs): ./baseline 1 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 82.1ms ± 992us 80.7ms … 86.8ms 2 ( 3%) 0%
peak_rss 26.6MB ± 50.8KB 26.6MB … 26.7MB 11 (18%) 0%
cpu_cycles 296M ± 3.28M 292M … 311M 4 ( 7%) 0%
instructions 645M ± 269 645M … 645M 0 ( 0%) 0%
cache_references 20.0M ± 170K 19.7M … 20.5M 3 ( 5%) 0%
cache_misses 492K ± 82.2K 379K … 768K 3 ( 5%) 0%
branch_misses 3.11M ± 5.07K 3.08M … 3.12M 3 ( 5%) 0%
Benchmark 2 (62 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 81.7ms ± 993us 80.4ms … 86.1ms 3 ( 5%) - 0.5% ± 0.4%
peak_rss 26.6MB ± 50.5KB 26.6MB … 26.7MB 11 (18%) - 0.0% ± 0.1%
cpu_cycles 294M ± 3.70M 288M … 312M 3 ( 5%) - 0.8% ± 0.4%
instructions 590M ± 286 590M … 590M 0 ( 0%) ⚡- 8.5% ± 0.0%
cache_references 19.8M ± 181K 19.5M … 20.4M 5 ( 8%) - 0.7% ± 0.3%
cache_misses 429K ± 105K 311K … 772K 2 ( 3%) ⚡- 12.9% ± 6.8%
branch_misses 3.06M ± 5.40K 3.05M … 3.09M 5 ( 8%) ⚡- 1.8% ± 0.1%
Benchmark 1 (21 runs): ./baseline 6 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 240ms ± 1.88ms 237ms … 245ms 2 (10%) 0%
peak_rss 24.7MB ± 52.7KB 24.6MB … 24.8MB 4 (19%) 0%
cpu_cycles 1.02G ± 6.25M 1.01G … 1.03G 4 (19%) 0%
instructions 2.11G ± 285 2.11G … 2.11G 0 ( 0%) 0%
cache_references 104M ± 792K 103M … 106M 1 ( 5%) 0%
cache_misses 1.63M ± 398K 1.08M … 2.65M 2 (10%) 0%
branch_misses 9.26M ± 8.45K 9.25M … 9.28M 0 ( 0%) 0%
Benchmark 2 (21 runs): ./target/release/examples/blogpost-compress 6 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 238ms ± 1.74ms 237ms … 245ms 2 (10%) - 0.7% ± 0.5%
peak_rss 24.7MB ± 73.5KB 24.5MB … 24.8MB 0 ( 0%) + 0.1% ± 0.2%
cpu_cycles 1.01G ± 8.82M 1.00G … 1.04G 2 (10%) - 0.7% ± 0.5%
instructions 1.91G ± 272 1.91G … 1.91G 0 ( 0%) ⚡- 9.5% ± 0.0%
cache_references 105M ± 1.05M 103M … 107M 2 (10%) + 0.3% ± 0.6%
cache_misses 1.77M ± 695K 1.27M … 4.25M 1 ( 5%) + 8.7% ± 21.6%
branch_misses 9.53M ± 9.02K 9.51M … 9.55M 0 ( 0%) 💩+ 2.8% ± 0.1%
Benchmark 1 (12 runs): ./baseline 9 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 439ms ± 2.42ms 437ms … 445ms 0 ( 0%) 0%
peak_rss 24.5MB ± 67.5KB 24.4MB … 24.6MB 3 (25%) 0%
cpu_cycles 1.93G ± 8.92M 1.92G … 1.95G 1 ( 8%) 0%
instructions 3.34G ± 462 3.34G … 3.34G 1 ( 8%) 0%
cache_references 194M ± 828K 192M … 195M 0 ( 0%) 0%
cache_misses 2.07M ± 326K 1.71M … 2.90M 1 ( 8%) 0%
branch_misses 19.1M ± 10.3K 19.1M … 19.1M 0 ( 0%) 0%
Benchmark 2 (12 runs): ./target/release/examples/blogpost-compress 9 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 436ms ± 2.70ms 433ms … 441ms 0 ( 0%) - 0.8% ± 0.5%
peak_rss 24.6MB ± 64.5KB 24.5MB … 24.6MB 0 ( 0%) + 0.1% ± 0.2%
cpu_cycles 1.91G ± 6.34M 1.90G … 1.93G 0 ( 0%) - 0.9% ± 0.3%
instructions 3.07G ± 356 3.07G … 3.07G 0 ( 0%) ⚡- 8.2% ± 0.0%
cache_references 193M ± 661K 192M … 195M 0 ( 0%) - 0.3% ± 0.3%
cache_misses 1.37M ± 209K 1.09M … 1.84M 0 ( 0%) ⚡- 33.6% ± 11.2%
branch_misses 19.3M ± 74.3K 19.2M … 19.4M 0 ( 0%) 💩+ 1.3% ± 0.2%
What we see here is a clear drop in the number of instructions executed, but no significant change in wall time. This is first of all a confirmation that our change to rustc did have an effect (you have to be careful with cargo caching things). Secondly, it shows that the instructions for the bounds check have no real cost, most likely due to instruction-level parallelism (making the comparison effectively free) and modern branch predictors (making the branch effectively free).
So, the difference between zlib-ng
and zlib-rs
is not the addition of bounds checks.
With the standard debug
profile, rust will panic when numerical operations overflow.
In release mode, wrapping arithmetic is used instead which may mask bugs, but is (said to be) faster.
Overflow bugs usually don't show up in your tests, and with rust's standard behavior, you'll never know if they do exist in production, until your whole app comes crashing down. This sort of issue is tough to reproduce or troubleshoot from some unrelated backtrace, so enabling overflow checks sounds useful.
We can enable overflow checks in release builds in the top-level Cargo.toml
:
[profile.release]
overflow-checks = true
Turns out, overflow checks do actually slightly degrade performance in our case.
Here I compare against a build of zlib-rs
without overflow checks
Benchmark 1 (61 runs): ./target/release/examples/baseline 1 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 82.3ms ± 679us 80.8ms … 84.0ms 0 ( 0%) 0%
peak_rss 26.6MB ± 59.2KB 26.3MB … 26.6MB 12 (20%) 0%
cpu_cycles 295M ± 2.05M 292M … 305M 2 ( 3%) 0%
instructions 640M ± 274 640M … 640M 0 ( 0%) 0%
cache_references 19.9M ± 193K 19.6M … 20.9M 3 ( 5%) 0%
cache_misses 470K ± 110K 356K … 1.05M 3 ( 5%) 0%
branch_misses 3.11M ± 2.96K 3.11M … 3.13M 1 ( 2%) 0%
Benchmark 2 (59 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 85.1ms ± 1.27ms 82.8ms … 89.7ms 3 ( 5%) 💩+ 3.5% ± 0.4%
peak_rss 26.6MB ± 56.2KB 26.5MB … 26.6MB 14 (24%) - 0.0% ± 0.1%
cpu_cycles 307M ± 4.70M 300M … 326M 4 ( 7%) 💩+ 4.1% ± 0.4%
instructions 758M ± 245 758M … 758M 0 ( 0%) 💩+ 18.4% ± 0.0%
cache_references 19.8M ± 185K 19.4M … 20.2M 0 ( 0%) - 0.6% ± 0.3%
cache_misses 452K ± 84.6K 316K … 694K 1 ( 2%) - 3.7% ± 7.6%
branch_misses 3.09M ± 8.71K 3.08M … 3.11M 0 ( 0%) - 0.6% ± 0.1%
Benchmark 1 (21 runs): ./target/release/examples/baseline 6 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 239ms ± 3.01ms 236ms … 248ms 2 (10%) 0%
peak_rss 24.6MB ± 57.2KB 24.5MB … 24.6MB 5 (24%) 0%
cpu_cycles 1.02G ± 11.2M 1.01G … 1.05G 2 (10%) 0%
instructions 2.10G ± 311 2.10G … 2.10G 0 ( 0%) 0%
cache_references 105M ± 819K 104M … 108M 1 ( 5%) 0%
cache_misses 2.09M ± 562K 1.54M … 4.02M 1 ( 5%) 0%
branch_misses 9.27M ± 7.43K 9.26M … 9.28M 0 ( 0%) 0%
Benchmark 2 (21 runs): ./target/release/examples/blogpost-compress 6 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 247ms ± 2.36ms 244ms … 252ms 0 ( 0%) 💩+ 3.2% ± 0.7%
peak_rss 24.6MB ± 75.7KB 24.4MB … 24.6MB 0 ( 0%) - 0.1% ± 0.2%
cpu_cycles 1.05G ± 9.69M 1.04G … 1.07G 0 ( 0%) 💩+ 3.2% ± 0.6%
instructions 2.34G ± 234 2.34G … 2.34G 0 ( 0%) 💩+ 11.8% ± 0.0%
cache_references 105M ± 1.02M 103M … 107M 0 ( 0%) - 0.5% ± 0.5%
cache_misses 1.70M ± 491K 1.12M … 2.69M 0 ( 0%) ⚡- 18.7% ± 15.8%
branch_misses 9.28M ± 9.09K 9.27M … 9.31M 1 ( 5%) + 0.2% ± 0.1%
Benchmark 1 (12 runs): ./target/release/examples/baseline 9 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 437ms ± 2.22ms 435ms … 443ms 1 ( 8%) 0%
peak_rss 24.5MB ± 59.3KB 24.4MB … 24.5MB 3 (25%) 0%
cpu_cycles 1.92G ± 8.83M 1.91G … 1.95G 2 (17%) 0%
instructions 3.32G ± 316 3.32G … 3.32G 1 ( 8%) 0%
cache_references 194M ± 721K 193M … 196M 1 ( 8%) 0%
cache_misses 1.92M ± 374K 1.35M … 2.77M 0 ( 0%) 0%
branch_misses 19.2M ± 14.6K 19.2M … 19.2M 1 ( 8%) 0%
Benchmark 2 (12 runs): ./target/release/examples/blogpost-compress 9 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 451ms ± 1.81ms 449ms … 454ms 0 ( 0%) 💩+ 3.1% ± 0.4%
peak_rss 24.5MB ± 64.5KB 24.4MB … 24.5MB 0 ( 0%) - 0.0% ± 0.2%
cpu_cycles 1.98G ± 7.52M 1.97G … 2.00G 0 ( 0%) 💩+ 3.1% ± 0.4%
instructions 3.60G ± 216 3.60G … 3.60G 0 ( 0%) 💩+ 8.5% ± 0.0%
cache_references 194M ± 791K 193M … 195M 0 ( 0%) - 0.0% ± 0.3%
cache_misses 2.36M ± 828K 1.58M … 4.46M 1 ( 8%) + 22.6% ± 28.3%
branch_misses 19.1M ± 14.0K 19.1M … 19.2M 0 ( 0%) - 0.5% ± 0.1%
That performance degradation is not acceptable for us, but I think the impact of overflow checks is much lower in most applications (doing much less number crunching) and the benefits of finding bugs this way is worth it.