A Rubyist's Walk Along the C-side (Part 10): Benchmarking

This is an article in a multi-part series called “A Rubyist’s Walk Along the C-side”

In the previous article, you implemented a circular buffer in two different ways as C extensions. In this article, we’ll benchmark the three implementations (pure Ruby vs. instance variables in C vs. TypedData objects) to compare how the implementation affects performance!

Benchmark tools

There are many different tools that can be used to benchmark Ruby code. I’ll be introducing two: the Benchmark module and benchmark-ips.

Ruby’s Benchmark module

The Benchmark module built into Ruby provides a way to time a block of code. You can look at the documentation to learn about the various features it provides. We’ll look at the Benchmark.measure method to measure a single block of code.

Let’s see an example of code that calculates the Fibonacci sequence:

require "benchmark"

puts(Benchmark.measure do
  fib_hash = { 0 => 0, 1 => 1 }
  fib_hash.default_proc = proc { |h, k| h[k - 2] + h[k - 1] }

  puts "F(30) = #{fib_hash[30]}"
end)

Let’s run this code and see the output:

F(30) = 832040
  0.094039   0.000224   0.094263 (  0.094286)

We see that the 31st Fibonacci number is 832040. It’s followed by four numbers on the next line, which is the output from Benchmark, which each are (from left to right):

  1. User CPU time: time spent inside of the Ruby process. You can think of this as the time spent running the code.
  2. System CPU time: time spent inside of the system kernel for the Ruby process. While this isn’t time spent by the program’s execution, it’s directly impacted by the code. Privileged operations like file I/O, network I/O, and allocating memory require switching to the system kernel. So reducing the number and complexity of system calls can reduce the system CPU time.
  3. Sum of user and system CPU times: total time the program occupied the CPU.
  4. Elapsed time: the time spent executing the program if it was timed using a stopwatch. This is the time that the user experiences.

It’s interesting to note that the elapsed time may differ significantly from the sum of user and system CPU times in different scenarios. The elapsed time may be higher when the system is under high load and the system scheduler decides to allocate time for other processes to run instead. The elapsed time may be lower when we use multithreading since the user and system CPU times are calculated as the sum of time spent in each CPU core. You can see this behavior when running the code above in multiple Ractor, but this is left as an exercise for the reader.

Going back to the code above, if you have a keen eye, you might have noticed that there was a memoization optimization that I missed. We can change this line:

fib_hash.default_proc = proc { |h, k| h[k] = h[k - 2] + h[k - 1] }

And run the benchmark again:

F(30) = 832040
  0.000014   0.000011   0.000025 (  0.000024)

Much faster.

Problem with the Benchmark module

A problem with the Benchmark module in Ruby is that it requires a good estimate (or trial-and-error) in finding a good program that runs in a reasonable amount of time. If it takes too long, it’s a slow feedback loop. If it takes too short, the data has few significant digits and high variance which has low statistical significance. For example, I can say that the improvement in the memoized version above is about 3700x (0.094263/0.000025). However, if I rerun my optimized version, I get a total CPU time of 0.000019 which changes the improvement to about 5000x (0.094263/0.000019). That’s quite a big difference!

The benchmark-ips gem

The benchmark-ips gem aims to take more of the guesswork out of benchmarking. It works by having two phases: it first performs a warmup phase to estimate how long the code takes to execute, and a calculation phase where it runs the code for the duration that was specified (default of 5 seconds) using the estimation from the warmup phase.

Let’s write a small benchmark to measure the two implementations above using benchmark-ips! Here’s the code:

# Install benchmark-ips from RubyGems
require "bundler/inline"
gemfile do
  source "https://rubygems.org"
  gem "benchmark-ips"
end

Benchmark.ips do |x|
  # Benchmark for unoptimized Fibonacci
  x.report("unoptimized") do
    fib_hash = { 0 => 0, 1 => 1 }
    fib_hash.default_proc = proc { |h, k| h[k - 2] + h[k - 1] }
    fib_hash[30]
  end

  # Benchmark for the memoized Fibonacci
  x.report("memoized") do
    fib_hash = { 0 => 0, 1 => 1 }
    fib_hash.default_proc = proc { |h, k| h[k] = h[k - 2] + h[k - 1] }
    fib_hash[30]
  end

  # Compare the results
  x.compare!
end

Let’s run the code and see the output:

Warming up --------------------------------------
         unoptimized     1.000  i/100ms
            memoized    27.928k i/100ms
Calculating -------------------------------------
         unoptimized     10.413  (± 0.0%) i/s -     53.000  in   5.094071s
            memoized    287.847k (± 1.3%) i/s -      1.452M in   5.046056s

Comparison:
            memoized:   287847.2 i/s
         unoptimized:       10.4 i/s - 27642.42x  slower

benchmark-ips first runs a warmup where it estimates how many iterations we can do per 100 milliseconds. It then runs the benchmark and outputs the iterations per second for each of the benchmarks. In this case, the unoptimized implementation ran about 10 times per second while the memoized version ran 287 thousand times per second! We also see the standard deviation to show the spread of our runs (the unoptimized version had an unmeasurable amount of variation between runs whereas the memoized version had 1.3% of variation). We can see that the comparison shows that the memoized version is almost 28 thousand times faster in calculating the 31st Fibonacci number!

Since benchmark-ips runs for a constant amount of time, it takes the guesswork out of the benchmarking. We can run this benchmark multiple times and we will always get similar results.

Benchmarking the three implementations of circular buffers

If you completed the previous article, you should have three implementations of the circular buffer: one written in pure Ruby, one written using instance variables, and one written using TypedData objects. If you just want the answers, you can find the solutions here. We’re now going to benchmark these solutions and see what kinds of performance improvements we can achieve.

The benchmark script can be found here. It can be run by first executing bundle install to install dependencies and then bundle exec rake benchmark to run the benchmark.

On my machine, I get the following output:

Warming up --------------------------------------
circular_buffer_ruby   403.000  i/100ms
circular_buffer_ivar   740.000  i/100ms
circular_buffer_typeddata
                         1.465k i/100ms
Calculating -------------------------------------
circular_buffer_ruby      4.026k (± 0.6%) i/s -     20.150k in   5.004632s
circular_buffer_ivar      7.362k (± 1.2%) i/s -     37.000k in   5.026645s
circular_buffer_typeddata
                         14.757k (± 2.0%) i/s -     74.715k in   5.065155s

Comparison:
circular_buffer_typeddata:    14757.1 i/s
circular_buffer_ivar:     7361.8 i/s - 2.00x  slower
circular_buffer_ruby:     4026.4 i/s - 3.67x  slower

We can see that the pure Ruby implementation is the slowest, we achieve nearly double the performance by using instance variables in our C extension, and we yet again double in performance by using TypedData objects! Ultimately, our TypedData implementation ends up being nearly 4x faster than the pure Ruby implementation. Not bad.

Performance penalty when using instance variables through the C API

While in the example above, using instance variables through the C API was faster than the pure Ruby implementation, this is not always the case. This is because Ruby understands Ruby code better and can apply more advanced optimization techniques such as inline caching. It also allows the JIT (just-in-time) compiler to further optimize the code. We can see how the JIT affects performance by enabling it using RUBY_YJIT_ENABLE=1 bundle exec rake benchmark:

Warming up --------------------------------------
circular_buffer_ruby     1.108k i/100ms
circular_buffer_ivar   820.000  i/100ms
circular_buffer_typeddata
                         1.886k i/100ms
Calculating -------------------------------------
circular_buffer_ruby     11.269k (± 1.2%) i/s -     56.508k in   5.015224s
circular_buffer_ivar      8.171k (± 1.7%) i/s -     41.000k in   5.018950s
circular_buffer_typeddata
                         18.954k (± 2.9%) i/s -     96.186k in   5.079668s

Comparison:
circular_buffer_typeddata:    18954.4 i/s
circular_buffer_ruby:    11269.1 i/s - 1.68x  slower
circular_buffer_ivar:     8171.4 i/s - 2.32x  slower

The story now changes. The implementation using instance variables in our C extension is by far the slowest. The pure Ruby implementation is now 1.38x faster. The TypedData object implementation remains the fastest, at 1.68x faster than the pure Ruby implementation. We can clearly see here how writing pure Ruby is often faster than a poorly optimized C implementation.

Conclusion

In this article, we looked at two different tools to benchmark Ruby code, and we benchmarked the project from the previous part. The benchmark showed a significant performance improvement using TypedData objects over the pure Ruby implementation. In the next article, we’ll look at some ways to make our lives less miserable when the inevitable day comes: our C extension crashes.