The Ruby Inplace Bug
It is strongly recommended that you read my article on the ways to create Ruby strings in C extensions if you’re not familiar with Ruby’s C string API
Here’s a story of string corruption in MRI when the wrong function is used to create the Ruby string. But before that, let me explain a few Ruby features that you may not know.
Ruby features that you may not know
Editing a file in place
Consider the following script:
while gets
puts $_.gsub(/perl/, "ruby")
end
And then we run:
$ echo "I like perl, it is my favourite language." > temp.txt
$ ruby -i script.rb temp.txt
$ cat temp.txt
I like ruby, it is my favourite language.
As you can see, with the -i
flag we can read the text file line-by-line through gets
, and replace that line with whatever is in the standard output (the puts
).
Backup files
Let’s run the script above again, but instead of the -i
we pass to Ruby, let’s pass -i.bak
:
$ echo "I like perl, it is my favourite language." > temp.txt
$ ruby -i.bak script.rb temp.txt
$ cat temp.txt
I like ruby, it is my favourite language.
$ cat temp.txt.bak
I like perl, it is my favourite language.
All this does is create a backup file (with the extension .bak
in our case) with the original contents before modifying it.
Run the Ruby script line-by-line
Did you know we can do the above, but with a single line of Ruby code? Consider the following Ruby script:
$_.gsub!(/perl/, "ruby")
And then we run it as the following (notice we are running the script with an extra -p
flag):
$ echo "I like perl, it is my favourite language." > temp.txt
$ ruby -pi.bak script.rb temp.txt
$ cat temp.txt
I like ruby, it is my favourite language.
$ cat temp.txt.bak
I like perl, it is my favourite language.
The -p
flag conviniently wraps your code around an implicit while gets(); ...; puts $_; end
.
BEGIN
blocks
One question you might ask yourself is, if I use the -p
flag, how do I do global setup? Ruby’s got a feature for that! Enter BEGIN
blocks.
Note: Ruby’s keywords are case sensitive, so
BEGIN
is not the same asbegin
.
So now, if we do:
BEGIN {
puts "It is starting!"
}
$_.gsub!(/perl/, "ruby")
And then we can run it as follows:
$ echo "I like perl, it is my favourite language." > temp.txt
$ ruby -pi.bak script.rb temp.txt
It is starting!
$ cat temp.txt
I like ruby, it is my favourite language.
$ cat temp.txt.bak
I like perl, it is my favourite language.
Notice we get the It is starting!
output in the terminal, and not in the file.
Shebang
We want to minimize the number of characters we have to type. So we can omit the flags from the terminal and instead include it in the file using a shebang line.
#!/usr/bin/ruby -pi.bak
BEGIN {
puts "It is starting!"
}
$_.gsub!(/perl/, "ruby")
And then we can run it as follows:
$ echo "I like perl, it is my favourite language." > temp.txt
$ ruby script.rb temp.txt
It is starting!
$ cat temp.txt
I like ruby, it is my favourite language.
$ cat temp.txt.bak
I like perl, it is my favourite language.
In fact, Ruby doesn’t actually check that the binary in the shebang is valid, just that it ends in ruby
. So the following shebang would have worked too:
#!/some/invalid/dir/ruby -pi.bak
The Ruby inplace bug
Ok, we finally have everything we need to reproduce the bug. Consider the following script:
#!/usr/bin/ruby -pi.bak
BEGIN {
GC.start(full_mark: true)
arr = []
1000000.times do |x|
arr << "fooo#{x}"
end
}
$_.gsub!(/perl/, "ruby")
So what do we expect it to do? Well, this is pretty much the same script as the one in example #5 (while doing some seemingly useless work in the BEGIN
block).
So let’s run this script:
$ echo "I like perl, it is my favourite language." > temp.txt
$ ruby script.rb temp.txt
$ cat temp.txt
I like ruby, it is my favourite language.
$ cat temp.txt.bak
cat: temp.txt.bak: No such file or directory
$ ls
script.rb
temp.txt
temp.txto106
Wait what!?!? Where is our backup file temp.txt.bak
? And what is temp.txto106
? Let’s inspect these files.
$ cat temp.txt
I like ruby, it is my favourite language.
$ cat temp.txto106
I like perl, it is my favourite language.
It seems like temp.txto106
contains the original file (the one that should have been the backup file). So what’s going on? You might have guessed it, because the wrong C function was used to create Ruby strings!
So, what’s wrong?
When you pass the flag -i.bak
to Ruby, it parses the arguments in your shebang from a Ruby string read from the contents of the Ruby script1. We then set the extension of the backup file by calling ruby_set_inplace_mode
. This function is really simple, it’s defined like this.
void
ruby_set_inplace_mode(const char *suffix)
{
ARGF.inplace = !suffix ? Qfalse : !*suffix ? Qnil : rb_fstring_cstr(suffix);
}
The problem here is the usage of rb_fstring_cstr
, which (similar to rb_str_new_static
) expects a pointer to a region of memory that is not free
‘d before this Ruby string is swept (because it sets the pointer of the Ruby string directly to the string passed in).
To explain using diagrams (don’t we all love diagrams?), we have a Ruby string of the flags in the shebang stored in argv
. We then call ruby_set_inplace_mode
and set ARGF.inplace
to a Ruby string that points to .bak
.
But then when we no longer need argv
, but ARGF.inplace
is still pointing to the string that argv
was referencing.
So then we call GC.start
which guarantees that argv
is swept, and then we create a large amount of fooo#{x}
strings. This helps us make sure the string -pi.bak
is overwritten and so we get ARGF.inplace
pointing to some other value. So then once the script terminates, the backup extension we use is o106
(or some other gibberish).
The fix
The fix only changes one line and a few characters!2 The solution is instead of calling rb_fstring_cstr
to create the string, we use rb_str_new
, which allocates memory for the string and copies over the contents of the string. Once we do that, we no longer need to ensure that the original string is not free
‘d.
Acknowledgements
I would like to acknowledge Matt Valentine-House as the co-discoverer of the bug and the co-author of the fix.
-
The string is created in
load_file_internal
and parsed inproc_options
. ↩