A Rubyist's Walk Along the C-side (Part 4): Primitive Data Types
This is an article in a multi-part series called “A Rubyist’s Walk Along the C-side”
In the previous article, we saw how to call Ruby methods in C extensions. In this article, we’ll look at the primitive data types in the Ruby C API.
In Ruby, everything is an object. However, that is not true when writing a C extension as not all types are created equal. For the more “primitive” types, there are often more efficient ways to manipulate them than to call the Ruby method. For example, there’s a more efficient way to get an element at a particular index in Ruby arrays in C than to call Array#[]
.
Primitive data types
Every type that is listed in the union of the RVALUE
struct is a primitive data type (if you’re interested in what the RVALUE
struct is and how the garbage collector works in Ruby, take a look at my article titled “Garbage Collection in Ruby”). The primitive data types are the following:
Array
Bignum
Class
Complex
File
Float
Hash
MatchData
Object
Rational
Regexp
String
Struct
Symbol
All the types above are “heap allocated” in Ruby, meaning that they require memory allocation and are managed by the garbage collector. However, Ruby also has the concept of immediates, which doesn’t even require an object allocation! Fixnum
(i.e. small integer values), is represented as an immediate. But how does it store data without allocating an object? Remember that the VALUE
type is an unsigned long
? Fixnum
takes advantage of that by setting a special bit in the VALUE
and then it’s able to directly store the integer value in the VALUE
. In addition to Fixnum
, Ruby’s true
, false
, and nil
types are also represented using immediates.
In this article, we’ll only be covering the following types:
- Special constants (
nil
,true
,false
) Fixnum
Array
Hash
String
Symbol
Exploring the other types is left as an exercise for the reader!
Constants
The Ruby C API has many builtin constants for our convinience. By convention, Ruby core modules are prefixed with rb_m
(e.g. rb_mKernel
for the Kernel
module), classes are prefixed with rb_c
(e.g. rb_cObject
for the Object
class), and exceptions are prefixed with rb_e
(e.g. rb_eRuntimeError
for the RuntimeError
exception). You can find the list of builtin modules, classes, and exceptions here.
Special constants
There are special constants for Ruby’s true
, false
, and nil
values. We can use Qtrue
, Qfalse
, and Qnil
in our C code for each of the Ruby values. We’ve seen Qnil
in the previous articles when we want to return nil
from a Ruby method.
In Ruby, all values are considered truthy except false
and nil
. Ruby’s C API provides the RTEST
macro which will return a C FALSE
if the value is nil
or false
and a C TRUE
otherwise.
VALUE my_obj = ...;
if (RTEST(my_obj)) {
// my_obj is not Qfalse or Qnil
} else {
// my_obj is either Qfalse or Qnil
}
Fixnum
If you recall Fixnum
from Ruby 2.3 and earlier, it’s a type used to represent small integers efficiently. Since Ruby 2.4, Fixnum
and Bignum
have been merged to form the Integer
class so we no longer have to differentiate the two when using Ruby. However, the two types are still distinct in the C API since they are represented differently internally.
To convert a C long
to a Fixnum
, use the LONG2FIX
macro. Similarly, to convert a Fixnum
back to a long
, use the FIX2LONG
macro. The example below shows how to use these two macros.
// Create Ruby fixnum zero
VALUE zero_ruby = LONG2FIX(0);
// Convert Ruby fixnum to C long
long zero_c = FIX2LONG(zero_ruby);
Array
Creating arrays
There are two ways to create a Ruby array:
rb_ary_new
: This creates a new, empty Ruby array.VALUE my_arr = rb_ary_new();
rb_ary_new_capa
: This creates a new, empty Ruby array with a specific capacity. This is more efficient thanrb_ary_new
if the number of elements is known ahead of time since no resizing will be needed within the capacity.// New Ruby array with capacity of 100 VALUE my_arr = rb_ary_new(100);
Adding to arrays
If we want to add one element to the array, use rb_ary_push
. It accepts two arguments and returns the original array ary
:
ary
: The array to append to.item
: The Ruby object to be added to the array.
// Function prototype for rb_ary_push
VALUE rb_ary_push(VALUE ary, VALUE item);
// Creating a new array and pushing a fixnum
VALUE my_array = rb_ary_new();
rb_ary_push(my_array, LONG2FIX(42));
To more efficiently add a large number of elements from a C array to a Ruby array, we can use rb_ary_cat
. It accepts three arguments and returns the original array ary
:
ary
: The array to append to.argv
: The C array of Ruby objects added to the array.len
: The number of elements to add.
// Function prototype for rb_ary_cat
VALUE rb_ary_cat(VALUE ary, const VALUE *argv, long len);
// Creating a new array and pushing three elements
VALUE my_array = rb_ary_new();
VALUE ruby_constants[3] = { Qtrue, Qfalse, Qnil };
rb_ary_cat(my_array, ruby_constants, 3);
Removing from arrays
Just like in Ruby, we can remove from Ruby arrays using functions like rb_ary_pop
and rb_ary_shift
. Exploring these functions is left as an exercise for the reader. You can find the list of exported Ruby array functions in array.h
and the implementation in array.c
.
Indexing arrays
We can use RARRAY_LEN
, RARRAY_PTR
, and RARRAY_AREF
to get the length, backing C array pointer, and an element at a specific index of a Ruby array, respectively. Here are examples of how it’s used (my_array
is assumed to be a Ruby array that already exists):
// Get the length of my_array
long len = RARRAY_LEN(my_array);
// Get the backing C array of my_array
VALUE *elements = RARRAY_PTR(my_array);
// Read the first element of my_array
VALUE first_element = elements[0];
// Set the 42nd element to the fixnum 0
elements[41] = LONG2FIX(0);
// Another way to read the first element of my_array
VALUE first_element = RARRAY_AREF(my_array, 0);
We should be careful when using RARRAY_PTR
and RARRAY_AREF
. RARRAY_PTR
may return a different pointer after elements are added or removed from the array since Ruby may decide to resize the backing C array. Reading or writing to the original pointer may lead to undefined behavior such as segmentation faults. Unlike Ruby’s Array#[]
, RARRAY_AREF
does not check the index that is passed in so we must ensure the index is in the range 0 <= index < RARRAY_LEN
. Any indexes out of the range will result in undefined behavior including returning a garbage value or a segmentation fault.
Hash
Creating hashes
Creating hashes is very simple, just use rb_hash_new
, which accepts no arguments and returns the hash.
// Function prototype for rb_hash_new
VALUE rb_hash_new(void);
// Creating a new hash
VALUE my_heap = rb_hash_new();
Look up
To look up an entry in the hash, we can use rb_hash_aref
which is the implementation for Hash#[]
. It accepts two arguments and returns the value of the key (or the default value if none is found):
hash
: The hash object.key
: The key object.
// Function prototype for rb_hash_aref
VALUE rb_hash_aref(VALUE hash, VALUE key);
// Lookup my_key from my_hash
VALUE my_val = rb_hash_aref(my_hash, my_key);
Set
To set an entry in the hash, we can use rb_hash_aset
which is the implementation for Hash#[]=
. It accepts three arguments and returns the value val
:
hash
: The hash object.key
: The key object.val
: The value to set atkey
.
// Function prototype for rb_hash_aref
VALUE rb_hash_aset(VALUE hash, VALUE key, VALUE val);
// Set my_key to my_val in my_hash
rb_hash_aset(my_hash, my_key, my_val);
Iteration
To iterate over every key/value pair in the hash, we can use rb_hash_foreach
. Just like iterating through a hash in Ruby with Hash#each
, we should not insert to or delete from the hash while iterating. rb_hash_foreach
accepts three arguments and does not return anything:
hash
: The hash to iterate over.func
: The callback function that is called for every key/value pair in the hash. This function must accept three arguments and return eitherST_CONTINUE
to continue iterating,ST_STOP
to stop iterating, orST_DELETE
to delete the current entry. The function signature looks like the following:int rb_foreach_func(VALUE key, VALUE value, VALUE arg);
key
: The key of the entry.value
: The value of the corresponding key.arg
: The value that is passed intofarg
during therb_hash_foreach
call.
farg
: Any data that we want to pass into thefunc
callback as the third argument. This could be anything and does not have to be a valid Ruby object.
// Function prototype for rb_hash_foreach
void rb_hash_foreach(VALUE hash, rb_foreach_func *func, VALUE farg);
// Iterate over every key/value pair in my_hash
int my_hash_iter_func(VALUE key, VALUE value, VALUE arg) {
// Implementation goes here
return ST_CONTINUE;
}
rb_hash_foreach(my_hash, my_hash_iter_func, 0);
String
Creating strings
There are too many ways to create strings. I have a whole article written about the common ways to create strings. Which one we use will depend on the situation, and if the wrong one is used, subtle and catastrophic bugs can be introduced like “The Ruby inplace bug”.
Appending to Ruby strings
We can use rb_str_cat
or rb_str_cat_cstr
to append to a string.
rb_str_cat
accepts three arguments and returns the original string str
:
str
: The string to append to.ptr
: Pointer to a character buffer.len
: Number of characters of the character buffer to append.
// Function prototype for rb_str_cat
VALUE rb_str_cat(VALUE str, const char *ptr, long len);
// Appending "Hello world!" to a Ruby string my_string
size_t string_length = 12;
char *c_str = malloc(string_length);
// The C string c_str may or may not contain a null terminator
memcpy(c_str, "Hello world!", string_length);
rb_str_cat(my_string, c_string, string_length);
free(c_str);
rb_str_cat_cstr
is simpler to use than rb_str_cat
but the caveat is that our C string must be null-terminated. It accepts two arguments and returns the original string str
:
str
: The string to append to.ptr
: Pointer to a C string that must be null-terminated.
// Function prototype for rb_str_cat_cstr
VALUE rb_str_cat_cstr(VALUE str, const char *ptr);
// Appending "Hello world!" to a Ruby string my_string
rb_str_cat_cstr(my_string, "Hello world!");
Reading and writing to strings
Just like how we can get the backing C array from a Ruby array, we can similarly get the C character array that backs the Ruby string. We can use StringValuePtr
to get the backing character array and RSTRING_LEN
to get the length of the string. Note that despite the name RSTRING_LEN
, it behaves like String#bytesize
and not String#length
. The difference is that String#bytesize
will return the number of bytes that the string occupies and String#length
will return the number of characters. These two values will differ when multi-byte characters exist in the string.
// Get the length of my_string
long length = RSTRING_LEN(my_string);
// Get the backing C character array
char *buff = StringValuePtr(my_string);
// Change the 11th character of my_string to 'a'
buff[10] = 'a'
We can also use StringValueCStr
to get a pointer to the backing character buffer. Unlike StringValuePtr
, StringValueCStr
will raise an ArgumentError
if the Ruby string contains null characters in the middle (i.e. the string cannot be treated as a C string) and will ensure the string is properly null-terminated. Using StringValueCStr
will allow us to safely use C functions that require null-terminated strings like strcat
, strlen
, strcmp
, etc.
Symbol
We’ve used rb_intern
many times to get an ID
type to call methods. In fact, this ID
type is the backing implementation of a symbol and has better performance by avoiding allocating the symbol object itself. Let’s see how rb_intern
works again:
// Function prototype for rb_intern
ID rb_intern(const char *name);
// Getting the ID of "hello"
ID hello = rb_intern("hello");
Creating Ruby symbol
However, since ID
is not a VALUE
type, it is not a Ruby object and we cannot pass an ID
back to Ruby. To create the Ruby symbol, we can use rb_id2sym
which accepts one argument and returns the Ruby symbol:
x
: TheID
of the symbol.
// Function prototype for rb_id2sym
VALUE rb_id2sym(ID x);
// Converting ID "hello" into Ruby symbol :hello
VALUE hello = rb_id2sym(rb_intern("hello"));
Checking types
In Ruby, we often take advantage of duck typing in our code. However, our C code often has assumptions on the Ruby type of an object and may misbehave when a type we don’t expect is passed in. When writing C extension code (especially in public APIs in gems), it is often a good idea to check the type of the parameters passed in. We can use RB_TYPE_P
to check the type and Check_Type
to enforce the type.
RB_TYPE_P
RB_TYPE_P
accepts two arguments:
obj
: The object.t
: The type. You can see the list of Ruby types that can be pass in.
It will return true if the object is of the type t
and false otherwise.
// Function prototype for RB_TYPE_P
bool RB_TYPE_P(VALUE obj, enum ruby_value_type t);
// Demo of checking whether an object is a fixnum
VALUE my_obj = ...;
if (RB_TYPE_P(my_obj, T_FIXNUM)) {
// my_obj is a Ruby fixnum
} else {
// my_obj is not a Ruby fixnum
}
Check_Type
Check_Type
accepts the same arguments as RB_TYPE_P
but will raise a TypeError
if the object is not of the correct type.
// Function prototype for Check_Type
void Check_Type(VALUE obj, enum ruby_value_type t);
// Demo of ensuring object is a fixnum
VALUE my_obj = ...;
// Raise TypeError if my_obj is not a fixnum
Check_Type(my_obj, T_FIXNUM);
// my_obj is for sure a fixnum
Conclusion
In this article, we discussed Ruby’s primitive types. Specifically, we took a deeper look at Ruby’s immediate types, arrays, strings, hashes, and symbols.
There was quite a lot of information to unpack! But be sure to take the time to try out these data types yourself to make sure you understand how they work. Having a solid understanding of the primitive data types is important as we’ll be using them very frequently in future articles. In the next article, we’ll look at using various scopes of variables using the C API.