Using Rust like C
Not the Usual Introduction
Any attempt to persuade C programmers to try out Rust usually starts with describing higher-level features plus emphasizing the fact that Rust has the same hard-core attitude to systems programming as C: as optimal use of resources as possible, and no sneaky allocations.
The higher-level features makes it much more pleasant and safe to do things like extensive string manipulation and containers, while still having full control of code generation.
This is all true, but it’s interesting to introduce unsafe Rust initially if you are used to C.
Rust Pointers
Although references are fundamental, there are also pointers in Rust. The real difference
is that that pointers do not follow the usual strict borrowing rules of references.
(Under the hood, &u8
and *const u8
look exactly the same.)
But to get a pointer, you need a reference first.
C is unique among programming languages by using the null byte to delimit the end
of a string. A C string is therefore an arbitrary non-null sequence of bytes.
In C++, std::string
does not work like this; it can contain embedded nulls so
can represent any arbitrary data. It does go to some trouble to provide a
null-terminated C-style string with the c_str()
method (Although C and C++
divorced in the 1980s, they tend to still live in the same apartment).
The Rust String
type works very much like std::string
, but insists that the
bytes must represent valid UTF-8. The “borrowed” form &str
(“string slice”) has
a simple representation: a struct with a pointer to the character data, and a size
field. So regular Rust string literals won’t have the representation we need.
A byte literal with an explicit embedded ‘\0’ will do fine - we ask for a pointer to its data and can now do the usual C pointer dance. Here we find out the length.
fn main() {
let s = b"hello\0".as_ptr();
let mut p = s;
while unsafe { *p } != b'\0' {
p = unsafe { p.add(1) };
}
let offs = (p as usize) - (s as usize);
println!("{}",offs);
// 5
}
There aren’t many explicit types here - only the as usize
cast to get the
difference between the initial and the final pointer.
It comes as no surprise that Rust regards dereferencing a pointer as fundamentally
unsafe - in general, we really don’t know where a pointer comes from and whether
it points to valid data. But advancing pointers can fail as well, if there is
overflow. Obviously the first unsafe
is much more likely to be a source of
problems than the second, but Rust tries to rigorously flag all possible sources
of undefined behaviour.
Let’s separate out the loop as a function to make the types more explicit:
fn r_strlen(s: *const u8) -> usize {
let mut p = s;
while unsafe { *p } != b'\0' {
p = unsafe { p.add(1) };
}
(p as usize) - (s as usize)
}
fn main() {
let s = b"hello\0".as_ptr();
println!("{:?}",r_strlen(s));
// 5
}
Rust pointer types are written backwards, but mean exactly the same thing as in C.
The loop pointer variable p
has to be mutable, and the return value of a
function is the last value of the function block.
Otherwise, this is fairly straightforward curly-bracket syntax.
Premature Safety
The new function r_strlen
is considered safe by any code that calls it. This
is a problem, because it will segfault if passed a rubbish pointer.
Rust programs should not segfault!
(They may panic,
but that’s a controlled unwinding of the stack and is completely memory safe).
So we are trusting this function prematurely.
Better to do flag this function explicitly as being unsafe:
unsafe fn r_strlen(s: *const u8) -> usize {
// as before....
}
fn main() {
let s = b"hello\0".as_ptr();
println!("{:?}",unsafe { r_strlen(s) });
// 5
}
It is now the responsibility of the caller to ensure that the function is called with a valid pointer to a C string.
We can easily bring in strlen
from the system C library.
#[link(name = "c")]
extern {
fn strlen(s: *const u8) -> usize;
}
fn main() {
let s = b"hello\0".as_ptr();
let len = unsafe { strlen(s) };
assert_eq!(len, 5);
}
Any function linked in with the FFI (“Foreign Function Interface”) is considered unsafe.
Generally, interfacing with external C code is straightforward and efficient, but
requires care to match the differences in representation - particularly with things
like strings. Here there are helper types like CString
and CStr
which help
bridge the gap.
It is good practice (however) to use the libc
crate to link in the standard C library,
since this handles the inevitable platform differences like
C headers do.
Copying Buffers
I’ll do another classic K&R kind of function, which is strcpy
. Just to make life
a little less tedious, we will define a macro for advancing a pointer:
#[link(name = "c")]
extern {
fn strlen(s: *const u8) -> usize;
}
macro_rules! next {
($p:expr) => {
unsafe { $p = $p.add(1) }
}
}
fn main() {
let src = b"hello\0";
let mut dest = [0; 25];
let mut p = src.as_ptr();
let mut q = dest.as_mut_ptr();
while unsafe { *p } != b'\0' {
unsafe { *q = *p };
next!(p);
next!(q);
}
unsafe { *q = *p };
let len = unsafe {strlen(dest.as_ptr())};
assert_eq!(len, 5);
}
Here, the source is a slice of bytes (which must have ‘\0’) and the
destination is an array of 25 elements, initialized to zero. Again,
the actual element type of that array (u8
) will be worked out by type inference.
I did C for many years, and
I will admit that the notation here is not as elegant as C. We do not have
pre- and post- increment operators and the “C axiom” p[i] == *(p + i)
. But
source elegance is overrated, since C notation can be abused to the point where the simplest
string manipulation becomes a bravura display of pointer gymnastics. This could
be seen as just harmless artistic expression, except these are often points of attack.
Reasoning about unsafe code is hard - even experienced and careful C programmers make mistakes
which the Rust compiler would flag as errors.
Here, the unsafe points are clearly marked.
Iterating over C-style Strings
This pattern of looping over all bytes can be wrapped up neatly as an iterator.
The iterator will take each each byte in the source, and return it as Option<char>
.
Then the Rust for-loop will iterate over each char
.
macro_rules! next {
($p:expr) => { unsafe { $p = $p.add(1) } }
}
struct StrIter {
p: *const u8
}
impl Iterator for StrIter {
type Item = char;
fn next(&mut self) -> Option<Self::Item> {
let ch = unsafe { *self.p };
if ch == 0 { // string is finished
None
} else {
next!(self.p);
Some(ch as char)
}
}
}
fn str_iter(s: *const u8) -> StrIter {
StrIter {
p: s
}
}
for c in str_iter(b"hello\0".as_ptr()) {
println!("{}",c);
}
// h
// e
// l
// l
// o
The cast from u8
to char
is interesting, because char
is a four-byte
Unicode code point. Obviously 7-bit ASCII works fine
but the ISO-8859-1 superset (Latin-1) works
here just as well because the 2nd Unicode block represents the rest of the
characters exactly (128-255).
However, if the original encoding isn’t iso-8859-1, then nonsense can be generated. The strength of C strings is that any encoding is permitted; the weakness is that the encoding is not well-specified.
Once you have an iterator over chars, it’s trivial to build a Rust string
by collecting those chars (here collect
needs a type hint.)
let s: String = str_iter(b"hello\0".as_ptr()).collect();
Life Beyond Borrowing
If you look at StrIter
, it is a struct that borrows the character data.
Because we are using pointers, the struct does not have to track the
lifetime of that data, and Rust can’t enforce any lifetime rules. On one
hand, this is an opportunity to do things that the borrow checker would otherwise
prohibit. But on the other hand, doing too much unsafe coding like this is defeating the
purpose of Rust, which is to guarantee memory safety automatically. unsafe
really means that
it is up to you to make that guarantee. People underestimate how hard it is
to make that analysis, so generally we keep use of unsafe
as restricted as possible.
So StrIter
is just a demonstration, not good coding practice. It is doing the flying
trapezee without a safety net for no particular benefit.
The risk/benefit makes more sense if unsafe
is used to efficiently implement
low-level data structures.
My advice (again, not so usual) is to implement double-linked lists and trees using the unsafe subset, to get a feeling for the syntactical differences without having to immediately hit the conceptual wall of doing it safely.
It is no secret that the Rust standard library
uses unsafe
liberally. Some think this demonstrates that the Rust memory model
is insufficient, but Rust is pragmatic: if a little unsafe
helps, in a few
well-audited places, then fine. Homeopathic use, not wholesale.