Using Rust like C

Not the Usual Introduction

Any attempt to persuade C programmers to try out Rust usually starts with describing higher-level features plus emphasizing the fact that Rust has the same hard-core attitude to systems programming as C: as optimal use of resources as possible, and no sneaky allocations.

The higher-level features makes it much more pleasant and safe to do things like extensive string manipulation and containers, while still having full control of code generation.

This is all true, but it’s interesting to introduce unsafe Rust initially if you are used to C.

Rust Pointers

Although references are fundamental, there are also pointers in Rust. The real difference is that that pointers do not follow the usual strict borrowing rules of references. (Under the hood, &u8 and *const u8 look exactly the same.) But to get a pointer, you need a reference first.

C is unique among programming languages by using the null byte to delimit the end of a string. A C string is therefore an arbitrary non-null sequence of bytes. In C++, std::string does not work like this; it can contain embedded nulls so can represent any arbitrary data. It does go to some trouble to provide a null-terminated C-style string with the c_str() method (Although C and C++ divorced in the 1980s, they tend to still live in the same apartment).

The Rust String type works very much like std::string, but insists that the bytes must represent valid UTF-8. The “borrowed” form &str (“string slice”) has a simple representation: a struct with a pointer to the character data, and a size field. So regular Rust string literals won’t have the representation we need.

A byte literal with an explicit embedded ‘\0’ will do fine - we ask for a pointer to its data and can now do the usual C pointer dance. Here we find out the length.

fn main() {
    let s = b"hello\0".as_ptr();
    let mut p = s;
    while unsafe { *p } != b'\0' {
        p = unsafe { p.add(1) };
    }
    let offs = (p as usize) - (s as usize);
    println!("{}",offs);
    // 5
}

There aren’t many explicit types here - only the as usize cast to get the difference between the initial and the final pointer.

It comes as no surprise that Rust regards dereferencing a pointer as fundamentally unsafe - in general, we really don’t know where a pointer comes from and whether it points to valid data. But advancing pointers can fail as well, if there is overflow. Obviously the first unsafe is much more likely to be a source of problems than the second, but Rust tries to rigorously flag all possible sources of undefined behaviour.

Let’s separate out the loop as a function to make the types more explicit:

fn r_strlen(s: *const u8) -> usize {
    let mut p = s;
    while unsafe { *p } != b'\0' {
        p =  unsafe { p.add(1) };
    }
    (p as usize) - (s as usize)
}

fn main() {
    let s = b"hello\0".as_ptr();
    println!("{:?}",r_strlen(s));
    // 5
}

Rust pointer types are written backwards, but mean exactly the same thing as in C. The loop pointer variable p has to be mutable, and the return value of a function is the last value of the function block. Otherwise, this is fairly straightforward curly-bracket syntax.

Premature Safety

The new function r_strlen is considered safe by any code that calls it. This is a problem, because it will segfault if passed a rubbish pointer. Rust programs should not segfault! (They may panic, but that’s a controlled unwinding of the stack and is completely memory safe). So we are trusting this function prematurely. Better to do flag this function explicitly as being unsafe:

unsafe fn r_strlen(s: *const u8) -> usize {
    // as before....
}

fn main() {
    let s = b"hello\0".as_ptr();
    println!("{:?}",unsafe { r_strlen(s) });
    // 5
}

It is now the responsibility of the caller to ensure that the function is called with a valid pointer to a C string.

We can easily bring in strlen from the system C library.

#[link(name = "c")]
extern {
    fn strlen(s: *const u8) -> usize;
}

fn main() {
    let s = b"hello\0".as_ptr();
    let len = unsafe { strlen(s) };
    assert_eq!(len, 5);
}

Any function linked in with the FFI (“Foreign Function Interface”) is considered unsafe.

Generally, interfacing with external C code is straightforward and efficient, but requires care to match the differences in representation - particularly with things like strings. Here there are helper types like CString and CStr which help bridge the gap.

It is good practice (however) to use the libc crate to link in the standard C library, since this handles the inevitable platform differences like C headers do.

Copying Buffers

I’ll do another classic K&R kind of function, which is strcpy. Just to make life a little less tedious, we will define a macro for advancing a pointer:

#[link(name = "c")]
extern {
    fn strlen(s: *const u8) -> usize;
}

macro_rules! next {
    ($p:expr) => {
        unsafe { $p = $p.add(1) }
    }
}

fn main() {
    let src = b"hello\0";
    let mut dest = [0; 25];

    let mut p = src.as_ptr();
    let mut q = dest.as_mut_ptr();
    while unsafe { *p } != b'\0' {
        unsafe { *q = *p };
        next!(p);
        next!(q);
    }
    unsafe { *q = *p };

    let len = unsafe {strlen(dest.as_ptr())};

    assert_eq!(len, 5);

}

Here, the source is a slice of bytes (which must have ‘\0’) and the destination is an array of 25 elements, initialized to zero. Again, the actual element type of that array (u8) will be worked out by type inference.

I did C for many years, and I will admit that the notation here is not as elegant as C. We do not have pre- and post- increment operators and the “C axiom” p[i] == *(p + i). But source elegance is overrated, since C notation can be abused to the point where the simplest string manipulation becomes a bravura display of pointer gymnastics. This could be seen as just harmless artistic expression, except these are often points of attack. Reasoning about unsafe code is hard - even experienced and careful C programmers make mistakes which the Rust compiler would flag as errors.

Here, the unsafe points are clearly marked.

Iterating over C-style Strings

This pattern of looping over all bytes can be wrapped up neatly as an iterator. The iterator will take each each byte in the source, and return it as Option<char>. Then the Rust for-loop will iterate over each char.

macro_rules! next {
    ($p:expr) => { unsafe { $p = $p.add(1) } }
}

struct StrIter {
    p: *const u8
}

impl Iterator for StrIter {
    type Item = char;

    fn next(&mut self) -> Option<Self::Item> {
        let ch = unsafe { *self.p };
        if ch == 0 { // string is finished
            None
        } else {
            next!(self.p);
            Some(ch as char)
        }
    }
}

fn str_iter(s: *const u8) -> StrIter {
    StrIter {
        p: s
    }
}

for c in str_iter(b"hello\0".as_ptr()) {
    println!("{}",c);
}
// h
// e
// l
// l
// o

The cast from u8 to char is interesting, because char is a four-byte Unicode code point. Obviously 7-bit ASCII works fine but the ISO-8859-1 superset (Latin-1) works here just as well because the 2nd Unicode block represents the rest of the characters exactly (128-255).

However, if the original encoding isn’t iso-8859-1, then nonsense can be generated. The strength of C strings is that any encoding is permitted; the weakness is that the encoding is not well-specified.

Once you have an iterator over chars, it’s trivial to build a Rust string by collecting those chars (here collect needs a type hint.)

let s: String = str_iter(b"hello\0".as_ptr()).collect();

Life Beyond Borrowing

If you look at StrIter, it is a struct that borrows the character data. Because we are using pointers, the struct does not have to track the lifetime of that data, and Rust can’t enforce any lifetime rules. On one hand, this is an opportunity to do things that the borrow checker would otherwise prohibit. But on the other hand, doing too much unsafe coding like this is defeating the purpose of Rust, which is to guarantee memory safety automatically. unsafe really means that it is up to you to make that guarantee. People underestimate how hard it is to make that analysis, so generally we keep use of unsafe as restricted as possible.

So StrIter is just a demonstration, not good coding practice. It is doing the flying trapezee without a safety net for no particular benefit.

The risk/benefit makes more sense if unsafe is used to efficiently implement low-level data structures.

My advice (again, not so usual) is to implement double-linked lists and trees using the unsafe subset, to get a feeling for the syntactical differences without having to immediately hit the conceptual wall of doing it safely.

It is no secret that the Rust standard library uses unsafe liberally. Some think this demonstrates that the Rust memory model is insufficient, but Rust is pragmatic: if a little unsafe helps, in a few well-audited places, then fine. Homeopathic use, not wholesale.