A Gentle Introduction To Rust
Why learn a new Programming Language?
The aim of this tutorial is to take you to a place where you can read and write enough Rust to fully appreciate the excellent learning resources available online, in particular The Book. It's an opportunity to try before you buy, and get enough feeling for the power of the language to want to go deeper.
As Einstein might have said, "As gentle as possible, but no gentler.". There is a lot of new stuff to learn here, and it's different enough to require some rearrangement of your mental furniture. By 'gentle' I mean that the features are presented practically with examples; as we encounter difficulties, I hope to show how Rust solves these problems. It is important to understand the problems before the solutions make sense. To put it in flowery language, we are going for a hike in hilly country and I will point out some interesting rock formations on the way, with only a few geology lectures. There will be some uphill but the view will be inspiring; the community is unusually pleasant and happy to help. There is the Rust Users Forum and an active subreddit which is unusually well-moderated. The FAQ is a good resource if you have specific questions.
First, why learn a new programming language? It is an investment of time and energy and that needs some justification. Even if you do not immediately land a cool job using that language, it stretches the mental muscles and makes you a better programmer. That seems a poor kind of return-on-investment but if you're not learning something genuinely new all the time then you will stagnate and be like the person who has ten years of experience in doing the same thing over and over.
Where Rust Shines
Rust is a statically and strongly typed systems programming language. statically means that all types are known at compile-time, strongly means that these types are designed to make it harder to write incorrect programs. A successful compilation means you have a much better guarantee of correctness than with a cowboy language like C. systems means generating the best possible machine code with full control of memory use. So the uses are pretty hardcore: operating systems, device drivers and embedded systems that might not even have an operating system. However, it's actually a very pleasant language to write normal application code in as well.
The big difference from C and C++ is that Rust is safe by default; all memory accesses are checked. It is not possible to corrupt memory by accident.
The unifying principles behind Rust are:
- strictly enforcing safe borrowing of data
- functions, methods and closures to operate on data
- tuples, structs and enums to aggregate data
- pattern matching to select and destructure data
- traits to define behaviour on data
There is a fast-growing ecosystem of available libraries through Cargo
but here we will concentrate on the core principles of the language
by learning to use the standard library. My advice is to write lots of small programs,
so learning to use rustc
directly is a core skill. When doing the examples in this
tutorial I defined a little script called rrun
which does a compilation and runs
the result:
rustc $1.rs && ./$1
Setting Up
This tutorial assumes that you have Rust installed locally. Fortunately this is very straightforward.
$ curl https://sh.rustup.rs -sSf | sh
$ rustup component add rust-docs
I would recommend getting the default stable version; it's easy to download unstable versions later and to switch between.
This gets the compiler, the Cargo package manager, the API documentation, and the Rust Book. The journey of a thousand miles starts with one step, and this first step is painless.
rustup
is the command you use to manage your Rust installation. When a new stable release
appears, you just have to say rustup update
to upgrade. rustup doc
will open
the offline documentation in your browser.
You will probably already have an editor you like, and basic Rust support is good. I'd suggest you start out with basic syntax highlighting at first, and work up as your programs get larger.
Personally I'm a fan of Geany which is one of the few editors with Rust support out-of-the-box; it's particularly easy on Linux since it's available through the package manager, but it works fine on other platforms.
The main thing is knowing how to edit, compile and run Rust programs. You learn to program with your fingers; type in the code yourself, and learn to rearrange things efficiently with your editor.
Zed Shaw's advice about learning to program in Python remains good, whatever the language. He says learning to program is like learning a musical instrument - the secret is practice and persistence. There's also good advice from Yoga and the soft martial arts like Tai Chi; feel the strain, but don't over-strain. You are not building dumb muscle here.
I'd like to thank the many contributors who caught bad English or bad Rust for me, and thanks to David Marino for his cool characterization of Rust as a friendly-but-hardcore no-nonsense knight in shining armour.
Steve Donovan © 2017-2018 MIT license version 0.4.0
Basics
Hello, World!
The original purpose of "hello world", ever since the first C version was written, was to test the compiler and run an actual program.
// hello.rs fn main() { println!("Hello, World!"); }
$ rustc hello.rs
$ ./hello
Hello, World!
Rust is a curly-braces language with semicolons, C++-style comments and a main
function - so far, so familiar. The exclamation mark indicates that this is a
macro call. For C++ programmers, this can be a turn-off, since they are used to
seriously stupid C macros - but I can ensure you that these macros are more
capable and sane.
For anybody else, it's probably "Great, now I have to remember when to say bang!". However, the compiler is unusually helpful; if you leave out that exclamation, you get:
error[E0425]: unresolved name `println`
--> hello2.rs:2:5
|
2 | println("Hello, World!");
| ^^^^^^^ did you mean the macro `println!`?
Learning a language means getting comfortable with its errors. Try to see the compiler as a strict but friendly helper rather than a computer shouting at you, because you are going to see a lot of red ink in the beginning. It's much better for the compiler to catch you out than for your program to blow up in front of actual humans.
The next step is to introduce a variable:
// let1.rs fn main() { let answer = 42; println!("Hello {}", answer); }
Spelling mistakes are compile errors, not runtime errors like with dynamic languages like Python or JavaScript. This will save you a lot of stress later! And if I wrote 'answr' instead of 'answer', the compiler is actually nice about it:
4 | println!("Hello {}", answr);
| ^^^^^ did you mean `answer`?
The println!
macro takes a format string
and some values; it's very similar to the formatting used by Python 3.
Another very useful macro is assert_eq!
. This is the workhorse of testing
in Rust; you assert that two things must be equal, and if not, panic.
// let2.rs fn main() { let answer = 42; assert_eq!(answer,42); }
Which won't produce any output. But change 42 to 40:
thread 'main' panicked at
'assertion failed: `(left == right)` (left: `42`, right: `40`)',
let2.rs:4
note: Run with `RUST_BACKTRACE=1` for a backtrace.
And that's our first runtime error in Rust.
Looping and Ifing
Anything interesting can be done more than once:
// for1.rs fn main() { for i in 0..5 { println!("Hello {}", i); } }
The range is not inclusive, so i
goes from 0 to 4. This is convenient in a
language which indexes things like arrays from 0.
And interesting things have to be done conditionally:
// for2.rs fn main() { for i in 0..5 { if i % 2 == 0 { println!("even {}", i); } else { println!("odd {}", i); } } }
even 0
odd 1
even 2
odd 3
even 4
i % 2
is zero if 2 can divide into i
cleanly; Rust uses C-style operators.
There are no brackets around the condition, just like in Go, but
you must use curly brackets around the block.
This does the same, written in a more interesting way:
// for3.rs fn main() { for i in 0..5 { let even_odd = if i % 2 == 0 {"even"} else {"odd"}; println!("{} {}", even_odd, i); } }
Traditionally, programming languages have statements (like if
) and
expressions (like 1+i
). In Rust, nearly everything has a value and can
be an expression. The seriously ugly C 'ternary operator' i % 2 == 0 ? "even" : "odd"
is not needed.
Note that there aren't any semi-colons in those blocks!
Adding Things Up
Computers are very good at arithmetic. Here is a first attempt at adding all the numbers from 0 to 4:
// add1.rs fn main() { let sum = 0; for i in 0..5 { sum += i; } println!("sum is {}", sum); }
But it fails to compile:
error[E0384]: re-assignment of immutable variable `sum`
--> add1.rs:5:9
3 | let sum = 0;
| --- first assignment to `sum`
4 | for i in 0..5 {
5 | sum += i;
| ^^^^^^^^ re-assignment of immutable variable
'Immutable'? A variable that cannot vary? let
variables by default can only
be assigned a value when declared. Adding the magic word mut
(please make
this variable mutable) does the trick:
// add2.rs fn main() { let mut sum = 0; for i in 0..5 { sum += i; } println!("sum is {}", sum); }
This can be puzzling when coming from other languages, where variables can be re-written by default. What makes something a 'variable' is that it gets assigned a computed value at run-time - it is not a constant. It is also how the word is used in mathematics, like when we say 'let n be the largest number in set S'.
There is a reason for declaring variables to be read-only by default. In a larger program, it gets hard to track where writes are taking place. So Rust makes things like mutability ('write-ability') explicit. There's a lot of cleverness in the language, but it tries not to hide anything.
Rust is both statically-typed and strongly-typed - these are often confused, but think of C (statically but weakly typed) and Python (dynamically but strongly typed). In static types the type is known at compile time, and dynamic types are only known at run time.
At the moment, however, it feels like Rust is hiding those types from you. What
exactly is the type of i
? The compiler can work it out, starting with 0,
with type inference, and comes up with i32
(four byte signed integer.)
Let's make exactly one change - turn that 0
into 0.0
. Then we get errors:
error[E0277]: the trait bound `{float}: std::ops::AddAssign<{integer}>` is not satisfied
--> add3.rs:5:9
|
5 | sum += i;
| ^^^^^^^^ the trait `std::ops::AddAssign<{integer}>` is not implemented for `{float}`
|
Ok, so the honeymoon is over: what does this mean? Each operator (like +=
) corresponds
to a trait, which is like an abstract interface that must be implemented for each concrete type.
We'll deal with traits in detail later, but here all you need to know is that
AddAssign
is the name of the trait implementing the +=
operator, and the error is saying
that floating point numbers do not implement this operator for integers. (The full list of
operator traits is here.)
Again, Rust likes to be explicit - it will not silently convert that integer into a float for you.
We have to cast that value to a floating-point value explicitly.
// add3.rs fn main() { let mut sum = 0.0; for i in 0..5 { sum += i as f64; } println!("sum is {}", sum); }
Function Types are Explicit
Functions are one place where the compiler will not work out types for you. And this in fact was a deliberate decision, since languages like Haskell have such powerful type inference that there are hardly any explicit type names. It's actually good Haskell style to put in explicit type signatures for functions. Rust requires this always.
Here is a simple user-defined function:
// fun1.rs fn sqr(x: f64) -> f64 { return x * x; } fn main() { let res = sqr(2.0); println!("square is {}", res); }
Rust goes back to an older style of argument declaration, where the type follows the name. This is how it was done in Algol-derived languages like Pascal.
Again, no integer-to-float conversions - if you replace the 2.0
with 2
then we
get a clear error:
8 | let res = sqr(2);
| ^ expected f64, found integral variable
|
You will actually rarely see functions written using a return
statement. More
often, it will look like this:
# #![allow(unused_variables)] # #fn main() { fn sqr(x: f64) -> f64 { x * x } #}
This is because the body of the function (inside {}
) has the value of its
last expression, just like with if-as-an-expression.
Since semicolons are inserted semi-automatically by human fingers, you might add it here and get the following error:
|
3 | fn sqr(x: f64) -> f64 {
| ^ expected f64, found ()
|
= note: expected type `f64`
= note: found type `()`
help: consider removing this semicolon:
--> fun2.rs:4:8
|
4 | x * x;
| ^
The ()
type is the empty type, nada, void
, zilch, nothing. Everything in Rust
has a value, but sometimes it's just nothing. The compiler knows this is
a common mistake, and actually helps you. (Anybody who has spent time with a
C++ compiler will know how damn unusual this is.)
A few more examples of this no-return expression style:
# #![allow(unused_variables)] # #fn main() { // absolute value of a floating-point number fn abs(x: f64) -> f64 { if x > 0.0 { x } else { -x } } // ensure the number always falls in the given range fn clamp(x: f64, x1: f64, x2: f64) -> f64 { if x < x1 { x1 } else if x > x2 { x2 } else { x } } #}
It's not wrong to use return
, but code is cleaner without it. You will still
use return
for returning early from a function.
Some operations can be elegantly expressed recursively:
# #![allow(unused_variables)] # #fn main() { fn factorial(n: u64) -> u64 { if n == 0 { 1 } else { n * factorial(n-1) } } #}
This can be a little strange at first, and the best thing is then to use pencil and paper and work out some examples. It isn't usually the most efficient way to do that operation however.
Values can also be passed by reference. A reference is created by &
and dereferenced
by *
.
fn by_ref(x: &i32) -> i32{ *x + 1 } fn main() { let i = 10; let res1 = by_ref(&i); let res2 = by_ref(&41); println!("{} {}", res1,res2); } // 11 42
What if you want a function to modify one of its arguments? Enter mutable references:
// fun4.rs fn modifies(x: &mut f64) { *x = 1.0; } fn main() { let mut res = 0.0; modifies(&mut res); println!("res is {}", res); }
This is more how C would do it than C++. You have to explicitly pass the
reference (with &
) and explicitly dereference with *
. And then throw in mut
because it's not the default. (I've always felt that C++ references are
too easy to miss compared to C.)
Basically, Rust is introducing some friction here, and not-so-subtly pushing
you towards returning values from functions directly. Fortunately, Rust has
powerful ways to express things like "operation succeeded and here's the result"
so &mut
isn't needed that often. Passing by reference is important when we have a
large object and don't wish to copy it.
The type-after-variable style applies to let
as well, when you really want to nail
down the type of a variable:
# #![allow(unused_variables)] # #fn main() { let bigint: i64 = 0; #}
Learning Where to Find the Ropes
It's time to start using the documentation. This will be installed on your machine,
and you can use rustup doc --std
to open it in a browser.
Note the search field at the top, since this is going to be your friend; it operates completely offline.
Let's say we want to see where the mathematical functions are, so search for 'cos'. The first two hits show it defined for both single and double-precision floating point numbers. It is defined on the value itself as a method, like so:
# #![allow(unused_variables)] # #fn main() { let pi: f64 = 3.1416; let x = pi/2.0; let cosine = x.cos(); #}
And the result will be sort-of zero; we obviously need a more authoritative source of pi-ness!
(Why do we need an explicit f64
type? Because without it, the constant could
be either f32
or f64
, and these are very different.)
Let me quote the example given for cos
, but written as a complete program
( assert!
is a cousin of assert_eq!
; the expression must be true):
fn main() { let x = 2.0 * std::f64::consts::PI; let abs_difference = (x.cos() - 1.0).abs(); assert!(abs_difference < 1e-10); }
std::f64::consts::PI
is a mouthful! ::
means much the same as it does in C++,
(often written using '.' in other languages) - it is a fully qualified name. We get
this full name from the second hit on searching for PI
.
Up to now, our little Rust programs have been free of all that import
and
include
stuff that tends to slow down the discussion of 'Hello World' programs.
Let's make this program more readable with a use
statement:
use std::f64::consts; fn main() { let x = 2.0 * consts::PI; let abs_difference = (x.cos() - 1.0).abs(); assert!(abs_difference < 1e-10); }
Why haven't we needed to do this up to now?
This is because Rust helpfully makes a lot of basic functionality visible without
explicit use
statements through the Rust prelude.
Arrays and Slices
All statically-typed languages have arrays, which are values packed nose to tail in memory. Arrays are indexed from zero:
// array1.rs fn main() { let arr = [10, 20, 30, 40]; let first = arr[0]; println!("first {}", first); for i in 0..4 { println!("[{}] = {}", i,arr[i]); } println!("length {}", arr.len()); }
And the output is:
first 10
[0] = 10
[1] = 20
[2] = 30
[3] = 40
length 4
In this case, Rust knows exactly how big the array is and if you try to
access arr[4]
it will be a compile error.
Learning a new language often involves unlearning mental habits from languages
you already know; if you are a Pythonista, then those brackets say List
. We will
come to the Rust equivalent of List
soon, but arrays are not the droids you're looking
for; they are fixed in size. They can be mutable (if we ask nicely) but you
cannot add new elements.
Arrays are not used that often in Rust, because the type of an array includes its
size. The type of the array in the example is
[i32; 4]
; the type of [10, 20]
would be [i32; 2]
and so forth: they
have different types. So they are bastards to pass around as
function arguments.
What are used often are slices. You can think of these as views into an underlying array of values. They otherwise behave very much like an array, and know their size, unlike those dangerous animals C pointers.
Note two important things here - how to write a slice's type, and that
you have to use &
to pass it to the function.
// array2.rs // read as: slice of i32 fn sum(values: &[i32]) -> i32 { let mut res = 0; for i in 0..values.len() { res += values[i] } res } fn main() { let arr = [10,20,30,40]; // look at that & let res = sum(&arr); println!("sum {}", res); }
Ignore the code of sum
for a while, and look at &[i32]
. The relationship between
Rust arrays and slices is similar to that between C arrays and pointers, except for
two important differences - Rust slices keep track of their size (and will
panic if you try to access outside that size) and you have to explicitly say that
you want to pass an array as a slice using the &
operator.
A C programmer pronounces &
as 'address of'; a Rust programmer pronounces it
'borrow'. This is going to be the key word when learning Rust. Borrowing is the name
given to a common pattern in programming; whenever you pass something by reference
(as nearly always happens in dynamic languages) or pass a pointer in C. Anything
borrowed remains owned by the original owner.
Slicing and Dicing
You cannot print out an array in the usual way with {}
but you can do a debug
print with {:?}
.
// array3.rs fn main() { let ints = [1, 2, 3]; let floats = [1.1, 2.1, 3.1]; let strings = ["hello", "world"]; let ints_ints = [[1, 2], [10, 20]]; println!("ints {:?}", ints); println!("floats {:?}", floats); println!("strings {:?}", strings); println!("ints_ints {:?}", ints_ints); }
Which gives:
ints [1, 2, 3]
floats [1.1, 2.1, 3.1]
strings ["hello", "world"]
ints_ints [[1, 2], [10, 20]]
So, arrays of arrays are no problem, but the important thing is that an array contains values of only one type. The values in an array are arranged next to each other in memory so that they are very efficient to access.
If you are curious about the actual types of these variables, here is a useful trick. Just declare a variable with an explicit type which you know will be wrong:
# #![allow(unused_variables)] # #fn main() { let var: () = [1.1, 1.2]; #}
Here is the informative error:
3 | let var: () = [1.1, 1.2];
| ^^^^^^^^^^ expected (), found array of 2 elements
|
= note: expected type `()`
= note: found type `[{float}; 2]`
({float}
means 'some floating-point type which is not fully specified yet')
Slices give you different views of the same array:
// slice1.rs fn main() { let ints = [1, 2, 3, 4, 5]; let slice1 = &ints[0..2]; let slice2 = &ints[1..]; // open range! println!("ints {:?}", ints); println!("slice1 {:?}", slice1); println!("slice2 {:?}", slice2); }
ints [1, 2, 3, 4, 5]
slice1 [1, 2]
slice2 [2, 3, 4, 5]
This is a neat notation which looks similar to Python slices but with a big difference: a copy of the data is never made. These slices all borrow their data from their arrays. They have a very intimate relationship with that array, and Rust spends a lot of effort to make sure that relationship does not break down.
Optional Values
Slices, like arrays, can be indexed. Rust knows the size of an array at
compile-time, but the size of a slice is only known at run-time. So s[i]
can
cause an out-of-bounds error when running and will panic. This is really not
what you want to happen - it can be the difference between a safe launch abort and
scattering pieces of a very expensive satellite all over Florida. And there are
no exceptions.
Let that sink in, because it comes as a shock. You cannot wrap dodgy-may-panic code in some try-block and 'catch the error' - at least not in a way you'd want to use every day. So how can Rust be safe?
There is a slice method get
which does not panic. But what does it return?
// slice2.rs fn main() { let ints = [1, 2, 3, 4, 5]; let slice = &ints; let first = slice.get(0); let last = slice.get(5); println!("first {:?}", first); println!("last {:?}", last); } // first Some(1) // last None
last
failed (we forgot about zero-based indexing), but returned something called None
.
first
is fine, but appears as a value wrapped in Some
. Welcome to the Option
type! It may be either Some
or None
.
The Option
type has some useful methods:
# #![allow(unused_variables)] # #fn main() { println!("first {} {}", first.is_some(), first.is_none()); println!("last {} {}", last.is_some(), last.is_none()); println!("first value {}", first.unwrap()); // first true false // last false true // first value 1 #}
If you were to unwrap last
, you would get a panic. But at least you can call
is_some
first to make sure - for instance, if you had a distinct no-value default:
# #![allow(unused_variables)] # #fn main() { let maybe_last = slice.get(5); let last = if maybe_last.is_some() { *maybe_last.unwrap() } else { -1 }; #}
Note the *
- the precise type inside the Some
is &i32
, which is a reference. We need
to dereference this to get back to a i32
value.
Which is long-winded, so there's a shortcut - unwrap_or
will return the value it
is given if the Option
was None
. The types must match up - get
returns
a reference. so you have to make up a &i32
with &-1
. Finally, again use *
to get the value as i32
.
# #![allow(unused_variables)] # #fn main() { let last = *slice.get(5).unwrap_or(&-1); #}
It's easy to miss the &
, but the compiler has your back here. If it was -1
,
rustc
says 'expected &{integer}, found integral variable' and then
'help: try with &-1
'.
You can think of Option
as a box which may contain a value, or nothing (None
).
(It is called Maybe
in Haskell). It may contain any kind of value, which is
its type parameter. In this case, the full type is Option<&i32>
, using
C++-style notation for generics. Unwrapping this box may cause an explosion,
but unlike Schroedinger's Cat, we can know in advance if it contains a value.
It is very common for Rust functions/methods to return such maybe-boxes, so learn how to use them comfortably.
Vectors
We'll return to slice methods again, but first: vectors. These are re-sizeable
arrays and behave much like Python List
and C++ std::vector
. The Rust type
Vec
(pronounced 'vector') behaves very much like an slice in fact; the
difference is that you can append extra values to a vector - note that it must
be declared as mutable.
// vec1.rs fn main() { let mut v = Vec::new(); v.push(10); v.push(20); v.push(30); let first = v[0]; // will panic if out-of-range let maybe_first = v.get(0); println!("v is {:?}", v); println!("first is {}", first); println!("maybe_first is {:?}", maybe_first); } // v is [10, 20, 30] // first is 10 // maybe_first is Some(10)
A common beginner mistake is to forget the mut
; you will get a helpful error
message:
3 | let v = Vec::new();
| - use `mut v` here to make mutable
4 | v.push(10);
| ^ cannot borrow mutably
There is a very intimate relation between vectors and slices:
// vec2.rs fn dump(arr: &[i32]) { println!("arr is {:?}", arr); } fn main() { let mut v = Vec::new(); v.push(10); v.push(20); v.push(30); dump(&v); let slice = &v[1..]; println!("slice is {:?}", slice); }
That little, so-important borrow operator &
is coercing the vector into a
slice. And it makes complete sense, because the vector is also looking after an array of
values, with the difference that the array is allocated dynamically.
If you come from a dynamic language, now is time for that little talk. In systems languages, program memory comes in two kinds: the stack and the heap. It is very fast to allocate data on the stack, but the stack is limited; typically of the order of megabytes. The heap can be gigabytes, but allocating is relatively expensive, and such memory must be freed later. In so-called 'managed' languages (like Java, Go and the so-called 'scripting' languages) these details are hidden from you by that convenient municipal utility called the garbage collector. Once the system is sure that data is no longer referenced by other data, it goes back into the pool of available memory.
Generally, this is a price worth paying. Playing with the stack is terribly unsafe, because if you make one mistake you can override the return address of the current function, and you die an ignominious death or (worse) got pwned by some guy living in his Mom's basement in Minsk.
The first C program I wrote (on an DOS PC) took out the whole computer. Unix systems always behaved better, and only the process died with a segfault. Why is this worse than a Rust (or Go) program panicking? Because a panic happens when the original problem happens, not when the program has become hopelessly confused and eaten all your homework. Panics are memory safe because they happen before any illegal access to memory. This is a common cause of security problems in C, because all memory accesses are unsafe and a cunning attacker can exploit this weakness.
Panicking sounds desperate and unplanned, but Rust panics are structured - the stack is unwound just as with exceptions. All allocated objects are dropped, and a backtrace is generated.
The downsides of garbage collection? The first is that it is wasteful of memory, which matters in those small embedded microchips which increasingly rule our world. The second is that it will decide, at the worst possible time, that a clean up must happen now. (The Mom analogy is that she wants to clean your room when you are at a delicate stage with a new lover). Those embedded systems need to respond to things when they happen ('real-time') and can't tolerate unscheduled outbreaks of cleaning. Roberto Ierusalimschy, the chief designer of Lua (one of the most elegant dynamic languages ever) said that he would not like to fly on an airplane that relied on garbage-collected software.
Back to vectors: when a vector is modified or created, it allocates from the heap and becomes the owner of that memory. The slice borrows the memory from the vector. When the vector dies or drops, it lets the memory go.
Iterators
We have got so far without mentioning a key part of the Rust puzzle - iterators.
The for-loop over a range was using an iterator (0..n
is actually similar to the
Python 3 range
function).
An iterator is easy to define informally. It is an 'object' with a next
method
which returns an Option
. As long as that value is not None
, we keep calling
next
:
// iter1.rs fn main() { let mut iter = 0..3; assert_eq!(iter.next(), Some(0)); assert_eq!(iter.next(), Some(1)); assert_eq!(iter.next(), Some(2)); assert_eq!(iter.next(), None); }
And that is exactly what for var in iter {}
does.
This may seem an inefficient way to define a for-loop, but rustc
does crazy-ass
optimizations in release mode and it will be just as fast as a while
loop.
Here is the first attempt to iterate over an array:
// iter2.rs fn main() { let arr = [10, 20, 30]; for i in arr { println!("{}", i); } }
which fails, but helpfully:
4 | for i in arr {
| ^ the trait `std::iter::Iterator` is not implemented for `[{integer}; 3]`
|
= note: `[{integer}; 3]` is not an iterator; maybe try calling
`.iter()` or a similar method
= note: required by `std::iter::IntoIterator::into_iter`
Following rustc
's advice, the following program works as expected.
// iter3.rs fn main() { let arr = [10, 20, 30]; for i in arr.iter() { println!("{}", i); } // slices will be converted implicitly to iterators... let slice = &arr; for i in slice { println!("{}", i); } }
In fact, it is more efficient to iterate over an array or slice this way
than to use for i in 0..slice.len() {}
because Rust does not have to obsessively
check every index operation.
We had an example of summing up a range of integers earlier. It involved a mut
variable and a loop. Here's the idiomatic, pro-level way of doing the sum:
// sum1.rs fn main() { let sum: i32 = (0..5).sum(); println!("sum was {}", sum); let sum: i64 = [10, 20, 30].iter().sum(); println!("sum was {}", sum); }
Note that this is one of those cases where you need to be explicit about the type of the variable, since otherwise Rust doesn't have enough information. Here we do sums with two different integer sizes, no problem. (It is also no problem to create a new variable of the same name if you run out of names to give things.)
With this background, some more of the slice methods will make more sense. (Another documentation tip; on the right-hand side of every doc page there's a '[-]' which you can click to collapse the method list. You can then expand the details of anything that looks interesting. Anything that looks too weird, just ignore for now.)
The windows
method gives you an iterator of slices - overlapping windows of
values!
// slice4.rs fn main() { let ints = [1, 2, 3, 4, 5]; let slice = &ints; for s in slice.windows(2) { println!("window {:?}", s); } } // window [1, 2] // window [2, 3] // window [3, 4] // window [4, 5]
Or chunks
:
# #![allow(unused_variables)] # #fn main() { for s in slice.chunks(2) { println!("chunks {:?}", s); } // chunks [1, 2] // chunks [3, 4] // chunks [5] #}
More about vectors...
There is a useful little macro vec!
for initializing a vector. Note that you
can remove values from the end of a vector using pop
, and extend a vector
using any compatible iterator.
// vec3.rs fn main() { let mut v1 = vec![10, 20, 30, 40]; v1.pop(); let mut v2 = Vec::new(); v2.push(10); v2.push(20); v2.push(30); assert_eq!(v1, v2); v2.extend(0..2); assert_eq!(v2, &[10, 20, 30, 0, 1]); }
Vectors compare with each other and with slices by value.
You can insert values into a vector at arbitrary positions with insert
,
and remove with remove
. This is not as efficient as pushing and popping since
the values will have to be moved to make room, so watch out for these operations on big
vectors.
Vectors have a size and a capacity. If you clear
a vector, its size becomes zero,
but it still retains its old capacity. So refilling it with push
, etc only requires
reallocation when the size gets larger than that capacity.
Vectors can be sorted, and then duplicates can be removed - these operate in-place
on the vector. (If you want to make a copy first, use clone
.)
// vec4.rs fn main() { let mut v1 = vec![1, 10, 5, 1, 2, 11, 2, 40]; v1.sort(); v1.dedup(); assert_eq!(v1, &[1, 2, 5, 10, 11, 40]); }
Strings
Strings in Rust are a little more involved than in other languages; the String
type,
like Vec
, allocates dynamically and is resizeable. (So it's like C++'s std::string
but not like the immutable strings of Java and Python.) But a program may contain a lot
of string literals (like "hello") and a system language should be able to store
these statically in the executable itself. In embedded micros, that could mean putting
them in cheap ROM rather than expensive RAM (for low-power devices, RAM is
also expensive in terms of power consumption.) A system language has to have
two kinds of string, allocated or static.
So "hello" is not of type String
. It is of type &str
(pronounced 'string slice').
It's like the distinction between const char*
and std::string
in C++, except
&str
is much more intelligent. In fact, &str
and String
have a very
similar relationship to each other as do &[T]
to Vec<T>
.
// string1.rs fn dump(s: &str) { println!("str '{}'", s); } fn main() { let text = "hello dolly"; // the string slice let s = text.to_string(); // it's now an allocated string dump(text); dump(&s); }
Again, the borrow operator can coerce String
into &str
, just as Vec<T>
could
be coerced into &[T]
.
Under the hood, String
is basically a Vec<u8>
and &str
is &[u8]
, but
those bytes must represent valid UTF-8 text.
Like a vector, you can push
a character and pop
one off the end of String
:
// string5.rs fn main() { let mut s = String::new(); // initially empty! s.push('H'); s.push_str("ello"); s.push(' '); s += "World!"; // short for `push_str` // remove the last char s.pop(); assert_eq!(s, "Hello World"); }
You can convert many types to strings using to_string
(if you can display them with '{}' then they can be converted).
The format!
macro is a very useful way to build
up more complicated strings using the same format strings as println!
.
// string6.rs fn array_to_str(arr: &[i32]) -> String { let mut res = '['.to_string(); for v in arr { res += &v.to_string(); res.push(','); } res.pop(); res.push(']'); res } fn main() { let arr = array_to_str(&[10, 20, 30]); let res = format!("hello {}", arr); assert_eq!(res, "hello [10,20,30]"); }
Note the &
in front of v.to_string()
- the operator is defined on a string
slice, not a String
itself, so it needs a little persuasion to match.
The notation used for slices works with strings as well:
// string2.rs fn main() { let text = "static"; let string = "dynamic".to_string(); let text_s = &text[1..]; let string_s = &string[2..4]; println!("slices {:?} {:?}", text_s, string_s); } // slices "tatic" "na"
But, you cannot index strings! This is because they use the One True Encoding, UTF-8, where a 'character' may be a number of bytes.
// string3.rs fn main() { let multilingual = "Hi! ¡Hola! привет!"; for ch in multilingual.chars() { print!("'{}' ", ch); } println!(""); println!("len {}", multilingual.len()); println!("count {}", multilingual.chars().count()); let maybe = multilingual.find('п'); if maybe.is_some() { let hi = &multilingual[maybe.unwrap()..]; println!("Russian hi {}", hi); } } // 'H' 'i' '!' ' ' '¡' 'H' 'o' 'l' 'a' '!' ' ' 'п' 'р' 'и' 'в' 'е' 'т' '!' // len 25 // count 18 // Russian hi привет!
Now, let that sink in - there are 25 bytes, but only 18 characters! However, if
you use a method like find
, you will get a valid index (if found) and then
any slice will be fine.
(The Rust char
type is a 4-byte Unicode code point. Strings are not arrays
of chars!)
String slicing may explode like vector indexing, because it uses byte offsets. In this case, the string consists of two bytes, so trying to pull out the first byte is a Unicode error. So be careful to only slice strings using valid offsets that come from string methods.
# #![allow(unused_variables)] # #fn main() { let s = "¡"; println!("{}", &s[0..1]); <-- bad, first byte of a multibyte character #}
Breaking up strings is a popular and useful pastime. The string split_whitespace
method returns an iterator, and we then choose what to do with it. A common need
is to create a vector of the split substrings.
collect
is very general and so needs some clues about what it is collecting - hence
the explicit type.
# #![allow(unused_variables)] # #fn main() { let text = "the red fox and the lazy dog"; let words: Vec<&str> = text.split_whitespace().collect(); // ["the", "red", "fox", "and", "the", "lazy", "dog"] #}
You could also say it like this, passing the iterator into the extend
method:
# #![allow(unused_variables)] # #fn main() { let mut words = Vec::new(); words.extend(text.split_whitespace()); #}
In most languages, we would have to make these separately allocated strings, whereas here each slice in the vector is borrowing from the original string. All we allocate is the space to keep the slices.
Have a look at this cute two-liner; we get an iterator over the chars,
and only take those characters which are not space. Again, collect
needs
a clue (we may have wanted a vector of chars, say):
# #![allow(unused_variables)] # #fn main() { let stripped: String = text.chars() .filter(|ch| ! ch.is_whitespace()).collect(); // theredfoxandthelazydog #}
The filter
method takes a closure, which is Rust-speak for
lambdas or anonymous functions. Here the argument type is clear from the
context, so the explicit rule is relaxed.
Yes, you can do this as an explicit loop over chars, pushing the returned slices into a mutable vector, but this is shorter, reads well (when you are used to it, of course) and just as fast. It is not a sin to use a loop, however, and I encourage you to write that version as well.
Interlude: Getting Command Line Arguments
Up to now our programs have lived in blissful ignorance of the outside world; now it's time to feed them data.
std::env::args
is how you access command-line arguments; it returns an iterator
over the arguments as strings, including the program name.
// args0.rs fn main() { for arg in std::env::args() { println!("'{}'", arg); } }
src$ rustc args0.rs
src$ ./args0 42 'hello dolly' frodo
'./args0'
'42'
'hello dolly'
'frodo'
Would it have been better to return a Vec
? It's easy enough to use collect
to
make that vector, using the iterator skip
method to move past the program
name.
# #![allow(unused_variables)] # #fn main() { let args: Vec<String> = std::env::args().skip(1).collect(); if args.len() > 0 { // we have args! ... } #}
Which is fine; it's pretty much how you would do it in most languages.
A more Rust-y approach to reading a single argument (together with parsing an integer value):
// args1.rs use std::env; fn main() { let first = env::args().nth(1).expect("please supply an argument"); let n: i32 = first.parse().expect("not an integer!"); // do your magic }
nth(1)
gives you the second value of the iterator, and expect
is like an unwrap
with a readable message.
Converting a string into a number is straightforward, but you do need to specify
the type of the value - how else could parse
know?
This program can panic, which is fine for dinky test programs. But don't get too comfortable with this convenient habit.
Matching
The code in string3.rs
where we extract the Russian greeting is not how it would
be usually written. Enter match:
# #![allow(unused_variables)] # #fn main() { match multilingual.find('п') { Some(idx) => { let hi = &multilingual[idx..]; println!("Russian hi {}", hi); }, None => println!("couldn't find the greeting, Товарищ") }; #}
match
consists of several patterns with a matching value following the fat arrow,
separated by commas. It has conveniently unwrapped the value from the Option
and
bound it to idx
. You must specify all the possibilities, so we have to handle
None
.
Once you are used to it (and by that I mean, typed it out in full a few times) it
feels more natural than the explicit is_some
check which needed an extra
variable to store the Option
.
But if you're not interested in failure here, then if let
is your friend:
# #![allow(unused_variables)] # #fn main() { if let Some(idx) = multilingual.find('п') { println!("Russian hi {}", &multilingual[idx..]); } #}
This is convenient if you want to do a match and are only interested in one possible result.
match
can also operate like a C switch
statement, and like other Rust constructs
can return a value:
# #![allow(unused_variables)] # #fn main() { let text = match n { 0 => "zero", 1 => "one", 2 => "two", _ => "many", }; #}
The _
is like C default
- it's a fall-back case. If you don't provide one then
rustc
will consider it an error. (In C++ the best you can expect is a warning, which
says a lot about the respective languages).
Rust match
statements can also match on ranges. Note that these ranges have
three dots and are inclusive ranges, so that the first condition would match 3.
# #![allow(unused_variables)] # #fn main() { let text = match n { 0...3 => "small", 4...6 => "medium", _ => "large", }; #}
Reading from Files
The next step to exposing our programs to the world is to reading files.
Recall that expect
is like unwrap
but gives a custom error message. We are
going to throw away a few errors here:
// file1.rs use std::env; use std::fs::File; use std::io::Read; fn main() { let first = env::args().nth(1).expect("please supply a filename"); let mut file = File::open(&first).expect("can't open the file"); let mut text = String::new(); file.read_to_string(&mut text).expect("can't read the file"); println!("file had {} bytes", text.len()); }
src$ file1 file1.rs
file had 366 bytes
src$ ./file1 frodo.txt
thread 'main' panicked at 'can't open the file: Error { repr: Os { code: 2, message: "No such file or directory" } }', ../src/libcore/result.rs:837
note: Run with `RUST_BACKTRACE=1` for a backtrace.
src$ file1 file1
thread 'main' panicked at 'can't read the file: Error { repr: Custom(Custom { kind: InvalidData, error: StringError("stream did not contain valid UTF-8") }) }', ../src/libcore/result.rs:837
note: Run with `RUST_BACKTRACE=1` for a backtrace.
So open
can fail because the file doesn't exist or we aren't allowed to read it,
and read_to_string
can fail because the file doesn't contain valid UTF-8. (Which is
fair enough, you can use read_to_end
and put the contents into a vector of bytes
instead.) For files that aren't too big, reading them in one gulp is useful and
straightforward.
If you know anything about file handling in other languages, you may wonder when
the file is closed. If we were writing to this file, then not closing it could
result in loss of data.
But the file here is closed when the function ends and the file
variable is dropped.
This 'throwing away errors' thing is getting too much of a habit. You do not
want to put this code into a function, knowing that it could so easily crash
the whole program. So now we have to talk about exactly what File::open
returns.
If Option
is a value that may contain something or nothing, then Result
is a value
that may contain something or an error. They both understand unwrap
(and its cousin
expect
) but they are quite different. Result
is defined by two type parameters,
for the Ok
value and the Err
value.
The Result
'box' has two compartments, one labelled Ok
and the other Err
.
fn good_or_bad(good: bool) -> Result<i32,String> { if good { Ok(42) } else { Err("bad".to_string()) } } fn main() { println!("{:?}",good_or_bad(true)); //Ok(42) println!("{:?}",good_or_bad(false)); //Err("bad") match good_or_bad(true) { Ok(n) => println!("Cool, I got {}",n), Err(e) => println!("Huh, I just got {}",e) } // Cool, I got 42 }
(The actual 'error' type is arbitrary - a lot of people use strings until they are comfortable with Rust error types.) It's a convenient way to either return one value or another.
This version of the file reading function does not crash. It returns a Result
and
it is the caller who must decide how to handle the error.
// file2.rs use std::env; use std::fs::File; use std::io::Read; use std::io; fn read_to_string(filename: &str) -> Result<String,io::Error> { let mut file = match File::open(&filename) { Ok(f) => f, Err(e) => return Err(e), }; let mut text = String::new(); match file.read_to_string(&mut text) { Ok(_) => Ok(text), Err(e) => Err(e), } } fn main() { let file = env::args().nth(1).expect("please supply a filename"); let text = read_to_string(&file).expect("bad file man!"); println!("file had {} bytes", text.len()); }
The first match safely extracts the value from Ok
, which
becomes the value of the match. If it's Err
it returns the error,
rewrapped as an Err
.
The second match returns the string, wrapped up as an Ok
, otherwise
(again) the error. The actual value in the Ok
is unimportant, so we ignore
it with _
.
This is not so pretty; when most of a function is error handling, then the 'happy path' gets lost. Go tends to have this problem, with lots of explicit early returns, or just ignoring errors. (That is, by the way, the closest thing to evil in the Rust universe.)
Fortunately, there is a shortcut.
The std::io
module defines a type alias io::Result<T>
which is exactly
the same as Result<T,io::Error>
and easier to type.
# #![allow(unused_variables)] # #fn main() { fn read_to_string(filename: &str) -> io::Result<String> { let mut file = File::open(&filename)?; let mut text = String::new(); file.read_to_string(&mut text)?; Ok(text) } #}
That ?
operator does almost exactly what the match on File::open
does;
if the result was an error, then it will immediately return that error.
Otherwise, it returns the Ok
result.
At the end, we still need to wrap up the string as a result.
2017 was a good year for Rust, and ?
was one of the cool things that
became stable. You will still see the macro try!
used in older code:
# #![allow(unused_variables)] # #fn main() { fn read_to_string(filename: &str) -> io::Result<String> { let mut file = try!(File::open(&filename)); let mut text = String::new(); try!(file.read_to_string(&mut text)); Ok(text) } #}
In summary, it's possible to write perfectly safe Rust that isn't ugly, without needing exceptions.
Structs, Enums and Matching
Rust likes to Move It, Move It
I'd like to move back a little, and show you something surprising:
// move1.rs fn main() { let s1 = "hello dolly".to_string(); let s2 = s1; println!("s1 {}", s1); }
And we get the following error:
error[E0382]: use of moved value: `s1`
--> move1.rs:5:22
|
4 | let s2 = s1;
| -- value moved here
5 | println!("s1 {}", s1);
| ^^ value used here after move
|
= note: move occurs because `s1` has type `std::string::String`,
which does not implement the `Copy` trait
Rust has different behaviour than other languages. In a language where variables are
always references (like Java or Python), s2
becomes yet another reference to the
string object referenced by s1
. In C++, s1
is a value, and it is copied to s2
.
But Rust moves the value. It doesn't see strings as copyable
("does not implement the Copy trait").
We would not see this with 'primitive' types like numbers, since they are just values;
they are allowed to be copyable because they are cheap to copy. But String
has allocated
memory containing "Hello dolly", and copying will involve allocating some more memory
and copying the characters. Rust will not do this silently.
Consider a String
containing the whole text of 'Moby-Dick'. It's not a big struct,
just has the address in memory of the text, its size, and how big the allocated block is.
Copying this is going to be expensive, because that memory is allocated on the heap and
the copy will need its own allocated block.
String
| addr | ---------> Call me Ishmael.....
| size | |
| cap | |
|
&str |
| addr | -------------------|
| size |
f64
| 8 bytes |
The second value is a string slice (&str
) which refers to the same memory as the string,
with a size - just the guy's name. Cheap to copy!
The third value is an f64
- just 8 bytes. It does not refer to any other memory, so
it's just as cheap to copy as to move.
Copy
values are only defined by their representation in memory, and when
Rust copies, it just copies those bytes elsewhere. Similarly, a non-Copy
value
is also just moved. There is no cleverness in copying and moving, unlike in C++.
Re-writing with a function call reveals exactly the same error:
// move2.rs fn dump(s: String) { println!("{}", s); } fn main() { let s1 = "hello dolly".to_string(); dump(s1); println!("s1 {}", s1); // <---error: 'value used here after move' }
Here, you have a choice. You may pass a reference to that string, or
explicitly copy it using its clone
method. Generally, the first is the better way
to go.
fn dump(s: &String) { println!("{}", s); } fn main() { let s1 = "hello dolly".to_string(); dump(&s1); println!("s1 {}", s1); }
The error goes away. But you'll rarely see a plain
String
reference like this, since to pass a string literal is really ugly and involves
creating a temporary string.
# #![allow(unused_variables)] # #fn main() { dump(&"hello world".to_string()); #}
So altogether the best way to declare that function is:
# #![allow(unused_variables)] # #fn main() { fn dump(s: &str) { println!("{}", s); } #}
And then both dump(&s1)
and dump("hello world")
work properly. (Here Deref
coercion kicks in and Rust will convert &String
to &str
for you.)
To summarise, assignment of a non-Copy value moves the value from one location to another. Otherwise, Rust would be forced to implicitly do a copy and break its promise to make allocations explicit.
Scope of Variables
So, the rule of thumb is to prefer to keep references to the original data - to 'borrow' it.
But a reference must not outlive the owner!
First, Rust is a block-scoped language. Variables only exist for the duration of their block:
# #![allow(unused_variables)] # #fn main() { { let a = 10; let b = "hello"; { let c = "hello".to_string(); // a,b and c are visible } // the string c is dropped // a,b are visible for i in 0..a { let b = &b[1..]; // original b is no longer visible - it is shadowed. } // the slice b is dropped // i is _not_ visible! } #}
Loop variables (like i
) are a little different, they are only visible in the loop
block. It is not an error to create a new variable using the same name ('shadowing')
but it can be confusing.
When a variable 'goes out of scope' then it is dropped. Any memory used is reclaimed,
and any other resources owned by that variable are given back to the system - for
instance, dropping a File
closes it. This is a Good Thing. Unused resources are
reclaimed immediately when not needed.
(A further Rust-specific issue is that a variable may appear to be in scope, but its value has moved.)
Here a reference rs1
is made to a value tmp
which only lives for the duration
of its block:
01 // ref1.rs 02 fn main() { 03 let s1 = "hello dolly".to_string(); 04 let mut rs1 = &s1; 05 { 06 let tmp = "hello world".to_string(); 07 rs1 = &tmp; 08 } 09 println!("ref {}", rs1); 10 }
We borrow the value of s1
and then borrow the value of tmp
. But tmp
's value
does not exist outside that block!
error: `tmp` does not live long enough
--> ref1.rs:8:5
|
7 | rs1 = &tmp;
| --- borrow occurs here
8 | }
| ^ `tmp` dropped here while still borrowed
9 | println!("ref {}", rs1);
10 | }
| - borrowed value needs to live until here
Where is tmp
? Gone, dead, gone back to the Big Heap in the Sky: dropped.
Rust is here saving you from the dreaded 'dangling pointer' problem of C -
a reference that points to stale data.
Tuples
It's sometimes very useful to return multiple values from a function. Tuples are a convenient solution:
// tuple1.rs fn add_mul(x: f64, y: f64) -> (f64,f64) { (x + y, x * y) } fn main() { let t = add_mul(2.0,10.0); // can debug print println!("t {:?}", t); // can 'index' the values println!("add {} mul {}", t.0,t.1); // can _extract_ values let (add,mul) = t; println!("add {} mul {}", add,mul); } // t (12, 20) // add 12 mul 20 // add 12 mul 20
Tuples may contain different types, which is the main difference from arrays.
# #![allow(unused_variables)] # #fn main() { let tuple = ("hello", 5, 'c'); assert_eq!(tuple.0, "hello"); assert_eq!(tuple.1, 5); assert_eq!(tuple.2, 'c'); #}
They appear in some Iterator
methods. enumerate
is like the Python generator
of the same name:
# #![allow(unused_variables)] # #fn main() { for t in ["zero","one","two"].iter().enumerate() { print!(" {} {};",t.0,t.1); } // 0 zero; 1 one; 2 two; #}
zip
combines two iterators into a single iterator of
tuples containing the values from both:
# #![allow(unused_variables)] # #fn main() { let names = ["ten","hundred","thousand"]; let nums = [10,100,1000]; for p in names.iter().zip(nums.iter()) { print!(" {} {};", p.0,p.1); } // ten 10; hundred 100; thousand 1000; #}
Structs
Tuples are convenient, but saying t.1
and keeping track of the meaning of each part
is tedious for anything that isn't straightforward.
Rust structs contain named fields:
// struct1.rs struct Person { first_name: String, last_name: String } fn main() { let p = Person { first_name: "John".to_string(), last_name: "Smith".to_string() }; println!("person {} {}", p.first_name,p.last_name); }
The values of a struct will be placed next to each other in memory, although you should not assume any particular memory layout, since the compiler will organize the memory for efficiency, not size, and there may be padding.
Initializing this struct is a bit clumsy, so we want to move the construction of a Person
into its own function. This function can be made into an associated function of Person
by putting
it into a impl
block:
// struct2.rs struct Person { first_name: String, last_name: String } impl Person { fn new(first: &str, name: &str) -> Person { Person { first_name: first.to_string(), last_name: name.to_string() } } } fn main() { let p = Person::new("John","Smith"); println!("person {} {}", p.first_name,p.last_name); }
There is nothing magic or reserved about the name new
here. Note that it's accessed
using a C++-like notation using double-colon ::
.
Here's a Person
method , that takes a reference self argument:
# #![allow(unused_variables)] # #fn main() { impl Person { ... fn full_name(&self) -> String { format!("{} {}", self.first_name, self.last_name) } } ... println!("fullname {}", p.full_name()); // fullname John Smith #}
The self
is used explicitly and is passed as a reference.
(You can think of &self
as short for self: &Person
.)
The keyword Self
refers to the struct type - you can mentally substitute Person
for Self
here:
# #![allow(unused_variables)] # #fn main() { fn copy(&self) -> Self { Self::new(&self.first_name,&self.last_name) } #}
Methods may allow the data to be modified using a mutable self argument:
# #![allow(unused_variables)] # #fn main() { fn set_first_name(&mut self, name: &str) { self.first_name = name.to_string(); } #}
And the data will move into the method when a plain self argument is used:
# #![allow(unused_variables)] # #fn main() { fn to_tuple(self) -> (String,String) { (self.first_name, self.last_name) } #}
(Try that with &self
- structs will not let go of their data without a fight!)
Note that after v.to_tuple()
is called, then v
has moved and is no longer
available.
To summarize:
- no
self
argument: you can associate functions with structs, like thenew
"constructor". &self
argument: can use the values of the struct, but not change them&mut self
argument: can modify the valuesself
argument: will consume the value, which will move.
If you try to do a debug dump of a Person
, you will get an informative error:
error[E0277]: the trait bound `Person: std::fmt::Debug` is not satisfied
--> struct2.rs:23:21
|
23 | println!("{:?}", p);
| ^ the trait `std::fmt::Debug` is not implemented for `Person`
|
= note: `Person` cannot be formatted using `:?`; if it is defined in your crate,
add `#[derive(Debug)]` or manually implement it
= note: required by `std::fmt::Debug::fmt`
The compiler is giving advice, so we put #[derive(Debug)]
in front of Person
, and now
there is sensible output:
Person { first_name: "John", last_name: "Smith" }
The directive makes the compiler generate a Debug
implementation, which is very
helpful. It's good practice to do this for your structs, so they can be
printed out (or written as a string using format!
). (Doing so by default would be
very un-Rustlike.)
Here is the final little program:
// struct4.rs use std::fmt; #[derive(Debug)] struct Person { first_name: String, last_name: String } impl Person { fn new(first: &str, name: &str) -> Person { Person { first_name: first.to_string(), last_name: name.to_string() } } fn full_name(&self) -> String { format!("{} {}",self.first_name, self.last_name) } fn set_first_name(&mut self, name: &str) { self.first_name = name.to_string(); } fn to_tuple(self) -> (String,String) { (self.first_name, self.last_name) } } fn main() { let mut p = Person::new("John","Smith"); println!("{:?}", p); p.set_first_name("Jane"); println!("{:?}", p); println!("{:?}", p.to_tuple()); // p has now moved. } // Person { first_name: "John", last_name: "Smith" } // Person { first_name: "Jane", last_name: "Smith" } // ("Jane", "Smith")
Lifetimes Start to Bite
Usually structs contain values, but often they also need to contain references. Say we want to put a string slice, not a string value, in a struct.
// life1.rs #[derive(Debug)] struct A { s: &str } fn main() { let a = A { s: "hello dammit" }; println!("{:?}", a); }
error[E0106]: missing lifetime specifier
--> life1.rs:5:8
|
5 | s: &str
| ^ expected lifetime parameter
To understand the complaint, you have to see the problem from the point of view of Rust. It will not allow a reference to be stored without knowing its lifetime. All references are borrowed from some value, and all values have lifetimes. The lifetime of a reference cannot be longer than the lifetime of that value. Rust cannot allow a situation where that reference could suddenly become invalid.
Now, string slices borrow from string literals
like "hello" or from String
values. String literals exist for the duration
of the whole program, which is called the 'static' lifetime.
So this works - we assure Rust that the string slice always refers to such static strings:
// life2.rs #[derive(Debug)] struct A { s: &'static str } fn main() { let a = A { s: "hello dammit" }; println!("{:?}", a); } // A { s: "hello dammit" }
It is not the most pretty notation, but sometimes ugliness is the necessary price of being precise.
This can also be used to specify a string slice that is returned from a function:
# #![allow(unused_variables)] # #fn main() { fn how(i: u32) -> &'static str { match i { 0 => "none", 1 => "one", _ => "many" } } #}
That works for the special case of static strings, but this is very restrictive.
However we can specify that the lifetime of the reference is at least as long as that of the struct itself.
// life3.rs #[derive(Debug)] struct A <'a> { s: &'a str } fn main() { let s = "I'm a little string".to_string(); let a = A { s: &s }; println!("{:?}", a); }
Lifetimes are conventionally called 'a','b',etc but you could just as well called it 'me' here.
After this point, our a
struct and the s
string are bound by a strict contract:
a
borrows from s
, and cannot outlive it.
With this struct definition, we would like to write a function that returns an A
value:
# #![allow(unused_variables)] # #fn main() { fn makes_a() -> A { let string = "I'm a little string".to_string(); A { s: &string } } #}
But A
needs a lifetime - "expected lifetime parameter":
= help: this function's return type contains a borrowed value,
but there is no value for it to be borrowed from
= help: consider giving it a 'static lifetime
rustc
is giving advice, so we follow it:
# #![allow(unused_variables)] # #fn main() { fn makes_a() -> A<'static> { let string = "I'm a little string".to_string(); A { s: &string } } #}
And now the error is
8 | A { s: &string }
| ^^^^^^ does not live long enough
9 | }
| - borrowed value only lives until here
There is no way that this could safely work, because string
will be dropped when the
function ends, and no reference to string
can outlast it.
You can usefully think of lifetime parameters as being part of the type of a value.
Sometimes it seems like a good idea for a struct to contain a value and a reference that borrows from that value. It's basically impossible because structs must be moveable, and any move will invalidate the reference. It isn't necessary to do this - for instance, if your struct has a string field, and needs to provide slices, then it could keep indices and have a method to generate the actual slices.
Traits
Please note that Rust does not spell struct
class. The keyword class
in other
languages is so overloaded with meaning that it effectively shuts down original thinking.
Let's put it like this: Rust structs cannot inherit from other structs; they are all unique types. There is no sub-typing. They are dumb data.
So how does one establish relationships between types? This is where traits come in.
rustc
often talks about implementing X trait
and so it's time to talk about traits
properly.
Here's a little example of defining a trait and implementing it for a particular type.
// trait1.rs trait Show { fn show(&self) -> String; } impl Show for i32 { fn show(&self) -> String { format!("four-byte signed {}", self) } } impl Show for f64 { fn show(&self) -> String { format!("eight-byte float {}", self) } } fn main() { let answer = 42; let maybe_pi = 3.14; let s1 = answer.show(); let s2 = maybe_pi.show(); println!("show {}", s1); println!("show {}", s2); } // show four-byte signed 42 // show eight-byte float 3.14
It's pretty cool; we have added a new method to both i32
and f64
!
Getting comfortable with Rust involves learning the basic traits of the standard library (they tend to hunt in packs.)
Debug
is very common.
We gave Person
a default implementation with the
convenient #[derive(Debug)]
, but say we want a Person
to display as its full name:
# #![allow(unused_variables)] # #fn main() { use std::fmt; impl fmt::Debug for Person { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { write!(f, "{}", self.full_name()) } } ... println!("{:?}", p); // John Smith #}
write!
is a very useful macro - here f
is anything that implements Write
.
(This would also work with a File
- or even a String
.)
Display
controls how values are printed out with "{}" and is implemented
just like Debug
. As a useful side-effect, ToString
is automatically
implemented for anything implementing Display
. So if we implement
Display
for Person
, then p.to_string()
also works.
Clone
defines the method clone
, and can simply be defined with
"#[derive(Clone)]" if all the fields themselves implement Clone
.
Example: iterator over floating-point range
We have met ranges before (0..n
) but they don't work for floating-point values. (You
can force this but you'll end up with a step of 1.0 which is uninteresting.)
Recall the informal definition of an iterator; it is an struct with a next
method
which may return Some
-thing or None
. In the process, the iterator itself gets modified,
it keeps the state for the iteration (like next index and so forth.) The data that
is being iterated over doesn't change usually, (But see Vec::drain
for an
interesting iterator that does modify its data.)
And here is the formal definition: the Iterator trait.
# #![allow(unused_variables)] # #fn main() { trait Iterator { type Item; fn next(&mut self) -> Option<Self::Item>; ... } #}
Here we meet an associated type of the Iterator
trait.
This trait must work for any type, so you must specify that return type somehow.
The method next
can then be written without using a
particular type - instead it refers to that type parameter's Item
via Self
.
The iterator trait for f64
is written Iterator<Item=f64>
, which can be read as
"an Iterator with its associated type Item set to f64".
The ...
refers to the provided methods of Iterator
. You only need to define Item
and next
, and the provided methods are defined for you.
// trait3.rs struct FRange { val: f64, end: f64, incr: f64 } fn range(x1: f64, x2: f64, skip: f64) -> FRange { FRange {val: x1, end: x2, incr: skip} } impl Iterator for FRange { type Item = f64; fn next(&mut self) -> Option<Self::Item> { let res = self.val; if res >= self.end { None } else { self.val += self.incr; Some(res) } } } fn main() { for x in range(0.0, 1.0, 0.1) { println!("{} ", x); } }
And the rather messy looking result is
0
0.1
0.2
0.30000000000000004
0.4
0.5
0.6
0.7
0.7999999999999999
0.8999999999999999
0.9999999999999999
This is because 0.1 is not precisely representable as a float, so a little formatting
help is needed. Replace the println!
with this
# #![allow(unused_variables)] # #fn main() { println!("{:.1} ", x); #}
And we get cleaner output (this format means 'one decimal after dot'.)
All of the default iterator methods are available, so we can collect these values into a vector, map them, and so forth.
# #![allow(unused_variables)] # #fn main() { let v: Vec<f64> = range(0.0, 1.0, 0.1).map(|x| x.sin()).collect(); #}
Generic Functions
We want a function which will dump out any value that implements Debug
. Here is
a first attempt at a generic function, where we can pass a reference to any type
of value. T
is a type parameter, which needs to be declared just after the
function name:
# #![allow(unused_variables)] # #fn main() { fn dump<T> (value: &T) { println!("value is {:?}",value); } let n = 42; dump(&n); #}
However, Rust clearly knows nothing about this generic type T
:
error[E0277]: the trait bound `T: std::fmt::Debug` is not satisfied
...
= help: the trait `std::fmt::Debug` is not implemented for `T`
= help: consider adding a `where T: std::fmt::Debug` bound
For this to work, Rust needs to be told that T
does in fact implement Debug
!
# #![allow(unused_variables)] # #fn main() { fn dump<T> (value: &T) where T: std::fmt::Debug { println!("value is {:?}",value); } let n = 42; dump(&n); // value is 42 #}
Rust generic functions need trait bounds on types - we are saying here that
"T is any type that implements Debug". rustc
is being very helpful, and
suggests exactly what bound needs to be provided.
Now that Rust knows the trait bounds for T
, it can give you sensible compiler messages:
# #![allow(unused_variables)] # #fn main() { struct Foo { name: String } let foo = Foo{name: "hello".to_string()}; dump(&foo) #}
And the error is "the trait std::fmt::Debug
is not implemented for Foo
".
Functions are already generic in dynamic languages because values carry their actual type around, and the type checking happens at run-time - or fails miserably. For larger programs, we really do want to know about problems at compile-time rather! Rather than sitting down calmly with compiler errors, a programmer in these languages has to deal with problems that only show up when the program is running. Murphy's Law then implies that these problems will tend to happen at the most inconvenient/disastrous time.
The operation of squaring a number is generic: x*x
will work for integers,
floats and generally for anything that knows about the multiplication operator *
.
But what are the type bounds?
// gen1.rs fn sqr<T> (x: T) -> T { x * x } fn main() { let res = sqr(10.0); println!("res {}",res); }
The first problem is that Rust does not know that T
can be multiplied:
error[E0369]: binary operation `*` cannot be applied to type `T`
--> gen1.rs:4:5
|
4 | x * x
| ^
|
note: an implementation of `std::ops::Mul` might be missing for `T`
--> gen1.rs:4:5
|
4 | x * x
| ^
Following the advice of the compiler, let's constrain that type parameter using
that trait, which is used to implement the multiplication operator *
:
# #![allow(unused_variables)] # #fn main() { fn sqr<T> (x: T) -> T where T: std::ops::Mul { x * x } #}
Which still doesn't work:
rror[E0308]: mismatched types
--> gen2.rs:6:5
|
6 | x * x
| ^^^ expected type parameter, found associated type
|
= note: expected type `T`
= note: found type `<T as std::ops::Mul>::Output`
What rustc
is saying that the type of x*x
is the associated type T::Output
, not T
.
There's actually no reason that the type of x*x
is the same as the type of x
, e.g. the dot product
of two vectors is a scalar.
# #![allow(unused_variables)] # #fn main() { fn sqr<T> (x: T) -> T::Output where T: std::ops::Mul { x * x } #}
and now the error is:
error[E0382]: use of moved value: `x`
--> gen2.rs:6:7
|
6 | x * x
| - ^ value used here after move
| |
| value moved here
|
= note: move occurs because `x` has type `T`, which does not implement the `Copy` trait
So, we need to constrain the type even further!
# #![allow(unused_variables)] # #fn main() { fn sqr<T> (x: T) -> T::Output where T: std::ops::Mul + Copy { x * x } #}
And that (finally) works. Calmly listening to the compiler will often get you closer to the magic point when ... things compile cleanly.
It is a bit simpler in C++:
template <typename T>
T sqr(x: T) {
return x * x;
}
but (to be honest) C++ is adopting cowboy tactics here. C++ template errors are famously bad, because all the compiler knows (ultimately) is that some operator or method is not defined. The C++ committee knows this is a problem and so they are working toward concepts, which are pretty much like trait-constrained type parameters in Rust.
Rust generic functions may look a bit overwhelming at first, but being explicit means you will know exactly what kind of values you can safely feed it, just by looking at the definition.
These functions are called monomorphic, in constrast to polymorphic. The body of the function is compiled separately for each unique type. With polymorphic functions, the same machine code works with each matching type, dynamically dispatching the correct method.
Monomorphic produces faster code,
specialized for the particular type, and can often be inlined. So when sqr(x)
is
seen, it's effectively replaced with x*x
. The downside is that large generic
functions produce a lot of code, for each type used, which can result in code bloat.
As always, there are trade-offs; an experienced person learns to make the right choice
for the job.
Simple Enums
Enums are types which have a few definite values. For instance, a direction has only four possible values.
# #![allow(unused_variables)] # #fn main() { enum Direction { Up, Down, Left, Right } ... // `start` is type `Direction` let start = Direction::Left; #}
They can have methods defined on them, just like structs.
The match
expression is the basic way to handle enum
values.
# #![allow(unused_variables)] # #fn main() { impl Direction { fn as_str(&self) -> &'static str { match *self { // *self has type Direction Direction::Up => "Up", Direction::Down => "Down", Direction::Left => "Left", Direction::Right => "Right" } } } #}
Punctuation matters. Note that *
before self
. It's easy to forget, because often
Rust will assume it (we said self.first_name
, not (*self).first_name
). However,
matching is a more exact business. Leaving it out would give a whole spew of messages,
which boil down to this type mismatch:
= note: expected type `&Direction`
= note: found type `Direction`
This is because self
has type &Direction
, so we have to throw in the *
to
deference the type.
Like structs, enums can implement traits, and our friend #[derive(Debug)]
can
be added to Direction
:
# #![allow(unused_variables)] # #fn main() { println!("start {:?}",start); // start Left #}
So that as_str
method isn't really necessary, since we can always get the name from Debug
.
(But as_str
does not allocate, which may be important.)
You should not assume any particular ordering here - there's no implied integer 'ordinal' value.
Here's a method which defines the 'successor' of each Direction
value. The
very handy wildcard use temporarily puts the enum names into the method context:
# #![allow(unused_variables)] # #fn main() { fn next(&self) -> Direction { use Direction::*; match *self { Up => Right, Right => Down, Down => Left, Left => Up } } ... let mut d = start; for _ in 0..8 { println!("d {:?}", d); d = d.next(); } // d Left // d Up // d Right // d Down // d Left // d Up // d Right // d Down #}
So this will cycle endlessly through the various directions in this particular, arbitrary, order. It is (in fact) a very simple state machine.
These enum values can't be compared:
assert_eq!(start, Direction::Left);
error[E0369]: binary operation `==` cannot be applied to type `Direction`
--> enum1.rs:42:5
|
42 | assert_eq!(start, Direction::Left);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
note: an implementation of `std::cmp::PartialEq` might be missing for `Direction`
--> enum1.rs:42:5
The solution is to say #[derive(Debug,PartialEq)]
in front of enum Direction
.
This is an important point - Rust user-defined types start out fresh and unadorned.
You give them sensible default behaviours by implementing the common traits. This
applies also to structs - if you ask for Rust to derive PartialEq
for a struct it
will do the sensible thing, assume that all fields implement it and build up
a comparison. If this isn't so, or you want to redefine equality, then you are free
to define PartialEq
explicitly.
Rust does 'C style enums' as well:
// enum2.rs enum Speed { Slow = 10, Medium = 20, Fast = 50 } fn main() { let s = Speed::Slow; let speed = s as u32; println!("speed {}", speed); }
They are initialized with an integer value, and can be converted into that integer with a type cast.
You only need to give the first name a value, and thereafter the value goes up by one each time:
# #![allow(unused_variables)] # #fn main() { enum Difficulty { Easy = 1, Medium, // is 2 Hard // is 3 } #}
By the way, 'name' is too vague, like saying 'thingy' all the time. The proper term here
is variant - Speed
has variants Slow
,Medium
and Fast
.
These enums do have a natural ordering, but you have to ask nicely.
After placing #[derive(PartialEq,PartialOrd)]
in front of enum Speed
, then it's indeed
true that Speed::Fast > Speed::Slow
and Speed::Medium != Speed::Slow
.
Enums in their Full Glory
Rust enums in their full form are like C unions on steroids, like a Ferrari compared to a Fiat Uno. Consider the problem of storing different values in a type-safe way.
// enum3.rs #[derive(Debug)] enum Value { Number(f64), Str(String), Bool(bool) } fn main() { use Value::*; let n = Number(2.3); let s = Str("hello".to_string()); let b = Bool(true); println!("n {:?} s {:?} b {:?}", n,s,b); } // n Number(2.3) s Str("hello") b Bool(true)
Again, this enum can only contain one of these values; its size will be the size of the largest variant.
So far, not really a supercar, although it's cool that enums know how to print themselves
out. But they also know how what kind of value they contain, and that is the
superpower of match
:
# #![allow(unused_variables)] # #fn main() { fn eat_and_dump(v: Value) { use Value::*; match v { Number(n) => println!("number is {}", n), Str(s) => println!("string is '{}'", s), Bool(b) => println!("boolean is {}", b) } } .... eat_and_dump(n); eat_and_dump(s); eat_and_dump(b); //number is 2.3 //string is 'hello' //boolean is true #}
(And that's what Option
and Result
are - enums.)
We like this eat_and_dump
function, but we want to pass the value as a reference, because currently
a move takes place and the value is 'eaten':
# #![allow(unused_variables)] # #fn main() { fn dump(v: &Value) { use Value::*; match *v { // type of *v is Value Number(n) => println!("number is {}", n), Str(s) => println!("string is '{}'", s), Bool(b) => println!("boolean is {}", b) } } error[E0507]: cannot move out of borrowed content --> enum3.rs:12:11 | 12 | match *v { | ^^ cannot move out of borrowed content 13 | Number(n) => println!("number is {}",n), 14 | Str(s) => println!("string is '{}'",s), | - hint: to prevent move, use `ref s` or `ref mut s` #}
There are things you cannot do with borrowed references. Rust is not letting
you extract the string contained in the original value. It did not complain about Number
because it's happy to copy f64
, but String
does not implement Copy
.
I mentioned earlier that match
is picky about exact types;
here we follow the hint and things will work; now we are just borrowing a reference
to that contained string.
# #![allow(unused_variables)] # #fn main() { fn dump(v: &Value) { use Value::*; match *v { Number(n) => println!("number is {}", n), Str(ref s) => println!("string is '{}'", s), Bool(b) => println!("boolean is {}", b) } } .... dump(&s); // string is 'hello' #}
Before we move on, filled with the euphoria of a successful Rust compilation, let's
pause a little. rustc
is unusually good at generating errors that have enough
context for a human to fix the error without necessarily understanding the error.
The issue is a combination of the exactness of matching, with the determination of the
borrow checker to foil any attempt to break the Rules. One of those Rules is that
you cannot yank out a value which belongs to some owning type. Some knowledge of
C++ is a hindrance here, since C++ will copy its way out of the problem, whether that
copy even makes sense. You will get exactly the same error if you try to pull out
a string from a vector, say with *v.get(0).unwrap()
(*
because indexing returns references.)
It will simply not let you do this. (Sometimes clone
isn't such a bad solution to this.)
(By the way, v[0]
does not work for non-copyable values like strings for precisely this reason.
You must either borrow with &v[0]
or clone with v[0].clone()
)
As for match
, you can see Str(s) =>
as short for Str(s: String) =>
. A local variable
(often called a binding) is created. Often that inferred type is cool, when you
eat up a value and extract its contents. But here we really needed is s: &String
, and the
ref
is a hint that ensures this: we just want to borrow that string.
Here we do want to extract that string, and don't care about
the enum value afterwards. _
as usual will match anything.
# #![allow(unused_variables)] # #fn main() { impl Value { fn to_str(self) -> Option<String> { match self { Value::Str(s) => Some(s), _ => None } } } ... println!("s? {:?}", s.to_str()); // s? Some("hello") // println!("{:?}", s) // error! s has moved... #}
Naming matters - this is called to_str
, not as_str
. You can write a
method that just borrows that string as an Option<&String>
(The reference will need
the same lifetime as the enum value.) But you would not call it to_str
.
You can write to_str
like this - it is completely equivalent:
# #![allow(unused_variables)] # #fn main() { fn to_str(self) -> Option<String> { if let Value::Str(s) = self { Some(s) } else { None } } #}
More about Matching
Recall that the values of a tuple can be extracted with '()':
# #![allow(unused_variables)] # #fn main() { let t = (10,"hello".to_string()); ... let (n,s) = t; // t has been moved. It is No More // n is i32, s is String #}
This is a special case of destructuring; we have some data and wish to either pull it apart (like here) or just borrow its values. Either way, we get the parts of a structure.
The syntax is like that used in match
. Here
we are explicitly borrowing the values.
# #![allow(unused_variables)] # #fn main() { let (ref n,ref s) = t; // n and s are borrowed from t. It still lives! // n is &i32, s is &String #}
Destructuring works with structs as well:
# #![allow(unused_variables)] # #fn main() { struct Point { x: f32, y: f32 } let p = Point{x:1.0,y:2.0}; ... let Point{x,y} = p; // p still lives, since x and y can and will be copied // both x and y are f32 #}
Time to revisit match
with some new patterns. The first two patterns are exactly like let
destructuring - it only matches tuples with first element zero, but any string;
the second adds an if
so that it only matches (1,"hello")
.
Finally, just a variable matches anything. This is useful if the match
applies
to an expression and you don't want to bind a variable to that expression. _
works
like a variable but is ignored. It's a common
way to finish off a match
.
# #![allow(unused_variables)] # #fn main() { fn match_tuple(t: (i32,String)) { let text = match t { (0, s) => format!("zero {}", s), (1, ref s) if s == "hello" => format!("hello one!"), tt => format!("no match {:?}", tt), // or say _ => format!("no match") if you're not interested in the value }; println!("{}", text); } #}
Why not just match against (1,"hello")
? Matching is an exact business, and the compiler
will complain:
= note: expected type `std::string::String`
= note: found type `&'static str`
Why do we need ref s
? It's a slightly obscure gotcha (look up the E00008 error) where
if you have an if guard you need to borrow, since the if guard happens in a different
context, a move will take place otherwise. It's a case of the implementation leaking
ever so slightly.
If the type was &str
then we match it directly:
# #![allow(unused_variables)] # #fn main() { match (42,"answer") { (42,"answer") => println!("yes"), _ => println!("no") }; #}
What applies to match
applies to if let
. This is a cool example, since if we
get a Some
, we can match inside it and only extract the string from the tuple. So it
isn't necessary to have nested if let
statements here. We use _
because we aren't interested
in the first part of the tuple.
# #![allow(unused_variables)] # #fn main() { let ot = Some((2,"hello".to_string()); if let Some((_,ref s)) = ot { assert_eq!(s, "hello"); } // we just borrowed the string, no 'destructive destructuring' #}
An interesting problem happens when using parse
(or any function which needs to work
out its return type from context)
# #![allow(unused_variables)] # #fn main() { if let Ok(n) = "42".parse() { ... } #}
So what's the type of n
? You have to give a hint somehow - what kind of integer? Is it
even an integer?
# #![allow(unused_variables)] # #fn main() { if let Ok(n) = "42".parse::<i32>() { ... } #}
This somewhat non-elegant syntax is called the 'turbofish operator'.
If you are in a function returning Result
, then the question-mark operator provides a much
more elegant solution:
# #![allow(unused_variables)] # #fn main() { let n: i32 = "42".parse()?; #}
However, the parse error needs to be convertible to the error type of the Result
, which is a topic
we'll take up later when discussing error handling.
Closures
A great deal of Rust's power comes from closures. In their simplest form, they act like shortcut functions:
# #![allow(unused_variables)] # #fn main() { let f = |x| x * x; let res = f(10); println!("res {}", res); // res 100 #}
There are no explicit types in this example - everything is deduced, starting with the integer literal 10.
We get an error if we call f
on different types - Rust has already decided that
f
must be called on an integer type:
let res = f(10);
let resf = f(1.2);
|
8 | let resf = f(1.2);
| ^^^ expected integral variable, found floating-point variable
|
= note: expected type `{integer}`
= note: found type `{float}`
So, the first call fixes the type of the argument x
. It's equivalent to this function:
# #![allow(unused_variables)] # #fn main() { fn f (x: i32) -> i32 { x * x } #}
But there's a big difference between functions and closures, apart from the need for explicit typing. Here we evaluate a linear function:
# #![allow(unused_variables)] # #fn main() { let m = 2.0; let c = 1.0; let lin = |x| m*x + c; println!("res {} {}", lin(1.0), lin(2.0)); // res 3 5 #}
You cannot do this with the explicit fn
form - it does not know about variables
in the enclosing scope. The closure has borrowed m
and c
from its context.
Now, what's the type of lin
? Only rustc
knows.
Under the hood, a closure is a struct that is callable ('implements the call operator').
It behaves as if it was written out like this:
# #![allow(unused_variables)] # #fn main() { struct MyAnonymousClosure1<'a> { m: &'a f64, c: &'a f64 } impl <'a>MyAnonymousClosure1<'a> { fn call(&self, x: f64) -> f64 { self.m * x + self.c } } #}
The compiler is certainly being helpful, turning simple closure syntax into all that code! You do need to know that a closure is a struct and it borrows values from its environment. And that therefore it has a lifetime.
All closures are unique types, but they have traits in common. So even though we don't know the exact type, we know the generic constraint:
# #![allow(unused_variables)] # #fn main() { fn apply<F>(x: f64, f: F) -> f64 where F: Fn(f64)->f64 { f(x) } ... let res1 = apply(3.0,lin); let res2 = apply(3.14, |x| x.sin()); #}
In English: apply
works for any type T
such that T
implements Fn(f64)->f64
- that
is, is a function which takes f64
and returns f64
.
After the call to apply(3.0,lin)
, trying to access lin
gives an interesting error:
let l = lin;
error[E0382]: use of moved value: `lin`
--> closure2.rs:22:9
|
16 | let res = apply(3.0,lin);
| --- value moved here
...
22 | let l = lin;
| ^ value used here after move
|
= note: move occurs because `lin` has type
`[closure@closure2.rs:12:15: 12:26 m:&f64, c:&f64]`,
which does not implement the `Copy` trait
That's it, apply
ate our closure. And there's the actual type of the struct that
rustc
made up to implement it. Always thinking of closures as structs is helpful.
Calling a closure is a method call: the three kinds of function traits correspond to the three kinds of methods:
Fn
struct passed as&self
FnMut
struct passed as&mut self
FnOnce
struct passed asself
So it's possible for a closure to mutate its captured references:
# #![allow(unused_variables)] # #fn main() { fn mutate<F>(mut f: F) where F: FnMut() { f() } let mut s = "world"; mutate(|| s = "hello"); assert_eq!(s, "hello"); #}
Note that mut
- f
needs to be mutable for this to work.
However, you cannot escape the rules for borrowing. Consider this:
# #![allow(unused_variables)] # #fn main() { let mut s = "world"; // closure does a mutable borrow of s let mut changer = || s = "world"; changer(); // does an immutable borrow of s assert_eq!(s, "world"); #}
Can't be done! The error is that we cannot borrow s
in the assert statement, because it has been previously borrowed by the
closure changer
as mutable. As long as that closure lives, no other
code can access s
, so the solution is to control that lifetime by
putting the closure in a limited scope:
# #![allow(unused_variables)] # #fn main() { let mut s = "world"; { let mut changer = || s = "world"; changer(); } assert_eq!(s, "world"); #}
At this point, if you are used to languages like JavaScript or Lua, you may wonder at the
complexity of Rust closures compared with how straightforward they are in those languages.
This is the necessary cost of Rust's promise to not sneakily make any allocations. In JavaScript,
the equivalent mutate(function() {s = "hello";})
will always result in a dynamically
allocated closure.
Sometimes you don't want a closure to borrow those variables, but instead move them.
# #![allow(unused_variables)] # #fn main() { let name = "dolly".to_string(); let age = 42; let c = move || { println!("name {} age {}", name,age); }; c(); println!("name {}",name); #}
And the error at the last println
is: "use of moved value: name
". So one solution
here - if we did want to keep name
alive - is to move a cloned copy into the closure:
# #![allow(unused_variables)] # #fn main() { let cname = name.to_string(); let c = move || { println!("name {} age {}",cname,age); }; #}
Why are moved closures needed? Because we might need to call them at a point where the original context no longer exists. A classic case is when creating a thread. A moved closure does not borrow, so does not have a lifetime.
A major use of closures is within iterator methods. Recall the range
iterator we
defined to go over a range of floating-point numbers. It's straightforward to operate
on this (or any other iterator) using closures:
# #![allow(unused_variables)] # #fn main() { let sine: Vec<f64> = range(0.0,1.0,0.1).map(|x| x.sin()).collect(); #}
map
isn't defined on vectors (although it's easy enough to create a trait that does this),
because then every map will create a new vector. This way, we have a choice. In this
sum, no temporary objects are created:
# #![allow(unused_variables)] # #fn main() { let sum: f64 = range(0.0,1.0,0.1).map(|x| x.sin()).sum(); #}
It will (in fact) be as fast as writing it out as an explicit loop! That performance guarantee would be impossible if Rust closures were as 'frictionless' as Javascript closures.
filter
is another useful iterator method - it only lets through values that match
a condition:
# #![allow(unused_variables)] # #fn main() { let tuples = [(10,"ten"),(20,"twenty"),(30,"thirty"),(40,"forty")]; let iter = tuples.iter().filter(|t| t.0 > 20).map(|t| t.1); for name in iter { println!("{} ", name); } // thirty // forty #}
The Three Kinds of Iterators
The three kinds correspond (again) to the three basic argument types. Assume we
have a vector of String
values. Here are the iterator types explicitly, and
then implicitly, together with the actual type returned by the iterator.
# #![allow(unused_variables)] # #fn main() { for s in vec.iter() {...} // &String for s in vec.iter_mut() {...} // &mut String for s in vec.into_iter() {...} // String // implicit! for s in &vec {...} // &String for s in &mut vec {...} // &mut String for s in vec {...} // String #}
Personally I prefer being explicit, but it's important to understand both forms, and their implications.
into_iter
consumes the vector and extracts its strings,
and so afterwards the vector is no longer available - it has been moved. It's
a definite gotcha for Pythonistas used to saying for s in vec
!
So the
implicit form for s in &vec
is usually the one you want, just as &T
is a good
default in passing arguments to functions.
It's important to understand how the three kinds works because Rust relies heavily on type deduction - you won't often see explicit types in closure arguments. And this is a Good Thing, because it would be noisy if all those types were explicitly typed out. However, the price of this compact code is that you need to know what the implicit types actually are!
map
takes whatever value the iterator returns and converts it into something else,
but filter
takes a reference to that value. In this case, we're using iter
so
the iterator item type is &String
. Note that filter
receives a reference to this type.
# #![allow(unused_variables)] # #fn main() { for n in vec.iter().map(|x: &String| x.len()) {...} // n is usize .... } for s in vec.iter().filter(|x: &&String| x.len() > 2) { // s is &String ... } #}
When calling methods, Rust will derefence automatically, so the problem isn't obvious.
But |x: &&String| x == "one"|
will not work, because operators are more strict
about type matching. rustc
will complain that there is no such operator that
compares &&String
and &str
. So you need an explicit deference to make that &&String
into a &String
which does match.
# #![allow(unused_variables)] # #fn main() { for s in vec.iter().filter(|x: &&String| *x == "one") {...} // same as implicit form: for s in vec.iter().filter(|x| *x == "one") {...} #}
If you leave out the explicit type, you can modify the argument so that the type of s
is now &String
:
# #![allow(unused_variables)] # #fn main() { for s in vec.iter().filter(|&x| x == "one") #}
And that's usually how you will see it written.
Structs with Dynamic Data
A most powerful technique is a struct that contain references to itself.
Here is the basic building block of a binary tree, expressed in C (everyone's favourite old relative with a frightening fondness for using power tools without protection.)
# #![allow(unused_variables)] # #fn main() { struct Node { const char *payload; struct Node *left; struct Node *right; }; #}
You can not do this by directly including Node
fields, because then the size of
Node
depends on the size of Node
... it just doesn't compute. So we use pointers
to Node
structs, since the size of a pointer is always known.
If left
isn't NULL
, the Node
will have a left pointing to another node, and so
moreorless indefinitely.
Rust does not do NULL
(at least not safely) so it's clearly a job for Option
.
But you cannot just put a Node
in that Option
, because we don't know the size
of Node
(and so forth.) This is a job for Box
, since it contains an allocated
pointer to the data, and always has a fixed size.
So here's the Rust equivalent, using type
to create an alias:
# #![allow(unused_variables)] # #fn main() { type NodeBox = Option<Box<Node>>; #[derive(Debug)] struct Node { payload: String, left: NodeBox, right: NodeBox } #}
(Rust is forgiving in this way - no need for forward declarations.)
And a first test program:
impl Node { fn new(s: &str) -> Node { Node{payload: s.to_string(), left: None, right: None} } fn boxer(node: Node) -> NodeBox { Some(Box::new(node)) } fn set_left(&mut self, node: Node) { self.left = Self::boxer(node); } fn set_right(&mut self, node: Node) { self.right = Self::boxer(node); } } fn main() { let mut root = Node::new("root"); root.set_left(Node::new("left")); root.set_right(Node::new("right")); println!("arr {:#?}", root); }
The output is surprisingly pretty, thanks to "{:#?}" ('#' means 'extended'.)
root Node {
payload: "root",
left: Some(
Node {
payload: "left",
left: None,
right: None
}
),
right: Some(
Node {
payload: "right",
left: None,
right: None
}
)
}
Now, what happens when root
is dropped? All fields are dropped; if the 'branches' of
the tree are dropped, they drop their fields and so on. Box::new
may be the
closest you will get to a new
keyword, but we have no need for delete
or free
.
We must now work out a use for this tree. Note that strings can be ordered: 'bar' < 'foo', 'abba' > 'aardvark'; so-called 'alphabetical order'. (Strictly speaking, this is lexical order, since human languages are very diverse and have strange rules.)
Here is a method which inserts nodes in lexical order of the strings. We compare the new data
to the current node - if it's less, then we try to insert on the left, otherwise try to insert
on the right. There may be no node on the left, so then set_left
and so forth.
fn insert(&mut self, data: &str) { if data < &self.payload { match self.left { Some(ref mut n) => n.insert(data), None => self.set_left(Self::new(data)), } } else { match self.right { Some(ref mut n) => n.insert(data), None => self.set_right(Self::new(data)), } } } ... fn main() { let mut root = Node::new("root"); root.insert("one"); root.insert("two"); root.insert("four"); println!("root {:#?}", root); }
Note the match
- we're pulling out a mutable reference to the box, if the Option
is Some
, and applying the insert
method. Otherwise, we need to create a new Node
for the left side and so forth. Box
is a smart pointer; note that no 'unboxing' was
needed to call Node
methods on it!
And here's the output tree:
root Node {
payload: "root",
left: Some(
Node {
payload: "one",
left: Some(
Node {
payload: "four",
left: None,
right: None
}
),
right: None
}
),
right: Some(
Node {
payload: "two",
left: None,
right: None
}
)
}
The strings that are 'less' than other strings get put down the left side, otherwise the right side.
Time for a visit. This is in-order traversal - we visit the left, do something on the node, and then visit the right.
# #![allow(unused_variables)] # #fn main() { fn visit(&self) { if let Some(ref left) = self.left { left.visit(); } println!("'{}'", self.payload); if let Some(ref right) = self.right { right.visit(); } } ... ... root.visit(); // 'four' // 'one' // 'root' // 'two' #}
So we're visiting the strings in order! Please note the reappearance of ref
- if let
uses exactly the same rules as match
.
Generic Structs
Consider the previous example of a binary tree. It would be seriously irritating to
have to rewrite it for all possible kinds of payload.
So here's our generic Node
with its type parameter T
.
# #![allow(unused_variables)] # #fn main() { type NodeBox<T> = Option<Box<Node<T>>>; #[derive(Debug)] struct Node<T> { payload: T, left: NodeBox<T>, right: NodeBox<T> } #}
The implementation shows the difference between the languages. The fundamental operation
on the payload is comparison, so T must be comparable with <
, i.e. implements PartialOrd
.
The type parameter must be declared in the impl
block with its constraints:
impl <T: PartialOrd> Node<T> { fn new(s: T) -> Node<T> { Node{payload: s, left: None, right: None} } fn boxer(node: Node<T>) -> NodeBox<T> { Some(Box::new(node)) } fn set_left(&mut self, node: Node<T>) { self.left = Self::boxer(node); } fn set_right(&mut self, node: Node<T>) { self.right = Self::boxer(node); } fn insert(&mut self, data: T) { if data < self.payload { match self.left { Some(ref mut n) => n.insert(data), None => self.set_left(Self::new(data)), } } else { match self.right { Some(ref mut n) => n.insert(data), None => self.set_right(Self::new(data)), } } } } fn main() { let mut root = Node::new("root".to_string()); root.insert("one".to_string()); root.insert("two".to_string()); root.insert("four".to_string()); println!("root {:#?}", root); }
So generic structs need their type parameter(s) specified
in angle brackets, like C++. Rust is usually smart enough to work out
that type parameter from context - it knows it has a Node<T>
, and knows
that its insert
method is passed T
. The first call of insert
nails
down T
to be String
. If any further calls are inconsistent it will complain.
But you do need to constrain that type appropriately!
Filesystem and Processes
Another look at Reading Files
At the end of Part 1, I showed how to read a whole file into a string. Naturally this isn't always such a good idea, so here is how to read a file line-by-line.
fs::File
implements io::Read
, which is the trait for anything readable.
This trait defines a read
method which will fill a slice of u8
with bytes -
this is the only required method of the trait, and you get some provided methods
for free, much like with Iterator
. You can use read_to_end
to fill a vector of
bytes with contents from the readable, and read_to_string
to fill a string - which
must be UTF-8 encoded.
This is a 'raw' read, with no buffering. For buffered reading there is the
io::BufRead
trait which gives us read_line
and a lines
iterator.
io::BufReader
will provide an implementation of io::BufRead
for any readable.
fs::File
also implements io::Write
.
The easiest way to make sure all these traits are visible is use std::io::prelude::*
.
# #![allow(unused_variables)] # #fn main() { use std::fs::File; use std::io; use std::io::prelude::*; fn read_all_lines(filename: &str) -> io::Result<()> { let file = File::open(&filename)?; let reader = io::BufReader::new(file); for line in reader.lines() { let line = line?; println!("{}", line); } Ok(()) } #}
The let line = line?
may look a bit strange. The line
returned by the
iterator is actually an io::Result<String>
which we unwrap with ?
.
Because things can go wrong during this iteration - I/O errors, swallowing
a chunk of bytes that aren't UTF-8, and so forth.
lines
being an iterator, it is straightforward to read a file into a vector
of strings using collect
, or print out the line with line numbers using the
enumerate
iterator.
It isn't the most efficient way to read all the lines, however, because a new
string is allocated for each line. It is more efficient to use read_line
, although
more awkward. Note that the returned line includes the linefeed, which
can be removed using trim_right
.
# #![allow(unused_variables)] # #fn main() { let mut reader = io::BufReader::new(file); let mut buf = String::new(); while reader.read_line(&mut buf)? > 0 { { let line = buf.trim_right(); println!("{}", line); } buf.clear(); } #}
This results in far less allocations, because clearing that string does not free its allocated memory; once the string has enough capacity, no more allocations will take place.
This is one of those cases where we use a block to control a borrow. line
is
borrowed from buf
, and this borrow must finish before we modify buf
. Again,
Rust is trying to stop us doing something stupid, which is to access line
after
we've cleared the buffer. (The borrow checker can be restrictive sometimes.
Rust is due to get 'non-lexical lifetimes', where
it will analyze the code and see that line
isn't used after buf.clear()
.)
This isn't very pretty. I cannot give you a proper iterator that returns references to a buffer, but I can give you something that looks like an iterator.
First define a generic struct;
the type parameter R
is 'any type that implements Read'. It contains the reader
and the buffer which we are going to borrow from.
# #![allow(unused_variables)] # #fn main() { // file5.rs use std::fs::File; use std::io; use std::io::prelude::*; struct Lines<R> { reader: io::BufReader<R>, buf: String } impl <R: Read> Lines<R> { fn new(r: R) -> Lines<R> { Lines{reader: io::BufReader::new(r), buf: String::new()} } ... } #}
Then the next
method. It returns an Option
- just like an iterator, when it
returns None
the iterator finishes. The returned type is a Result
because
read_line
might fail, and we never throw errors away. So if fails, we
wrap up its error in a Some<Result>
. Otherwise, it may have read zero bytes,
which is the natural end of the file - not an error, just a None
.
At this point, the buffer contains the line with a linefeed (`\n') appended. Trim this away, and package up the string slice.
# #![allow(unused_variables)] # #fn main() { fn next<'a>(&'a mut self) -> Option<io::Result<&'a str>>{ self.buf.clear(); match self.reader.read_line(&mut self.buf) { Ok(nbytes) => if nbytes == 0 { None // no more lines! } else { let line = self.buf.trim_right(); Some(Ok(line)) }, Err(e) => Some(Err(e)) } } #}
Now, note how the lifetimes work. We need an explicit lifetime because Rust will never
allow us to hand out borrowed string slices without knowing their lifetime. And here
we say that the lifetime of this borrowed string is within the lifetime of self
.
And this signature, with the lifetime, is incompatible with the interface of Iterator
.
But it's easy to see problems if it were compatible; consider collect
trying to make
a vector of these string slices. There's no way this could work, since they're all
borrowed from the same mutable string! (If you had read all the file into a string, then
the string's lines
iterator can return string slices because they are all borrowed from
distinct parts of the original string.)
The resulting loop is much cleaner, and the file buffering is invisible to the user.
# #![allow(unused_variables)] # #fn main() { fn read_all_lines(filename: &str) -> io::Result<()> { let file = File::open(&filename)?; let mut lines = Lines::new(file); while let Some(line) = lines.next() { let line = line?; println!("{}", line); } Ok(()) } #}
You can even write the loop like this, since the explicit match can pull out the string slice:
# #![allow(unused_variables)] # #fn main() { while let Some(Ok(line)) = lines.next() { println!("{}", line)?; } #}
It's tempting, but you are throwing away a possible error here; this loop will silently stop whenever an error occurs. In particular, it will stop at the first place where Rust can't convert a line to UTF-8. Fine for casual code, bad for production code!
Writing To Files
We met the write!
macro when implementing Debug
- it also works with anything
that implements Write
. So here's a another way of saying print!
:
# #![allow(unused_variables)] # #fn main() { let mut stdout = io::stdout(); ... write!(stdout,"answer is {}\n", 42).expect("write failed"); #}
If an error is possible, you must handle it. It may not be
very likely but it can happen. It's usually fine, because if you
are doing file i/o you should be in a context where ?
works.
But there is a difference: print!
locks stdout for each write. This is usually
what you want for output, because without that locking multithreaded
programs can mix up that output in interesting ways. But if you are pumping out
a lot of text, then write!
is going to be faster.
For arbitrary files we need write!
. The
file is closed when out
is dropped at the end of write_out
, which is
both convenient and important.
// file6.rs use std::fs::File; use std::io; use std::io::prelude::*; fn write_out(f: &str) -> io::Result<()> { let mut out = File::create(f)?; write!(out,"answer is {}\n", 42)?; Ok(()) } fn main() { write_out("test.txt").expect("write failed"); }
If you care about performance, you need to know that Rust files are unbuffered
by default. So each little write request goes straight to the OS, and this is
going to be significantly slower. I mention this because this default is different
from other programming languages, and could lead to the shocking discovery that Rust
can be left in the dust by scripting languages!
Just as with Read
and io::BufReader
, there is io::BufWriter
for
buffering any Write
.
Files, Paths and Directories
Here is a little program for printing out the Cargo directory on a machine. The
simplest case is that it's '~/.cargo'. This is a Unix shell expansion,
so we use env::home_dir
because it's cross-platform. (It might fail, but a
computer without a home directory isn't going to be hosting Rust tools anyway.)
We then create a PathBuf
and use its push
method to build up the full file path from its components.
(This is much easier than fooling around with '/','' or whatever, depending on
the system.)
// file7.rs use std::env; use std::path::PathBuf; fn main() { let home = env::home_dir().expect("no home!"); let mut path = PathBuf::new(); path.push(home); path.push(".cargo"); if path.is_dir() { println!("{}", path.display()); } }
A PathBuf
is like String
- it owns a growable set of characters, but with methods
specialized to building up paths. Most of its functionality however comes from
the borrowed version Path
, which is like &str
. So, for instance, is_dir
is
a Path
method.
This might sound suspiciously like a form of inheritance, but the magic Deref
trait works differently. It works just like it does with String/&str
-
a reference to PathBuf
can be coerced into a reference to Path
.
('Coerce' is a strong word, but this really
is one of the few places where Rust does conversions for you.)
# #![allow(unused_variables)] # #fn main() { fn foo(p: &Path) {...} ... let path = PathBuf::from(home); foo(&path); #}
PathBuf
has an intimate relationship with OsString
, which represents strings we get
directly from the system. (There is a corresponding OsString/&OsStr
relationship.)
Such strings are not guaranteed to be representable as UTF-8! Real life is a complicated matter, particularly see the answer to 'Why are they so hard?'. To summarize, first there are years of ASCII legacy coding, and multiple special encodings for other languages. Second, human languages are complicated. For instance 'noël' is five Unicode code points!
It's true that most of the time
with modern operating systems file names will be Unicode (UTF-8 on the Unix side, UTF-16
for Windows), except when they're not. And Rust must handle that possibility
rigorously. For instance,
Path
has a method as_os_str
which returns a &OsStr
, but the to_str
method
returns an Option<&str>
. Not always possible!
People have trouble at this point because they have become too attached to 'string' and
'character' as the only necessary abstractions. As Einstein could have said, a programming language
has to be as simple as possible, but no simpler. A systems language needs a
String/&str
distinction (owned versus borrowed: this is also very convenient)
and if it wishes to standardize on Unicode strings then it needs another type to handle
text which isn't valid Unicode - hence OsString/&OsStr
. Notice that there aren't
any interesting string-like methods for these types, precisely because we don't know the
encoding.
But, people are used to processing filenames as if they were strings, which is why
Rust makes it easier to manipulate file paths using PathBuf
methods.
You can pop
to successively remove path components. Here we start with the
current directory of the program:
// file8.rs use std::env; fn main() { let mut path = env::current_dir().expect("can't access current dir"); loop { println!("{}", path.display()); if ! path.pop() { break; } } } // /home/steve/rust/gentle-intro/code // /home/steve/rust/gentle-intro // /home/steve/rust // /home/steve // /home // /
Here's a useful variation. I have a program which searches for a configuration file,
and the rule is that it may appear in any subdirectory of the current directory.
So I create /home/steve/rust/config.txt
and start this program up in /home/steve/rust/gentle-intro/code
:
// file9.rs use std::env; fn main() { let mut path = env::current_dir().expect("can't access current dir"); loop { path.push("config.txt"); if path.is_file() { println!("gotcha {}", path.display()); break; } else { path.pop(); } if ! path.pop() { break; } } } // gotcha /home/steve/rust/config.txt
This is pretty much how git works when it wants to know what the current repo is.
The details about a file (its size, type, etc) are called its metadata. As always, there may be an error - not just 'not found' but also if we don't have permission to read this file.
// file10.rs use std::env; use std::path::Path; fn main() { let file = env::args().skip(1).next().unwrap_or("file10.rs".to_string()); let path = Path::new(&file); match path.metadata() { Ok(data) => { println!("type {:?}", data.file_type()); println!("len {}", data.len()); println!("perm {:?}", data.permissions()); println!("modified {:?}", data.modified()); }, Err(e) => println!("error {:?}", e) } } // type FileType(FileType { mode: 33204 }) // len 488 // perm Permissions(FilePermissions { mode: 436 }) // modified Ok(SystemTime { tv_sec: 1483866529, tv_nsec: 600495644 })
The length of the file (in bytes) and modified time are straightforward to interpret.
(Note we may not be able to get this time!) The file type has methods is_dir
,
is_file
and is_symlink
.
permissions
is an interesting one. Rust strives to be cross-platform, and so it's
a case of the 'lowest common denominator'. In general, all you can query is whether
the file is read-only - the 'permissions' concept is extended in Unix and encodes
read/write/executable for user/group/others.
But, if you are not interested in Windows, then bringing in a platform-specific trait will give us at least the permission mode bits. (As usual, a trait only kicks in when it is visible.) Then, applying the program to its own executable gives:
# #![allow(unused_variables)] # #fn main() { use std::os::unix::fs::PermissionsExt; ... println!("perm {:o}",data.permissions().mode()); // perm 755 #}
(Note '{:o}' for printing out in octal)
(Whether a file is executable on Windows is determined by its extension. The executable
extensions are found in the PATHEXT
environment variable - '.exe','.bat' and so forth).
std::fs
contains a number of useful functions for working with files, such as copying or
moving files, making symbolic links and creating directories.
To find the contents of a directory, std::fs::read_dir
provides an iterator.
Here are all files with extension '.rs' and size greater than 1024 bytes:
# #![allow(unused_variables)] # #fn main() { fn dump_dir(dir: &str) -> io::Result<()> { for entry in fs::read_dir(dir)? { let entry = entry?; let data = entry.metadata()?; let path = entry.path(); if data.is_file() { if let Some(ex) = path.extension() { if ex == "rs" && data.len() > 1024 { println!("{} length {}", path.display(),data.len()); } } } } Ok(()) } // ./enum4.rs length 2401 // ./struct7.rs length 1151 // ./sexpr.rs length 7483 // ./struct6.rs length 1359 // ./new-sexpr.rs length 7719 #}
Obviously read_dir
might fail (usually 'not found' or 'no permission'), but
also getting each new entry might fail (it's like the lines
iterator over a buffered
reader's contents). Plus, we might not be able to get the metadata corresponding to
the entry. A file might have no extension, so we have to check for that as well.
Why not just an iterator over paths? On Unix this is the way the opendir
system call works,
but on Windows you cannot iterate over a directory's contents without getting the
metadata. So this is a reasonably elegant compromise that allows cross-platform
code to be as efficient as possible.
You can be forgiven for feeling 'error fatigue' at this point. But please note that the errors always existed - it's not that Rust is inventing new ones. It's just trying hard to make it impossible for you to ignore them. Any operating system call may fail.
Languages like Java and Python throw exceptions; languages like Go and Lua return two values, where the first is the result and the second is the error: like Rust it is considered bad manners for library functions to raise errors. So there is a lot of error checking and early-returns from functions.
Rust uses Result
because it's either-or: you cannot get both a result and an error.
And the question-mark operator makes handling errors much cleaner.
Processes
A fundamental need is for programs to run programs, or to launch processes. Your program can spawn as many child processes it likes, and as the name suggests they have a special relationship with their parent.
To run a program is straightforward using the Command
struct, which builds up
arguments to pass to the program:
use std::process::Command; fn main() { let status = Command::new("rustc") .arg("-V") .status() .expect("no rustc?"); println!("cool {} code {}", status.success(), status.code().unwrap()); } // rustc 1.15.0-nightly (8f02c429a 2016-12-15) // cool true code 0
So new
receives the name of the program (it will be looked up on PATH
if not
an absolute filename), arg
adds a new argument, and status
causes it to be run.
This returns a Result
, which is Ok
if the program actually run, containing an
ExitStatus
. In this case, the program succeeded, and returned an exit code 0. (The
unwrap
is because we can't always get the code if the program was killed by
a signal).
If we change the -V
to -v
(an easy mistake) then rustc
fails:
error: no input filename given
cool false code 101
So there are three possibilities:
- program didn't exist, was bad, or we were not allowed to run it
- program ran, but was not successful - non-zero exit code
- program ran, with zero exit code. Success!
By default, the program's standard output and standard error streams go to the terminal.
Often we are very interested in capturing that output, so there's the output
method.
// process2.rs use std::process::Command; fn main() { let output = Command::new("rustc") .arg("-V") .output() .expect("no rustc?"); if output.status.success() { println!("ok!"); } println!("len stdout {} stderr {}", output.stdout.len(), output.stderr.len()); } // ok! // len stdout 44 stderr 0
As with status
our program blocks until the child process is finished, and we get
back three things - the status (as before), the contents of stdout and the contents
of stderr.
The captured output is simply Vec<u8>
- just bytes. Recall we have no guarantee
that data we receive from the operating system is a properly encoded UTF-8 string. In
fact, we have no guarantee that it even is a string - programs may return arbitrary
binary data.
If we are pretty sure the output is UTF-8, then String::from_utf8
will convert those
vectors or bytes - it returns a Result
because this conversion may not succeed.
A more sloppy function is String::from_utf8_lossy
which will make a good attempt at
conversion and insert the invalid Unicode mark � where it failed.
Here is a useful function which runs a program using the shell. This uses the usual shell mechanism for joining stderr to stdout. The name of the shell is different on Windows, but otherwise things work as expected.
# #![allow(unused_variables)] # #fn main() { fn shell(cmd: &str) -> (String,bool) { let cmd = format!("{} 2>&1",cmd); let shell = if cfg!(windows) {"cmd.exe"} else {"/bin/sh"}; let flag = if cfg!(windows) {"/c"} else {"-c"}; let output = Command::new(shell) .arg(flag) .arg(&cmd) .output() .expect("no shell?"); ( String::from_utf8_lossy(&output.stdout).trim_right().to_string(), output.status.success() ) } fn shell_success(cmd: &str) -> Option<String> { let (output,success) = shell(cmd); if success {Some(output)} else {None} } #}
I'm trimming any whitespace from the right so that if you said shell("which rustc")
you will get the path without any extra linefeed.
You can control the execution of a program launched by Process
by specifying the directory it will run
in using the current_dir
method and the environment variables it sees using env
.
Up to now, our program simply waits for the child process to finish. If you use
the spawn
method then we return immediately, and must explicitly wait for it to
finish - or go off and do something else in the meantime! This example also
shows how to suppress both standard out and standard error:
// process5.rs use std::process::{Command,Stdio}; fn main() { let mut child = Command::new("rustc") .stdout(Stdio::null()) .stderr(Stdio::null()) .spawn() .expect("no rustc?"); let res = child.wait(); println!("res {:?}", res); }
By default, the child 'inherits' the standard input and output of the parent. In this case,
we redirect the child's output handles into 'nowhere'. It's equivalent to saying
> /dev/null 2> /dev/null
in the Unix shell.
Now, it's possible to do these things using the shell (sh
or cmd
) in Rust.
But this way you get full programmatic control of process creation.
For example, if we just had .stdout(Stdio::piped())
then the child's standard output
is redirected to a pipe. Then child.stdout
is something you can use to directly
read the output (i.e. implements Read
). Likewise, you can use the .stdout(Stdio::piped())
method so you can write to child.stdin
.
But if we used wait_with_output
instead of wait
then
it returns a Result<Output>
and the child's output is captured into the stdout
field of that Output
as a Vec<u8>
just as before.
The Child
struct also gives you an explicit kill
method.
Modules and Cargo
Modules
As programs get larger, it's necessary to spread them over more than one file and put functions and types in different namespaces. The Rust solution for both of these is modules.
C does the first, and not
the second, so you end up with awful names like primitive_display_set_width
and
so forth. The actual filenames can be named arbitrarily.
In Rust the full name would look like primitive::display::set_width
,
and after saying use primitive::display
you can then refer to it as display::set_width
.
You can even say use primitive::display::set_width
and then just say set_width
, but
it's not a good idea to get carried away with this. rustc
will not be confused, but you
may get confused later. But for this to work, filenames must follow some simple rules.
A new keyword mod
is used to define a module as a block
where Rust types or functions can be written:
mod foo { #[derive(Debug)] struct Foo { s: &'static str } } fn main() { let f = foo::Foo{s: "hello"}; println!("{:?}", f); }
But it's still not quite right - we get 'struct Foo is private'. To solve this, we need the pub
keyword
to export Foo
. The error then changes to 'field s of struct foo::Foo is private', so put pub
before the field s
to export Foo::s
. Then things will work.
# #![allow(unused_variables)] # #fn main() { pub struct Foo { pub s: &'static str } #}
Needing an explicit pub
means that you must choose what items to make public from a module.
The set of functions and types exported from a module is called its interface.
It is usually better to hide the insides of a struct, and only allow access through methods:
mod foo { #[derive(Debug)] pub struct Foo { s: &'static str } impl Foo { pub fn new(s: &'static str) -> Foo { Foo{s: s} } } } fn main() { let f = foo::Foo::new("hello"); println!("{:?}", f); }
Why is hiding the implementation a good thing? Because it means you may change it later without breaking the interface, without consumers of a module getting too dependent on its details. The great enemy of large-scale programing is a tendency for code to get entangled, so that understanding a piece of code is impossible in isolation.
In a perfect world a module does one thing, does it well, and keeps its own secrets.
When not to hide? As Stroustrup says, when the interface is the implementation, like
struct Point{x: f32, y: f32}
.
Within a module, all items are visible to all other items. It's a cozy place where everyone can be friends and know intimate details about each other.
Everyone gets to a point where they want to break a program up into separate files, depending on taste. I start getting uncomfortable around 500 lines, but we all agree that more than 2000 lines is pushing it.
So how to break this program into separate files?
We put the foo
code into foo.rs
:
# #![allow(unused_variables)] # #fn main() { // foo.rs #[derive(Debug)] pub struct Foo { s: &'static str } impl Foo { pub fn new(s: &'static str) -> Foo { Foo{s: s} } } #}
And use a mod foo
statement without a block in the main program:
// mod3.rs mod foo; fn main() { let f = foo::Foo::new("hello"); println!("{:?}", f); }
Now rustc mod3.rs
will cause foo.rs
to be compiled as well. There is no need to fool around
with makefiles!
The compiler will also look at MODNAME/mod.rs
, so this will work
if I create a directory boo
containing a file mod.rs
:
# #![allow(unused_variables)] # #fn main() { // boo/mod.rs pub fn answer()->u32 { 42 } #}
And now the main program can use both modules as separate files:
// mod3.rs
mod foo;
mod boo;
fn main() {
let f = foo::Foo::new("hello");
let res = boo::answer();
println!("{:?} {}", f,res);
}
So far, there's mod3.rs
, containing main
, a module foo.rs
and a directory boo
containing
mod.rs
. The usual convention is that the file containing main
is just called main.rs
.
Why two ways to do the same thing? Because boo/mod.rs
can refer to other modules defined in boo
,
Update boo/mod.rs
and add a new module - note that this is explicitly exported. (Without the pub
,
bar
can only be seen inside the boo
module.)
# #![allow(unused_variables)] # #fn main() { // boo/mod.rs pub fn answer()->u32 { 42 } pub mod bar { pub fn question() -> &'static str { "the meaning of everything" } } #}
and then we have the question corresponding to the answer (the bar
module is inside boo
):
# #![allow(unused_variables)] # #fn main() { let q = boo::bar::question(); #}
That module block can be pulled out as boo/bar.rs
:
# #![allow(unused_variables)] # #fn main() { // boo/bar.rs pub fn question() -> &'static str { "the meaning of everything" } #}
And boo/mod.rs
becomes:
# #![allow(unused_variables)] # #fn main() { // boo/mod.rs pub fn answer()->u32 { 42 } pub mod bar; #}
In summary, modules are about organization and visibility, and this may or may not involve separate files.
Please note that use
has nothing to do with importing, and simply specifies visibility
of module names. For example:
# #![allow(unused_variables)] # #fn main() { { use boo::bar; let q = bar::question(); ... } { use boo::bar::question(); let q = question(); ... } #}
An important point to note is there is no separate compilation here. The main program and its
module files will be recompiled each time. Larger programs will take a fair amount of time to build,
although rustc
is getting better at incremental compilation.
Crates
The 'compilation unit' for Rust is the crate, which is either an executable or a library.
To separately compile the files from the last section,
first build foo.rs
as a Rust static library crate:
src$ rustc foo.rs --crate-type=lib
src$ ls -l libfoo.rlib
-rw-rw-r-- 1 steve steve 7888 Jan 5 13:35 libfoo.rlib
We can now link this into our main program:
src$ rustc mod4.rs --extern foo=libfoo.rlib
But the main program must now look like this, where the extern
name is the same
as the one used when linking. There is an implicit top-level module foo
associated
with the library crate:
// mod4.rs
extern crate foo;
fn main() {
let f = foo::Foo::new("hello");
println!("{:?}", f);
}
Before people start chanting 'Cargo! Cargo!' let me justify this lower-level look at building Rust. I'm a great believer in 'Know Thy Toolchain', and this will reduce the amount of new magic you need to learn when we look at managing projects with Cargo. Modules are basic language features and can be used outside Cargo projects.
It's time to understand why Rust binaries are so large:
src$ ls -lh mod4
-rwxrwxr-x 1 steve steve 3,4M Jan 5 13:39 mod4
That's rather fat! There is a lot of debug information in that executable. This is not a Bad Thing, if you want to use a debugger and actually want meaningful backtraces when your program panics.
So let's strip that debug information and see:
src$ strip mod4
src$ ls -lh mod4
-rwxrwxr-x 1 steve steve 300K Jan 5 13:49 mod4
Still feels a little large for something so simple, but this program links statically against
the Rust standard library. This is a Good Thing, since you can hand this executable to anyone
with the right operating system - they will not need a 'Rust runtime'. (And rustup
will even let
you cross-compile for other operating systems and platforms as well.)
We can link dynamically against the Rust runtime and get truly tiny exes:
src$ rustc -C prefer-dynamic mod4.rs --extern foo=libfoo.rlib
src$ ls -lh mod4
-rwxrwxr-x 1 steve steve 14K Jan 5 13:53 mod4
src$ ldd mod4
linux-vdso.so.1 => (0x00007fffa8746000)
libstd-b4054fae3db32020.so => not found
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3cd47aa000)
/lib64/ld-linux-x86-64.so.2 (0x00007f3cd4d72000)
That 'not found' is because rustup
doesn't install the dynamic libraries globally. We
can hack our way to happiness, at least on Unix (yes, I know the best solution is a symlink.)
src$ export LD_LIBRARY_PATH=~/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib
src$ ./mod4
Foo { s: "hello" }
Rust does not have a philosophical problem with dynamic linking, in the same way as Go does. It's just that when there's a stable release every 6 weeks it becomes inconvenient to have to recompile everything. If you have a stable version that Works For You, then cool. As stable versions of Rust get increasingly delivered by the OS package manager, dynamic linking will become more popular.
Cargo
The Rust standard library is not very large, compared to Java or Python; although much more fully featured than C or C++, which lean heavily on operating system provided libraries.
But it is straightforward to access community-provided libraries in crates.io using Cargo. Cargo will look up the correct version and download the source for you, and ensures that any other needed crates are downloaded as well.
Let's create a simple program which needs to read JSON. This data format is very widely used, but is too specialized for inclusion in the standard library. So we initialize a Cargo project, using '--bin' because the default is to create a library project.
test$ cargo init --bin test-json
Created binary (application) project
test$ cd test-json
test$ cat Cargo.toml
[package]
name = "test-json"
version = "0.1.0"
authors = ["Your Name <you@example.org>"]
[dependencies]
To make the project depend on the JSON crate, edit the 'Cargo.toml' file so:
[dependencies]
json="0.11.4"
Then do a first build with Cargo:
test-json$ cargo build
Updating registry `https://github.com/rust-lang/crates.io-index`
Downloading json v0.11.4
Compiling json v0.11.4
Compiling test-json v0.1.0 (file:///home/steve/c/rust/test/test-json)
Finished debug [unoptimized + debuginfo] target(s) in 1.75 secs
The main file of this project has already been created - it's 'main.rs' in the 'src' directory. It starts out just as a 'hello world' app, so let's edit it to be a proper test program.
Note the very convenient 'raw' string literal - otherwise we would need to escape those double quotes and end up with ugliness:
// test-json/src/main.rs extern crate json; fn main() { let doc = json::parse(r#" { "code": 200, "success": true, "payload": { "features": [ "awesome", "easyAPI", "lowLearningCurve" ] } } "#).expect("parse failed"); println!("debug {:?}", doc); println!("display {}", doc); }
You can now build and run this project - only main.rs
has changed.
test-json$ cargo run
Compiling test-json v0.1.0 (file:///home/steve/c/rust/test/test-json)
Finished debug [unoptimized + debuginfo] target(s) in 0.21 secs
Running `target/debug/test-json`
debug Object(Object { store: [("code", Number(Number { category: 1, exponent: 0, mantissa: 200 }),
0, 1), ("success", Boolean(true), 0, 2), ("payload", Object(Object { store: [("features",
Array([Short("awesome"), Short("easyAPI"), Short("lowLearningCurve")]), 0, 0)] }), 0, 0)] })
display {"code":200,"success":true,"payload":{"features":["awesome","easyAPI","lowLearningCurve"]}}
The debug output shows some internal details of the JSON document, but a
plain '{}', using the Display
trait, regenerates JSON from the parsed document.
Let's explore the JSON API.
It would not be useful if we could not extract values. The as_TYPE
methods
return Option<TYPE>
since we cannot be sure that the field exists or is of the correct type.
(see the docs for JsonValue)
# #![allow(unused_variables)] # #fn main() { let code = doc["code"].as_u32().unwrap_or(0); let success = doc["success"].as_bool().unwrap_or(false); assert_eq!(code, 200); assert_eq!(success, true); let features = &doc["payload"]["features"]; for v in features.members() { println!("{}", v.as_str().unwrap()); // MIGHT explode } // awesome // easyAPI // lowLearningCurve #}
features
here is a reference to JsonValue
- it has to be a reference because otherwise
we would be trying to move a value out of the JSON document. Here we know it's an array,
so members()
will return a non-empty iterator over &JsonValue
.
What if the 'payload' object didn't have a 'features' key? Then features
would be set to Null
.
There will be no explosion. This convenience expresses the free-form, anything-goes nature of JSON
very well. It is up to you to examine the structure of any document you receive and create your own
errors if the structure does not match.
You can modify these structures. If we had let mut doc
then this would do what you expect:
# #![allow(unused_variables)] # #fn main() { let features = &mut doc["payload"]["features"]; features.push("cargo!").expect("couldn't push"); #}
The push
will fail if features
wasn't an array, hence it returns Result<()>
.
Here's a truly beautiful use of macros to generate JSON literals:
# #![allow(unused_variables)] # #fn main() { let data = object!{ "name" => "John Doe", "age" => 30, "numbers" => array![10,53,553] }; assert_eq!( data.dump(), r#"{"name":"John Doe","age":30,"numbers":[10,53,553]}"# ); #}
For this to work, you need to explicitly import macros from the JSON crate thus:
# #![allow(unused_variables)] # #fn main() { #[macro_use] extern crate json; #}
There is a downside to using this crate, because of the mismatch between the amorphous, dynamically-typed nature of JSON and the structured, static nature of Rust. (The readme explicitly speaks of 'friction') So if you did want to map JSON to Rust data structures, you would end up doing a lot of checking, because you can not assume that the received structure matches your structs! For that, a better solution is serde_json where you serialize Rust data structures into JSON and deserialize JSON into Rust.
For this, create a another Cargo binary project with cargo new --bin test-serde-json
, go into
the test-serde-json
directory and edit Cargo.toml
. Edit it like so:
[dependencies]
serde="0.9"
serde_derive="0.9"
serde_json="0.9"
And edit src/main.rs
to be this:
#[macro_use] extern crate serde_derive; extern crate serde_json; #[derive(Serialize, Deserialize, Debug)] struct Person { name: String, age: u8, address: Address, phones: Vec<String>, } #[derive(Serialize, Deserialize, Debug)] struct Address { street: String, city: String, } fn main() { let data = r#" { "name": "John Doe", "age": 43, "address": {"street": "main", "city":"Downtown"}, "phones":["27726550023"] } "#; let p: Person = serde_json::from_str(data).expect("deserialize error"); println!("Please call {} at the number {}", p.name, p.phones[0]); println!("{:#?}",p); }
You have seen the derive
attribute before, but the serde_derive
crate defines custom derives
for the special Serialize
and Deserialize
traits. And the result shows the generated Rust struct:
Please call John Doe at the number 27726550023
Person {
name: "John Doe",
age: 43,
address: Address {
street: "main",
city: "Downtown"
},
phones: [
"27726550023"
]
}
Now, if you did this using the json
crate, you would need a few hundred lines of custom conversion
code, mostly error handling. Tedious, easy to mess up, and not where you want to put effort anyway.
This is clearly the best solution if you are processing well-structured JSON from outside sources (it's
possible to remap field names if needed) and provides a robust way for Rust programs to share data
with other programs over the network (since everything understands JSON these days.) The cool thing
about serde
(for SERialization DEserialization) is that other file formats are also supported, such
as toml
, which is the popular configuration-friendly format used in Cargo. So your program can read .toml
files into structs, and write those structs out as .json.
Serialization is an important technique and similar solutions exist for Java and Go - but with a big difference. In those languages the structure of the data is found at run-time using reflection, but in this case the serialization code is generated at compile-time - altogether more efficient!
Cargo is considered to be one of the great strengths of the Rust ecosystem, because it does a lot of work for us. Otherwise we would have had to download these libraries from Github, build as static library crates, and link them against the program. It's painful to do this for C++ projects, and would be nearly as painful for Rust projects if Cargo did not exist. C++ is somewhat unique in its painfullness here, so we should compare this with other languages' package managers. npm (for JavaScript) and pip (for Python) manage dependencies and downloads for you, but the distribution story is harder, since the user of your program needs NodeJS or Python installed. But Rust programs are statically linked against their dependencies, so again they can be handed out to your buddies without external dependencies.
More Gems
When processing anything except simple text, regular expressions make your life much easier. These are commonly available for most languages and I'll here assume a basic familiarity with regex notation. To use the regex crate, put 'regex = "0.2.1"' after "[dependencies]" in your Cargo.toml.
We'll use 'raw strings' again so that the backslashes don't have to be escaped. In English, this regular expression is "match exactly two digits, the character ':', and then any number of digits. Capture both sets of digits":
# #![allow(unused_variables)] # #fn main() { extern crate regex; use regex::Regex; let re = Regex::new(r"(\d{2}):(\d+)").unwrap(); println!("{:?}", re.captures(" 10:230")); println!("{:?}", re.captures("[22:2]")); println!("{:?}", re.captures("10:x23")); // Some(Captures({0: Some("10:230"), 1: Some("10"), 2: Some("230")})) // Some(Captures({0: Some("22:2"), 1: Some("22"), 2: Some("2")})) // None #}
The successful output actually has three captures - the whole match, and the two sets of digits. These regular expressions are not anchored by default, so regex will hunt for the first occurring match, skipping anything that doesn't match. (If you left out the '()' it would just give us the whole match.)
It's possible to name those captures, and spread the regular expression over several lines, even including comments! Compiling the regex might fail (the first expect) or the match might fail (the second expect). Here we can use the result as an associative array and look up captures by name.
# #![allow(unused_variables)] # #fn main() { let re = Regex::new(r"(?x) (?P<year>\d{4}) # the year - (?P<month>\d{2}) # the month - (?P<day>\d{2}) # the day ").expect("bad regex"); let caps = re.captures("2010-03-14").expect("match failed"); assert_eq!("2010", &caps["year"]); assert_eq!("03", &caps["month"]); assert_eq!("14", &caps["day"]); #}
Regular expressions can break up strings that match a pattern, but won't check whether they make sense. That is, you can specify and match the syntax of ISO-style dates, but semantically they may be nonsense, like "2014-24-52".
For this, you need dedicated date-time processing, which is provided by chrono. You need to decide on a time zone when doing dates:
extern crate chrono; use chrono::*; fn main() { let date = Local.ymd(2010,3,14); println!("date was {}", date); } // date was 2010-03-14+02:00
However, this isn't recommended because feeding it bad dates will cause a panic! (try a bogus date
to see this.) The method you need here is ymd_opt
which returns LocalResult<Date>
# #![allow(unused_variables)] # #fn main() { let date = Local.ymd_opt(2010,3,14); println!("date was {:?}", date); // date was Single(2010-03-14+02:00) let date = Local.ymd_opt(2014,24,52); println!("date was {:?}", date); // date was None #}
You can also directly parse date-times, either in standard UTC format or using custom formats. These self-same formats allow you to print out dates in exactly the format you want.
I specifically highlighted these two useful crates because they would be part of the standard library in most other languages. And, in fact, the embryonic form of these crates was once part of the Rust stdlib, but were cut loose. This was a deliberate decision: the Rust team takes stdlib stability very seriously so features only arrive in stable once they have gone through incubation in unstable nightly versions, and only then beta and stable. For libraries that need experimentation and refinement, it's much better that they remain independent and get tracked with Cargo. For all practical purposes, these two crates are standard - they are not going away and may be folded back into the stdlib at some point.
Standard Library Containers
Reading the Documentation
In this section I'll briefly introduce some common parts of the Rust standard library. The documentation is excellent but a little context and a few examples is always useful.
Initially, reading the Rust documentation can be challenging, so I'll go through Vec
as an
example. A useful tip is to tick the '[-]' box to collapse the docs. (If you download the
standard library source using rustup component add rust-src
a '[src]' link will appear next to this.)
This gives you a bird's eye view of all the available methods.
The first point to notice is that not all possible methods are defined on Vec
itself. They are (mostly)
mutable methods that change the vector, e.g. push
. Some methods are only implemented for vectors where
the type matches some constraint. For example, you can only call dedup
(remove duplicates) if the
type is indeed something that can be compared for equality. There are multiple impl
blocks that
define Vec
for different type constraints.
Then there's the very special relationship between Vec<T>
and &[T]
. Any method that works on
slices will also directly work on vectors, without explicitly having to use the as_slice
method.
This relationship is expressed by Deref<Target=[T]>
. This also kicks in when you pass a vector by
reference to something that expects a slice - this is one of the few places where
a conversion between types happens automatically. So slice methods like first
, which maybe-returns
a reference to the first element, or last
, work for vectors as well. Many of the methods are similar
to the corresponding string methods, so there's split_at
for getting a pair of slices split at an index,
starts_with
to check whether a vector starts with sequence of values, and contains
to check whether
a vector contains a particular value.
There's no search
method for finding the index of a particular value, but here's a rule of thumb:
if you can't find a method on the container, look for a method on the iterator:
# #![allow(unused_variables)] # #fn main() { let v = vec![10,20,30,40,50]; assert_eq!(v.iter().position(|&i| i == 30).unwrap(), 2); #}
(The &
is because this is an iterator over references - alternatively you could say *i == 30
for
the comparison.)
Likewise, there's no map
method on vectors, because iter().map(...).collect()
will do the job
just as well. Rust does not like to allocate unnecessarily - often you don't need the result of that map
as an actual allocated vector.
So I'd suggest you become familiar with all the iterator methods, because they are crucial to writing good Rust code without having to write loops out all the time. As always, write little programs to explore iterator methods, rather than wrestling with them in the context of a more complicated program.
The Vec<T>
and &[T]
methods are followed by the common traits: vectors know how to do a debug display of themselves
(but only if the elements implement Debug
). Likewise, they are clonable if their elements are clonable.
They implement Drop
, which happens when vectors get to finally die; memory is released,
and all the elements are dropped as well.
The Extend
trait means values from iterators can be added to a vector without a loop:
# #![allow(unused_variables)] # #fn main() { v.extend([60,70,80].iter()); let mut strings = vec!["hello".to_string(), "dolly".to_string()]; strings.extend(["you","are","fine"].iter().map(|s| s.to_string())); #}
There's also FromIterator
, which lets vectors be constructed from iterators. (The iterator collect
method leans on this.)
Any container needs to be iterable as well. Recall that there are three kinds of iterators
# #![allow(unused_variables)] # #fn main() { for x in v {...} // returns T, consumes v for x in &v {...} // returns &T for x in &mut v {...} // returns &mut T #}
The for
statement relies on the IntoIterator
trait, and there's indeed three implementations.
Then there is indexing, controlled by Index
(reading from a vector) and IndexMut
(modifying a
vector.) There are many possibilities, because there's slice indexing as well, like v[0..2]
,
returning these return slices, as well as plain v[0]
which returns a reference to the first element.
There's a few implementations of the From
trait. For instance Vec::from("hello".to_string())
will give you a vector of the underlying bytes of the string Vec<u8>
.
Now, there's already a method into_bytes
on String
, so why the redundancy?
It seems confusing to have multiple ways of doing the same thing.
But it's needed because explicit traits make generic methods possible.
Sometimes limitations of the Rust type system make things clumsy. An example here is how PartialEq
is separately defined for arrays up to size 32! (This will get better.) This allows the convenience
of directly comparing vectors with arrays, but beware the size limit.
And there are Hidden Gems buried
deep in the documentation. As Karol Kuczmarski says "Because let’s be honest: no one scrolls that far".
How does one handle errors in an iterator? Say you map over some operation that might
fail and so returns Result
, and then want to collect the results:
fn main() { let nums = ["5","52","65"]; let iter = nums.iter().map(|s| s.parse::<i32>()); let converted: Vec<_> = iter.collect(); println!("{:?}",converted); } //[Ok(5), Ok(52), Ok(65)]
Fair enough, but now you have to unwrap these errors - carefully!.
But Rust already knows how to do the Right Thing,
if you ask for the vector to be contained in a Result
- that is,
either is a vector or an error:
# #![allow(unused_variables)] # #fn main() { let converted: Result<Vec<_>,_> = iter.collect(); //Ok([5, 52, 65]) #}
And if there was a bad conversion? Then you would just get Err
with the first
error encountered. It's a good example of how extremely flexible collect
is.
(The notation here can be intimidating - Vec<_>
means "this is a vector, work
out the actual type for meand
Result<Vec<>,>` is furthermore asking
Rust to work out the error type as well.)
So there's a lot of detail in the documentation.
But it's certainly clearer than what the C++ docs say about std::vector
The requirements that are imposed on the elements depend on the actual operations performed on the container. Generally, it is required that element type is a complete type and meets the requirements of Erasable, but many member functions impose stricter requirements.
With C++, you're on your own. The explicitness of Rust is daunting at first, but as you learn to
read the constraints you will know exactly what any particular method of Vec
requires.
I would suggest that you get the source using rustup component add rust-src
, since the
standard library source is very readable and the method implementations are usually less scary than the
method declarations.
Maps
Maps (sometimes called associative arrays or dicts) let you look up values associated with a key. It's not really a fancy concept, and can be done with an array of tuples:
# #![allow(unused_variables)] # #fn main() { let entries = [("one","eins"),("two","zwei"),("three","drei")]; if let Some(val) = entries.iter().find(|t| t.0 == "two") { assert_eq!(val.1,"zwei"); } #}
This is fine for small maps and just requires equality to be defined for the keys, but the search takes linear time - proportional to the size of the map.
A HashMap
does much better when there are a lot of key/value pairs to be
searched:
# #![allow(unused_variables)] # #fn main() { use std::collections::HashMap; let mut map = HashMap::new(); map.insert("one","eins"); map.insert("two","zwei"); map.insert("three","drei"); assert_eq! (map.contains_key("two"), true); assert_eq! (map.get("two"), Some(&"zwei")); #}
&"zwei"
? This is because get
returns a reference to the value, not the value
itself. Here the value type is &str
, so we get a &&str
. In general it has to be
a reference, because we can't just move a value out of its owning type.
get_mut
is like get
but returns a possible mutable reference. Here we have
a map from strings to integers, and wish to update the value for the key 'two':
# #![allow(unused_variables)] # #fn main() { let mut map = HashMap::new(); map.insert("one",1); map.insert("two",2); map.insert("three",3); println!("before {}", map.get("two").unwrap()); { let mut mref = map.get_mut("two").unwrap(); *mref = 20; } println!("after {}", map.get("two").unwrap()); // before 2 // after 20 #}
Note that getting that writable reference takes place in its own block - otherwise,
we would have a mutable borrow lasting until the end, and then Rust won't allow
you to borrow from map
again with map.get("two")
; it cannot allow any readable
references while there's already a writable reference in scope. (If it did, it could
not guarantee that those readable references would remain valid.)
So the solution is to make
sure that mutable borrow doesn't last very long.
It is not the most elegant API possible, but we can't throw away any possible
errors. Python would bail out with an exception, and C++ would just create
a default value. (This is convenient but sneaky; easy to forget that the price
of a_map["two"]
always returning an integer is that we can't tell the difference
between zero and 'not found', plus an extra entry is created!)
And no-one just calls unwrap
, except in examples. However, most Rust code you see consists
of little standalone examples! Much more likely for a match to take place:
# #![allow(unused_variables)] # #fn main() { if let Some(v) = map.get("two") { let res = v + 1; assert_eq!(res, 3); } ... match map.get_mut("two") { Some(mref) => *mref = 20, None => panic!("_now_ we can panic!") } #}
We can iterate over the key/value pairs, but not in any particular order.
# #![allow(unused_variables)] # #fn main() { for (k,v) in map.iter() { println!("key {} value {}", k,v); } // key one value eins // key three value drei // key two value zwei #}
There are also keys
and values
methods returning iterators over the keys and
values respectively, which makes creating vectors of values easy.
Example: Counting Words
An entertaining thing to do with text is count word frequency. It is straightforward
to break text into words with split_whitespace
, but really we must respect punctuation.
So the words should be defined as consisting only of alphabetic characters.
And the words need to be compared as lower-case as well.
Doing a mutable lookup on a map is straightforward, but also handling the case where the lookup fails is a little awkward. Fortunately there's an elegant way to update the values of a map:
# #![allow(unused_variables)] # #fn main() { let mut map = HashMap::new(); for s in text.split(|c: char| ! c.is_alphabetic()) { let word = s.to_lowercase(); let mut count = map.entry(word).or_insert(0); *count += 1; } #}
If there's no existing count corresponding to a word, then let's create a new entry containing zero for that word and insert it into the map. Its exactly what a C++ map does, except it's done explicitly and not sneakily.
There is exactly one explicit type in this snippet, and that's the char
needed
because of a quirk of the string Pattern
trait used by split
.
But we can deduce that the key type is String
and the value type is i32
.
Using The Adventures of Sherlock Holmes
from Project Gutenberg, we can test this out
more thoroughly. The total number of unique words (map.len()
) is 8071.
How to find the twenty most common words? First, convert the map into a vector
of (key,value) tuples. (This consumes the map, since we used into_iter
.)
# #![allow(unused_variables)] # #fn main() { let mut entries: Vec<_> = map.into_iter().collect(); #}
Next we can sort in descending order. sort_by
expects the result of the cmp
method that comes from the Ord
trait, which is implemented by the
integer value type:
# #![allow(unused_variables)] # #fn main() { entries.sort_by(|a,b| b.1.cmp(&a.1)); #}
And finally print out the first twenty entries:
# #![allow(unused_variables)] # #fn main() { for e in entries.iter().take(20) { println!("{} {}", e.0, e.1); } #}
(Well, you could just loop over 0..20
and index the vector here - it isn't wrong,
just a little un-idiomatic - and potentially more expensive for big iterations.)
38765
the 5810
and 3088
i 3038
to 2823
of 2778
a 2701
in 1823
that 1767
it 1749
you 1572
he 1486
was 1411
his 1159
is 1150
my 1007
have 929
with 877
as 863
had 830
A little surprise - what's that empty word? It is because split
works on single-character
delimiters, so any punctuation or extra spaces causes a new split.
Sets
Sets are maps where you care only about the keys, not any associated values.
So insert
only takes one value, and you use contains
for testing whether a value
is in a set.
Like all containers, you can create a HashSet
from an iterator. And this
is exactly what collect
does, once you have given it the necessary type hint.
// set1.rs use std::collections::HashSet; fn make_set(words: &str) -> HashSet<&str> { words.split_whitespace().collect() } fn main() { let fruit = make_set("apple orange pear orange"); println!("{:?}", fruit); } // {"orange", "pear", "apple"}
Note (as expected) that repeated insertions of the same key have no effect, and the order of values in a set are not important.
They would not be sets without the usual operations:
# #![allow(unused_variables)] # #fn main() { let fruit = make_set("apple orange pear"); let colours = make_set("brown purple orange yellow"); for c in fruit.intersection(&colours) { println!("{:?}",c); } // "orange" #}
They all create iterators, and you can use collect
to make these into sets.
Here's a shortcut, just as we defined for vectors:
# #![allow(unused_variables)] # #fn main() { use std::hash::Hash; trait ToSet<T> { fn to_set(self) -> HashSet<T>; } impl <T,I> ToSet<T> for I where T: Eq + Hash, I: Iterator<Item=T> { fn to_set(self) -> HashSet<T> { self.collect() } } ... let intersect = fruit.intersection(&colours).to_set(); #}
As with all Rust generics, you do need to constrain types - this can only be
implemented for types that understand equality (Eq
) and for which a 'hash function'
exists (Hash
). Remember that there is no type called Iterator
, so I
represents
any type that implements Iterator
.
This technique for implementing our own methods on standard library types may appear to be a little too powerful, but again, there are Rules. We can only do this for our own traits. If both the struct and the trait came from the same crate (particularly, the stdlib) then such implemention would not be allowed. In this way, you are saved from creating confusion.
Before congratulating ourselves on such a clever, convenient shortcut, you should be
aware of the consequences. If make_set
was written so, so that these are sets
of owned strings, then the actual type of intersect
could come as a surprise:
# #![allow(unused_variables)] # #fn main() { fn make_set(words: &str) -> HashSet<String> { words.split_whitespace().map(|s| s.to_string()).collect() } ... // intersect is HashSet<&String>! let intersect = fruit.intersection(&colours).to_set(); #}
And it cannot be otherwise, since Rust will not suddenly start making copies of owned
strings. intersect
contains a single &String
borrowed from fruit
. I can promise
you that this will cause you trouble later, when you start patching up lifetimes!
A better solution is to use the iterator's cloned
method to make owned string copies
of the intersection.
# #![allow(unused_variables)] # #fn main() { // intersect is HashSet<String> - much better let intersect = fruit.intersection(&colours).cloned().to_set(); #}
A more robust definition of to_set
might be self.cloned().collect()
,
which I invite you to try out.
Example: Interactive command processing
It's often useful to have an interactive session with a program. Each line is read in and split into words; the command is looked up on the first word, and the rest of the words are passed as an argument to that command.
A natural implementation is a map from command names to closures. But how do we store closures, given that they will all have different sizes? Boxing them will copy them onto the heap:
Here's a first try:
# #![allow(unused_variables)] # #fn main() { let mut v = Vec::new(); v.push(Box::new(|x| x * x)); v.push(Box::new(|x| x / 2.0)); for f in v.iter() { let res = f(1.0); println!("res {}", res); } #}
We get a very definite error on the second push:
= note: expected type `[closure@closure4.rs:4:21: 4:28]`
= note: found type `[closure@closure4.rs:5:21: 5:28]`
note: no two closures, even if identical, have the same type
rustc
has deduced a type which is too specific, so it's necessary to force that
vector to have the boxed trait type before things just work:
# #![allow(unused_variables)] # #fn main() { let mut v: Vec<Box<Fn(f64)->f64>> = Vec::new(); #}
We can now use the same trick and keep these boxed closures in a HashMap
. We still
have to watch out for lifetimes, since closures can borrow from their environment.
It's tempting as first to make them FnMut
- that is, they can modify any captured variables. But we will
have more than one command, each with its own closure, and you cannot then mutably borrow
the same variables.
So the closures are passed a mutable reference as an argument, plus
a slice of string slices (&[&str]
) representing the command arguments.
They will return some Result
- We'll use String
errors at first.
D
is the data type, which can be anything with a size.
# #![allow(unused_variables)] # #fn main() { type CliResult = Result<String,String>; struct Cli<'a,D> { data: D, callbacks: HashMap<String, Box<Fn(&mut D,&[&str])->CliResult + 'a>> } impl <'a,D: Sized> Cli<'a,D> { fn new(data: D) -> Cli<'a,D> { Cli{data: data, callbacks: HashMap::new()} } fn cmd<F>(&mut self, name: &str, callback: F) where F: Fn(&mut D, &[&str])->CliResult + 'a { self.callbacks.insert(name.to_string(),Box::new(callback)); } #}
cmd
is passed a name and any closure that matches our signature, which is boxed
and entered into the map. Fn
means that our closures borrow their environment
but can't modify it. It's one of those generic methods where the declaration is scarier than
the actual implementation! Forgetting the explicit lifetime is a common error
here - Rust won't let us forget that these closures have a lifetime limited to
their environment!
Now for reading and running commands:
# #![allow(unused_variables)] # #fn main() { fn process(&mut self,line: &str) -> CliResult { let parts: Vec<_> = line.split_whitespace().collect(); if parts.len() == 0 { return Ok("".to_string()); } match self.callbacks.get(parts[0]) { Some(callback) => callback(&mut self.data,&parts[1..]), None => Err("no such command".to_string()) } } fn go(&mut self) { let mut buff = String::new(); while io::stdin().read_line(&mut buff).expect("error") > 0 { { let line = buff.trim_left(); let res = self.process(line); println!("{:?}", res); } buff.clear(); } } #}
This is all reasonably straightforward - split the line into words as a vector, look up the first word in the map and call the closure with our stored mutable data and the rest of the words. An empty line is ignored and not considered an error.
Next, let's define some helper functions to make it easier for our closures to
return correct and incorrect results. There's a little bit of cleverness going on;
they are generic functions that work for any type that can be converted to a String
.
# #![allow(unused_variables)] # #fn main() { fn ok<T: ToString>(s: T) -> CliResult { Ok(s.to_string()) } fn err<T: ToString>(s: T) -> CliResult { Err(s.to_string()) } #}
So finally, the Main Program. Look at how ok(answer)
works - because
integers know how to convert themselves to strings!
use std::error::Error; fn main() { println!("Welcome to the Interactive Prompt! "); struct Data { answer: i32 } let mut cli = Cli::new(Data{answer: 42}); cli.cmd("go",|data,args| { if args.len() == 0 { return err("need 1 argument"); } data.answer = match args[0].parse::<i32>() { Ok(n) => n, Err(e) => return err(e.description()) }; println!("got {:?}", args); ok(data.answer) }); cli.cmd("show",|data,_| { ok(data.answer) }); cli.go(); }
The error handling is a bit clunky here, and we'll later see how to use the question
mark operator in cases like this.
Basically, the particular error std::num::ParseIntError
implements
the trait std::error::Error
, which we must bring into scope to use the description
method - Rust doesn't let traits operate unless they're visible.
And in action:
Welcome to the Interactive Prompt!
go 32
got ["32"]
Ok("32")
show
Ok("32")
goop one two three
Err("no such command")
go 42 one two three
got ["42", "one", "two", "three"]
Ok("42")
go boo!
Err("invalid digit found in string")
Here are some obvious improvements for you to try. First, if we give cmd
three
arguments with the second being a help line, then we can store these help lines
and automatically implement a 'help' command. Second, having some command editing and
history is very convenient, so use the rustyline crate
from Cargo.
Error Handling
Basic Error Handling
Error handling in Rust can be clumsy if you can't use the question-mark operator.
To achieve happiness, we need to return a Result
which can accept any error.
All errors implement the trait std::error::Error
, and
so any error can convert into a Box<Error>
.
Say we needed to handle both i/o errors and errors from converting strings into numbers:
# #![allow(unused_variables)] # #fn main() { // box-error.rs use std::fs::File; use std::io::prelude::*; use std::error::Error; fn run(file: &str) -> Result<i32,Box<Error>> { let mut file = File::open(file)?; let mut contents = String::new(); file.read_to_string(&mut contents)?; Ok(contents.trim().parse()?) } #}
So that's two question-marks for the i/o errors (can't open file, or can't read as string)
and one question-mark for the conversion error. Finally, we wrap the result in Ok
.
Rust can work out from the return type that parse
should convert to i32
.
It's easy to create a shortcut for this Result
type:
# #![allow(unused_variables)] # #fn main() { type BoxResult<T> = Result<T,Box<Error>>; #}
However, our programs will have application-specific error conditions, and so we need to create our own error type. The basic requirements are straightforward:
- May implement
Debug
- Must implement
Display
- Must implement
Error
Otherwise, your error can do pretty much what it likes.
# #![allow(unused_variables)] # #fn main() { // error1.rs use std::error::Error; use std::fmt; #[derive(Debug)] struct MyError { details: String } impl MyError { fn new(msg: &str) -> MyError { MyError{details: msg.to_string()} } } impl fmt::Display for MyError { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { write!(f,"{}",self.details) } } impl Error for MyError { fn description(&self) -> &str { &self.details } } // a test function that returns our error result fn raises_my_error(yes: bool) -> Result<(),MyError> { if yes { Err(MyError::new("borked")) } else { Ok(()) } } #}
Typing Result<T,MyError>
gets tedious and many Rust modules define their own
Result
- e.g. io::Result<T>
is short for Result<T,io::Error>
.
In this next example we need to handle the specific error when a string can't be parsed as a floating-point number.
Now the way that ?
works
is to look for a conversion from the error of the expression to the error that must
be returned. And this conversion is expressed by the From
trait. Box<Error>
works as it does because it implements From
for all types implementing Error
.
At this point you can continue to use the convenient alias BoxResult
and catch everything
as before; there will be a conversion from our error into Box<Error>
.
This is a good option for smaller applications. But I want to show other errors can
be explicitly made to cooperate with our error type.
ParseFloatError
implements Error
so description()
is defined.
# #![allow(unused_variables)] # #fn main() { use std::num::ParseFloatError; impl From<ParseFloatError> for MyError { fn from(err: ParseFloatError) -> Self { MyError::new(err.description()) } } // and test! fn parse_f64(s: &str, yes: bool) -> Result<f64,MyError> { raises_my_error(yes)?; let x: f64 = s.parse()?; Ok(x) } #}
The first ?
is fine (a type always converts to itself with From
) and the
second ?
will convert the ParseFloatError
to MyError
.
And the results:
fn main() { println!(" {:?}", parse_f64("42",false)); println!(" {:?}", parse_f64("42",true)); println!(" {:?}", parse_f64("?42",false)); } // Ok(42) // Err(MyError { details: "borked" }) // Err(MyError { details: "invalid float literal" })
Not too complicated, although a little long-winded. The tedious bit is having to
write From
conversions for all the other error types that need to play nice
with MyError
- or you simply lean on Box<Error>
. Newcomers get confused
by the multitude of ways to do the same thing in Rust; there is always another
way to peel the avocado (or skin the cat, if you're feeling bloodthirsty). The price
of flexibility is having many options. Error-handling for a 200 line program can afford
to be simpler than for a large application. And if you ever want to package your precious
droppings as a Cargo crate, then error handling becomes crucial.
Currently, the question-mark operator only works for Result
, not Option
, and this is
a feature, not a limitation. Option
has a ok_or_else
which converts itself into a Result
.
For example, say we had a HashMap
and must fail if a key isn't defined:
# #![allow(unused_variables)] # #fn main() { let val = map.get("my_key").ok_or_else(|| MyError::new("my_key not defined"))?; #}
Now here the error returned is completely clear! (This form uses a closure, so the error value is only created if the lookup fails.)
simple-error for Simple Errors
The simple-error crate provides you with
a basic error type based on a string, as we have defined it here, and a few convenient macros.
Like any error, it works fine with Box<Error>
:
#[macro_use] extern crate simple_error; use std::error::Error; type BoxResult<T> = Result<T,Box<Error>>; fn run(s: &str) -> BoxResult<i32> { if s.len() == 0 { bail!("empty string"); } Ok(s.trim().parse()?) } fn main() { println!("{:?}", run("23")); println!("{:?}", run("2x")); println!("{:?}", run("")); } // Ok(23) // Err(ParseIntError { kind: InvalidDigit }) // Err(StringError("empty string"))
bail!(s)
expands to return SimpleError::new(s).into();
- return early with a conversion into
the receiving type.
You need to use BoxResult
for mixing the SimpleError
type with other errors, since
we can't implement From
for it, since both the trait and the type come from other crates.
error-chain for Serious Errors
For non-trivial applications have a look at the error_chain crate. A little macro magic can go a long way in Rust...
Create a binary crate with cargo new --bin test-error-chain
and
change to this directory. Edit Cargo.toml
and add error-chain="0.8.1"
to the end.
What error-chain does for you is create all the definitions we needed for manually implementing
an error type; creating a struct, and implementing the necessary traits: Display
, Debug
and Error
.
It also by default implements From
so strings can be converted into errors.
Our first src/main.rs
file looks like this. All the main program does is call run
, print out any
errors, and end the program with a non-zero exit code. The macro error_chain
generates all the
definitions needed, within an error
module - in a larger program you would put this in its own file.
We need to bring everything in error
back into global scope because our code will need to see
the generated traits. By default, there will be an Error
struct and a Result
defined with that
error.
Here we also ask for From
to be implemented so that std::io::Error
will convert into
our error type using foreign_links
:
#[macro_use] extern crate error_chain; mod errors { error_chain!{ foreign_links { Io(::std::io::Error); } } } use errors::*; fn run() -> Result<()> { use std::fs::File; File::open("file")?; Ok(()) } fn main() { if let Err(e) = run() { println!("error: {}", e); std::process::exit(1); } } // error: No such file or directory (os error 2)
The 'foreign_links' has made our life easier, since the question mark operator now knows how to
convert std::io::Error
into our error::Error
. (Under the hood, the macro is creating a
From<std::io::Error>
conversion, exactly as spelt out earlier.)
All the action happens in run
; let's make it print out the first 10 lines of a file given as the
first program argument. There may or may not be such an argument, which isn't necessarily an
error. Here we want to convert an Option<String>
into a Result<String>
. There are two Option
methods for doing this conversion, and I've picked the simplest one. Our Error
type implements
From
for &str
, so it's straightforward to make an error with a simple text message.
# #![allow(unused_variables)] # #fn main() { fn run() -> Result<()> { use std::env::args; use std::fs::File; use std::io::BufReader; use std::io::prelude::*; let file = args().skip(1).next() .ok_or(Error::from("provide a file"))?; let f = File::open(&file)?; let mut l = 0; for line in BufReader::new(f).lines() { let line = line?; println!("{}", line); l += 1; if l == 10 { break; } } Ok(()) } #}
There is (again) a useful little macro bail!
for 'throwing' errors.
An alternative to the ok_or
method here could be:
# #![allow(unused_variables)] # #fn main() { let file = match args().skip(1).next() { Some(s) => s, None => bail!("provide a file") }; #}
Like ?
it does an early return.
The returned error contains an enum ErrorKind
, which allows us to distinguish between various
kinds of errors. There's always a variant Msg
(when you say Error::from(str)
) and the foreign_links
has declared Io
which wraps I/O errors:
fn main() { if let Err(e) = run() { match e.kind() { &ErrorKind::Msg(ref s) => println!("msg {}",s), &ErrorKind::Io(ref s) => println!("io {}",s), } std::process::exit(1); } } // $ cargo run // msg provide a file // $ cargo run foo // io No such file or directory (os error 2)
It's straightforward to add new kinds of errors. Add an errors
section to the error_chain!
macro:
# #![allow(unused_variables)] # #fn main() { error_chain!{ foreign_links { Io(::std::io::Error); } errors { NoArgument(t: String) { display("no argument provided: '{}'", t) } } } #}
This defines how Display
works for this new kind of error. And now we can handle
'no argument' errors more specifically, feeding ErrorKind::NoArgument
a String
value:
# #![allow(unused_variables)] # #fn main() { let file = args().skip(1).next() .ok_or(ErrorKind::NoArgument("filename needed".to_string()))?; #}
There's now an extra ErrorKind
variant that you must match:
fn main() { if let Err(e) = run() { println!("error {}",e); match e.kind() { &ErrorKind::Msg(ref s) => println!("msg {}", s), &ErrorKind::Io(ref s) => println!("io {}", s), &ErrorKind::NoArgument(ref s) => println!("no argument {:?}", s), } std::process::exit(1); } } // cargo run // error no argument provided: 'filename needed' // no argument "filename needed"
Generally, it's useful to make errors as specific as possible, particularly if this is a library
function! This match-on-kind technique is pretty much the equivalent of traditional exception handling,
where you match on exception types in a catch
or except
block.
In summary, error-chain creates a type Error
for you, and defines Result<T>
to be std::result::Result<T,Error>
.
Error
contains an enum ErrorKind
and by default there is one variant Msg
for errors created from
strings. You define external errors with foreign_links
which does two things. First, it creates a new
ErrorKind
variant. Second, it defines From
on these external errors so they can be converted to our
error. New error variants can be easily added. A lot of irritating boilerplate code is eliminated.
Chaining Errors
But the really cool thing that this crate provides is error chaining.
As a library user, it's irritating when a method simply just 'throws' a generic I/O error. OK, it could not open a file, fine, but what file? Basically, what use is this information to me?
error_chain
does error chaining which helps solve this problem of over-generic errors. When we
try to open the file, we can lazily lean on the conversion to io::Error
using ?
, or chain the error.
# #![allow(unused_variables)] # #fn main() { // non-specific error let f = File::open(&file)?; // a specific chained error let f = File::open(&file).chain_err(|| "unable to read the damn file")?; #}
Here's a new version of the program, with no imported 'foreign' errors, just the defaults:
#[macro_use] extern crate error_chain; mod errors { error_chain!{ } } use errors::*; fn run() -> Result<()> { use std::env::args; use std::fs::File; use std::io::BufReader; use std::io::prelude::*; let file = args().skip(1).next() .ok_or(Error::from("filename needed"))?; ///////// chain explicitly! /////////// let f = File::open(&file).chain_err(|| "unable to read the damn file")?; let mut l = 0; for line in BufReader::new(f).lines() { let line = line.chain_err(|| "cannot read a line")?; println!("{}", line); l += 1; if l == 10 { break; } } Ok(()) } fn main() { if let Err(e) = run() { println!("error {}", e); /////// look at the chain of errors... /////// for e in e.iter().skip(1) { println!("caused by: {}", e); } std::process::exit(1); } } // $ cargo run foo // error unable to read the damn file // caused by: No such file or directory (os error 2)
So the chain_err
method takes the original error, and creates a new error which contains the
original error - this can be continued indefinitely. The closure is expected to return any
value which can be converted into an error.
Rust macros can clearly save you a lot of typing. error-chain
even provides a shortcut that
replaces the whole main program:
# #![allow(unused_variables)] # #fn main() { quick_main!(run); #}
(run
is where all the action takes place, anyway.)
Threads, Networking and Sharing
Changing the Unchangeable
If you're feeling pig-headed (as I get) you wonder if it's ever possible to get around the restrictions of the borrow checker.
Consider the following little program, which compiles and runs without problems.
// cell.rs use std::cell::Cell; fn main() { let answer = Cell::new(42); assert_eq!(answer.get(), 42); answer.set(77); assert_eq!(answer.get(), 77); }
The answer was changed - and yet the variable answer
was not mutable!
This is obviously perfectly safe, since the value inside the cell is only accessed
through set
and get
. This goes by the grand name of interior mutability. The
usual is called inherited mutability: if I have a struct value v
, then I can only
write to a field v.a
if v
itself is writeable. Cell
values relax this rule, since
we can change the value contained within them with set
even if the cell itself is
not mutable.
However, Cell
only works with Copy
types
(e.g primitive types and user types deriving the Copy
trait).
For other values, we have to get a reference we can work on, either mutable or immutable.
This is what RefCell
provides - you ask it explicitly for a reference to the contained
value:
// refcell.rs use std::cell::RefCell; fn main() { let greeting = RefCell::new("hello".to_string()); assert_eq!(*greeting.borrow(), "hello"); assert_eq!(greeting.borrow().len(), 5); *greeting.borrow_mut() = "hola".to_string(); assert_eq!(*greeting.borrow(), "hola"); }
Again, greeting
was not declared as mutable!
The explicit dereference operator *
can be a bit confusing in Rust, because
often you don't need it - for instance greeting.borrow().len()
is fine since method
calls will dereference implicitly. But you do need *
to pull out the underlying
&String
from greeting.borrow()
or the &mut String
from greeting.borrow_mut()
.
Using a RefCell
isn't always safe, because any references returned from these
methods must follow the usual rules.
# #![allow(unused_variables)] # #fn main() { let mut gr = greeting.borrow_mut(); // gr is a mutable borrow *gr = "hola".to_string(); assert_eq!(*greeting.borrow(), "hola"); // <== we blow up here! .... thread 'main' panicked at 'already mutably borrowed: BorrowError' #}
You cannot borrow immutably if you have already borrowed mutably! Except - and this
is important - the violation of the rules happens at runtime. The solution (as always)
is to keep the scope of mutable borrows as limited as possible - in this case, you could
put a block around the first two lines here so that the mutable reference gr
gets
dropped before we borrow again.
So, this is not a feature you use without good reason, since you will not get a compile-time error. These types provide dynamic borrowing in cases where the usual rules make some things impossible.
Shared References
Up to now, the relationship between a value and its borrowed references has been clear
and known at compile time. The value is the owner, and the references cannot outlive it.
But many cases simply don't fit into this neat pattern. For example, say we have
a Player
struct and a Role
struct. A Player
keeps a vector of references to Role
objects. There isn't a neat one-to-one relationship between these values, and persuading
rustc
to cooperate becomes nasty.
Rc
works like Box
- heap memory is allocated and the value is moved to it. If you
clone a Box
, it allocates a full cloned copy of the value. But cloning an Rc
is
cheap, because each time you clone it just updates a reference count to the same data.
This is an old and very popular strategy for memory management,
for example it's used in the Objective C runtime on iOS/MacOS.
(In modern C++, it is implemented with std::shared_ptr
.)
When a Rc
is dropped, the reference count is decremented. When that count goes to zero
the owned value is dropped and the memory freed.
// rc1.rs use std::rc::Rc; fn main() { let s = "hello dolly".to_string(); let rs1 = Rc::new(s); // s moves to heap; ref count 1 let rs2 = rs1.clone(); // ref count 2 println!("len {}, {}", rs1.len(), rs2.len()); } // both rs1 and rs2 drop, string dies.
You may make as many references as you like to the original value - it's dynamic borrowing
again. You do not have to carefully track the relationship between the value T
and
its references &T
. There is some runtime cost involved, so it isn't the first
solution you choose, but it makes patterns of sharing possible which would fall foul
of the borrow checker. Note that Rc
gives you immutable shared references, since
otherwise that would break one of the very basic rules of borrowing.
A leopard can't change its spots without ceasing to be a leopard.
In the case of a Player
, it can now keep its roles as a Vec<Rc<Role>>
and things
work out fine - we can add or remove roles but not change them after their creation.
However, what if each Player
needs to keep references to a team as a vector of
Player
references? Then everything becomes immutable, because all the Player
values
need to be stored as Rc
! This is the place where RefCell
becomes necessary. The team
may be then defined as Vec<Rc<RefCell<Player>>>
. It is now possible to change
a Player
value using borrow_mut
, provided no-one has 'checked out' a reference
to a Player
at the same time. For example, say we have a rule that if something special
happens to a player, then all of their team gets stronger:
# #![allow(unused_variables)] # #fn main() { for p in &self.team { p.borrow_mut().make_stronger(); } #}
So the application code isn't too bad, but the type signatures get a bit scary. You can
always simplify them with a type
alias:
# #![allow(unused_variables)] # #fn main() { type PlayerRef = Rc<RefCell<Player>>; #}
Multithreading
Over the last twenty years, there has been a shift away from raw processing speed
to CPUs having multiple cores. So the only way to get the most out of a modern computer
is to keep all of those cores busy. It's certainly possible to spawn child processes
in the background as we saw with Command
but there's still a synchronization problem:
we don't know exactly when those children are finished without waiting on them.
There are other reasons for needing separate threads of execution, of course. You cannot afford to lock up your whole process just to wait on blocking i/o, for instance.
Spawning threads is straightforward in Rust - you feed spawn
a closure which is
executed in the background.
// thread1.rs use std::thread; use std::time; fn main() { thread::spawn(|| println!("hello")); thread::spawn(|| println!("dolly")); println!("so fine"); // wait a little bit thread::sleep(time::Duration::from_millis(100)); } // so fine // hello // dolly
Well obviously just 'wait a little bit' is not a very rigorous solution! It's better
to call join
on the returned object - then the main thread waits for the
spawned thread to finish.
// thread2.rs use std::thread; fn main() { let t = thread::spawn(|| { println!("hello"); }); println!("wait {:?}", t.join()); } // hello // wait Ok(())
Here's an interesting variation: force the new thread to panic.
# #![allow(unused_variables)] # #fn main() { let t = thread::spawn(|| { println!("hello"); panic!("I give up!"); }); println!("wait {:?}", t.join()); #}
We get a panic as expected, but only the panicking thread dies! We still manage
to print out the error message from the join
. So yes, panics are not always fatal,
but threads are relatively expensive, so this should not be seen as a routine way
of handling panics.
hello
thread '<unnamed>' panicked at 'I give up!', thread2.rs:7
note: Run with `RUST_BACKTRACE=1` for a backtrace.
wait Err(Any)
The returned objects can be used to keep track of multiple threads:
// thread4.rs use std::thread; fn main() { let mut threads = Vec::new(); for i in 0..5 { let t = thread::spawn(move || { println!("hello {}", i); }); threads.push(t); } for t in threads { t.join().expect("thread failed"); } } // hello 0 // hello 2 // hello 4 // hello 3 // hello 1
Rust insists that we handle the case where the join failed - i.e. that thread panicked. (You would typically not bail out of the main program when this happens, just note the error, retry etc)
There is no particular order to thread execution (this program gives different orders for different runs), and this is key - they really are independent threads of execution. Multithreading is easy; what's hard is concurrency - managing and synchronizing multiple threads of execution.
Threads Don't Borrow
It's possible for the thread closure to capture values, but by moving, not by borrowing!
// thread3.rs use std::thread; fn main() { let name = "dolly".to_string(); let t = thread::spawn(|| { println!("hello {}", name); }); println!("wait {:?}", t.join()); }
And here's the helpful error message:
error[E0373]: closure may outlive the current function, but it borrows `name`, which is owned by the current function
--> thread3.rs:6:27
|
6 | let t = thread::spawn(|| {
| ^^ may outlive borrowed value `name`
7 | println!("hello {}", name);
| ---- `name` is borrowed here
|
help: to force the closure to take ownership of `name` (and any other referenced variables), use the `move` keyword, as shown:
| let t = thread::spawn(move || {
That's fair enough! Imagine spawning this thread from a function - it will exist
after the function call has finished and name
gets dropped. So adding move
solves our
problem.
But this is a move, so name
may only appear in one thread! I'd like to emphasize
that it is possible to share references, but they need to have static
lifetime:
# #![allow(unused_variables)] # #fn main() { let name = "dolly"; let t1 = thread::spawn(move || { println!("hello {}", name); }); let t2 = thread::spawn(move || { println!("goodbye {}", name); }); #}
name
exists for the whole duration of the program (static
), so
rustc
is satisfied that the closure will never outlive name
. However, most interesting
references do not have static
lifetimes!
Threads can't share the same environment - by design in Rust. In particular, they cannot share regular references because the closures move their captured variables.
shared references are fine however, because their lifetime is 'as long as needed' -
but you cannot use Rc
for this. This is because
Rc
is not thread safe - it's optimized to be fast for the non-threaded case.
Fortunately it is a compile error to use Rc
here; the compiler is watching your
back as always.
For threads, you need std::sync::Arc
- 'Arc' stands for 'Atomic Reference Counting'.
That is, it guarantees that the reference count will be modified in one logical operation.
To make this guarantee, it must ensure that the operation is locked so that only the current
thread has access. clone
is still much cheaper than actually making a copy however.
// thread5.rs use std::thread; use std::sync::Arc; struct MyString(String); impl MyString { fn new(s: &str) -> MyString { MyString(s.to_string()) } } fn main() { let mut threads = Vec::new(); let name = Arc::new(MyString::new("dolly")); for i in 0..5 { let tname = name.clone(); let t = thread::spawn(move || { println!("hello {} count {}", tname.0, i); }); threads.push(t); } for t in threads { t.join().expect("thread failed"); } }
I"ve deliberately created a wrapper type for String
here (a 'newtype') since
our MyString
does not implement Clone
. But the shared reference can be cloned!
The shared reference name
is passed to each new thread by making a new reference
with clone
and moving it into the closure. It's a little verbose, but this is a safe
pattern. Safety is important in concurrency precisely because the problems are so
unpredictable. A program may run fine on your machine, but occasionally crash on the
server, usually on the weekend. Worse still, the symptoms of such problems are
not easy to diagnose.
Channels
There are ways to send data between threads. This
is done in Rust using channels. std::sync::mpsc::channel()
returns a tuple consisting
of the receiver channel and the sender channel. Each thread is passed a copy
of the sender with clone
, and calls send
. Meanwhile the main thread calls
recv
on the receiver.
'MPSC' stands for 'Multiple Producer Single Consumer'. We create multiple threads which attempt to send to the channel, and the main thread 'consumes' the channel.
// thread9.rs use std::thread; use std::sync::mpsc; fn main() { let nthreads = 5; let (tx, rx) = mpsc::channel(); for i in 0..nthreads { let tx = tx.clone(); thread::spawn(move || { let response = format!("hello {}", i); tx.send(response).unwrap(); }); } for _ in 0..nthreads { println!("got {:?}", rx.recv()); } } // got Ok("hello 0") // got Ok("hello 1") // got Ok("hello 3") // got Ok("hello 4") // got Ok("hello 2")
There's no need to join here since the threads send their response just before they
end execution, but obviously this can happen at any time. recv
will block, and will
return an error if the sender channel is disconnected. recv_timeout
will only block
for a given time period, and may return a timeout error as well.
send
never blocks, which is useful because threads can push out data without waiting
for the receiver to process. In addition, the channel is buffered so multiple
send operations can take place, which will be received in order.
However, not blocking means that Ok
does not automatically mean 'successfully delivered message'!
A sync_channel
does block on send. With an argument of zero, the send blocks until the
recv happens. The threads must meet up or rendezvous (on the sound principle that most things
sound better in French.)
# #![allow(unused_variables)] # #fn main() { let (tx, rx) = mpsc::sync_channel(0); let t1 = thread::spawn(move || { for i in 0..5 { tx.send(i).unwrap(); } }); for _ in 0..5 { let res = rx.recv().unwrap(); println!("{}",res); } t1.join().unwrap(); #}
We can easily cause an error here by calling recv
when there has been no corresponding send
, e.g
by looping for i in 0..4
. The thread ends, and tx
drops, and then recv
will fail. This will also
happen if the thread panics, which causes its stack to be unwound, dropping any values.
If the sync_channel
was created with a non-zero argument n
, then it acts like a queue with a
maximum size of n
- send
will only block when it tries to add more than n
values to the queue.
Channels are strongly typed - here the channel had type i32
- but type inference makes this implicit.
If you need to pass different kinds of data, then enums are a good way to express this.
Synchronization
Let's look at synchronization. join
is very basic, and merely waits until a
particular thread has finished. A sync_channel
synchronizes two threads - in the last example, the
spawned thread and the main thread are completely locked together.
Barrier synchronization is a checkpoint where the threads must wait until all of
them have reached that point. Then they can keep going as before. The barrier is
created with the number of threads that we want to wait for. As before we use use Arc
to share the barrier with all the threads.
// thread7.rs use std::thread; use std::sync::Arc; use std::sync::Barrier; fn main() { let nthreads = 5; let mut threads = Vec::new(); let barrier = Arc::new(Barrier::new(nthreads)); for i in 0..nthreads { let barrier = barrier.clone(); let t = thread::spawn(move || { println!("before wait {}", i); barrier.wait(); println!("after wait {}", i); }); threads.push(t); } for t in threads { t.join().unwrap(); } } // before wait 2 // before wait 0 // before wait 1 // before wait 3 // before wait 4 // after wait 4 // after wait 2 // after wait 3 // after wait 0 // after wait 1
The threads do their semi-random thing, all meet up, and then continue. It's like a kind
of resumable join
and useful when you need to farm off pieces of a job to
different threads and want to take some action when all the pieces are finished.
Shared State
How can threads modify shared state?
Recall the Rc<RefCell<T>>
strategy for dynamically doing a
mutable borrow on shared references. The threading equivalent to RefCell
is
Mutex
- you may get your mutable reference by calling lock
. While this reference
exists, no other thread can access it. mutex
stands for 'Mutual Exclusion' - we lock
a section of code so that only one thread can access it, and then unlock it. You get the
lock with the lock
method, and it is unlocked when the reference is dropped.
// thread9.rs use std::thread; use std::sync::Arc; use std::sync::Mutex; fn main() { let answer = Arc::new(Mutex::new(42)); let answer_ref = answer.clone(); let t = thread::spawn(move || { let mut answer = answer_ref.lock().unwrap(); *answer = 55; }); t.join().unwrap(); let ar = answer.lock().unwrap(); assert_eq!(*ar, 55); }
This isn't so straightforward as using RefCell
because asking for the lock on
the mutex might fail, if another thread has panicked while holding the lock.
(In this case, the documentation actually recommends just exiting the thread with unwrap
because things have gone seriously wrong!)
It's even more important to keep this mutable borrow as short as possible, because as long as the mutex is locked, other threads are blocked. This is not the place for expensive calculations! So typically such code would be used like this:
# #![allow(unused_variables)] # #fn main() { // ... do something in the thread // get a locked reference and use it briefly! { let mut data = data_ref.lock().unwrap(); // modify data } //... continue with the thread #}
Higher-Level Operations
It's better to find higher-level ways of doing threading, rather than managing the synchronization
yourself. An example is when you need to do things in parallel and collect the results. One very
cool crate is pipeliner which has a very straightforward
API. Here's the 'Hello, World!' - an iterator feeds us inputs and we execute up to n
of the operations
on the values in parallel.
extern crate pipeliner; use pipeliner::Pipeline; fn main() { for result in (0..10).with_threads(4).map(|x| x + 1) { println!("result: {}", result); } } // result: 1 // result: 2 // result: 5 // result: 3 // result: 6 // result: 7 // result: 8 // result: 9 // result: 10 // result: 4
It's a silly example of course, because the operation is so cheap to calculate, but shows how easy it is to run code in parallel.
Here's something more useful. Doing network operations in parallel is very useful, because they can take time, and you don't want to wait for them all to finish before starting to do work.
This example is pretty crude (believe me, there are better ways of doing it) but here we want to focus
on the principle. We reuse the shell
function defined in section 4 to call ping
on a range
of IP4 addresses.
extern crate pipeliner; use pipeliner::Pipeline; use std::process::Command; fn shell(cmd: &str) -> (String,bool) { let cmd = format!("{} 2>&1",cmd); let output = Command::new("/bin/sh") .arg("-c") .arg(&cmd) .output() .expect("no shell?"); ( String::from_utf8_lossy(&output.stdout).trim_right().to_string(), output.status.success() ) } fn main() { let addresses: Vec<_> = (1..40).map(|n| format!("ping -c1 192.168.0.{}",n)).collect(); let n = addresses.len(); for result in addresses.with_threads(n).map(|s| shell(&s)) { if result.1 { println!("got: {}", result.0); } } }
And the result on my home network looks like this:
got: PING 192.168.0.1 (192.168.0.1) 56(84) bytes of data.
64 bytes from 192.168.0.1: icmp_seq=1 ttl=64 time=43.2 ms
--- 192.168.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 43.284/43.284/43.284/0.000 ms
got: PING 192.168.0.18 (192.168.0.18) 56(84) bytes of data.
64 bytes from 192.168.0.18: icmp_seq=1 ttl=64 time=0.029 ms
--- 192.168.0.18 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms
got: PING 192.168.0.3 (192.168.0.3) 56(84) bytes of data.
64 bytes from 192.168.0.3: icmp_seq=1 ttl=64 time=110 ms
--- 192.168.0.3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 110.008/110.008/110.008/0.000 ms
got: PING 192.168.0.5 (192.168.0.5) 56(84) bytes of data.
64 bytes from 192.168.0.5: icmp_seq=1 ttl=64 time=207 ms
...
The active addresses come through pretty fast within the first half-second, and we then wait for the negative
results to come in. Otherwise, we would wait for the better part of a minute! You can now proceed
to scrape things like ping times from the output, although this would only work on Linux. ping
is universal, but the exact output format is different for each platform. To do better we need to use
the cross-platform Rust networking API, and so let's move onto Networking.
A Better Way to Resolve Addresses
If you just want availability and not detailed ping statistics, the std::net::ToSocketAddrs
trait
will do any DNS resolution for you:
use std::net::*; fn main() { for res in "google.com:80".to_socket_addrs().expect("bad") { println!("got {:?}", res); } } // got V4(216.58.223.14:80) // got V6([2c0f:fb50:4002:803::200e]:80)
It's an iterator because there is often more than one interface associated with a domain - there are both IPV4 and IPV6 interfaces to Google.
So, let's naively use this method to rewrite the pipeliner example. Most networking protocols use both an address and a port:
extern crate pipeliner; use pipeliner::Pipeline; use std::net::*; fn main() { let addresses: Vec<_> = (1..40).map(|n| format!("192.168.0.{}:0",n)).collect(); let n = addresses.len(); for result in addresses.with_threads(n).map(|s| s.to_socket_addrs()) { println!("got: {:?}", result); } } // got: Ok(IntoIter([V4(192.168.0.1:0)])) // got: Ok(IntoIter([V4(192.168.0.39:0)])) // got: Ok(IntoIter([V4(192.168.0.2:0)])) // got: Ok(IntoIter([V4(192.168.0.3:0)])) // got: Ok(IntoIter([V4(192.168.0.5:0)])) // ....
This is much faster than the ping example because it's just checking that the IP address is valid - if we fed it a list of actual domain names the DNS lookup could take some time, hence the importance of parallelism.
Suprisingly, it sort-of Just Works. The fact that everything in the standard library implements Debug
is great for exploration as well as debugging. The iterator is returning Result
(hence Ok
) and
in that Result
is an IntoIter
into a SocketAddr
which is an enum with either a ipv4 or a ipv6 address.
Why IntoIter
? Because a socket may have multiple addresses (e.g. both ipv4 and ipv6).
# #![allow(unused_variables)] # #fn main() { for result in addresses.with_threads(n) .map(|s| s.to_socket_addrs().unwrap().next().unwrap()) { println!("got: {:?}", result); } // got: V4(192.168.0.1:0) // got: V4(192.168.0.39:0) // got: V4(192.168.0.3:0) #}
This also works, surprisingly enough, at least for our simple example. The first unwrap
gets rid of
the Result
, and then we explicitly pull the first value out of the iterator. The Result
will get
bad typically when we give a nonsense address (like an address name without a port.)
TCP Client Server
Rust provides a straightforward interface to the most commonly used network protocol, TCP. It is very fault-resistant and is the base on which our networked world is built - packets of data are sent and received, with acknowledgement. By contrast, UDP sends packets out into the wild without acknowledgement - there's a joke that goes "I could tell you a joke about UDP but you might not get it." (Jokes about networking are only funny for a specialized meaning of the word 'funny')
However, error handling is very important with networking, because anything can happen, and will, eventually.
TCP works as a client/server model; the server listens on a address and a particular network port, and the client connects to that server. A connection is established and thereafter the client and server can communicate with a socket.
TcpStream::connect
takes anything that can convert into a SocketAddr
, in particular the plain strings
we have been using.
A simple TCP client in Rust is easy - a TcpStream
struct is both readable and writeable. As usual, we
have to bring the Read
, Write
and other std::io
traits into scope:
// client.rs use std::net::TcpStream; use std::io::prelude::*; fn main() { let mut stream = TcpStream::connect("127.0.0.1:8000").expect("connection failed"); write!(stream,"hello from the client!\n").expect("write failed"); }
The server is not much more complicated; we set up a listener and wait for connections. When a
client connects, we get a TcpStream
on the server side. In this
case, we read everything that the client has written into a string.
// server.rs use std::net::TcpListener; use std::io::prelude::*; fn main() { let listener = TcpListener::bind("127.0.0.1:8000").expect("could not start server"); // accept connections and get a TcpStream for connection in listener.incoming() { match connection { Ok(mut stream) => { let mut text = String::new(); stream.read_to_string(&mut text).expect("read failed"); println!("got '{}'", text); } Err(e) => { println!("connection failed {}", e); } } } }
Here I've chosen a port number moreorless at random, but most ports are assigned some meaning.
Note that both parties have to agree on a protocol - the client expects it can write text to the stream, and the server expects to read text from the stream. If they don't play the same game, then situations can occur where one party is blocked, waiting for bytes that never come.
Error checking is important - network I/O can fail for many reasons, and errors that might appear once in a blue moon on a local filesystem can happen on a regular basis. Someone can trip over the network cable, the other party could crash, and so forth. This little server isn't very robust, because it will fall over on the first read error.
Here is a more solid server that handles the error without failing. It also specifically reads a line
from the stream, which is done using io::BufReader
to create an io::BufRead
on which we can call
read_line
.
// server2.rs use std::net::{TcpListener, TcpStream}; use std::io::prelude::*; use std::io; fn handle_connection(stream: TcpStream) -> io::Result<()>{ let mut rdr = io::BufReader::new(stream); let mut text = String::new(); rdr.read_line(&mut text)?; println!("got '{}'", text.trim_right()); Ok(()) } fn main() { let listener = TcpListener::bind("127.0.0.1:8000").expect("could not start server"); // accept connections and get a TcpStream for connection in listener.incoming() { match connection { Ok(stream) => { if let Err(e) = handle_connection(stream) { println!("error {:?}", e); } } Err(e) => { print!("connection failed {}\n", e); } } } }
read_line
might fail in handle_connection
, but the resulting error is safely handled.
One-way communications like this are certainly useful - for instance. a set of services across a network which want to collect their status reports together in one central place. But it's reasonable to expect a polite reply, even if just 'ok'!
A simple example is a basic 'echo' server. The client writes some text ending in a newline to the server, and receives the same text back with a newline - the stream is readable and writeable.
// client_echo.rs use std::io::prelude::*; use std::net::TcpStream; fn main() { let mut stream = TcpStream::connect("127.0.0.1:8000").expect("connection failed"); let msg = "hello from the client!"; write!(stream,"{}\n", msg).expect("write failed"); let mut resp = String::new(); stream.read_to_string(&mut resp).expect("read failed"); let text = resp.trim_right(); assert_eq!(msg,text); }
The server has an interesting twist. Only handle_connection
changes:
# #![allow(unused_variables)] # #fn main() { fn handle_connection(stream: TcpStream) -> io::Result<()>{ let mut ostream = stream.try_clone()?; let mut rdr = io::BufReader::new(stream); let mut text = String::new(); rdr.read_line(&mut text)?; ostream.write_all(text.as_bytes())?; Ok(()) } #}
This is a common gotcha with simple two-way socket communication; we want to read a line, so
need to feed the readable stream to BufReader
- but it consumes the stream! So we have to
clone the stream, creating a new struct which refers to the same underlying socket. Then we
have happiness.
Object-Orientation in Rust
Everyone comes from somewhere, and the chances are good that your previous programming language implemented Object-Oriented Programming (OOP) in a particular way:
- 'classes' act as factories for generating objects (often called instances) and define unique types.
- Classes may inherit from other classes (their parents), inheriting both data (fields) and behaviour (methods)
- If B inherits from A, then an instance of B can be passed to something expecting A (subtyping)
- An object should hide its data (encapsulation), which can only be operated on with methods.
Object-oriented design is then about identifying the classes (the 'nouns') and the methods (the 'verbs') and then establishing relationships between them, is-a and has-a.
There was a point in the old Star Trek series where the doctor would say to the captain,
"It's Life, Jim, just not Life as we know it". And this applies very much to Rust-flavoured
object-orientation: it comes as a shock, because Rust data aggregates (structs, enums
and tuples) are dumb. You can define methods on them, and make the data itself private,
all the usual tactics of encapsulation, but they are all unrelated types.
There is no subtyping and no inheritance of data (apart from the specialized
case of Deref
coercions.)
The relationships between various data types in Rust are established using traits. A large part of learning Rust is understanding how the standard library traits operate, because that's the web of meaning that glues all the data types together.
Traits are interesting because there's no one-to-one correspondence between them and concepts from mainstream languages. It depends if you're thinking dynamically or statically. In the dynamic case, they're rather like Java or Go interfaces.
Trait Objects
Consider the example first used to introduce traits:
# #![allow(unused_variables)] # #fn main() { trait Show { fn show(&self) -> String; } impl Show for i32 { fn show(&self) -> String { format!("four-byte signed {}", self) } } impl Show for f64 { fn show(&self) -> String { format!("eight-byte float {}", self) } } #}
Here's a little program with big implications:
fn main() { let answer = 42; let maybe_pi = 3.14; let v: Vec<&Show> = vec![&answer,&maybe_pi]; for d in v.iter() { println!("show {}",d.show()); } } // show four-byte signed 42 // show eight-byte float 3.14
This is a case where Rust needs some type guidance - I specifically want a vector
of references to anything that implements Show
. Now note that i32
and f64
have no relationship to each other, but they both understand the show
method
because they both implement the same trait. This method is virtual, because
the actual method has different code for different types, and yet the correct
method is invoked based on runtime information. These references
are called trait objects.
And that is how you can put objects of different types in the same vector. If
you come from a Java or Go background, you can think of Show
as acting like an interface.
A little refinement of this example - we box the values. A box contains a reference to data
allocated on the heap, and acts very much like a reference - it's a smart pointer. When boxes
go out of scope and Drop
kicks in, then that memory is released.
# #![allow(unused_variables)] # #fn main() { let answer = Box::new(42); let maybe_pi = Box::new(3.14); let show_list: Vec<Box<Show>> = vec![question,answer]; for d in &show_list { println!("show {}",d.show()); } #}
The difference is that you can now take this vector, pass it as a reference or give it away without having to track any borrowed references. When the vector is dropped, the boxes will be dropped, and all memory is reclaimed.
Animals
For some reason, any discussion of OOP and inheritance seems to end up talking about animals. It
makes for a nice story: "See, a Cat is a Carnivore. And a Carnivore is an Animal". But I'll start
with a classic slogan from the Ruby universe: "if it quacks, it's a duck". All your objects have
to do is define quack
and they can be considered to be ducks, albeit in a very narrow way.
# #![allow(unused_variables)] # #fn main() { trait Quack { fn quack(&self); } struct Duck (); impl Quack for Duck { fn quack(&self) { println!("quack!"); } } struct RandomBird { is_a_parrot: bool } impl Quack for RandomBird { fn quack(&self) { if ! self.is_a_parrot { println!("quack!"); } else { println!("squawk!"); } } } let duck1 = Duck(); let duck2 = RandomBird{is_a_parrot: false}; let parrot = RandomBird{is_a_parrot: true}; let ducks: Vec<&Quack> = vec![&duck1,&duck2,&parrot]; for d in &ducks { d.quack(); } // quack! // quack! // squawk! #}
Here we have two completely different types (one is so dumb it doesn't even have data), and yes,
they all quack()
. One is behaving a little odd (for a duck) but they share the same method name
and Rust can keep a collection of such objects in a type-safe way.
Type safety is a fantastic thing. Without static typing, you could insert a cat into that collection of Quackers, resulting in run-time chaos.
Here's a funny one:
# #![allow(unused_variables)] # #fn main() { // and why the hell not! impl Quack for i32 { fn quack(&self) { for i in 0..*self { print!("quack {} ",i); } println!(""); } } let int = 4; let ducks: Vec<&Quack> = vec![&duck1,&duck2,&parrot,&int]; ... // quack! // quack! // squawk! // quack 0 quack 1 quack 2 quack 3 #}
What can I say? It quacks, it must be a duck. What's interesting is that you can apply your traits
to any Rust value, not just 'objects'. (Since quack
is passed a reference, there's an explicit
dereference *
to get the integer.)
However, you can only do this with a trait and a type from the same crate, so the standard library cannot be 'monkey patched', which is another piece of Ruby folk practice (and not the most wildly admired either.)
Up to this point, the trait Quack
was behaving very much like a Java interface, and like
modern Java interfaces you can have provided methods which supply a default implementation
if you have implemented the required methods. (The Iterator
trait is a good example.)
But, note that traits are not part of the definition of a type and you can define and implement new traits on any type, subject to the same-crate restriction.
It's possible to pass a reference to any Quack
implementor:
# #![allow(unused_variables)] # #fn main() { fn quack_ref (q: &Quack) { q.quack(); } quack_ref(&d); #}
And that's subtyping, Rust-style.
Since we're doing Programming Language Comparisons 101 here, I'll mention that Go has an interesting
take on the quacking business - if there's a Go interface Quack
, and a type has a quack
method,
then that type satisfies Quack
without any need for explicit definition. This also breaks the
baked-into-definition Java model, and allows compile-time duck-typing, at the cost of some
clarity and type-safety.
But there is a problem with duck-typing.
One of the signs of bad OOP is too many methods which have some
generic name like run
. "If it has run(), it must be Runnable" doesn't sound so catchy as
the original! So it is possible for a Go interface to be accidentally valid. In Rust,
both the Debug
and Display
traits define fmt
methods, but they really mean different
things.
So Rust traits allow traditional polymorphic OOP. But what about inheritance? People usually
mean implementation inheritance whereas Rust does interface inheritance. It's as if a Java
programmer never used extend
and instead used implements
. And this is actually
recommended practice
by Alan Holub. He says:
I once attended a Java user group meeting where James Gosling (Java's inventor) was the featured speaker. During the memorable Q&A session, someone asked him: "If you could do Java over again, what would you change?" "I'd leave out classes," he replied. After the laughter died down, he explained that the real problem wasn't classes per se, but rather implementation inheritance (the extends relationship). Interface inheritance (the implements relationship) is preferable. You should avoid implementation inheritance whenever possible
So even in Java, you've probably been overdoing classes!
Implementation inheritance has some serious problems. But it does feel so very
convenient. There's this fat base class called Animal
and it has loads of useful
functionality (it may even expose its innards!) which our derived class Cat
can use. That is,
it is a form of code reuse. But code reuse is a separate concern.
Getting the distinction between implementation and interface inheritance is important when understanding Rust.
Note that traits may have provided methods. Consider Iterator
- you only have to override
next
, but get a whole host of methods free. This is similar to 'default' methods of modern
Java interfaces. Here we only define name
and upper_case
is defined for us. We could
override upper_case
as well, but it isn't required.
# #![allow(unused_variables)] # #fn main() { trait Named { fn name(&self) -> String; fn upper_case(&self) -> String { self.name().to_uppercase() } } struct Boo(); impl Named for Boo { fn name(&self) -> String { "boo".to_string() } } let f = Boo(); assert_eq!(f.name(),"boo".to_string()); assert_eq!(f.upper_case(),"BOO".to_string()); #}
This is a kind of code reuse, true, but note that it does not apply to the data, only the interface!
Ducks and Generics
An example of generic-friendly duck function in Rust would be this trivial one:
# #![allow(unused_variables)] # #fn main() { fn quack<Q> (q: &Q) where Q: Quack { q.quack(); } let d = Duck(); quack(&d); #}
The type parameter is any type which implements Quack
. There's an important difference
between quack
and the quack_ref
defined in the last section.
The body of this function is compiled for each of the calling
types and no virtual method is needed; such functions can be completely inlined. It
uses the trait Quack
in a different way, as a constraint on generic types.
This is the C++ equivalent to the generic quack
(note the const
):
template <class Q>
void quack(const Q& q) {
q.quack();
}
Note that the type parameter is not constrained in any way.
This is very much compile-time duck-typing - if we pass a reference to a
non-quackable type, then the compiler will complain bitterly about no quack
method.
At least the error is found at compile-time, but it's worse when a type is accidentally
Quackable, as happens with Go. More involved template functions and classes lead to
terrible error messages, because there are no constraints on the generic types.
You could define a function which could handle an iteration over Quacker pointers:
template <class It>
void quack_everyone (It start, It finish) {
for (It i = start; i != finish; i++) {
(*i)->quack();
}
}
This would then be implemented for each iterator type It
.
The Rust equivalent is a little more challenging:
# #![allow(unused_variables)] # #fn main() { fn quack_everyone <I> (iter: I) where I: Iterator<Item=Box<Quack>> { for d in iter { d.quack(); } } let ducks: Vec<Box<Quack>> = vec![Box::new(duck1),Box::new(duck2),Box::new(parrot),Box::new(int)]; quack_everyone(ducks.into_iter()); #}
Iterators in Rust aren't duck-typed but are types that must implement Iterator
, and in
this case the iterator provides boxes of Quack
. There's no ambiguity about the types
involved, and the values must satisfy Quack
. Often the function signature is the most challenging
thing about a generic Rust function, which is why I recommend reading
the source of the standard library - the implementation is often much simpler than the declaration!
Here the only type parameter is the actual iterator type,
which means that this will work with anything that can deliver a sequence of Box<Duck>
, not just
a vector iterator.
Inheritance
A common problem with object-oriented design is trying to force things into a is-a relationship, and neglecting has-a relationships. The GoF said "Prefer Composition to Inheritance" in their Design Patterns book, twenty-two years ago.
Here's an example: you want to model the employees of some company, and Employee
seems a good
name for a class. Then, Manager is-a Employee (this is true) so we start building our
hierarchy with a Manager
subclass of Employee
. This isn't as smart as it seems. Maybe we got
carried away with identifying important Nouns, maybe we (unconsciously) think that managers and
employees are different kinds of animals? It's better for Employee
to has-a Roles
collection,
and then a manager is just an Employee
with more responsibilities and capabilities.
Or consider Vehicles - ranging from bicycles to 300t ore trucks. There are multiple ways to think about vehicles, road-worthiness (all-terrain, city, rail-bound, etc), power-source (electric, diesel, diesel-electric, etc), cargo-or-people, and so forth. Any fixed hierarchy of classes you create based on one aspect ignores all other aspects. That is, there are multiple possible classifications of vehicles!
Composition is more important in Rust for the obvious reason that you can't inherit functionality in a lazy way from a base class.
Composition is also important because the borrow checker is smart enough to know that borrowing different struct fields are separate borrows. You can have a mutable borrow of one field while having an immutable borrow of another field, and so forth. Rust cannot tell that a method only accesses one field, so the fields should be structs with their own methods for implementation convenience. (The external interface of the struct can be anything you like using suitable traits.)
A concrete example of 'split borrrowing' will make this clearer. We have a struct that owns some strings, with a method for borrowing the first string mutably.
# #![allow(unused_variables)] # #fn main() { struct Foo { one: String, two: String } impl Foo { fn borrow_one_mut(&mut self) -> &mut String { &mut self.one } .... } #}
(This is an example of a Rust naming convention - such methods should end in _mut
)
Now, a method for borrowing both strings, reusing the first method:
# #![allow(unused_variables)] # #fn main() { fn borrow_both(&self) -> (&str,&str) { (self.borrow_one_mut(), &self.two) } #}
Which can't work! We've borrrowed mutably from self
and also borrowed immmutably from self
.
If Rust allowed situations like this, then that immmutable reference can't be guaranteed not to
change.
The solution is simple:
# #![allow(unused_variables)] # #fn main() { fn borrow_both(&self) -> (&str,&str) { (&self.one, &self.two) } #}
And this is fine, because the borrow checker considers these to be independent borrows. So imagine that the fields were some arbitrary types, and you can see that methods called on these fields will not cause borrowing problems.
There is a restricted but very important kind of
'inheritance' with Deref,
which is the trait for the 'dereference' operator *
.
String
implements Deref<Target=str>
and so all the methods defined on &str
are automatically
available for String
as well! In a similar way, the methods of Foo
can be directly
called on Box<Foo>
. Some find this a little ... magical, but it is tremendously convenient.
There is a simpler language inside modern Rust, but it would not be half as pleasant to use.
It really should be used for cases where there is an owned, mutable type and a simpler borrowed
type.
Generally in Rust there is trait inheritance:
# #![allow(unused_variables)] # #fn main() { trait Show { fn show(&self) -> String; } trait Location { fn location(&self) -> String; } trait ShowTell: Show + Location {} #}
The last trait simply combines our two distinct traits into one, although it could specify other methods.
Things now proceed as before:
# #![allow(unused_variables)] # #fn main() { #[derive(Debug)] struct Foo { name: String, location: String } impl Foo { fn new(name: &str, location: &str) -> Foo { Foo{ name: name.to_string(), location: location.to_string() } } } impl Show for Foo { fn show(&self) -> String { self.name.clone() } } impl Location for Foo { fn location(&self) -> String { self.location.clone() } } impl ShowTell for Foo {} #}
Now, if I have a value foo
of type Foo
, then a reference to that value will
satisfy &Show
, &Location
or &ShowTell
(which implies both.)
Here's a useful little macro:
# #![allow(unused_variables)] # #fn main() { macro_rules! dbg { ($x:expr) => { println!("{} = {:?}",stringify!($x),$x); } } #}
It takes one argument (represented by $x
) which must be an 'expression'. We print out its
value, and a stringified version of the value. C programmers can be a little smug at this point,
but this means that if I passed 1+2
(an expression) then stringify!(1+2)
is the literal
string "1+2". This will save us some typing when playing with code:
# #![allow(unused_variables)] # #fn main() { let foo = Foo::new("Pete","bathroom"); dbg!(foo.show()); dbg!(foo.location()); let st: &ShowTell = &foo; dbg!(st.show()); dbg!(st.location()); fn show_it_all(r: &ShowTell) { dbg!(r.show()); dbg!(r.location()); } let boo = Foo::new("Alice","cupboard"); show_it_all(&boo); fn show(s: &Show) { dbg!(s.show()); } show(&boo); // foo.show() = "Pete" // foo.location() = "bathroom" // st.show() = "Pete" // st.location() = "bathroom" // r.show() = "Alice" // r.location() = "cupboard" // s.show() = "Alice" #}
This is object-orientation, just not the kind you may be used to.
Please note that the Show
reference passed to show
can not be dynamically
upgraded to a ShowTell
! Languages with more dynamic class systems allow you to
check whether a given object is an instance of a class and then to do a
dynamic cast to that type. It isn't really a good idea in general, and specifically
cannot work in Rust because that Show
reference has 'forgotten' that it was originally
a ShowTell
reference.
You always have a choice: polymorphic, via trait objects, or monomorphic, via generics constrainted by traits. Modern C++ and the Rust standard library tends to take the generic route, but the polymorphic route is not obsolete. You do have to understand the different trade-offs - generics generate the fastest code, which can be inlined. This may lead to code bloat. But not everything needs to be as fast as possible - it may only happen a 'few' times in the lifetime of a typical program run.
So, here's a summary:
- the role played by
class
is shared between data and traits - structs and enums are dumb, although you can define methods and do data hiding
- a limited form of subtyping is possible on data using the
Deref
trait - traits don't have any data, but can be implemented for any type (not just structs)
- traits can inherit from other traits
- traits can have provided methods, allowing interface code re-use
- traits give you both virtual methods (polymorphism) and generic constraints (monomorphism)
Example: Windows API
One of the areas where traditional OOP is used extensively is GUI toolkits. An EditControl
or a ListWindow
is-a Window
, and so forth. This makes writing Rust bindings to GUI toolkits more difficult
than it needs to be.
Win32 programming can be done directly in Rust, and it's a little less awkward than the original C. As soon as I graduated from C to C++ I wanted something cleaner and did my own OOP wrapper.
A typical Win32 API function is ShowWindow
which is used to control the visibility of a window. Now, an EditControl
has some specialized
functionality, but it's all done with a Win32 HWND
('handle to window') opaque value.
You would like EditControl
to also have a show
method, which traditionally would be done
by implementation inheritance. You not want to have to type out all these inherited methods
for each type! But Rust traits provide a solution. There would be a Window
trait:
# #![allow(unused_variables)] # #fn main() { trait Window { // you need to define this! fn get_hwnd(&self) -> HWND; // and all these will be provided fn show(&self, visible: bool) { unsafe { user32_sys::ShowWindow(self.get_hwnd(), if visible {1} else {0}) } } // ..... oodles of methods operating on Windows } #}
So, the implementation struct for EditControl
can just contain a HWND
and implement Window
by defining one method; EditControl
is a trait that inherits from Window
and defines the extended
interface. Something like ComboxBox
- which behaves like an EditControl
and a
ListWindow
can be easily implemented with trait inheritance.
The Win32 API ('32' no longer means '32-bit' anymore) is in fact object-oriented, but an
older style, influenced by Alan Kay's definition: objects contain hidden data, and are operated
on by messages. So at the heart of any Windows application there's a message loop, and
the various kinds of windows (called 'window classes') implement these methods with their
own switch statements. There is a message called WM_SETTEXT
but the implementation can be
different: a label's text changes, but a top-level window's caption changes.
Here is a rather promising
minimal Windows GUI framework. But to my taste, there are too many unwrap
instances
going on - and some of them aren't even errors. This is because NWG is exploiting the
loose dynamic nature of messaging. With a proper type-safe interface, more errors are
caught at compile-time.
The next edition of The Rust Programming Language book has a very good discussion on what 'object-oriented' means in Rust.
Parsing Text with Nom
Nom, (documented here) is a parser library for Rust which is well worth the initial time investment.
If you have to parse a known data format, like CSV or JSON, then it's best to use a specialized library like Rust CSV or the JSON libraries discussed in Section 4.
Likewise, for configuration files use dedicated parsers like ini or toml. (The last one is particularly cool since it integrates with the Serde framework, just as we saw with serde_json.
But if the text is not regular, or some made-up format, then you need to scan that text without writing a lot of tedious string-processing code. The suggested go-to is often regex, but regexes can be frustratingly opaque when sufficiently involved. Nom provides a way to parse text which is just as powerful and can be built up by combining simpler parsers. And regexes have their limits, for instance, don't use regexes for parsing HTML but you could use Nom to parse HTML. If you ever had the itch to write your own programming language, Nom is a good place for you start on that hard road to obscurity.
There are some excellent tutorials for learning Nom, but I want to start at the hello-world level to build some initial familiarity. The basic things you need to know - first, Nom is macros all the way down, and second, Nom prefers to work with byte slices, not strings. The first means that you have to be especially careful to get Nom expressions right, because the error messages are not going to be friendly. And the second means that Nom can be used for any data format, not just text. People have used Nom to decode binary protocols and file headers. It can also work with 'text' in encodings other than UTF-8.
Recent versions of Nom work fine with string slices, although you need to use the macros that
end with _s
.
#[macro_use] extern crate nom; named!(get_greeting<&str,&str>, tag_s!("hi") ); fn main() { let res = get_greeting("hi there"); println!("{:?}",res); } // Done(" there", "hi")
The named!
macro creates functions which take some input type (&[u8]
by default)
and return the second type in angle brackets.
tag_s!
matches a literal string in the stream of characters, and its value is
a string slice representing that literal. (If you wanted to work with &[u8]
then
use the tag!
macro.)
We call the defined parser get_greeting
with a &str
and
get back an IResult.
And indeed we get back the matching value.
Look at " there" - This is the string slice left over after matching..
We want to ignore whitespace. By just wrapping the tag!
in ws!
we can match "hi" anywhere
among spaces, tabs or newlines:
named!(get_greeting<&str,&str>, ws!(tag_s!("hi")) ); fn main() { let res = get_greeting("hi there"); println!("{:?}",res); } // Done("there", "hi")
The result is "hi" as before, and the remaining string is "there"! The spaces have been skipped.
"hi" is matching nicely, although this isn't very useful yet.
Let's match either "hi" or "bye". The alt!
macro ("alternate") takes parser expressions
separated by |
and matches any of them. Note that you can use whitespace here to make
the parser function easier to read:
# #![allow(unused_variables)] # #fn main() { named!(get_greeting<&str>, ws!(alt!(tag_s!("hi") | tag_s!("bye"))) ); println!("{:?}", get_greeting(" hi ")); println!("{:?}", get_greeting(" bye ")); println!("{:?}", get_greeting(" hola ")); // Done("", "hi") // Done("", "bye") // Error(Alt) #}
The last match failed because there is no alternative that matches "hola".
Clearly we need to understand this IResult
type to go further, but first let's compare this
with the regex solution:
# #![allow(unused_variables)] # #fn main() { let greetings = Regex::new(r"\s*(hi|bye)\s*").expect("bad regex"); let caps = greetings.captures(" hi ").expect("match failed"); println!("{:?}",caps); // Captures({0: Some(" hi "), 1: Some("hi")}) #}
Regular expressions are certainly more compact!.
We needed to put '()' around the two possibilities
separated by '|' so that we will capture the greeting and nothing else. The first result is the
whole string, the second is the matched capture. ('|' is the so-called 'alternation' operator in
regexes, which is the motivation for the alt!
macro syntax.)
But this is a very simple regex, and they get complicated very quickly. Being a text mini-language, you
have to escape significant characters like *
and (
. If I wanted to match "(hi)" or
"(bye)" the regex becomes "\s*((hi|bye))\s*" but the Nom parser simply becomes
alt!(tag_s!("(hi)") | tag_s!("(bye)"))
.
It's also a heavy-weight dependency. On this fairly feeble i5 laptop, Nom examples take about 0.55 seconds to compile, which is not much more than "Hello world". But the regex examples take about 0.90s. And the stripped release build executable of the Nom example is about 0.3Mb (which is about as small as statically linked Rust programs go) versus 0.8Mb for the regex example.
What a Nom Parser returns
IResult has an interesting difference
from the standard Result
type - there are three possibilities:
Done
- success - you get both the result and the remaining bytesError
- failed to parse - you get an errorImcomplete
- more data needed
We can write a generic dump
function that handles any return value that can be debug-printed.
This also demonstrates the to_result
method which returns a regular Result
- this is probably
the method you will use for most cases since it returns either the returned value or an error.
#[macro_use] extern crate nom; use nom::IResult; use std::str::from_utf8; use std::fmt::Debug; fn dump<T: Debug>(res: IResult<&str,T>) { match res { IResult::Done(rest, value) => {println!("Done {:?} {:?}",rest,value)}, IResult::Error(err) => {println!("Err {:?}",err)}, IResult::Incomplete(needed) => {println!("Needed {:?}",needed)} } } fn main() { named!(get_greeting<&str,&str>, ws!( alt!( tag_s!("hi") | tag_s!("bye")) ) ); dump(get_greeting(" hi ")); dump(get_greeting(" bye hi")); dump(get_greeting(" hola ")); println!("result {:?}", get_greeting(" bye ").to_result()); } // Done Ok("") "hi" // Done Ok("hi") "bye" // Err Alt // result Ok("bye")
Parsers returning any unparsed text, and being able to indicate that they don't have enough
input characters to decide, is very useful for stream parsing. But usually to_result
is your friend.
Combining Parsers
Let's continue the greeting example and imagine that a greeting consists of "hi" or "bye", plus
a name. nom::alpha
matches a series of alphabetical characters.
The pair!
macro will collect the result of matching two parsers as a tuple:
# #![allow(unused_variables)] # #fn main() { named!(full_greeting<&str,(&str,&str)>, pair!( get_greeting, nom::alpha ) ); println!("result {:?}", full_greeting(" hi Bob ").to_result()); // result Ok(("hi", "Bob")) #}
Now, further imagine that the greeter is perhaps a little shy or doesn't know anybody's name:
let us make the name optional. Naturally, the second value of the tuple becomes an Option
.
# #![allow(unused_variables)] # #fn main() { named!(full_greeting<&str, (&str,Option<&str>)>, pair!( get_greeting, opt!(nom::alpha) ) ); println!("result {:?}", full_greeting(" hi Bob ").to_result()); println!("result {:?}", full_greeting(" bye ?").to_result()); // result Ok(("hi", Some("Bob"))) // result Ok(("bye", None)) #}
Notice that it was straightforward to combine an existing parser for greetings with a parser that picks up names, and then it was easy to make that name optional. This is the great power of Nom, and it's why it's called a "parser combinator library". You can build up your complicated parsers from simpler parsers, which you can test individually. (At this point, the equivalent regex is starting to look like a Perl program: regexes do not combine well.)
However, we are not yet home and dry! full_greeting(" bye ")
will fail with an
Imcomplete
error. Nom knows that "bye" may be followed by a name and wants us to give it more
data. This is how a streaming parser needs to work, so you can feed it a file chunk by chunk,
but here we need to tell Nom that the input is complete.
# #![allow(unused_variables)] # #fn main() { named!(full_greeting<&str,(&str,Option<&str>)>, pair!( get_greeting, opt!(complete!(nom::alpha)) ) ); println!("result {:?}", full_greeting(" bye ").to_result()); // result Ok(("bye", None)) #}
Parsing Numbers
Nom provides a function digit
which matches a series of numerical digits.
So we use map!
, to convert the string into an integer,
and return the full Result
type.
# #![allow(unused_variables)] # #fn main() { use nom::digit; use std::str::FromStr; use std::num::ParseIntError; named!(int8 <&str, Result<i8,ParseIntError>>, map!(digit, FromStr::from_str) ); named!(int32 <&str, Result<i32,ParseIntError>>, map!(digit, FromStr::from_str) ); println!("{:?}", int8("120")); println!("{:?}", int8("1200")); println!("{:?}", int8("x120")); println!("{:?}", int32("1202")); // Done("", Ok(120)) // Done("", Err(ParseIntError { kind: Overflow })) // Error(Digit) // Done("", Ok(1202)) #}
So what we get is a parser IResult
containing a conversion Result
- and sure enough, there
is more than one way to fail here. Note that the body of our converting function has exactly
the same code; the actual conversion depends on the return type of the function.
Integers may have a sign. We can capture integers as a pair, where the first value may be a sign, and the second value would be any digits following.
Consider:
# #![allow(unused_variables)] # #fn main() { named!(signed_digits<&str, (Option<&str>,&str)>, pair!( opt!(alt!(tag_s!("+") | tag_s!("-"))), // maybe sign? digit ) ); println!("signed {:?}", signed_digits("4")); println!("signed {:?}", signed_digits("+12")); // signed Done("", (None, "4")) // signed Done("", (Some("+"), "12")) #}
When we aren't interested in the intermediate results, but just want all the matching
input, then recognize!
is what you need.
# #![allow(unused_variables)] # #fn main() { named!(maybe_signed_digits<&str,&str>, recognize!(signed_digits) ); println!("signed {:?}", maybe_signed_digits("+12")); // signed Done("", "+12") #}
With this technique, we can recognize floating-point numbers. Again we map to string slice
from the byte slice over all these matches. tuple!
is the generalization of pair!
,
although we aren't interested in the generated tuple here. complete!
is needed to resolve
the same problem we had with incomplete greetings - "12" is a valid number without the
optional floating-point part.
# #![allow(unused_variables)] # #fn main() { named!(floating_point<&str,&str>, recognize!( tuple!( maybe_signed_digits, opt!(complete!(pair!( tag_s!("."), digit ))), opt!(complete!(pair!( alt!(tag_s!("e") | tag_s!("E")), maybe_signed_digits ))) ) ) ); #}
By defining a little helper macro, we get some passing tests. The test
passes if floating_point
matches all of the string that it is given.
# #![allow(unused_variables)] # #fn main() { macro_rules! nom_eq { ($p:expr,$e:expr) => ( assert_eq!($p($e).to_result().unwrap(), $e) ) } nom_eq!(floating_point, "+2343"); nom_eq!(floating_point, "-2343"); nom_eq!(floating_point, "2343"); nom_eq!(floating_point, "2343.23"); nom_eq!(floating_point, "2e20"); nom_eq!(floating_point, "2.0e-6"); #}
(Although sometimes macros feel a little dirty, making your tests pretty is a fine thing.)
And then we can parse and convert floating point numbers. Here I'll throw caution to the winds and throw away the error:
# #![allow(unused_variables)] # #fn main() { named!(float64<f64>, map_res!(floating_point, FromStr::from_str) ); #}
Please note how it's possible to build up complicated parsers step by step, testing each part in isolation first. That's a strong advantage of parser combinators over regexes. It is very much the classic programming tactic of divide-and-rule.
Operations over Multiple Matches
We've met pairs!
and tuple!
which capture a fixed number of matches as Rust tuples.
There is also many0
and many1
- they both capture indefinite numbers of matches as vectors.
The difference is that the first may capture 'zero or many' and the second 'one or many' (like the
difference between the regex *
versus +
modifiers.) So many1!(ws!(float64))
would
parse "1 2 3" into vec![1.0,2.0,3.0]
, but will fail on the empty string.
fold_many0
is a reducing operation. The match values are combined into a single value,
using a binary operator.
For instance, this is how Rust people did sums over iterators before sum
was added; this fold
starts with an initial value (here zero) for the accumulator and keeps adding values to
that accumulator using +
.
# #![allow(unused_variables)] # #fn main() { let res = [1,2,3].iter().fold(0,|acc,v| acc + v); println!("{}",res); // 6 #}
Here's the Nom equivalent:
# #![allow(unused_variables)] # #fn main() { named!(fold_sum<&str,f64>, fold_many1!( ws!(float64), 0.0, |acc, v| acc + v ) ); println!("fold {}", fold_sum("1 2 3").to_result().unwrap()); //fold 6 #}
Up to now, we've had to capture every expression, or just grab all matching bytes with recognize!
:
# #![allow(unused_variables)] # #fn main() { named!(pointf<(f64,&[u8],f64)>, tuple!( float64, tag_s!(","), float64 ) ); println!("got {:?}", nom_res!(pointf,"20,52.2").unwrap()); //got (20, ",", 52.2) #}
For more complicated expressions, capturing the results of all the parsers leads to rather untidy types! We can do better.
do_parse!
lets you extract only the values you're interested in. The matches are separated
with >>
- the matches of interest are of the form name: parser
. Finally, there's a code
block in parentheses.
# #![allow(unused_variables)] # #fn main() { #[derive(Debug)] struct Point { x: f64, y: f64 } named!(pointf<Point>, do_parse!( first: float64 >> tag_s!(",") >> second: float64 >> (Point{x: first, y: second}) ) ); println!("got {:?}", nom_res!(pointf,"20,52.2").unwrap()); // got Point { x: 20, y: 52.2 } #}
We're not interested in that tag's value (it can only be a comma) but we assign the two float values to temporary values which are used to build a struct. The code at the end can be any Rust expression.
Parsing Arithmetic Expressions
With the necessary background established, we can do simple arithmetic expressions. This is a good example of something that really can't be done with regexes.
The idea is to build up expressions from the bottom up. Expressions consist of terms, which are added or subtracted. Terms consist of factors, which are multiplied or divided. And (for now) factors are just floating-point numbers:
# #![allow(unused_variables)] # #fn main() { named!(factor<f64>, ws!(float64) ); named!(term<&str,f64>, do_parse!( init: factor >> res: fold_many0!( tuple!( alt!(tag_s!("*") | tag_s!("/")), factor ), init, |acc, v:(_,f64)| { if v.0 == "*" {acc * v.1} else {acc / v.1} } ) >> (res) )); named!(expr<&str,f64>, do_parse!( init: term >> res: fold_many0!( tuple!( alt!(tag_s!("+") | tag_s!("-")), term ), init, |acc, v:(_,f64)| { if v.0 == "+" {acc + v.1} else {acc - v.1} } ) >> (res) )); #}
This expresses our definitions more precisely - an expression consists of at least one term, and then
zero or many plus-or-minus terms. We don't collect them, but fold them using the appropriate
operator. (It's one of those cases where Rust can't quite work out the type of the expression, so
we need a type hint.) Doing it like this establishes the correct operator precedence - *
always
wins over +
and so forth.
We're going to need floating-point asserts here, and there's a crate for that.
Add the line 'approx="0.1.1" to your Cargo.toml, and away we go:
# #![allow(unused_variables)] # #fn main() { #[macro_use] extern crate approx; ... assert_relative_eq!(fold_sum("1 2 3").to_result().unwrap(), 6.0); #}
Let's define a convenient little testing macro. stringify!
turns the expression into a string
literal which we can feed into expr
and then compare the result with how Rust would
evaluate it.
# #![allow(unused_variables)] # #fn main() { macro_rules! expr_eq { ($e:expr) => (assert_relative_eq!( expr(stringify!($e).to_result().unwrap(), $e) ) } expr_eq!(2.3); expr_eq!(2.0 + 3.0 - 4.0); expr_eq!(2.0*3.0 - 4.0); #}
This is very cool - a few lines to get an expression evaluator! But it gets better.
We add an alternative to numbers in the factor
parser - expressions contained inside
parentheses:
# #![allow(unused_variables)] # #fn main() { named!(factor<&str,f64>, alt!( ws!(float64) | ws!(delimited!( tag_s!("("), expr, tag_s!(")") )) ) ); expr_eq!(2.2*(1.1 + 4.5)/3.4); expr_eq!((1.0 + 2.0)*(3.0 + 4.0*(5.0 + 6.0))); #}
The coolness is that expressions are now defined recursively in terms of expressions!
The particular magic of delimited!
is that parentheses may be nested - Nom makes sure
the brackets match up.
We are now way past the capabilities of regular expressions, and the stripped executable at 0.5Mb is still half the size of a "hello world" regex program.
Pain Points
It is true to say that Rust is a harder language to learn than most 'mainstream' languages. There are exceptional people who don't find it so difficult, but note the strict meaning of 'exceptional' - they are exceptions. Many struggle at first, and then succeed. Initial difficulties aren't predictive of later competency!
We all come from somewhere, and in the case of programming languages this means previous exposure to mainstream languages like one of the 'dynamic' languages like Python or one of the 'static' languages like C++. Either way, Rust is sufficiently different to require mental retooling. Clever people with experience jump in and are disappointed that their cleverness is not immediately rewarded; people with less self-worth think they aren't 'clever' enough.
For those with dynamic language experience (in which I would include Java) everything is a reference, and all references are mutable by default. And garbage collection does make it easier to write memory-safe programs. A lot has gone into making the JVM pretty fast, at the cost of memory use and predicability. Often that cost is considered worth it - the old new idea that programmer productivity is more important than computer performance.
But most computers in the world - the ones that handle really important things like throttle control on cars - don't have the massive resources that even a cheap laptop has, and they need to respond to events in real time. Likewise, basic software infrastructure needs to be correct, robust, and fast (the old engineering trinity). Much of this is done in C and C++ which are inherently unsafe - the total cost of this unsafety is the thing to look at here. Maybe you knock the program together quicker, but then the real development starts.
System languages can't afford garbage collection, because they are the bedrock on which everything rests. They allow you to be free to waste resources as you see fit.
If there is no garbage collection, then memory must be managed in other ways. Manual memory management - I grab memory, use it, and explicitly give it back - is hard to get right. You can learn enough C to be productive and dangerous in a few weeks - but it takes years to become a good safe C programmer, checking every possible error condition.
Rust manages memory like modern C++ - as objects are destroyed, their
memory is reclaimed. You can allocate memory on the heap with Box
, but
as soon as that box 'goes out of scope' at the end of the function, the
memory is reclaimed. So there is something like new
but nothing like
delete
. You create a File
and at the end, the file handle (a precious
resource) is closed. In Rust this is called dropping.
You need to share resources - it's very inefficient to make copies of
everything - and that's where things get interesting. C++ also has
references, although Rust references are rather more like C pointers -
you need to say *r
to refer to the value, you need to say &
to
pass a value as a reference.
Rust's borrow checker makes sure that is impossible for a reference to exist after the original value is destroyed.
Type Inference
The distinction between 'static' and 'dynamic' isn't everything. Like with
most things, there are more dimensions in play. C is statically-typed
(every variable has a type at compile-time) but weakly-typed (e.g. void*
can point to anything); Python is dynamically-typed (the type is in
the value, not the variable) but strongly-typed. Java is static/sorta strong
(with reflection as convenient/dangerous escape valve) and Rust is
static/strong, with no runtime reflection.
Java is famous for needing all thoses types typed out in numbing detail,
Rust likes to infer types. This is generally a good idea, but it does
mean that you sometimes need to work out what the actual types are. You
will see let n = 100
and wonder - what kind of integer is this? By
default, it would be i32
- a four-byte signed integer. Everyone agrees
by now that C's unspecified integer types (like int
and long
) are
a bad idea; better to be explicit. You can always spell out the type,
as in let n: u32 = 100
or let the literal force the type, as in
let n = 100u32
. But type inference goes much further than that!
If you declare let n = 100
then all rustc
knows that n
must be
some integer type. If you then passed n
to a function expecting
a u64
then that must be the type of n
!
After that, you try to pass n
to a function expecting u32
.
rustc
will not let you do this, because n
has been tied down to
u64
and it will not take the easy way out and convert that
integer for you. This is strong typing in action - there are none
of those little conversions and promotions which make your life
smoother until integer overflow bites your ass suddenly. You would have
to explicitly pass n
as n as u32
- a Rust typecast. Fortunately,
rustc
is good at breaking the bad news in an 'actionable' way - that is,
you can follow the compiler's advice about fixing the problem.
So, Rust code can be very free of explicit types:
# #![allow(unused_variables)] # #fn main() { let mut v = Vec::new(); // v is deduced to have type Vec<i32> v.push(10); v.push(20); v.push("hello") <--- just can't do this, man! #}
Not being able to put strings into a vector of integers is a feature, not a bug. The flexibility of dynamic typing is also a curse.
(If you do need to put integers and strings into the same vector, then
Rust enum
types are the way to do it safely.)
Sometimes you need to at least give a type hint. collect
is a
fantastic iterator method, but it needs a hint. Say I have a
iterator returning char
. Then collect
can swing two ways:
# #![allow(unused_variables)] # #fn main() { // a vector of char ['h','e','l','l','o'] let v: Vec<_> = "hello".chars().collect(); // a string "doy" let m: String = "dolly".chars().filter(|&c| c != 'l').collect(); #}
When feeling uncertain about the type of a variable, there's always this
trick, which forces rustc
to reveal the actual type name in an
error message:
# #![allow(unused_variables)] # #fn main() { let x: () = var; #}
rustc
may pick an over-specific type. Here we want to put different
references into a vector as &Debug
but need to declare the type
explicitly.
# #![allow(unused_variables)] # #fn main() { use std::fmt::Debug; let answer = 42; let message = "hello"; let float = 2.7212; let display: Vec<&Debug> = vec![&message, &answer, &float]; for d in display { println!("got {:?}", d); } #}
Mutable References
The rule is: only one mutable reference at a time. The reason is that tracking mutability is hard when it can happen all over the place. Not obvious in dinky little programs, but things can get bad in big codebases.
The further constraint is that you can't have immutable references while
there's a mutable reference out. Otherwise, anybody who has those
references doesn't have a guarantee that they won't change. C++ also
has immutable references (e.g. const string&
) but does not give
you this guarantee that someone can't keep a string&
reference and modify it
behind your back.
This is a challenge if you are used to languages where every reference is mutable! Unsafe, 'relaxed' languages depend on people understanding their own programs and nobly deciding not to do Bad Things. But big programs are written by more than one person and are beyond the power of a single individual to understand in detail.
The irritating thing is that the borrow checker is not as smart as it could be.
# #![allow(unused_variables)] # #fn main() { let mut m = HashMap::new(); m.insert("one", 1); m.insert("two", 2); if let Some(r) = m.get_mut("one") { // <-- mutable borrow of m *r = 10; } else { m.insert("one", 1); // can't borrow mutably again! } #}
Clearly this does not really violate the Rules since if we got None
we
haven't actually borrowed anything from the map.
There are various ugly workarounds:
# #![allow(unused_variables)] # #fn main() { let mut found = false; if let Some(r) = m.get_mut("one") { *r = 10; found = true; } if ! found { m.insert("one", 1); } #}
Which is yucky, but it works because the bothersome borrow is kept to the first if-statement.
The better way here is to use HashMap
's entry API.
# #![allow(unused_variables)] # #fn main() { use std::collections::hash_map::Entry; match m.entry("one") { Entry::Occupied(e) => { *e.into_mut() = 10; }, Entry::Vacant(e) => { e.insert(1); } }; #}
The borrow checker will get less frustrating when non-lexical lifetimes arrive sometime this year.
The borrow checker does understand some important cases, however. If you have a struct, fields can be independently borrowed. So composition is your friend; a big struct should contain smaller structs, which have their own methods. Defining all the mutable methods on the big struct will lead to a situation where you can't modify things, even though the methods might only refer to one field.
With mutable data, there are special methods for treating parts of the
data independently. For instance, if you have a mutable slice, then split_at_mut
will split this into two mutable slices. This is perfectly safe, since Rust
knows that the slices do not overlap.
References and Lifetimes
Rust cannot allow a situation where a reference outlives the value. Otherwise we would have a 'dangling reference' where it refers to a dead value - a segfault is inevitable.
rustc
can often make sensible assumptions about lifetimes in functions:
fn pair(s: &str, ch: char) -> (&str, &str) { if let Some(idx) = s.find(ch) { (&s[0..idx], &s[idx+1..]) } else { (s, "") } } fn main() { let p = pair("hello:dolly", ':'); println!("{:?}", p); } // ("hello", "dolly")
This is quite safe because we cope with the case where the delimiter isn't found.
rustc
is here assuming that both strings in the tuple are borrowed from the
string passed as an argument to the function.
Explicitly, the function definition looks like this:
# #![allow(unused_variables)] # #fn main() { fn pair<'a>(s: &'a str, ch: char) -> (&'a str, &'a str) {...} #}
What the notation says is that the output strings live at least as long as the
input string. It's not saying that the lifetimes are the same, we could drop them
at any time, just that they cannot outlive s
.
So, rustc
makes common cases prettier with lifetime ellision.
Now, if that function received two strings, then you would need to explicitly do lifetime annotation to tell Rust what output string is borrowed from what input string.
You always need an explicit lifetime when a struct borrows a reference:
# #![allow(unused_variables)] # #fn main() { struct Container<'a> { s: &'a str } #}
Which is again insisting that the struct cannot outlive the reference.
For both structs and functions, the lifetime needs to be declared in <>
like a type parameter.
Closures are very convenient and a powerful feature - a lot of the power
of Rust iterators comes from them. But if you store them, you have
to specify a lifetime. This is because basically a closure is a generated
struct that can be called, and that by default borrows its environment.
Here the linear
closure has immutable references to m
and c
.
# #![allow(unused_variables)] # #fn main() { let m = 2.0; let c = 0.5; let linear = |x| m*x + c; let sc = |x| m*x.cos() ... #}
Both linear
and sc
implement Fn(x: f64)->f64
but they are not
the same animal - they have different types and sizes! So to store
them you have to make a Box<Fn(x: f64)->f64 + 'a>
.
Very irritating if you're used to how fluent closures are in Javascript
or Lua, but C++ does a similar thing to Rust and needs std::function
to store different closures, taking a little penalty for the virtual
call.
Strings
It is common to feel irritated with Rust strings in the beginning. There are different ways to create them, and they all feel verbose:
# #![allow(unused_variables)] # #fn main() { let s1 = "hello".to_string(); let s2 = String::from("dolly"); #}
Isn't "hello" already a string? Well, in a way. String
is an owned string,
allocated on the heap; a string literal "hello" is of type &str
("string slice")
and might be either baked into the executable ("static") or borrowed from a String
.
System languages need this distinction - consider a tiny microcontroller, which has
a little bit of RAM and rather more ROM. Literal strings will get stored in ROM
("read-only") which is both cheaper and consumes much less power.
But (you may say) it's so simple in C++:
std::string s = "hello";
Which is shorter yes, but hides the implicit creation of a string object.
Rust likes to be explicit about memory allocations, hence to_string
.
On the other hand, to borrow from a C++ string requires c_str
, and
C strings are stupid.
Fortunately, things are better in Rust - once you accept that both String
and &str
are necessary. The methods of String
are mostly for changing the string,
like push
adding a char (under the hood it's very much like a Vec<u8>
).
But all the methods of &str
are also available. By the same Deref
mechanism, a String
can be passed as &str
to a function - which is
why you rarely see &String
in function definitions.
There are a number of ways to convert &str
to String
, corresponding
to various traits. Rust needs these traits to work with types generically.
As a rule of thumb, anything that implements Display
also knows to_string
,
like 42.to_string()
.
Some operators may not behave according to intuition:
# #![allow(unused_variables)] # #fn main() { let s1 = "hello".to_string(); let s2 = s1.clone(); assert!(s1 == s2); // cool assert!(s1 == "hello"); // fine assert!(s1 == &s2); // WTF? #}
Remember, String
and &String
are different types, and ==
isn't
defined for that combination. This might puzzle a C++ person who is
used to references being almost interchangeable with values.
Furthermore, &s2
doesn't magically become a &str
, that's
a deref coercion which only happens when assigning to a &str
variable or argument. (The explicit s2.as_str()
would work.)
However, this more genuinely deserves a WTF:
# #![allow(unused_variables)] # #fn main() { let s3 = s1 + s2; // <--- no can do #}
You cannot concatenate two String
values, but you can concatenate
a String
with a &str
. You furthermore cannot concatenate a
&str
with a String
. So mostly people don't use +
and use
the format!
macro, which is convenient but not so efficient.
Some string operations are available but work differently. For instance,
languages often have a split
method for breaking up a string into an array
of strings. This method for Rust strings returns an iterator, which
you can then collect into a vector.
# #![allow(unused_variables)] # #fn main() { let parts: Vec<_> = s.split(',').collect(); #}
This is a bit clumsy if you are in a hurry to get a vector. But you can do operations on the parts without allocating a vector! For instance, length of largest string in the split?
# #![allow(unused_variables)] # #fn main() { let max = s.split(',').map(|s| s.len()).max().unwrap(); #}
(The unwrap
is because an empty iterator has no maximum and we must
cover this case.)
The collect
method returns a Vec<&str>
, where the parts are
borrowed from the original string - we only need allocate space
for the references. There is no method like this in C++, but until
recently it would have to individually allocate each substring. (C++ 17
has std::string_view
which behaves like a Rust string slice.)
A Note on Semicolons
Semicolons are not optional, but usually left out in the same places as
in C, e.g. after {}
blocks. They also aren't needed after enum
or
struct
(that's a C peculiarity.) However, if the block must have a
value, then the semi-colons are dropped:
# #![allow(unused_variables)] # #fn main() { let msg = if ok {"ok"} else {"error"}; #}
Note that there must be a semi-colon after this let
statement!
If there were semicolons after these string literals then the returned
value would be ()
(like Nothing
or void
). It's common error when
defining functions:
# #![allow(unused_variables)] # #fn main() { fn sqr(x: f64) -> f64 { x * x; } #}
rustc
will give you a clear error in this case.
C++-specific Issues
Rust value semantics are Different
In C++, it's possible to define types which behave exactly like primitives and copy themselves. In addition, a move constructor can be defined to specify how a value can be moved out of a temporary context.
In Rust, primitives behave as expected, but the Copy
trait can only
be defined if the aggregate type (struct, tuple or enum) itself contains
only copyable types. Arbitrary types may have Clone
, but you have
to call the clone
method on values. Rust requires any allocation
to be explicit and not hide in copy constructors or assignment operators.
So, copying and moving is always defined as just moving bits around and is not overrideable.
If s1
is a non Copy
value type, then s2 = s1;
causes a move to happen,
and this consumes s1
! So, when you really want a copy, use clone
.
Borrowing is often better than copying, but then you must follow the
rules of borrowing. Fortunately, borrowing is an overridable behaviour.
For instance, String
can be borrowed as &str
, and shares all the
immutable methods of &str
. String slices are very powerful compared
to the analogous C++ 'borrowing' operation, which is to extract a const char*
using c_str
. &str
consists of a pointer to some owned bytes (or a string
literal) and a size. This leads to some very memory-efficient patterns.
You can have a Vec<&str>
where all the strings have been borrowed from
some underlying string - only space for the vector needs to be allocated:
For example, splitting by whitespace:
# #![allow(unused_variables)] # #fn main() { fn split_whitespace(s: &str) -> Vec<&str> { s.split_whitespace().collect() } #}
Likewise, a C++ s.substr(0,2)
call will always copy the string, but a slice
will just borrow: &s[0..2]
.
There is an equivalent relationship between Vec<T>
and &[T]
.
Shared References
Rust has smart pointers like C++ - for instance, the equivalent of
std::unique_ptr
is Box
. There's no need for delete
, since any
memory or other resources will be reclaimed when the box goes out of
scope (Rust very much embraces RAII).
# #![allow(unused_variables)] # #fn main() { let mut answer = Box::new("hello".to_string()); *answer = "world".to_string(); answer.push('!'); println!("{} {}", answer, answer.len()); #}
People find to_string
irritating at first, but it is explicit.
Note the explicit dererefence *
, but methods on smart pointers
don't need any special notation (we do not say (*answer).push('!')
)
Obviously, borrowing only works if there is a clearly defined owner of the original content. In many designs this isn't possible.
In C++, this is where std::shared_ptr
is used; copying just involves
modifying a reference count on the common data. This is not without
cost, however:
- even if the data is read-only, constantly modifying the reference count can cause cache invalidation
std::shared_ptr
is designed to be thread-safe and carries locking overhead as well
In Rust, std::rc::Rc
also acts like a shared smart pointer using
reference-counting. However, it is for immutable references only! If you
want a thread-safe variant, use std::sync::Arc
(for 'Atomic Rc').
So Rust is being a little awkward here in providing two variants, but you
get to avoid the locking overhead for non-threaded operations.
These must be immutable references because that is fundamental to Rust's
memory model. However, there's a get-out card: std::cell::RefCell
.
If you have a shared reference defined as Rc<RefCell<T>>
then you
can mutably borrow using its borrow_mut
method. This applies the
Rust borrowing rules dynamically - so e.g. any attempt to call
borrow_mut
when a borrow was already happening will cause a panic.
This is still safe. Panics will happen before any memory has been touched inappropriately! Like exceptions, they unroll the call stack. So it's an unfortunate word for such a structured process - it's an ordered withdrawal rather than a panicked retreat.
The full Rc<RefCell<T>>
type is clumsy, but the application code isn't
unpleasant. Here Rust (again) is prefering to be explicit.
If you wanted thread-safe access to shared state, then Arc<T>
is the
only safe way to go. If you need mutable access, then Arc<Mutex<T>>
is the equivalent of Rc<RefCell<T>>
. Mutex
works a little differently
than how it's usually defined: it is a container for a value. You get
a lock on the value and can then modify it.
# #![allow(unused_variables)] # #fn main() { let answer = Arc::new(Mutex::new(10)); // in another thread .. { let mut answer_ref = answer.lock().unwrap(); *answer_ref = 42; } #}
Why the unwrap
? If the previous holding thread panicked, then
this lock
fails. (It's one place in the documentation where unwrap
is considered a reasonable thing to do, since clearly things have
gone seriously wrong. Panics can always be caught on threads.)
It's important (as always with mutexes) that this exclusive lock is held for as little time as possible. So it's common for them to happen in a limited scope - then the lock ends when the mutable reference goes out of scope.
Compared with the apparently simpler situation in C++ ("use shared_ptr dude")
this seems awkward. But now any modifications of shared state become obvious,
and the Mutex
lock pattern forces thread safety.
Like everything, use shared references with caution.
Iterators
Iterators in C++ are defined fairly informally; they involve smart pointers,
usually starting with c.begin()
and ending with c.end()
. Operations on
iterators are then implemented as stand-alone template functions, like std::find_if
.
Rust iterators are defined by the Iterator
trait; next
returns an Option
and when
the Option
is None
we are finished.
The most common operations are now methods.
Here is the equivalent of find_if
. It returns an Option
(case
of not finding is None
) and here the if let
statement is convenient for
extracting the non-None
case:
# #![allow(unused_variables)] # #fn main() { let arr = [10, 2, 30, 5]; if let Some(res) = arr.find(|x| x == 2) { // res is 2 } #}
Unsafety and Linked Lists
It's no secret that parts of the Rust stdlib are implemented using unsafe
. This
does not invalidate the conservative approach of the borrow checker. Remember that
"unsafe" has a particular meaning - operations which Rust cannot fully verify at
compile time. From Rust's perspective, C++ operates in unsafe mode all the time!
So if a large application needs a few dozen lines of unsafe code, then that's fine,
since these few lines can be carefully checked by a human. Humans are not good at
checking 100Kloc+ of code.
I mention this, because there appears to be a pattern:
an experienced C++ person tries to implement a linked list or a tree structure,
and gets frustrated. Well, a double-linked list is possible in safe Rust,
with Rc
references going forward, and Weak
references going back. But the
standard library gets more performance out of using... pointers.