Filesystem and Processes
Another look at Reading Files
At the end of Part 1, I showed how to read a whole file into a string. Naturally this isn't always such a good idea, so here is how to read a file line-by-line.
fs::File
implements io::Read
, which is the trait for anything readable.
This trait defines a read
method which will fill a slice of u8
with bytes -
this is the only required method of the trait, and you get some provided methods
for free, much like with Iterator
. You can use read_to_end
to fill a vector of
bytes with contents from the readable, and read_to_string
to fill a string - which
must be UTF-8 encoded.
This is a 'raw' read, with no buffering. For buffered reading there is the
io::BufRead
trait which gives us read_line
and a lines
iterator.
io::BufReader
will provide an implementation of io::BufRead
for any readable.
fs::File
also implements io::Write
.
The easiest way to make sure all these traits are visible is use std::io::prelude::*
.
# #![allow(unused_variables)] # #fn main() { use std::fs::File; use std::io; use std::io::prelude::*; fn read_all_lines(filename: &str) -> io::Result<()> { let file = File::open(&filename)?; let reader = io::BufReader::new(file); for line in reader.lines() { let line = line?; println!("{}", line); } Ok(()) } #}
The let line = line?
may look a bit strange. The line
returned by the
iterator is actually an io::Result<String>
which we unwrap with ?
.
Because things can go wrong during this iteration - I/O errors, swallowing
a chunk of bytes that aren't UTF-8, and so forth.
lines
being an iterator, it is straightforward to read a file into a vector
of strings using collect
, or print out the line with line numbers using the
enumerate
iterator.
It isn't the most efficient way to read all the lines, however, because a new
string is allocated for each line. It is more efficient to use read_line
, although
more awkward. Note that the returned line includes the linefeed, which
can be removed using trim_right
.
# #![allow(unused_variables)] # #fn main() { let mut reader = io::BufReader::new(file); let mut buf = String::new(); while reader.read_line(&mut buf)? > 0 { { let line = buf.trim_right(); println!("{}", line); } buf.clear(); } #}
This results in far less allocations, because clearing that string does not free its allocated memory; once the string has enough capacity, no more allocations will take place.
This is one of those cases where we use a block to control a borrow. line
is
borrowed from buf
, and this borrow must finish before we modify buf
. Again,
Rust is trying to stop us doing something stupid, which is to access line
after
we've cleared the buffer. (The borrow checker can be restrictive sometimes.
Rust is due to get 'non-lexical lifetimes', where
it will analyze the code and see that line
isn't used after buf.clear()
.)
This isn't very pretty. I cannot give you a proper iterator that returns references to a buffer, but I can give you something that looks like an iterator.
First define a generic struct;
the type parameter R
is 'any type that implements Read'. It contains the reader
and the buffer which we are going to borrow from.
# #![allow(unused_variables)] # #fn main() { // file5.rs use std::fs::File; use std::io; use std::io::prelude::*; struct Lines<R> { reader: io::BufReader<R>, buf: String } impl <R: Read> Lines<R> { fn new(r: R) -> Lines<R> { Lines{reader: io::BufReader::new(r), buf: String::new()} } ... } #}
Then the next
method. It returns an Option
- just like an iterator, when it
returns None
the iterator finishes. The returned type is a Result
because
read_line
might fail, and we never throw errors away. So if fails, we
wrap up its error in a Some<Result>
. Otherwise, it may have read zero bytes,
which is the natural end of the file - not an error, just a None
.
At this point, the buffer contains the line with a linefeed (`\n') appended. Trim this away, and package up the string slice.
# #![allow(unused_variables)] # #fn main() { fn next<'a>(&'a mut self) -> Option<io::Result<&'a str>>{ self.buf.clear(); match self.reader.read_line(&mut self.buf) { Ok(nbytes) => if nbytes == 0 { None // no more lines! } else { let line = self.buf.trim_right(); Some(Ok(line)) }, Err(e) => Some(Err(e)) } } #}
Now, note how the lifetimes work. We need an explicit lifetime because Rust will never
allow us to hand out borrowed string slices without knowing their lifetime. And here
we say that the lifetime of this borrowed string is within the lifetime of self
.
And this signature, with the lifetime, is incompatible with the interface of Iterator
.
But it's easy to see problems if it were compatible; consider collect
trying to make
a vector of these string slices. There's no way this could work, since they're all
borrowed from the same mutable string! (If you had read all the file into a string, then
the string's lines
iterator can return string slices because they are all borrowed from
distinct parts of the original string.)
The resulting loop is much cleaner, and the file buffering is invisible to the user.
# #![allow(unused_variables)] # #fn main() { fn read_all_lines(filename: &str) -> io::Result<()> { let file = File::open(&filename)?; let mut lines = Lines::new(file); while let Some(line) = lines.next() { let line = line?; println!("{}", line); } Ok(()) } #}
You can even write the loop like this, since the explicit match can pull out the string slice:
# #![allow(unused_variables)] # #fn main() { while let Some(Ok(line)) = lines.next() { println!("{}", line)?; } #}
It's tempting, but you are throwing away a possible error here; this loop will silently stop whenever an error occurs. In particular, it will stop at the first place where Rust can't convert a line to UTF-8. Fine for casual code, bad for production code!
Writing To Files
We met the write!
macro when implementing Debug
- it also works with anything
that implements Write
. So here's a another way of saying print!
:
# #![allow(unused_variables)] # #fn main() { let mut stdout = io::stdout(); ... write!(stdout,"answer is {}\n", 42).expect("write failed"); #}
If an error is possible, you must handle it. It may not be
very likely but it can happen. It's usually fine, because if you
are doing file i/o you should be in a context where ?
works.
But there is a difference: print!
locks stdout for each write. This is usually
what you want for output, because without that locking multithreaded
programs can mix up that output in interesting ways. But if you are pumping out
a lot of text, then write!
is going to be faster.
For arbitrary files we need write!
. The
file is closed when out
is dropped at the end of write_out
, which is
both convenient and important.
// file6.rs use std::fs::File; use std::io; use std::io::prelude::*; fn write_out(f: &str) -> io::Result<()> { let mut out = File::create(f)?; write!(out,"answer is {}\n", 42)?; Ok(()) } fn main() { write_out("test.txt").expect("write failed"); }
If you care about performance, you need to know that Rust files are unbuffered
by default. So each little write request goes straight to the OS, and this is
going to be significantly slower. I mention this because this default is different
from other programming languages, and could lead to the shocking discovery that Rust
can be left in the dust by scripting languages!
Just as with Read
and io::BufReader
, there is io::BufWriter
for
buffering any Write
.
Files, Paths and Directories
Here is a little program for printing out the Cargo directory on a machine. The
simplest case is that it's '~/.cargo'. This is a Unix shell expansion,
so we use env::home_dir
because it's cross-platform. (It might fail, but a
computer without a home directory isn't going to be hosting Rust tools anyway.)
We then create a PathBuf
and use its push
method to build up the full file path from its components.
(This is much easier than fooling around with '/','' or whatever, depending on
the system.)
// file7.rs use std::env; use std::path::PathBuf; fn main() { let home = env::home_dir().expect("no home!"); let mut path = PathBuf::new(); path.push(home); path.push(".cargo"); if path.is_dir() { println!("{}", path.display()); } }
A PathBuf
is like String
- it owns a growable set of characters, but with methods
specialized to building up paths. Most of its functionality however comes from
the borrowed version Path
, which is like &str
. So, for instance, is_dir
is
a Path
method.
This might sound suspiciously like a form of inheritance, but the magic Deref
trait works differently. It works just like it does with String/&str
-
a reference to PathBuf
can be coerced into a reference to Path
.
('Coerce' is a strong word, but this really
is one of the few places where Rust does conversions for you.)
# #![allow(unused_variables)] # #fn main() { fn foo(p: &Path) {...} ... let path = PathBuf::from(home); foo(&path); #}
PathBuf
has an intimate relationship with OsString
, which represents strings we get
directly from the system. (There is a corresponding OsString/&OsStr
relationship.)
Such strings are not guaranteed to be representable as UTF-8! Real life is a complicated matter, particularly see the answer to 'Why are they so hard?'. To summarize, first there are years of ASCII legacy coding, and multiple special encodings for other languages. Second, human languages are complicated. For instance 'noël' is five Unicode code points!
It's true that most of the time
with modern operating systems file names will be Unicode (UTF-8 on the Unix side, UTF-16
for Windows), except when they're not. And Rust must handle that possibility
rigorously. For instance,
Path
has a method as_os_str
which returns a &OsStr
, but the to_str
method
returns an Option<&str>
. Not always possible!
People have trouble at this point because they have become too attached to 'string' and
'character' as the only necessary abstractions. As Einstein could have said, a programming language
has to be as simple as possible, but no simpler. A systems language needs a
String/&str
distinction (owned versus borrowed: this is also very convenient)
and if it wishes to standardize on Unicode strings then it needs another type to handle
text which isn't valid Unicode - hence OsString/&OsStr
. Notice that there aren't
any interesting string-like methods for these types, precisely because we don't know the
encoding.
But, people are used to processing filenames as if they were strings, which is why
Rust makes it easier to manipulate file paths using PathBuf
methods.
You can pop
to successively remove path components. Here we start with the
current directory of the program:
// file8.rs use std::env; fn main() { let mut path = env::current_dir().expect("can't access current dir"); loop { println!("{}", path.display()); if ! path.pop() { break; } } } // /home/steve/rust/gentle-intro/code // /home/steve/rust/gentle-intro // /home/steve/rust // /home/steve // /home // /
Here's a useful variation. I have a program which searches for a configuration file,
and the rule is that it may appear in any subdirectory of the current directory.
So I create /home/steve/rust/config.txt
and start this program up in /home/steve/rust/gentle-intro/code
:
// file9.rs use std::env; fn main() { let mut path = env::current_dir().expect("can't access current dir"); loop { path.push("config.txt"); if path.is_file() { println!("gotcha {}", path.display()); break; } else { path.pop(); } if ! path.pop() { break; } } } // gotcha /home/steve/rust/config.txt
This is pretty much how git works when it wants to know what the current repo is.
The details about a file (its size, type, etc) are called its metadata. As always, there may be an error - not just 'not found' but also if we don't have permission to read this file.
// file10.rs use std::env; use std::path::Path; fn main() { let file = env::args().skip(1).next().unwrap_or("file10.rs".to_string()); let path = Path::new(&file); match path.metadata() { Ok(data) => { println!("type {:?}", data.file_type()); println!("len {}", data.len()); println!("perm {:?}", data.permissions()); println!("modified {:?}", data.modified()); }, Err(e) => println!("error {:?}", e) } } // type FileType(FileType { mode: 33204 }) // len 488 // perm Permissions(FilePermissions { mode: 436 }) // modified Ok(SystemTime { tv_sec: 1483866529, tv_nsec: 600495644 })
The length of the file (in bytes) and modified time are straightforward to interpret.
(Note we may not be able to get this time!) The file type has methods is_dir
,
is_file
and is_symlink
.
permissions
is an interesting one. Rust strives to be cross-platform, and so it's
a case of the 'lowest common denominator'. In general, all you can query is whether
the file is read-only - the 'permissions' concept is extended in Unix and encodes
read/write/executable for user/group/others.
But, if you are not interested in Windows, then bringing in a platform-specific trait will give us at least the permission mode bits. (As usual, a trait only kicks in when it is visible.) Then, applying the program to its own executable gives:
# #![allow(unused_variables)] # #fn main() { use std::os::unix::fs::PermissionsExt; ... println!("perm {:o}",data.permissions().mode()); // perm 755 #}
(Note '{:o}' for printing out in octal)
(Whether a file is executable on Windows is determined by its extension. The executable
extensions are found in the PATHEXT
environment variable - '.exe','.bat' and so forth).
std::fs
contains a number of useful functions for working with files, such as copying or
moving files, making symbolic links and creating directories.
To find the contents of a directory, std::fs::read_dir
provides an iterator.
Here are all files with extension '.rs' and size greater than 1024 bytes:
# #![allow(unused_variables)] # #fn main() { fn dump_dir(dir: &str) -> io::Result<()> { for entry in fs::read_dir(dir)? { let entry = entry?; let data = entry.metadata()?; let path = entry.path(); if data.is_file() { if let Some(ex) = path.extension() { if ex == "rs" && data.len() > 1024 { println!("{} length {}", path.display(),data.len()); } } } } Ok(()) } // ./enum4.rs length 2401 // ./struct7.rs length 1151 // ./sexpr.rs length 7483 // ./struct6.rs length 1359 // ./new-sexpr.rs length 7719 #}
Obviously read_dir
might fail (usually 'not found' or 'no permission'), but
also getting each new entry might fail (it's like the lines
iterator over a buffered
reader's contents). Plus, we might not be able to get the metadata corresponding to
the entry. A file might have no extension, so we have to check for that as well.
Why not just an iterator over paths? On Unix this is the way the opendir
system call works,
but on Windows you cannot iterate over a directory's contents without getting the
metadata. So this is a reasonably elegant compromise that allows cross-platform
code to be as efficient as possible.
You can be forgiven for feeling 'error fatigue' at this point. But please note that the errors always existed - it's not that Rust is inventing new ones. It's just trying hard to make it impossible for you to ignore them. Any operating system call may fail.
Languages like Java and Python throw exceptions; languages like Go and Lua return two values, where the first is the result and the second is the error: like Rust it is considered bad manners for library functions to raise errors. So there is a lot of error checking and early-returns from functions.
Rust uses Result
because it's either-or: you cannot get both a result and an error.
And the question-mark operator makes handling errors much cleaner.
Processes
A fundamental need is for programs to run programs, or to launch processes. Your program can spawn as many child processes it likes, and as the name suggests they have a special relationship with their parent.
To run a program is straightforward using the Command
struct, which builds up
arguments to pass to the program:
use std::process::Command; fn main() { let status = Command::new("rustc") .arg("-V") .status() .expect("no rustc?"); println!("cool {} code {}", status.success(), status.code().unwrap()); } // rustc 1.15.0-nightly (8f02c429a 2016-12-15) // cool true code 0
So new
receives the name of the program (it will be looked up on PATH
if not
an absolute filename), arg
adds a new argument, and status
causes it to be run.
This returns a Result
, which is Ok
if the program actually run, containing an
ExitStatus
. In this case, the program succeeded, and returned an exit code 0. (The
unwrap
is because we can't always get the code if the program was killed by
a signal).
If we change the -V
to -v
(an easy mistake) then rustc
fails:
error: no input filename given
cool false code 101
So there are three possibilities:
- program didn't exist, was bad, or we were not allowed to run it
- program ran, but was not successful - non-zero exit code
- program ran, with zero exit code. Success!
By default, the program's standard output and standard error streams go to the terminal.
Often we are very interested in capturing that output, so there's the output
method.
// process2.rs use std::process::Command; fn main() { let output = Command::new("rustc") .arg("-V") .output() .expect("no rustc?"); if output.status.success() { println!("ok!"); } println!("len stdout {} stderr {}", output.stdout.len(), output.stderr.len()); } // ok! // len stdout 44 stderr 0
As with status
our program blocks until the child process is finished, and we get
back three things - the status (as before), the contents of stdout and the contents
of stderr.
The captured output is simply Vec<u8>
- just bytes. Recall we have no guarantee
that data we receive from the operating system is a properly encoded UTF-8 string. In
fact, we have no guarantee that it even is a string - programs may return arbitrary
binary data.
If we are pretty sure the output is UTF-8, then String::from_utf8
will convert those
vectors or bytes - it returns a Result
because this conversion may not succeed.
A more sloppy function is String::from_utf8_lossy
which will make a good attempt at
conversion and insert the invalid Unicode mark � where it failed.
Here is a useful function which runs a program using the shell. This uses the usual shell mechanism for joining stderr to stdout. The name of the shell is different on Windows, but otherwise things work as expected.
# #![allow(unused_variables)] # #fn main() { fn shell(cmd: &str) -> (String,bool) { let cmd = format!("{} 2>&1",cmd); let shell = if cfg!(windows) {"cmd.exe"} else {"/bin/sh"}; let flag = if cfg!(windows) {"/c"} else {"-c"}; let output = Command::new(shell) .arg(flag) .arg(&cmd) .output() .expect("no shell?"); ( String::from_utf8_lossy(&output.stdout).trim_right().to_string(), output.status.success() ) } fn shell_success(cmd: &str) -> Option<String> { let (output,success) = shell(cmd); if success {Some(output)} else {None} } #}
I'm trimming any whitespace from the right so that if you said shell("which rustc")
you will get the path without any extra linefeed.
You can control the execution of a program launched by Process
by specifying the directory it will run
in using the current_dir
method and the environment variables it sees using env
.
Up to now, our program simply waits for the child process to finish. If you use
the spawn
method then we return immediately, and must explicitly wait for it to
finish - or go off and do something else in the meantime! This example also
shows how to suppress both standard out and standard error:
// process5.rs use std::process::{Command,Stdio}; fn main() { let mut child = Command::new("rustc") .stdout(Stdio::null()) .stderr(Stdio::null()) .spawn() .expect("no rustc?"); let res = child.wait(); println!("res {:?}", res); }
By default, the child 'inherits' the standard input and output of the parent. In this case,
we redirect the child's output handles into 'nowhere'. It's equivalent to saying
> /dev/null 2> /dev/null
in the Unix shell.
Now, it's possible to do these things using the shell (sh
or cmd
) in Rust.
But this way you get full programmatic control of process creation.
For example, if we just had .stdout(Stdio::piped())
then the child's standard output
is redirected to a pipe. Then child.stdout
is something you can use to directly
read the output (i.e. implements Read
). Likewise, you can use the .stdout(Stdio::piped())
method so you can write to child.stdin
.
But if we used wait_with_output
instead of wait
then
it returns a Result<Output>
and the child's output is captured into the stdout
field of that Output
as a Vec<u8>
just as before.
The Child
struct also gives you an explicit kill
method.