Created
September 16, 2022 21:57
-
-
Save sbeam/8e754d86fd78b79f742785813d59f004 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
use std::str; | |
fn main() { | |
// the string literal is actually a &str (slice ref to a String) that is "owned" by the | |
// runtime when it starts, it's shipped as part of the binary as pre-allocated and readonly | |
// memory and is not on the heap. | |
// it needs to be converted to a String so it can be placed on the heap, and borrowed and | |
// resized if needed. This is the job of .to_string(). It might seem strange to take a string | |
// and immediately convert it to a String, but basically you need to ship it from "const" land | |
// to the heap, or you can't do anything with it and the compiler will be sad. | |
// | |
// We will use a phrase in German to make it look slightly more interesting. | |
let s = "große Feuer-Bälle 🔥!".to_string(); | |
// as_bytes() converts the String (a Vec<UTF8stuff> sorta) to the separate bytes [u8] | |
let byte_array = s.as_bytes(); | |
// let's have a look at them | |
for n in byte_array { | |
print!("{}|", n); | |
} | |
println!(""); | |
// => | |
// 103|114|111|195|159|101|32|70|101|117|101|114|45|66|195|164|108|108|101|32|240|159|148|165| | |
// | |
// Nice, but where do all these bytes live? | |
// {:p} makes it easy to print the physical address of any object in memory. | |
println!("heap string: {:p}", s.as_ptr()); | |
// => | |
// 0x600000e31120 | |
// | |
// Ok, there it be. That's a big heaping number. | |
// what if make a ref to the string? | |
let a_ref = &s; | |
println!("ref: {:p}", a_ref); | |
// => | |
// 0x309fb8808 | |
// | |
// interesting, a much smaller address! The reference is stored on the stack. | |
// | |
// Wait. What is the stack again? | |
// It's the place in memory where the runtime stores all values whose size can be known at | |
// compile time. It can therefore be pre-allocated. Items can only be popped or shifted from | |
// the ends of the stack as they go into or out of scope. Allocations to the stack are very | |
// cheap. | |
// | |
// It's distinct from the heap, which is dynamically allocated at runtime. It is managed as a | |
// binary tree (?), and allocations to the heap are moderately expensive, so Rust will avoid | |
// that whenever possible. Strings, most Vecs, Hashmaps, "Box", and other types must be stored | |
// in the heap, but are often referred to by reference. References are stored on the stack, but | |
// usually point to objects on the heap. Regions of memory that become unused as they go out of | |
// scope can be returned to the OS (I assume?) or re-used. Since Rust tracks the "owner" of | |
// every bit of memory throughout runtime, it can free any objects that go out of scope and | |
// therefore does not need garbage collection, but also cannot be used to write non-memory-safe | |
// code (miraculous!). | |
// | |
// Thus, taking the referernce (a &str) and calling as_ptr() reveals, again, the same location | |
// of the String, somewhere out in the heap. | |
println!("{:p}", a_ref.as_ptr()); | |
// what is a &str again? | |
// &str is a reference to a string literal stored in the read only memory when the program is | |
// run. Can’t be changed. It's always a reference to a slice of somebody else's String. It's | |
// stored on the stack. | |
// | |
// So in general, if a function needs to be called with a string that doesn't need to change, | |
// it should receive a &str | |
hark(a_ref); | |
// => Hark! große Feuer-Bälle 🔥! | |
// a &str doesn't have to reference the entire underlying string. The String itself is still | |
// owned by someone else, and can't be borrowed mutably. This just creates another pointer on | |
// the stack we can pass around like any other. | |
let ending = &s[19..]; | |
hark(ending); | |
// => Hark! 🔥! | |
// and what if the underlying string needs to be changed in-place? | |
// In that case, we must have a string declared as mutable, then make sure you pass it mutably | |
// borrowed (&mut). Here, `clone()` obviously allocates a whole new area of the stack and | |
// copies the original string to it byte-for-byte. | |
let mut s2 = s.clone(); | |
hark(&upper(&mut s2)); | |
// => Hark! GROSSE FEUER-BÄLLE 🔥! | |
// | |
// Note the ß is expanded to "SS" by .to_uppercase() which apparently is correct. | |
// what if we want to decode the string to individual bytes? Here we convert byte_array to | |
// a Vec<String>, and can then print them with join(), which gives the same output as above. | |
let bytes_as_strings: Vec<String> = byte_array.iter().map(|i| i.to_string()).collect(); | |
println!("{}", bytes_as_strings.join("|")); | |
// => 103|114|111|195|159|101|32|70|101|117|101|114|45|66|195|164|108|108|101|32|240|159|148|165|33 | |
// boring decimal numbers again. But what if we were interested in Unicode? (who isn't???? right?) | |
// chars() can get us a Vec<char>, and char is a ‘Unicode scalar value’. | |
for c in s.chars() { | |
print!("{} U+({:04X}) ", c, c as u32); | |
} | |
println!(); | |
// => | |
// g U+(0067) r U+(0072) o U+(006F) ß U+(00DF) e U+(0065) U+(0020) F U+(0046) e U+(0065) u | |
// U+(0075) e U+(0065) r U+(0072) - U+(002D) B U+(0042) ä U+(00E4) l U+(006C) l U+(006C) e | |
// U+(0065) U+(0020) 🔥 U+(1F525) ! U+(0021) | |
// | |
// yep, that emoji has a mighty big codepoint and takes up 4 bytes. I guess that's why there | |
// are so many damn emoji 🤓 | |
// Now, what if we wanted to construct a byte array of our own, and convert it to a String? | |
// let's copy the byte_array, make it mutable, and replace that last 4 bytes with a U+2661 | |
// which should be a heart shape! yay! | |
// | |
// first we have to cast the original string's bytes to a mutable Vec | |
let mut new_bytes: Vec<u8> = s.bytes().collect(); | |
// splice() makes it too easy to insert our desired values and remove the extra byte for the | |
// emoji vs the extended codepage character or whatever it's called. I cheated to figure out | |
// what the 3 decimal values should be. | |
new_bytes.splice(20..24, [226, 153, 161]); | |
for n in new_bytes.iter() { | |
print!("{}|", n); | |
} | |
println!(); | |
// Here we use std::str::from_utf8 instead of String::from_utf8 because the latter does not | |
// take a reference, and therefore borrows new_bytes. This upsets the compiler when we want to | |
// use new_bytes on the next line. (TBH I am not sure why the receiving function cannot "give | |
// back" ownership once it is done, since it isn't async or anything). In any case this | |
// version takes a reference and thus does not trigger a borrow check. | |
let edited = str::from_utf8(&new_bytes).unwrap(); | |
println!("{}", edited); | |
// => | |
// große Feuer-Bälle ♡! | |
// | |
// Sweet, we are well on our way to cloning emacs! | |
// | |
// and what does Rust do if you try to String-ify a sequence of bytes that isn't valid UTF-8? | |
new_bytes[7] = 199; | |
if let Err(ohno) = str::from_utf8(&new_bytes) { | |
eprintln!("{}", ohno); | |
} | |
// => invalid utf-8 sequence of 1 bytes from index 7 | |
// | |
// Beautiful. Rust is annoyingly pedantic and uncompromising. But it's nice to know the | |
// compiler has absolutely no tolerance for nonsense bugs that are so common in other | |
// languages. | |
} | |
fn hark(text: &str) { | |
println!("Hark! {}", text); | |
} | |
fn upper(text: &mut str) -> String { | |
text.to_uppercase() | |
} | |
/* | |
* Credit to: | |
* https://blog.thoughtram.io/string-vs-str-in-rust/ | |
* https://fasterthanli.me/articles/working-with-strings-in-rust | |
* https://www.reddit.com/r/rust/comments/fcuq8x/understanding_string_and_str_in_rust/ | |
*/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment