Last active
July 15, 2020 08:39
-
-
Save dbyr/7f114d85db462603bd810f241b32f559 to your computer and use it in GitHub Desktop.
I came up with an algorithm that provably selects a value at random in a completely balanced and fair manner from an iterator (applicable to streams too) of indeterminate length. I tried searching (admittedly briefly) for such an algorithm but didn't find anything, so I thought I'd publish this gist in case it's never been done before (which I d…
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
use std::iter::Iterator; | |
use rand; // requires the rand crate | |
use rand::Rng; | |
use std::f64; | |
// I came up with this little algorithm because it seems | |
// (as far as I could google) there are no simple methods | |
// for selecting a single value randomly _and_ fairly | |
// from a variable-size stream/iterator in a single pass. | |
/// Method for collecting a single, _evenly_ random | |
/// value from an iterator/stream of unknown size in a | |
/// single pass. Only requires that the number of values | |
/// so far passed is known/tracked. It does so by selecting | |
/// the current item at random proportional to how many items | |
/// have been seen up to (and including) this item. For example, | |
/// the first three numbers have probabilities 1/1, 1/2, 1/3. | |
/// This obviously continues for indefinite number of items | |
/// in the collection. | |
/// | |
/// The correctness of the algorithm can be easily proven | |
/// (assuming the random number generator sufficiently | |
/// generates numbers within an interval randomly): | |
/// | |
/// Base case: For a single item, any number generated | |
/// will yield that item. For the second item, there is | |
/// 1/2 chance it will be selected, meaning the first item | |
/// also has 1/2 of being selected (P(~1/2) = 1/2). Therefore, | |
/// in the base case(s) | |
/// Recursive case: Assume all values up to this value have been | |
/// selected from fairly at random. Therefore, the probability | |
/// with which the currently selected point was selected is 1/(n-1). | |
/// Now select the current value with probability 1/n. This means | |
/// the previously selected value is selected with probability | |
/// 1/(n-1) * P(~current) = 1/(n-1) * (n-1)/n = (n-1)/n(n-1) = 1/n | |
/// = P(current). Therefore, since all previous values were selected | |
/// from evenly (with prob. 1/(n-1)), then all values remain selected | |
/// from evenly with the introduction of the nth value (that is, P(any)=1/n). | |
// The method described above written in Rust | |
fn random_value<'a, V, T>(vals: V) -> Option<&'a T> | |
where V: Iterator<Item=&'a T> { | |
let mut gen = rand::thread_rng(); | |
let mut counted = 1f64; | |
let mut selected = None; | |
let mut select: f64; | |
let mut prob: f64; | |
for val in vals { | |
select = gen.gen_range(0f64, f64::MAX); | |
prob = f64::MAX / counted; | |
if prob >= select { | |
selected = Some(val); | |
} | |
counted += 1f64; | |
} | |
selected | |
} | |
fn main() { | |
let vs: Vec<u32> = (0..100).map(|x| x).collect(); | |
let it = vs.iter(); | |
println!("{}", random_value(it).unwrap()); | |
} | |
#[cfg(test)] | |
mod test { | |
use super::random_value; | |
use std::collections::HashMap; | |
use std::fs::OpenOptions; | |
use std::io::Write; | |
const WITHIN_10000: i32 = 30; | |
// obviously this is going to fail some of the time, | |
// that's just how randomness is... but it's at least | |
// a good indicator if it works some/most of the time! | |
#[test] | |
fn freq_test_1000() { | |
let mut occurrences = HashMap::new(); | |
let data: Vec<i32> = (0..100).map(|x| x).collect(); | |
// get 100 values randomly | |
for _ in 0..10000 { | |
let res = random_value(data.iter()).unwrap(); // only None if list empty | |
if let Some(v) = occurrences.get_mut(&res) { | |
*v += 1; | |
} else { | |
occurrences.insert(res, 1i32); | |
} | |
} | |
// test each value to see they're within | |
// a reasonable number of occurences | |
for datum in data.iter() { | |
if let Some(v) = occurrences.get(datum) { | |
assert!( | |
(v - 100).abs() < WITHIN_10000, | |
format!("'{}' appeared {} times", datum, v) | |
); | |
} else { | |
assert!(false, format!("'{}' didn't appear at all", datum)); | |
} | |
} | |
output_stats(&occurrences); | |
} | |
fn output_stats(stats: &HashMap<&i32, i32>) { | |
let f = OpenOptions::new() | |
.create_new(true) | |
.read(true) | |
.write(true) | |
.open("./data/distribution.csv"); | |
match f { | |
Ok(mut file) => { | |
file.write("Number,Occurences\n".as_bytes()); | |
for (k, v) in stats { | |
file.write(format!("{},{}\n", k, v).as_bytes()); | |
} | |
file.flush(); | |
}, | |
Err(e) => println!("Could not open file: {}", e) | |
} | |
} | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment