sortedvec

published on 2019-01-21

Vectors are the most widely used data structures in the Rust programming language, and for good reason too: these array types are both simple and perform well at many tasks. Its elements are stored next to each other in memory — we call such containers contiguous — and that makes their representation in memory compact and iteration quick (due to cache locality). Further, it is possible to index items without traversal, providing constant access times.

Although vectors have many advantages, they aren't ideal for every task. One such operation is search. Without additional structure, the best way to find an element in a vector that satisfies a certain condition (also called predicates) is to simply try every single element until you either find one or hit the end of the vector. This is called linear search, because the time it takes to look up an element scales linearly with the number of elements in the collection.

The sortedvec crate for Rust tries to improve these lookup times, while keeping many of the advantages of vectors. It introduces a way to define contiguous data structures that are fairly simple wrapper type around the standard library's Vec type, with the added property that the internal data is sorted with respect to some derived key.

Storing the data in a sorted state allows the lookup of values by key using binary search. For all but the smallest arrays, binary search is much faster than linear search since it only requires a number of comparisons that is logarithmic in the number of elements in the structure. For example, in an array of a million values, linear search will on average require 500,000 comparisons. Even in the worst case, binary search will never require more than 20 comparisons. Though this is good, it's not quite as fast as the specialized HashMap data structure, which achieves a near flat scaling of lookups! Sorted vectors are perfect for situations where fast key based lookups are required, but the memory and complexity overhead of hashmaps are not warranted.

712151616192224253950515355576364697979key: 53binary iterations: 4linear iterations: 13

Rust's own Vec provides methods to look up values using binary search by key already, but that assumes that you provide a vector that is already sorted with respect to that key. In fact, sortedvec uses these methods internally. Its added value is that it maintains that invariant, so lookups are guaranteed to be quick and correct. To achieve this, the crate leverages Rust's unique type system. The sortedvec types provide immutable references to the underlying Vec, so you can still use all the immutable methods you are used to for vectors and slices. However, you won't be able to get a mutable reference, because that could invalidate the data's order invariant.

Here's an overview of how lookups scale on HashMap, SortedVec and Vec when using string keys:

2345102030405010020030040050010000102030405060708090100lookup time (ns)collection sizehashmapsortedvecvec

Note that this does not take into account the upfront costs of constructing the data structures, which is a bit more expensive for sorted vectors compared to regular vectors, since it'll require a sort.

crate details

The sortedvec crate only exposes a single macro, sortedvec!, which can be used to define a data type like so

sortedvec! {
    pub struct SortedVec {
        fn key_deriv(x: &T) -> K { x.as_k() }
    }
}

where T is the type of the value stored, K the type of its derived key and key_deriv the function that constructs a key K from a value reference &T. You can think of the resulting struct SortedVec as a map similar to HashMap. However, unlike the HashMap, the defined struct does not have any type parameters.

The initial versions of the crate actually defined a single parametric struct SortedVec, where F: for<'t> FnMut(&'t T) -> &'t K was the key derivation function. It worked, but since Rust doesn't have any way to actually write the type of a function, all functions and data structures dealing with it would also have to be parametric in this function type F: an ergonomics nightmare.

In version 0.4 of sortedvec, defining and using a collection may look something like this:

#[derive(Debug, Default)]
struct SomeComplexValue {
    some_map: HashMap,
    name: String,
    prio: u64,
}

sortedvec! {
    /// Vec of `SomeComplexValues` that allows quick
    /// lookup by (name, prio) keys
    #[derive(Debug)]
    struct ComplexMap {
        fn key_deriv(val: &SomeComplexValue) -> (&str, u64) {
            (val.name.as_str(), val.prio)
        }
    }
}

let mut sv = ComplexMap::default();
sv.insert(SomeComplexValue {
    some_map: Default::default(),
    name: "test".to_owned(),
    prio: 0,
});
assert!(sv.find(&("hello", 1)).is_none());
assert!(sv.find(&("test", 0)).is_some());

for val in sv {
    println!("{:?}", val);
}

The crate is at the time of writing still in an experimental and unstable state, but all are encouraged to try it out. If you find any issues, bugs or missing use cases, please let us know on the GitHub page.