Allow `Unstructured` to be backed by an iterator of bytes instead of a byte slice? #103

maackle · 2022-02-18T21:59:53Z

In our use case, we are not using Arbitrary for fuzzing, but simply for creating arbitrary fixture values in tests. Currently we create a 10MB static Vec<u8> of random noise and use that as our Unstructured data. However this is annoying since it requires 10MB of memory overhead, and sometimes even then we run out of bytes.

I am wondering if it would be valid to have two flavors of Unstructured, one backed by a byte slice, presumably for fuzzing, and one backed by an infinite iterator of bytes which can never be exhausted. I experimented with a PR for doing this and got some basic tests passing, but don't know how valid it is in the grand scheme. I did find that some functionality is indeed dependent on the fixed byte slice, so at the very least some extra UX effort would have to be made to provide slightly different interfaces for different Unstructured flavors.

I understand if this wouldn't be worth the effort but I guess I am primarily wondering from a motivational standpoint why Unstructured is backed by a byte array instead of an iterator, and only secondarily asking for some feedback on the feasibility of using infinite iterators. Note: I know next to nothing about fuzzing.

The text was updated successfully, but these errors were encountered:

nagisa · 2022-02-18T22:12:55Z

It is more straightforward to derive the relationship between data passed in as a buffer and its use in code, than when it needs to go through the iterator machinery.

The predecessor to Unstructured was also intended to be inexhaustible, so it was important for it to be able to read the data from the buffer multiple times in a ring buffer like fashion.

It would be an interesting challenge/exercise to implement a virtually infinite buffer of virtual memory that is filled with data as accesses from Unstructured fault the pages in. Basically something along the lines of:

mmap(a long inaccessible chunk of virtual memory)
set up signal handlers for sigsegv on the mmaped pages
make Unstructured with the mmaped buffer
when signal handler fires, fill the faulting page with data and make page RW before returning control to the faulting instruction.

maackle · 2022-02-18T23:11:51Z

To be clear, I was thinking of something as simple as std::iter::repeat_with(rand::thread_rng().gen) as the iterator.

I guess what I don't understand is what relationship needs to be understood between the data and its use in code, if the unstructured data is truly arbitrary. This may be where my lack of knowledge of fuzzing methodology is holding me back.

fitzgen · 2022-02-22T19:17:54Z

The Arbitrary+Unstructured APIs are already fairly complex and I wouldn't want to make them any more complex by adding type parameters or anything like that.

I think you should be able to get thyour ultimately desired functionality in a performant manner by using the size_hint method and doing something like

let (min, max) = T::size_hint(0);
let capacity = max.unwrap_or(min * 2);
let mut data = vec![0; capacity];
loop {
    rng.fill(&mut data);
    let u = Unstructured::new(&data);
    let x = match MyType::arbitrary_take_rest(u) {
        Ok(x) => x,
        Err(arbitrary::Error::NotEnoughData) => {
            // Double the buffer's size. Optionally have a max
            // buffer size.
            let new_len = data.len() * 2;
            data.resize(new_len, 0);
            continue;
        }
        Err(_) => {
            // Just try again with new data. Optionally have a
            // max number of retries.
            continue;
        }
    };

    // Do stuff with `x`...
}

maackle · 2022-02-22T20:08:33Z

Interesting, this could work. Thinking it could be made ergonomic by creating a trait parallel to Arbitrary, with a similar API but which instead of taking Unstructured, takes a new struct which wraps a SeededRng and a Vec<u8> buffer, and produces Unstructured on the fly when asked to generate an arbitrary type. If I wind up needing this, I'll give that a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `Unstructured` to be backed by an iterator of bytes instead of a byte slice? #103

Allow `Unstructured` to be backed by an iterator of bytes instead of a byte slice? #103

maackle commented Feb 18, 2022 •

edited

Loading

nagisa commented Feb 18, 2022

maackle commented Feb 18, 2022

fitzgen commented Feb 22, 2022

maackle commented Feb 22, 2022

Allow Unstructured to be backed by an iterator of bytes instead of a byte slice? #103

Allow Unstructured to be backed by an iterator of bytes instead of a byte slice? #103

Comments

maackle commented Feb 18, 2022 • edited Loading

nagisa commented Feb 18, 2022

maackle commented Feb 18, 2022

fitzgen commented Feb 22, 2022

maackle commented Feb 22, 2022

Allow `Unstructured` to be backed by an iterator of bytes instead of a byte slice? #103

Allow `Unstructured` to be backed by an iterator of bytes instead of a byte slice? #103

maackle commented Feb 18, 2022 •

edited

Loading