Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving wasmtime performance on illumos via mmapping /dev/null #9544

Open
sunshowers opened this issue Nov 2, 2024 · 4 comments
Open

Improving wasmtime performance on illumos via mmapping /dev/null #9544

sunshowers opened this issue Nov 2, 2024 · 4 comments

Comments

@sunshowers
Copy link
Contributor

sunshowers commented Nov 2, 2024

(I'm still a novice to the wasmtime code so please let me know if I've made any mistakes. Thanks!)

Motivation

In #9535 we added initial support for illumos. (Thanks!) As noted there, we did notice some performance issues, particularly around freeing memory at shutdown.

Some of my colleagues and I investigated the situation today, and we have a pretty good sense of what's going on (some of my colleagues will be writing up the related illumos issues at some point, but it's essentially that anonymous MAP_NORESERVE allocations of size N take up O(N) CPU, rather than apparently O(1) as on some other platforms. The constant factor is very small, but wasmtime allocates enough memory that it is noticeable.)

However, we've found that there's an alternative mmapping strategy that works well, bringing illumos perf to within the same ballpark as Linux (at least as far as the wast test suite goes). The alternative strategy is:

  • For PROT_NONE mappings, rather than creating a large amount of anonymous memory, create a map to /dev/null.
  • When a particular part of memory needs to be made accessible in any way, create an anonymous mmap. Accessible memory is a very small portion of the overall address space map, so this is fine.
  • The regions that have "real" anonymous mmaps associated with them need to be tracked via a data structure, ideally an interval tree. Essentially, some of the bookkeeping the kernel currently does makes its way into userspace.

I have a quick and dirty prototype that shows how /dev/null mapping might work, though it needs a lot of work to be made shippable. The prototype is at sunshowers#1 -- mmap.rs has the meaty interval tree logic if you're interested.

With this PR,

% cargo +beta test --test wast -- --test-threads 1 Cranelift/pooling/tests/spec_testsuite/load.wast
<snip>
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 2491 filtered out; finished in 0.08s

(Note that this strategy doesn't work on Linux -- mmaping /dev/null produces ENODEV. Naturally, if we go this route, this will be a platform-specific impl of mmapping on illumos, and possibly other systems where /dev/null-based mapping might be faster.)

Challenges

The most important implication of this approach is that all memory management must go through the mmap. This means that nothing outside the mmap code should call mmap, mmap_anonymous or mprotect directly.

This turns out to be a bit of a sticking point, sadly. Several parts of the wasmtime VM store pointers to base addresses and operate on them directly.

  • For example, cow.rs calls expose_existing_mapping directly, which on Unix calls mprotect. With /dev/null-based mapping, that would no longer be possible -- that code must go through Mmap instead.

While some components that currently store addresses can be passed a &Mmap, this turns out to be more challenging for other parts -- particularly the RuntimeLinearMemory API, which is generic over both an owned MmapMemory and a logically-borrowed StaticMemory.

In my prototype I just decided to store a SendSyncPtr<Mmap> inside StaticMemory, with the hope/promise that the Mmap outlives the StaticMemory. I don't think that's hugely worse than storing a raw *mut u8 as we do today where there's the same implicit promise, but it is arguably not great.

I'll also say that in general, this change leads to some nice internal improvements. For example, there are currently at least three different implementations of decommitting memory in the kernel, and only one of them does MADV_DONTNEED on Linux. With my prototype, all of that lives in one spot.

Possible solutions

So given that we've established the benefits of changing out the style of mapping, at least on illumos -- and given the challenges I encountered, I think there are a few approaches we could take:

  1. Do nothing. Wasmtime continues to use the current strategy and continues to have performance issues on illumos, though at least we now know what's going on. It is certainly fair to say that the illumos kernel shouldn't take O(N) time for NORESERVE mappings, but my understanding is that fixing that is likely going to take a while. /dev/null-based mapping is also the officially recommended way to do this on illumos.
  2. In the spots that currently store *mut u8, if it is too inconvenient to pass in a &Mmap, store a pointer to the Mmap instead. If I understand correctly, these spots all already have an implicit requirement that the Mmap outlives them, so this would probably not be worse than today. But it is new unsafe Rust.
  3. Refactor wasmtime to change APIs and add lifetimes to various spots as needed, so that they can all hold on to an &'map Mmap or similar. This seemed a bit daunting to me -- particularly the Box<dyn RuntimeLinearMemory> abstracting over both owned and borrowed memory mappings -- but as I said in the beginning, I'm a novice so it's probably a lot easier for an experienced wasmtime dev.
  4. Move Mmap into an Arc and store (strong? weak?) refs to it where needed.
  5. Other ideas?

I'd love to hear y'all's thoughts on this!

@alexcrichton
Copy link
Member

Thanks for writing this up and investigating! I really like the patch you have passing around &Mmap to various locations. That feels like a great design not only from a safety point of view but also the portability point of view too. In terms of what's the best road to take here at least from our side we can help out with reviews and design feedback but the implementation will be up to you/others for sending PRs/handling feedback/tests/etc. I personally would love to see this fixed without waiting for illumos kernel changes.

For a bit of history all our VM apis are a bit of a mish mash of organically evolved "abstractions" over time. The abstractions keep changing because we keep not being able to pin down exactly what we want. In that sense you're more than welcome to apply refactorings to shuffle things around (e.g. having one decommit instead of 3) and don't place too much weight on the current design.

What I might recommend for landing these changes if you're up for it is incremental progress towards the refactoring here. The memory-related bits are notoriously tricky in Wasmtime so we ideally prefer to change only small bits at a time to ensure thorough review and everything. In that sense:

  • Perhaps the first change could be to thread around &Mmap in more locations. That would remove the need for a bunch of raw pointers in places and VM operations would be "obviously based on an existing mmap" by construction.
  • Next VM operations could be pushed onto methods of Mmap which could include things like bounds checks assertions to ensure it's all in-range.
  • Next we could have an illumos-specific implementation of Mmap which does the /dev/null truck as you've implemented with trees too.

Those could all be separate PRs, but separate commits in the same PR is something I'd personally be ok with too. (would primarily prefer to not lump it all into one though). And if you feel there's better divisions of commits/PRs definitely feel free to slice/dice further too.

How's that all sound?

@sunshowers
Copy link
Contributor Author

My worry with passing around &Mmap just is about the complexity of that growing out of hand. Do you have a sense of how difficult the refactoring would be?

I'll give it a shot as time permits.

(I would absolutely split this up into several PRs. I'm one of the biggest stacked PR proponents around :) )

@alexcrichton
Copy link
Member

Personally I thought sunshowers#1 looked pretty reasonable. I'm notoriously bad at judging ahead of time whether something will work out, so I'm happy to leave it to your discretion too as to whether the refactoring feels right/wrong. Any difficult parts I or others can help out with during review too.

One thing I think is ok is to store Arc<Mmap> in more places. That shouldn't really be all that expensive since adding a (hypothetical) atomic increment/decrement to VM operations shouldn't really break the bank.

@sunshowers
Copy link
Contributor Author

In the PR I do store raw pointers to an Mmap in a couple of spots, because I found it a bit complex to change the APIs. I could probably replace those with Arc.

I'll also try doing a more aggressive refactor (eg introducing lifetimes in a few places) now that you're on board with the general approach.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants