Added tooling to validate ptrs passed to free() #75

BlitterObjectBob · 2024-02-02T22:40:08Z

Every allocation now gets a pre- and post structure, instead of just a size. Each structure has a tag so that every free() or realloc() is able to check them for corruption.

This will likely be the first of at least two such tooling PRs. This first one sets up the mechanism and the second expands on it to add leak checks.

bradh352 · 2024-02-04T13:51:39Z

This seems like a really really really bad idea. Creating a pipe() and writing to it to validate memory? So multiple syscalls and entry and exit from kernel on memory to try to validate memory pointers?

Then I also see a note about adding leak checking? Unless you're implementing a full stack unwind for each leaked block, its not going to be very usable. I don't know why you don't just LD_PRELOAD jemalloc and use the features built in there.

You can have jemalloc abort() on detected memory issues (so you can then capture that signal and write the stack returned from backtrace(3)). You can also turn on profiling so it dumps profiling data on an interval which you can then extract and analyze on a separate machine, it will graph memory usage between any 2 profiling dumps (so you can say ignore any memory allocated during application startup).

https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Leak-Checking

Oh, forgot to mention, this PR won't work on Windows either, breaks building.

BlitterObjectBob · 2024-02-04T14:27:33Z

Thank you, I will wrap the changes to prevent breaking Windows builds.

This code is essentially temporary in order to help us find the cause of a SEGV on Linux (we suspect a memory scribbler) and so we are planning to enable this code in a test environment to try to capture it. The idea is to create a build that liberally checks pointers before use so we can get flagged as close to the scribbling as possible. Using a pipe to validate a pointer is more efficient than other ways but I am open to suggestions in alternatives to validate pointers before the signal handler is invoked.

There are 100% crashers in jemalloc (as found in the SEGV investigation) and the latest release is approaching 2 years old with the repo having 290 issues, so the confidence in jemalloc has been diminished somewhat. From the latest evidence, we are (indirectly) relying on jemalloc to prevent memory leaks so the plan is to find the crasher first then look for leaks.

bradh352 · 2024-02-04T15:23:47Z

I'd be hard pressed to blame jemalloc. A SIGSEGV backtrace when there is memory corruption is basically garbage, its going to point to all different areas that you can't trust. Plus just scanning the issues for jemalloc, a lot are feature requests, questions on integration or usage, and interactions with 3rd party libraries that do non-standard things. I'm not seeing anything in a general use case that would indicate its going to be at fault.

I know for a fact that Monetra was using it in production for over 10 years to improve issues with memory fragmentation and was also used to debug memory leaks like previously described. If all of a sudden you're having crashes or out of memory conditions, the first place I'd look are any changes to your code base that might be causing memory corruption. If you haven't changed jemalloc versions, because there haven't been new releases, I don't see how you can blame something that is unchanged for new issues.

Regarding memory corruption, you could always temporarily deploy a production build with ASAN enabled to help debug such situations. Yes, ASAN will cause slowdown so you'd probably only want to deploy on a single node of a cluster, but with something like Monetra it would have a negligible performance impact on transactions (reports, however, would be where the larger impact comes in).

BlitterObjectBob · 2024-02-04T15:37:44Z

I wasn't blaming jemalloc for the SEGV (jemalloc doesn't appear in the stack trace), just pointing out that there are two (perhaps related) issues at play and that our reliance on jemalloc is one of the causes for concern.

You are correct about the stack trace not being trustworthy, which is where the need to trap the bad pointer before it's actually used comes from.

Added tooling to validate ptrs passed to free()

ec86d21

BlitterObjectBob closed this by deleting the head repository Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added tooling to validate ptrs passed to free() #75

Added tooling to validate ptrs passed to free() #75

BlitterObjectBob commented Feb 2, 2024

bradh352 commented Feb 4, 2024 •

edited

Loading

BlitterObjectBob commented Feb 4, 2024

bradh352 commented Feb 4, 2024 •

edited

Loading

BlitterObjectBob commented Feb 4, 2024

Added tooling to validate ptrs passed to free() #75

Added tooling to validate ptrs passed to free() #75

Conversation

BlitterObjectBob commented Feb 2, 2024

bradh352 commented Feb 4, 2024 • edited Loading

BlitterObjectBob commented Feb 4, 2024

bradh352 commented Feb 4, 2024 • edited Loading

BlitterObjectBob commented Feb 4, 2024

bradh352 commented Feb 4, 2024 •

edited

Loading

bradh352 commented Feb 4, 2024 •

edited

Loading