Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a proposed change to the storage mechanism of compiled regular expressions. Currently, a compiled regex is made up of an array of these fixed-size structures:
The newly proposed data structure is a sequence of variable-length nodes of this structure:
The first difference between the two is that the new structure does not contain any pointers to any other data. All the data for the regex is self-contained which means there's no ownership dependencies to handle.
The next difference is fixed-length vs variable-length. The current fixed-length
regex_t
structure is 8 bytes on 32-bit systems and 16 bytes on 64-bit systems. This is because even though thetype
field is a single byte, the second field contains a pointer which needs to be aligned on the system's word boundary (4 for 32-bit 8 for 64-bit). Note that it also wouldn't matter whether the field order were changed, because the size of the structure also needs to align the next element if it were in an array. In comparison, the newly proposed data structure is variable length. It is 2 bytes for most types withCHAR
,CHAR_CLASS
andINV_CHAR_CLASS
taking more bytes to store their raw characters. The new structure requires less memory to store any regex and typically ends up taking 2 to 4 times less memory.It's also worth noting that the new structure has no alignment requirement, so it can be moved to any location in memory with something like
memcpy
.Another difference is the new data structure should be a bit more cache friendly while checking for matches. Along with being smaller, it also turned the original structure of 2 buffers into 1 buffer which guarantees that the type id is right next the data it needs while checking for matches.
Yet another difference is the buffers to store both the regex objects and the character data are "uncolored". This means you don't have to adjust two different sizes of two different buffers to make sure you have enough memory to store a regex, there is only one size to adjust.
I also tested run time using
time ./tests/test2
with this change. Before this changes times range from 4 to 5.5 seconds, after this change they ranged from 3.8 to 4 seconds. Times were more consistent and around 20% faster on average, which makes sense considering that it should be more cache friendly.