the regex `.{32768}` ended up running pretty slow. how can I make it go faster? #2530

robinwhittleton · 2023-06-12T07:24:30Z

robinwhittleton
Jun 12, 2023

My goal: find all lines in markup exceeding 32767 characters. My naïve solution: rg -txml '.{32768}'. This took an exceedingly long time.

Could I have done it better? Could ripgrep be better optimised for this unusual usecase?

(My data is the corpus of books at https://github.com/standardebooks/, so a typical line length might be between 15 and a few thousand characters. There’s a little of 1.3GB of markup in total to scan. I didn’t log a specific runtime but it was around 15 minutes on a recentish Intel Macbook Air, sitting at 600-700% CPU.)

Answered by BurntSushi

Jun 12, 2023

The problem is that when you write something like .{5}, it is quite literally translated as ...... So when you do .{32768}, it turns into a very giant regex. . is also further complicated by the fact that it is itself a small little state machine that matches the UTF-8 encoding of any Unicode scalar value (sans \n). It's small in the sense that it's only about 12x bigger than (?-u:.) (which matches any byte value except for \n), but when you repeat it a large number of times, that small increase can add up. So you could try using (?-u:.) instead if your data set is mostly ASCII, or if you can abide codepoints matching multiple ..

Otherwise, the main thing you can probably do is use --dfa-…

View full answer

BurntSushi · 2023-06-12T11:43:17Z

BurntSushi
Jun 12, 2023
Maintainer

The problem is that when you write something like .{5}, it is quite literally translated as ...... So when you do .{32768}, it turns into a very giant regex. . is also further complicated by the fact that it is itself a small little state machine that matches the UTF-8 encoding of any Unicode scalar value (sans \n). It's small in the sense that it's only about 12x bigger than (?-u:.) (which matches any byte value except for \n), but when you repeat it a large number of times, that small increase can add up. So you could try using (?-u:.) instead if your data set is mostly ASCII, or if you can abide codepoints matching multiple ..

Otherwise, the main thing you can probably do is use --dfa-size-limit to set a very large capacity value. See, when a regex gets too big, it switches from a faster engine that requires more space to a slower engine that requires more time. It's plausible that .{32768} is big enough to trip that and that increasing the size limit will help by enabling the faster regex engine to be used. But it may use more memory.

So, that's rg -txml '(?-u:.){32768}' --dfa-size-limit 999999999 or something like it.

Generally speaking, using a regex with crazy high bounded repeats like this is not a good idea. Regex engines just don't tend to handle them that well because the bounded repeats are really just terse syntax for writing bigger regexes. The bigger the regex the longer the search takes, generally speaking. And when you get so big that it cross internal heuristic thresholds for optimizations, the speed difference can become quite noticeable.

With that said, for simple cases like this, we can probably do better. And when the regex is your only interface to a tool, there isn't much other choice here other than writing your own little quick program to do what you want. That's what this issue is for in the regex crate: rust-lang/regex#802

My data is the corpus of books at https://github.com/standardebooks/

Whenever possible, please either share the actual data or the steps required to get the data. Otherwise this particular line doesn't really help me much. :-)

1 reply

robinwhittleton Jun 12, 2023
Author

Thanks for the super-detailed answer! And for the link to the issue, which seems to be asking for exactly the same thing I wanted. I’ll track that.

OnlineCop · 2023-07-13T18:55:04Z

OnlineCop
Jul 13, 2023

While late to the party, another suggestion is to add an anchor to the beginning of the pattern: rg -txml '^.{32768}' (or applied to the Unicode suggestion earlier: rg -txml '^(?-u:.){32768}').

This should prevent ripgrep from continuing to match additional characters on the same line, after it has already reported that there was at least one match there.

As an example, let's say that Line 2 contains 32768 characters, Line 3 contains 65534 (32768^2) characters, and Line 5 contains 98301 (32768^3) characters.
Using the ^ anchor means that only the first 32768 characters needed to be tested before moving onto the next line for processing.

By default, ripgrep will display the contents of the match, so for every match, you are getting 32768+ characters returned to the terminal (or whatever process is consuming ripgrep's stdout).

Try this: rg -txml '^.{32768,}' -nr LONG_LINE

Again, use the ^ anchor to speed up the detection.
Use -n, --line-number so the report shows you which lines were too long.
Use -r, --replace with some custom text so instead of 32768+ characters being returned to your terminal/stdout, a much more manageable block of text can follow the line number.
- Note that -r does not modify your original file; it only modifies its own output.
Change the quantifier from {32768} to {32768,} (note the trailing comma) so any remaining text on the line ALSO gets replaced by the -r option.
- Otherwise, a line containing 32769-or-more characters would replace only the first 32768 characters with LONG_LINE but the rest of the text will be returned to your terminal/stdout.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the regex `.{32768}` ended up running pretty slow. how can I make it go faster? #2530

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

the regex .{32768} ended up running pretty slow. how can I make it go faster? #2530

robinwhittleton Jun 12, 2023

Replies: 2 comments · 1 reply

BurntSushi Jun 12, 2023 Maintainer

robinwhittleton Jun 12, 2023 Author

OnlineCop Jul 13, 2023

the regex `.{32768}` ended up running pretty slow. how can I make it go faster? #2530

robinwhittleton
Jun 12, 2023

Replies: 2 comments 1 reply

BurntSushi
Jun 12, 2023
Maintainer

robinwhittleton Jun 12, 2023
Author

OnlineCop
Jul 13, 2023