How to stop rg after num matches? #2695
-
I have a huge file without EOL chars and I want to find a string in it in a reasonable amount of time. If string is found, I need a byte offset of this string to be printed to stdout. Since file doesn't have EOL chars, I need rg to stop after number of matches, not number of lines processed. Something like: Such option can be emulated with external tools but with poor performance, code below is executed in Powershell: > Get-Item .\rsl01.mrc | % Length
5612554780
> rg -m1 -o -b "nam" .\rsl01.mrc | head -n1
5:nam
# ^ This brings desired output but it took 22 seconds:
> Measure-Command { rg -m1 -o -b "nam" .\rsl01.mrc | head -n1 } | % Seconds
22
# For some reason native pwsh cmdlet is much faster:
Measure-Command { rg -m1 -o -b "nam" .\rsl01.mrc | Select-Object -First 1 } | % Seconds
2
# I don't think two seconds is reasonable amount of time for finding a string at 6th byte of a file in SSD drive. It was Win11+ pwsh, this is performance in WSL Ubuntu: > time rg -m1 -o -b "nam" ./rsl01.mrc | head -n1
5:nam
real 2m17.032s
user 0m1.694s
sys 0m10.831s Performance is much worse because rsl01.mrc is in different filesystem, somewhere in Obviously it would be much faster if rg would have an option to limit number of matches. There was an issue with request of this option in its title, and it was closed as completed, but still I can't find this option. Doesn't look like it's listed in |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
That's your problem. ripgrep (and grep) are line oriented search tools. Fundamentally, they operate as-if iterating over every line in a file and printing each matching line. This is not just an implementation detail, it is the conceptual model on which the program works. If you don't have line oriented data, the grep programs are usually less useful. There are "hacks" to make grep tools work on binary data by changing the line terminator (e.g., the This should make it clear why just adding an option to limit the number of matches is not going to solve your problem. Before ripgrep even gets to the point of applying a limit, it has to load the first line of the file into memory. If your file is 5GB and it's just one long line, well, then it has load 5GB into memory. It doesn't matter if it only needs to search the first 6 bytes to find the first match. It still has to load the 5GB into memory first because that's the size of the first line. You could jump to the next thing which is, "well you should implement search without loading the full line into memory." And indeed, if that were easy to do, then that might be a good idea. But it isn't easy to do. (And if you want to talk about that, then please open a separate Discussion question because it is a deep and nasty pile of weeds.) With all that said, I find your timings to be quite high. You didn't provide a reproduction for me (you should), so I'll have to make one up myself. Here's a Rust program: use std::{env::args_os, io::Write};
fn main() -> anyhow::Result<()> {
let Some(arg) = args_os().nth(1) else {
anyhow::bail!("Usage: d2695 <one-line | many-lines>")
};
let Ok(which) = arg.into_string() else {
anyhow::bail!("command given is not valid UTF-8")
};
match &*which {
"one-line" | "many-lines" => {}
unknown => anyhow::bail!("unrecognized command '{unknown}'"),
}
let mut out = std::io::BufWriter::new(std::io::stdout().lock());
write!(out, "XXXXXXfooXXXXXX")?;
for _ in 0..1000 {
for (i, ch) in ('\u{1}'..='\u{10FFFF}').enumerate() {
if ch == '\r' || ch == '\n' {
continue;
}
out.write_all(ch.encode_utf8(&mut [0; 4]).as_bytes())?;
if i % 1000 == 0 && which == "many-lines" {
out.write_all(b"\n")?;
}
}
}
out.write_all(b"\n")?;
out.flush()?;
Ok(())
} And a [package]
publish = false
name = "d2695"
version = "0.1.0"
edition = "2021"
[dependencies]
anyhow = "1.0.77"
[[bin]]
name = "d2695"
path = "main.rs" And now generate two haystacks of similar size, but one with many lines and one with only one line:
And finally let's run some searches:
This makes it clear that what I said above is pretty much what is happening here... except ripgrep sometimes uses file backed memory maps. And I expect that may be helping especially here, because it side-steps the issue of loading each line on to the heap. We can test that explicitly:
So memory maps are helping a lot here. When they aren't used, indeed, ripgrep is reading the full file on to the heap. The case of multiple lines works as one would expect. Since each line is reasonably short, there's no problem with reading them on to the heap and then stopping after the first match. The timings above and memory used match up with what one would expect.
This should hopefully be addressed by commentary above. If ripgrep had a limit on the number of matches, that wouldn't help you. Because it still might need to load each line onto the heap. Now the use of memory maps can side-step that issue, as demonstrated above. In this case, having a limit on the number of matches I believe would help you here, but it would only be pertinent if the memory map optimization kicked in. And that's not guaranteed to happen. And adding an option that only really makes sense in this sort of niche scenario doesn't make a ton of sense IMO. It's too in the weeds to be used in a sensible way.
I already explained in that issue. The request was implement. The flag is called
What? Yes it is...
Again, that is not what it says:
It looks like ripgrep 13 did indeed have this wrong:
But ripgrep 14 is the latest release, and the wording was apparently fixed there.
There may be bespoke ways to make your particular search faster, but since I don't know the full context for the problem you're trying to solve nor the specific haystack contents, I can't really help much here. But yes, in general, ripgrep needs to be able to read each line on to the heap. If the lines are so huge as to make this impractical, then you either need to reformat the data to make it line oriented in some way, or change your search tool. |
Beta Was this translation helpful? Give feedback.
That's your problem. ripgrep (and grep) are line oriented search tools. Fundamentally, they operate as-if iterating over every line in a file and printing each matching line. This is not just an implementation detail, it is the conceptual model on which the program works. If you don't have line oriented data, the grep programs are usually less useful. There are "hacks" to make grep tools work on binary data by changing the line terminator (e.g., the
--null-data
flag makes ripgrep use the NUL byte as a line terminator instead of\n
), but generally speaking, if you can't formulate your data as lines then grep is not the right tool.This should make it …