How to stop rg after num matches? #2695

Podbrushkin · 2023-12-28T07:46:39Z

Podbrushkin
Dec 28, 2023

I have a huge file without EOL chars and I want to find a string in it in a reasonable amount of time. If string is found, I need a byte offset of this string to be printed to stdout. Since file doesn't have EOL chars, I need rg to stop after number of matches, not number of lines processed. Something like: rg -b -o --max-matches 1 "nam" ./rsl01.mrc > 5:nam.

Such option can be emulated with external tools but with poor performance, code below is executed in Powershell:

> Get-Item .\rsl01.mrc | % Length
5612554780
> rg -m1 -o -b "nam" .\rsl01.mrc | head -n1
5:nam
# ^ This brings desired output but it took 22 seconds:
> Measure-Command { rg -m1 -o -b "nam" .\rsl01.mrc | head -n1 } | % Seconds
22
# For some reason native pwsh cmdlet is much faster:
Measure-Command { rg -m1 -o -b "nam" .\rsl01.mrc | Select-Object -First 1 } | % Seconds
2
# I don't think two seconds is reasonable amount of time for finding a string at 6th byte of a file in SSD drive.

It was Win11+ pwsh, this is performance in WSL Ubuntu:

> time rg -m1 -o -b "nam" ./rsl01.mrc | head -n1
5:nam

real    2m17.032s
user    0m1.694s
sys     0m10.831s

Performance is much worse because rsl01.mrc is in different filesystem, somewhere in /mnt/c/, not in native Unix volume.

Obviously it would be much faster if rg would have an option to limit number of matches. There was an issue with request of this option in its title, and it was closed as completed, but still I can't find this option. Doesn't look like it's listed in --help. It is listed in -h: -m, --max-count <NUM> Limit the number of matches. but it doesn't limit number of matches, also it conflicts with full definition in --help - it actually limits number of lines being processed.
So, option to limit num of matches doesn't exist in RipGrep as for now, am I right?
If so, 2 seconds with Select-Object is as fast as I can get?

Answered by BurntSushi

Dec 28, 2023

I have a huge file without EOL chars

That's your problem. ripgrep (and grep) are line oriented search tools. Fundamentally, they operate as-if iterating over every line in a file and printing each matching line. This is not just an implementation detail, it is the conceptual model on which the program works. If you don't have line oriented data, the grep programs are usually less useful. There are "hacks" to make grep tools work on binary data by changing the line terminator (e.g., the --null-data flag makes ripgrep use the NUL byte as a line terminator instead of \n), but generally speaking, if you can't formulate your data as lines then grep is not the right tool.

This should make it …

View full answer

BurntSushi · 2023-12-28T23:08:08Z

BurntSushi
Dec 28, 2023
Maintainer

I have a huge file without EOL chars

That's your problem. ripgrep (and grep) are line oriented search tools. Fundamentally, they operate as-if iterating over every line in a file and printing each matching line. This is not just an implementation detail, it is the conceptual model on which the program works. If you don't have line oriented data, the grep programs are usually less useful. There are "hacks" to make grep tools work on binary data by changing the line terminator (e.g., the --null-data flag makes ripgrep use the NUL byte as a line terminator instead of \n), but generally speaking, if you can't formulate your data as lines then grep is not the right tool.

This should make it clear why just adding an option to limit the number of matches is not going to solve your problem. Before ripgrep even gets to the point of applying a limit, it has to load the first line of the file into memory. If your file is 5GB and it's just one long line, well, then it has load 5GB into memory. It doesn't matter if it only needs to search the first 6 bytes to find the first match. It still has to load the 5GB into memory first because that's the size of the first line.

You could jump to the next thing which is, "well you should implement search without loading the full line into memory." And indeed, if that were easy to do, then that might be a good idea. But it isn't easy to do. (And if you want to talk about that, then please open a separate Discussion question because it is a deep and nasty pile of weeds.)

With all that said, I find your timings to be quite high. You didn't provide a reproduction for me (you should), so I'll have to make one up myself. Here's a Rust program:

use std::{env::args_os, io::Write};

fn main() -> anyhow::Result<()> {
    let Some(arg) = args_os().nth(1) else {
        anyhow::bail!("Usage: d2695 <one-line | many-lines>")
    };
    let Ok(which) = arg.into_string() else {
        anyhow::bail!("command given is not valid UTF-8")
    };
    match &*which {
        "one-line" | "many-lines" => {}
        unknown => anyhow::bail!("unrecognized command '{unknown}'"),
    }

    let mut out = std::io::BufWriter::new(std::io::stdout().lock());
    write!(out, "XXXXXXfooXXXXXX")?;
    for _ in 0..1000 {
        for (i, ch) in ('\u{1}'..='\u{10FFFF}').enumerate() {
            if ch == '\r' || ch == '\n' {
                continue;
            }
            out.write_all(ch.encode_utf8(&mut [0; 4]).as_bytes())?;
            if i % 1000 == 0 && which == "many-lines" {
                out.write_all(b"\n")?;
            }
        }
    }
    out.write_all(b"\n")?;
    out.flush()?;
    Ok(())
}

And a Cargo.toml:

[package]
publish = false
name = "d2695"
version = "0.1.0"
edition = "2021"

[dependencies]
anyhow = "1.0.77"

[[bin]]
name = "d2695"
path = "main.rs"

And now generate two haystacks of similar size, but one with many lines and one with only one line:

$ cargo run -qr one-line > one-line.txt
$ cargo run -qr many-lines > many-lines.txt

And finally let's run some searches:

$ time rg -m1 -o -b foo one-line.txt
1:6:foo

real    0.869
user    0.701
sys     0.167
maxmem  4185 MB
faults  0

$ time rg -m1 -o -b foo many-lines.txt
1:6:foo

real    0.002
user    0.002
sys     0.000
maxmem  16 MB
faults  0

This makes it clear that what I said above is pretty much what is happening here... except ripgrep sometimes uses file backed memory maps. And I expect that may be helping especially here, because it side-steps the issue of loading each line on to the heap. We can test that explicitly:

$ time rg -m1 -o -b foo one-line.txt --no-mmap
1:6:foo

real    4.629
user    1.025
sys     3.599
maxmem  11077 MB
faults  0

$ time rg -m1 -o -b foo one-line.txt --mmap
1:6:foo

real    0.871
user    0.694
sys     0.177
maxmem  4185 MB
faults  0

So memory maps are helping a lot here. When they aren't used, indeed, ripgrep is reading the full file on to the heap.

The case of multiple lines works as one would expect. Since each line is reasonably short, there's no problem with reading them on to the heap and then stopping after the first match. The timings above and memory used match up with what one would expect.

I need rg to stop after number of matches, not number of lines processed.

This should hopefully be addressed by commentary above. If ripgrep had a limit on the number of matches, that wouldn't help you. Because it still might need to load each line onto the heap.

Now the use of memory maps can side-step that issue, as demonstrated above. In this case, having a limit on the number of matches I believe would help you here, but it would only be pertinent if the memory map optimization kicked in. And that's not guaranteed to happen. And adding an option that only really makes sense in this sort of niche scenario doesn't make a ton of sense IMO. It's too in the weeds to be used in a sensible way.

There was an #159 with request of this option in its title, and it was closed as completed, but still I can't find this option.

I already explained in that issue. The request was implement. The flag is called -m/--max-count. I don't understand your confusion here given that you're using it in the commands you've provided. I already explained in that issue that the option applies to the number of matching lines and not the total number of matches. It works this way because that's the logical model on which grep programs work: they search one line at a time.

Doesn't look like it's listed in --help.

What? Yes it is...

$ rg --version
ripgrep 14.0.3 (rev 5b7a30846f)

features:-simd-accel,+pcre2
simd(compile):+SSE2,-SSSE3,-AVX2
simd(runtime):+SSE2,+SSSE3,+AVX2

PCRE2 10.42 is available (JIT is available)

$ rg --help | rg max-count -A5
    -m NUM, --max-count=NUM
        Limit the number of matching lines per file searched to NUM.

        Note that 0 is a legal value but not likely to be useful. When used,
        ripgrep won't search anything.

It is listed in -h: -m, --max-count <NUM> Limit the number of matches.

Again, that is not what it says:

$ rg -h | rg max-count
  -m, --max-count=NUM             Limit the number of matching lines.

It looks like ripgrep 13 did indeed have this wrong:

$ rg-13.0.0 -h | rg max-count
    -m, --max-count <NUM>                        Limit the number of matches.

But ripgrep 14 is the latest release, and the wording was apparently fixed there.

If so, 2 seconds with Select-Object is as fast as I can get?

There may be bespoke ways to make your particular search faster, but since I don't know the full context for the problem you're trying to solve nor the specific haystack contents, I can't really help much here. But yes, in general, ripgrep needs to be able to read each line on to the heap. If the lines are so huge as to make this impractical, then you either need to reformat the data to make it line oriented in some way, or change your search tool.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to stop rg after num matches? #2695

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to stop rg after num matches? #2695

Podbrushkin Dec 28, 2023

Replies: 1 comment

BurntSushi Dec 28, 2023 Maintainer

Podbrushkin
Dec 28, 2023

BurntSushi
Dec 28, 2023
Maintainer