Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically add exclusion rules based on robots.txt #631

Open
benoit74 opened this issue Jun 27, 2024 · 5 comments
Open

Automatically add exclusion rules based on robots.txt #631

benoit74 opened this issue Jun 27, 2024 · 5 comments

Comments

@benoit74
Copy link
Contributor

It would be nice if the crawler could automatically fetch rules from robots.txt and add exclusion rules for every rule present in the robots.txt file.

I think this functionality should even be turned-on by default to avoid annoying servers which have clearly expressed what they do not want "external systems" to mess with.

At Kiwix, we have lots of non-tech users configuring zimit to do a browertrix crawl. In most cases, they have no idea what a robots.txt is, so having the switch turned-on by default would help a lot. That being said, I don't mind if it is off by default, we can do the magic to turn it on by default in zimit ^^

@rgaudin
Copy link
Contributor

rgaudin commented Jun 27, 2024

Despite its name, robots.txt's usage is to prevent (well just give directions actually) indexation robots from exploring resources. browsertrix-crawler is a technical bot, but it acts as a user and certainly not an indexing bot.

I don't see value in such a feature but I can imagine there are scenarios where it can be useful. @benoit74 do you have one to share?

Without further information, I'd advise on not having this (non existent yet) feature on by default as it changes the browser's behavior while I think this project uses explicit flags for this.

@benoit74
Copy link
Contributor Author

First use case is https://forums.gentoo.org/robots.txt where the robots.txt content indicate with a certain fidelity what we should exclude from a crawl of https://forums.gentoo.org/ website.

Disallow: /cgi-bin/
Disallow: /search.php
Disallow: /admin/
Disallow: /memberlist.php
Disallow: /groupcp.php
Disallow: /statistics.php
Disallow: /profile.php
Disallow: /privmsg.php
Disallow: /login.php
Disallow: /posting.php

The idea behind automatically using robots.txt is helping lazy / not knowledgeable users have a first version of a WARC/ZIM which is lickely to contain only useful content rather than wasting time and resources (ours and upstream server) building a WARC/ZIM with too many unneeded pages.

Currently in self-service mode, users tend to simply input the URL https://forums.gentoo.org/ and say "Zimit!". And this is true for "young" Kiwix editors as well.

After that initial run, it might prove interesting in this case to still include /profile.php (user profiles) in the crawl. At least such a choice probably needs to be discussed by humans. But this kind of refinement can be done in a second step once we realize we miss this.

If we do not automate something on this, it means the self-service approach is mostly doomed to produce only bad archives, which is a bit sad.

@rgaudin
Copy link
Contributor

rgaudin commented Jun 27, 2024

This confirms that it can be useful in zimit, via an option (that you'd turn on)

@ikreymer
Copy link
Member

ikreymer commented Jul 4, 2024

We're definitely aware of robots.txt and generally haven't used these as they may be too restrictive for browser-based archiving. However, robots.txt may provide a hint for paths to exclude, as you suggest.
The idea would be to gather all the specific Disallow rules, while ignoring something like Disallow: /. Of course, some of the robots rules are URL-specific, but could also apply to in-page block rules as well.
An interesting idea - could extend the system sitemap parsing which already parses robots.txt:
https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/sitemapper.ts#L209
and simply parse all of the Disallow and Allow rules to create exclusions and inclusions.
Not quite sure how to handle different user agents - perhaps grabbing rules from all of them, or a specific one?

This isn't a priority for us at the moment, but would welcome a PR that does this!

@benoit74
Copy link
Contributor Author

benoit74 commented Jul 8, 2024

Good points!

This is not a high priority for us either, let's hope we find time to work on it ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

3 participants