-
-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Mozilla's Readibility.js #42
base: main
Are you sure you want to change the base?
Integrate Mozilla's Readibility.js #42
Conversation
crawler.js
Outdated
@@ -654,14 +683,20 @@ class Crawler { | |||
if (!fs.existsSync(this.pagesDir)) { | |||
fs.mkdirSync(this.pagesDir); | |||
const header = {"format": "json-pages-1.0", "id": "pages", "title": "All Pages"}; | |||
header["hasText"] = this.params.text; | |||
header["hasReaderView"] = this.params.readerView; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding "hasReaderView" how about using "textSource". That way we can reference using readability or the browser dom or 'boilerpipe'. the boilerpipe library lives in py-wacz and we have the dom extraction method used in browsertrix-crawler and in archiveweb.page. This way we can be specific about what method was used for text extraction.
So the current default would be browser-dom and when reader view is set it would instead be 'readability'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can change this. Just to confirm, you mean?
header["textSource"] = (this.params.readerView ? "readability" : "browser-dom");
Also: by now, the article
object is added to a page in case there is the reader view is available, see isProbablyReaderable. If the intention is to replace title and text by the readerable values, shouldn't the property set per page?
Would you be willing to add tests and a line in the readme where we detail all available flags? |
Sure, I'll add a test. Is it possible to provide a static page source? Could take https://www.iana.org/domains/reserved (Firefox applies the reader view), but with a remote page source any subtle change in the page content or layout may break the test. |
@sebastian-nagel sorry for the radio silence. We've been using www.example.com or example.org for tests. |
- see https://github.com/mozilla/readability - if enabled (command-line flag --readerView): - remove boilerplate from text and HTML - (if available) extract article metadat (author, etc.) - add readable 'article' object to page records in pages.jsonl
893f2b5
to
2acffd6
Compare
- indicate in header "textSource" from where the text extract could be taken (via readability or via DOM dump)
- add unit test reading https://www.iana.org/about
@emmadickson: no problem. sorry as well, some time has passed. I've rebased the branch to the current main branch to start work.
Sure. The point is: the page is so simple that there is no "readable" version of it. Only a short and clean text, no boilerplate to strip. So I've added a unit test reading https://www.iana.org/about - no cons if the tests shall be limited to example.com. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good practice to validate user input and handle any potential errors. In this case, there is no validation of the --readerView flag, and if the user enters an invalid value, the code will still run without showing any warnings or errors. It may be worth adding a check for invalid flag values and show an error message to the user.
The Readability library is loaded from a local file path, which could cause issues if the file is missing or corrupted. It may be more reliable to include Readability as a package dependency and load it through require() instead of reading from a file.
There are some console log statements in the code, which are useful for debugging during development, but they should be removed before the code is deployed to production.
There is a typo in the commit message: "metadata" is misspelled as "metadat".
Other than these minor issues, the code changes look good.
--readerView
The readability library is useful to get a clean text for news articles or blog posts, see the example below. Of course, the results are not always perfect.