Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selenium grid support #89

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Selenium grid support #89

wants to merge 13 commits into from

Conversation

muodov
Copy link
Member

@muodov muodov commented Feb 6, 2023


@muodov muodov requested a review from kdzwinel February 7, 2023 07:59
Copy link
Member

@kdzwinel kdzwinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I left you some comments.

@@ -0,0 +1,25 @@
const DEFAULT_USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I like that you extracted this!

reporters/CLIReporter.js Show resolved Hide resolved
@@ -56,6 +56,7 @@ class CLIReporter extends BaseReporter {
this.progressBar.total = data.urls;
this.progressBar.tick({
site: data.site,
finished: `${finished} / ${data.urls}`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is that different from :percent ? why do we need this number?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On bigger crawls, percents change very slowly, and when something goes wrong (e.g. a deadlock or slow network), it is not always obvious. With this change, we will show the exact number of analyzed sites.

@@ -30,6 +30,7 @@ Available options:
- `-r, --region-code <region>` - optional 2 letter region code. For metadata only
- `-a, --disable-anti-bot` - disable simple build-in anti bot detection script injected to every frame
- `--chromium-version <version_number>` - use custom version of Chromium (e.g. "843427") instead of using the default
- `--selenium-hub <url>` - instead of running a local browser, request a remote browser at the provided Selenium hub endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also add it to other examples in the readme (below)

*/
function filterUrls(inputUrls, logFunction, outputPath) {
return new Promise((resolveFilterUrls, rejectFilterUrls) => {
asyncLib.filter(inputUrls, (item, filterCallback) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is performance here worth the the complexity of doing filtering async?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do a detailed profiling, but with hundreds of thousands URLs, this takes quite a lot of time (tens of seconds). Depending on the FS speed it could make a difference I think.
The latest async lib supports promises, so if/when we upgrade it it should look less ugly.

if (seleniumHub) {
try {
prefixedLog(`Getting remote browser...`);
driver = await getRemoteDriver({seleniumHub, chromiumVersion, proxyHost});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is chromiumVersion in the expected format here? Is selenium also expecting "revision" instead of regular "version" number? I understand that we are reusing --chromium-version flag to pass it?

@@ -0,0 +1,111 @@
const {Builder} = require("selenium-webdriver");
const puppeteer = require('puppeteer');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have inconsistent quotes (' vs ") can we require ' in linter?

extraExecutionTimeMs,
collectorFlags,
});
let browserContext = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this selenium stuff doesn't feel right in here, this function is not about setting up a browser, but rather doing a single site crawl and dumping the data. I'd rather do it in the crawler.js/openBrowser - that's where we are setting up the browser. Additionally, it will help with some of the repetition (e.g. flags - '--enable-blink-features=InterestCohortAPI',)

const opts = new chrome.Options();

// default chrome arguments passed by puppeteer 10.2.0
opts.addArguments(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we extract those flags from puppeteer somehow? e.g. aren't they in some public constant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants