-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix enqueueing of Minion jobs breaking PARALLEL_ONE_HOST_ONLY=1
#6048
base: master
Are you sure you want to change the base?
Conversation
* Avoid having two times the same identical loop * Use better variable names
When restarting openQA jobs, additional Minion jobs are enqueued (of `git_clone` task, this is happening especially often when `git_auto_clone` is enabled). So far this didn't happen in a transaction. So the scheduler might see the openQA jobs but not the Minion jobs they are blocked by. This is problematic because the scheduler might assign jobs too soon to a worker. This is especially problematic if it does not happen consistently across a parallel job cluster as it then breaks the `PARALLEL_ONE_HOST_ONLY=1` feature. Related ticket: https://progress.opensuse.org/issues/169342
} | ||
|
||
# create comments on original jobs | ||
$result_source->schema->resultset('Comments') | ||
->create_for_jobs(\@original_job_ids, $comment_text, $comment_user_id, $comments) | ||
if defined $comment_text; | ||
|
||
# enqueue Minion jobs to clone required Git repositories | ||
OpenQA::App->singleton->gru->enqueue_git_clones(\%clones, \@clone_ids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would create a minion job for every job, only cluster jobs would be grouped together, right?
That would be way too many jobs in some cases.
The detection for identical git_clone tasks is not perfect, especially if they are created quickly after each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this leads to more Minion jobs attempted to be enqueued but I was hoping the code for de-duplicating Minion jobs you have recently introduced will ensure that we don't have too many after all.
Note that there will still only be one enqueuing attempt per job cluster. Only if one specified multiple jobs IDs explicitly (e.g. using #5971 when it gets merged) we would have more enqueuing attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise we really needed a "preparing" state.
I suppose it would work like this:
- We create jobs and now the initial state is "preparing" instead of "scheduled" keeping track of the job IDs.
- We enqueue Minion jobs.
- We set all jobs we haven't created Minion jobs for in step 2 to "scheduled" immediately.
- Before deleting GRU tasks we set related jobs to "scheduled". (This would still not fix the scheduling problem when there's a different set of e.g. download jobs within a parallel cluster - unless we take job dependencies into account here.)
Step 2 and 3 would happen in one transaction. Step 4 would also happen in one transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I suppose I can fix the scheduler first while we think what's best here. With the scheduler fixed we still have this race condition but at least job clusters aren't torn apart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of simplicity we could also reduce the number of Minion jobs on job restarts by avoiding Minion jobs for git_auto_update
. Of course then git_auto_update
would rely only on the periodic updates for restarted jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand step 4. Maybe you can explain tomorrow after the daily?
OTOH we could try this PR and see how it works out in practice.
Theoretically we could find out if any new minion job is required by keeping track of the git directories in the %git_clones
hash while iterating over the jobs to restart. I guess in most cases we will only have one CASEDIR/NEEDLES_DIR or DISTRI.
Just that we would need to pass an additional %git_clones_not_yet_enqueued
hash through from the toplevel job_restart, which doesn't sound nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of simplicity we could also reduce the number of Minion jobs on job restarts by avoiding Minion jobs for
git_auto_update
. Of course thengit_auto_update
would rely only on the periodic updates for restarted jobs.
That was a requirement though. We did have a complaint from someone who restarted a job and was wondering why it wasn't using the updated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I guess this should be:
OpenQA::App->singleton->gru->enqueue_git_clones(\%clones, \@clone_ids); | |
OpenQA::App->singleton->gru->enqueue_git_clones(\%git_clones, \@clone_ids); |
Probably the reason for the test failures
When restarting openQA jobs, additional Minion jobs are enqueued (of
git_clone
task, this is happening especially often whengit_auto_clone
is enabled). So far this didn't happen in a transaction. So the scheduler
might see the openQA jobs but not the Minion jobs they are blocked by. This
is problematic because the scheduler might assign jobs too soon to a
worker. This is especially problematic if it does not happen consistently
across a parallel job cluster as it then breaks the
PARALLEL_ONE_HOST_ONLY=1
feature.Related ticket: https://progress.opensuse.org/issues/169342