Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use needles from correct ref of NEEDLES_DIR #5175

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

sdclarke
Copy link
Contributor

@sdclarke sdclarke commented Jun 1, 2023

If the CASEDIR variable is a git repo URL with a fragment specifying the ref of the repository, we attempt to use it as the source of needles when viewing needle diffs in test results.

If the TEST_GIT_SHA variable is set, we use that instead as it should be the most accurate way to determine exactly which commit was used

Copy link
Contributor

@Martchus Martchus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think creating another worktree for every rev we might encounter scales very well. It would make more sense to use git show … to obtain a specific needle for a specific rev.

Note that we've been pondering about this problem ourselves and coming up with an approach that scales well enough for being usable on our production instances is hard (e.g. see https://progress.opensuse.org/issues/56789. https://progress.opensuse.org/issues/91905 and other related tickets).

lib/OpenQA/WebAPI/Controller/Step.pm Outdated Show resolved Hide resolved
lib/OpenQA/WebAPI/Controller/Step.pm Outdated Show resolved Hide resolved
lib/OpenQA/WebAPI/Controller/Step.pm Outdated Show resolved Hide resolved
lib/OpenQA/WebAPI/Controller/Step.pm Outdated Show resolved Hide resolved
lib/OpenQA/WebAPI/Controller/Step.pm Outdated Show resolved Hide resolved
lib/OpenQA/WebAPI/Controller/Step.pm Outdated Show resolved Hide resolved
@sdclarke
Copy link
Contributor Author

sdclarke commented Jun 1, 2023

I appreciate the feedback!

I don't think creating another worktree for every rev we might encounter scales very well. It would make more sense to use git show … to obtain a specific needle for a specific rev.

I agree that it's unlikely to scale well, using git show seems like a good idea, I'll take a look into that and see if I can come up with something functional.

@Martchus
Copy link
Contributor

Martchus commented Jun 1, 2023

By the way, I think the most important point of my review was that this should be put behind a configuration option and be disabled by default - at least until we are confident this works well in all kinds of instances (and currently we are likely pretty far from that).

@sdclarke
Copy link
Contributor Author

sdclarke commented Jun 1, 2023

By the way, I think the most important point of my review was that this should be put behind a configuration option and be disabled by default - at least until we are confident this works well in all kinds of instances (and currently we are likely pretty far from that).

Yes, definitely, this patch is very much a hack that was put together to get a quick and nasty solution to the problem for a specific use case, I realise there's an awful lot of change required for it to be anything like production-ready.

@sdclarke sdclarke force-pushed the needles_from_casedir branch 2 times, most recently from 562ad9e to 30a6b8e Compare June 2, 2023 14:41
@sdclarke
Copy link
Contributor Author

sdclarke commented Jun 2, 2023

I've changed this to make use of git show to get the contents of the needle files from the correct SHA of the repository.

It now also has a config option to prevent the feature being used by default.

The hacky clean up has been removed for now, so another method of cleaning up will be needed.

@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 17, 2023
@sdclarke sdclarke marked this pull request as ready for review September 20, 2023 09:13
@stale stale bot removed the stale label Sep 20, 2023
@sdclarke sdclarke force-pushed the needles_from_casedir branch 5 times, most recently from f2d4db8 to e6992d7 Compare February 1, 2024 17:05
Copy link
Contributor

@Martchus Martchus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling should be improved. I commented on the most important places.

Also, see my other inline-comments.

lib/OpenQA/Task/Needle/RemoveVersions.pm Outdated Show resolved Hide resolved
lib/OpenQA/Task/Needle/RemoveVersions.pm Outdated Show resolved Hide resolved
lib/OpenQA/Task/Needle/RemoveVersions.pm Outdated Show resolved Hide resolved
lib/OpenQA/Task/Needle/RemoveVersions.pm Outdated Show resolved Hide resolved
lib/OpenQA/Task/Needle/RemoveVersions.pm Outdated Show resolved Hide resolved
lib/OpenQA/WebAPI/Controller/Step.pm Outdated Show resolved Hide resolved
script/openqa-enqueue-needle-versions-cleanup Outdated Show resolved Hide resolved
t/14-grutasks.t Outdated Show resolved Hide resolved
lib/OpenQA/WebAPI/Controller/File.pm Outdated Show resolved Hide resolved
lib/OpenQA/Utils.pm Outdated Show resolved Hide resolved
@foursixnine
Copy link
Member

@Martchus sorry I thought you hadn't reviewed and clicked again :) (just noticed the thingie, @sdclarke mentioned this during our chat at FOSDEM)

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@Martchus Martchus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling should be improved as mentioned in my last review.

The cleanup code also needs further improvement, see https://github.com/os-autoinst/openQA/pull/5175/files#r1480067530.

@Martchus
Copy link
Contributor

Martchus commented Mar 6, 2024

@sdclarke You're not actively working on this, right? I would take over as we'd actually also like to have this feature ourselves (see https://progress.opensuse.org/issues/154783). I would keep your commits as-is (except for rebasing) and add my own commit on top of them.

@neillydun
Copy link

Is this currently being worked on, or is there anything that could be done to help get it merged?

@okurz
Copy link
Member

okurz commented Jun 10, 2024

There are at least three points necessary:

  • Rebase
  • Fix rebase/merge conflicts
  • Add tests for missing coverage

@Martchus
Copy link
Contributor

Also see https://progress.opensuse.org/issues/154783#note-10 and my subsequent comments in that ticket for what's left.

@Martchus
Copy link
Contributor

Martchus commented Jun 13, 2024

The minimum effort we need to put into this PR to be mergable is:

  1. Resolve conflicts
  2. Ensure test coverage checks pass (I suppose UI tests are missing)
  3. Maybe some minor tweaks coming up in further reviews of the PR

If anyone wants to work on it (e.g. @sdclarke or someone from our team), drop a comment here for better coordination. We now also have https://progress.opensuse.org/issues/162035 for tracking this.

@okurz
Copy link
Member

okurz commented Aug 7, 2024

@sdclarke I saw that you pushed an update. Can you state what changes we should expect?

@sdclarke
Copy link
Contributor Author

sdclarke commented Aug 7, 2024

@sdclarke I saw that you pushed an update. Can you state what changes we should expect?

This push was just fixing the conflicts, I'm trying to look for the correct places to improve the test coverage but I'm having some trouble working out where the changes are needed. Any pointers in that regard would be appreciated.

Assuming I can deal with the test coverage issues, would that be enough to get this merged for now? From my perspective it would be good to get this over the line as it works well enough for our purposes, and it is disabled by default in the config. We can potentially revisit it later to address any outstanding points.

@okurz
Copy link
Member

okurz commented Aug 7, 2024

@sdclarke I saw that you pushed an update. Can you state what changes we should expect?

This push was just fixing the conflicts, I'm trying to look for the correct places to improve the test coverage but I'm having some trouble working out where the changes are needed. Any pointers in that regard would be appreciated.

As you are also adding new modules you could create new test modules like
t/15-shared-plugin-gru.t
which are basically unit tests with mocking all external dependencies.

Assuming I can deal with the test coverage issues, would that be enough to get this merged for now?

Yes. I already asked the team to support for which I created ticket https://progress.opensuse.org/issues/162035 but unfortunately no one picked it up yet.

@Martchus
Copy link
Contributor

Martchus commented Aug 7, 2024

Assuming I can deal with the test coverage issues, would that be enough to get this merged for now? From my perspective it would be good to get this over the line as it works well enough for our purposes, and it is disabled by default in the config. We can potentially revisit it later to address any outstanding points.

@sdclarke As stated in #5175 (comment) the answer is no - nothing else needs to be done except further problems are found in PR review. However, I'd strongly recommend looking into point 2 of this note because in my opinion it makes not much sense to enable it on a production instance without it.

Copy link
Contributor

mergify bot commented Aug 14, 2024

This pull request is now in conflicts. Could you fix it? 🙏

};
chomp $needles_ref;
return undef unless $needles_ref;
return undef unless run_cmd_with_log ['git', '-C', $needle_dirs, 'fetch', '--depth', 1, 'origin', $needles_ref];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdclarke We have not forgotten about this.

We are currently working on some git features and fixes, and I just wanted to note that we have a git_auto_clone feature now that will fetch any requested ref on the webui (branch or sha) in a minion job before the job starts running. (Note that it only works if it is the same origin and not some fork.)

That would make this fetch here not necessary anymore and also avoid the need for a lockfile.
The sha should always be in NEEDLES_GIT_HASH so the rev-parse also seems unnecessary.

Would you consider to use the git_auto_clone feature in order to not have to implement that part here?

I'm not sure if there are any cases which git_auto_clone wouldn't cover. @Martchus said there might be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that git_auto_clone only helps if a Git-URL is specified as NEEDLES_DIR (or CASEDIR which is relevant if needles are part of the tests).

Even then it is not guaranteed to fetch the refs the test actually runs on. The test might specify a branch name (or no branch which means "default branch") and when the test is finally executed on the worker host the ref that branch points to might differ from the ref checked out by the web UI at the time the test was scheduled.

Additionally, we probably want to have some cleanup for the fetching done by the git_auto_clone feature. So the ref might have also already been cleaned up.

So I think we need that fetch in any case.


That would make this fetch here not necessary anymore and also avoid the need for a lockfile.

I don't think we need a lockfile here. We can probably arrange it so that it would not matter if any other commands are executed in the middle. (The use of FETCH_HEAD definitely needs to go way but it can probably be replaced with whatever we specify on the fetch command.)

The sha should always be in NEEDLES_GIT_HASH so the rev-parse also seems unnecessary.

That's true. So we could (and probably should) get rid of it. If one specifies a branch name via NEEDLES_DIR the NEEDLES_GIT_HASH still has the concrete sha.

Copy link
Contributor

@perlpunk perlpunk Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that git_auto_clone only helps if a Git-URL is specified as NEEDLES_DIR (or CASEDIR which is relevant if needles are part of the tests).

  • The PR description says that it is about CASEDIR/NEEDLES_DIR being set.
  • And if it is not set, there is no certain ref to fetch, just the default remote branch
  • In the near future, git_auto_clone will also automatically update clones if CASEDIR/NEEDLES_DIR is empty. It will just update the checkout branch in the checked out directory.

And this is all we need, no?
What else needs to be fetched that git_auto_clone does not fetch?

The only thing could be forks, but as far as I can see this PR also doesn't handle forks.

Even then it is not guaranteed to fetch the refs the test actually runs on. The test might specify a branch name (or no branch which means "default branch") and when the test is finally executed on the worker host the ref that branch points to might differ from the ref checked out by the web UI at the time the test was scheduled.

What's checked out on the webui does not matter. The important thing is that it has been fetched at some point, so we can do a git show on it, right?

Additionally, we probably want to have some cleanup for the fetching done by the git_auto_clone feature. So the ref might have also already been cleaned up.

Yes. I'm not sure what git already offers there and how to know what we can cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's checked out on the webui does not matter. The important thing is that it has been fetched at some point, so we can do a git show on it, right?

I meant "fetched by the web UI". And I meant the following sequence of events:

  1. A test is scheduled pointing to a branch, e.g. just the default branch.
  2. The web UI fetches the latest ref of that branch because git_auto_clone = 1 is configured.
  3. Someone pushes additional commits to that branch or even force-pushes there.
  4. The test is finally assigned to a worker and that worker now fetches a different ref than the web UI in step 2 so NEEDLES_GIT_HASH points to a ref that is not yet checked out (despite git_auto_clone = 1).
  5. Someone wants to display the candidate needles for that job. We need to fetch the missing ref.

And this is not very contrived. That someone specifies a branch where many commits happen (e.g. the default branch of some repo with many contributors) is probably a common use case. That it can take between the scheduling of a job and the time it actually gets executed is also something we need to expect.

@perlpunk
Copy link
Contributor

perlpunk commented Aug 20, 2024

Just to clarify a few things, I'm currently trying to understand the feature.

If the CASEDIR variable is a git repo URL with a fragment specifying the ref of the repository, we attempt to use it as the source of needles when viewing needle diffs in test results.

If the TEST_GIT_SHA variable is set, we use that instead as it should be the most accurate way to determine exactly which commit was used

The title is talking about NEEDLES_DIR, the code mentions NEEDLES_GIT_SHA, but the PR description talks about CASEDIR and TEST_GIT_SHA.
Of course there can be needles in a CASEDIR if the repo has a needles dir itself. I just found it a bit confusing.

I was also wondering: Even if you don't specify CASEDIR/NEEDLES_DIR, the feature is about showing the correct ref from the time the test was running, even if there was a newer version in git meanwhile, right?

Because we have this ticket, but I can't find a clear description of the story, so no acceptance criteria, so I have to figure it out from the code.

@kalikiana
Copy link
Member

Because we have this ticket, but I can't find a clear description of the story, so no acceptance criteria, so I have to figure it out from the code.

@sdclarke Hey! 👋 Did you have a chance to look at the ticket description and the acceptance criteria? To be sure we're on the same page now on what's being implemented here.

@kalikiana
Copy link
Member

@sdclarke Hey! 👋 Did you have a chance to look at the ticket description and the acceptance criteria? To be sure we're on the same page now on what's being implemented here.

@sdclarke ping? In case you missed the previous one 🙃

@sdclarke
Copy link
Contributor Author

Hi all, sorry for the lack of updates, I haven't had time to follow up on this lately.

Because we have this ticket, but I can't find a clear description of the story, so no acceptance criteria, so I have to figure it out from the code.

@sdclarke Hey! 👋 Did you have a chance to look at the ticket description and the acceptance criteria? To be sure we're on the same page now on what's being implemented here.

Looking at the ticket, the acceptance criteria look to match what I was trying to achieve.

The title is talking about NEEDLES_DIR, the code mentions NEEDLES_GIT_SHA, but the PR description talks about CASEDIR and TEST_GIT_SHA. Of course there can be needles in a CASEDIR if the repo has a needles dir itself. I just found it a bit confusing.

I think the title became out of sync with the description after an initial round of reviews, when I first wrote it the workflow I was using had the needles directory within the CASEDIR so I didn't initially consider them separately. I subsequently used the NEEDLES_DIR in this PR but must have missed the description.

I was also wondering: Even if you don't specify CASEDIR/NEEDLES_DIR, the feature is about showing the correct ref from the time the test was running, even if there was a newer version in git meanwhile, right?

The original problem we were having that inspired this was that the web UI might not have any version of a particular needle, as we had a workflow which involved testing new needles in branches of the tests/needles repository before merging them to the main branch.

But ultimately, yes your description of the problem this PR is attempting to solve is correct, I based it on CASEDIR/NEEDLES_DIR as that was how we used it, and it provided the path of least resistance to doing what we needed.

As mentioned I don't have time to work on this for now, once again sorry for my delayed response.

@kalikiana kalikiana force-pushed the needles_from_casedir branch 3 times, most recently from b31f643 to 124af5c Compare October 22, 2024 16:36
@@ -117,6 +117,10 @@
#git_auto_clone = yes
## Experimental - Ensure a git update of all test code and needles
#git_auto_update = no
# whether openQA should attempt to display needles of the correct version in the web UI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that the comment is a bit cryptic from users' side. What does it mean attempt? Also correct version I think is not very explanatory. I am talking from the POV of someone who do not know much about the openQA or a new user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'll be able to explain this feature within the config file. I think this comment is good enough for people trying to enable this feature after reading documentation about it elsewhere. So we could document this feature as part of the normal openQA documentation. However, I'd be fine keeping this as an almost undocumented "expert feature" for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe we can still change the word "correct". Who wouldn't want "correct versions"? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

Suggested change
# whether openQA should attempt to display needles of the correct version in the web UI
# whether openQA should display needles from different branches in the web UI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not about the branch that was used (branches are moving targets) but really about the exact version that was used (not a moving target). This lookup will probably also be best effort. So I find the version of this comment as it is right now much better than the suggested change.

Who wouldn't want "correct versions"? :)

Someone who doesn't want to pay the price for it. We could state the disadvantages here, especially that the feature is still experimental.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So rather say "exact version" and state the drawbacks

t/14-grutasks.t Outdated Show resolved Hide resolved
Copy link
Contributor

mergify bot commented Oct 31, 2024

This pull request is now in conflicts. Could you fix it? 🙏

sdclarke and others added 6 commits October 31, 2024 15:50
If the NEEDLES_DIR (or CASEDIR as a backup) variable is a git repo URL
with a fragment specifying the ref of the repository, we attempt to use
it as the source of needles when viewing needle diffs in test results.

If the NEEDLES_GIT_SHA variable is set, we use that instead as it should
be the most accurate way to determine exactly which commit was used
There is now a minion task which will clean up alternate needle files
created by the WebUI. The minimum retention time of the needle files can
be set in the config, and defaults to 30 minutes
* Improve error handling
* Simplify code
* Split code into smaller functions
* Move needle-related code into its own module so utilities don't become
  too big
* Avoid using a shell to run Git commands; this breaks when needle paths
  contain characters that needed escaping
* Avoid invoking Git commands if temporary needle files are still present
* Use the usual directory the web UI stores cache files in (instead of
  hard-coding `/tmp/…`)
* Output temporary files directly instead of using intermediate buffer
* Keep using the real needle path as before for populating last seen/match
  because using the relative path breaks compatibility with existing
  databases and this change is supposedly not required anyway
* See https://progress.opensuse.org/issues/154783
* Use the settings key `temp_needle_refs_retention` which makes it clear
  that this setting is about temporary needle refs
* Document settings key in default `openqa.ini`
* Adapt the cleanup to be in accordance with the changed path for temporary
  needle refs
* Rename cleanup task to `limit_temp_needle_refs` to be more in-line with
  existing cleanup tasks
* Add systemd units to enqueue the cleanup task automatically
* Change the default retention to two hours (from 30 minutes)
* Use the modification time instead of the access time because the access
  time might be updated very to easily unintendedly
* Use signal guard to retry cleanup in case the gru service is restarted
* See https://progress.opensuse.org/issues/154783
Copy link
Contributor

mergify bot commented Nov 5, 2024

This pull request is now in conflicts. Could you fix it? 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants