Enrich packages from OpenSSF Security Scorecard Data #134
Replies: 3 comments 1 reply
-
See https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/load_sbom.py#L27
Yes, that is the plan, but this needs a bit more research. We also need some structure to store this data also, but initially this can be in the extra_data attributes.
Becuase this data won't be fetched by default, when we create packages in SCIO. So if we are implementing this in SCIO, other pipelines detect or create packages, then an additional pipeline would be run to get this data fro the APIs and store it. And then also modify the SBOM output creation to use this data appropriately. That's what I was thinking if we are implementing this in the SCIO side. In the purldb side, we are fetching metadata for and scanning source/binary packages, given a purl (see for example support for debian which I added: aboutcode-org/purldb#300, we also had maven and npm support for this previously). The scans are essentially SCIO pipelines being run, and the data imported back. So here to enrich the packages with scorecard data we have two options: Implement this in the SCIo side as discussed above, or fetch this data when we are getting metadata for a purl, based on source repo information present in the metadata. In both cases, we'd be storing the data in purldb too. |
Beta Was this translation helpful? Give feedback.
-
@AyanSinhaMahapatra @pombredanne I was exploring the API I found that we get the score card results only for those repos, which are part of scorecard workflow and I saw that only but for some weird reason I could not find the entry of and its report was on this commit I also got the idea that if we are creating a pipeline out of this, it is dependent on the other pipelines (eg; scan_codebase , scan_single_package, etc) which are supposed to create package entries in the And related the creating a pipeline why cant we keep it as a step which can be integrated into the existing pipeline and will run after the packages are populated. The below code snippet from Can we add fetching scorecard details as a step if we are going to integrate it?
Regarding enriching the SBOM, I saw the process that it currently being populated on demand using the On the side of purlDB, I think it will be a lot easier, as you have done for the debian packages. If following along the same line we can integrate the score card API details here also for every indexed package. Note : I just noted that the crucial thing for this OpenSSF to work effectively we need to have VCS url given as input. But in most of the cases, the VCS url is empty (in SCIO and purlDB). Let me know what you think on this approach ? |
Beta Was this translation helpful? Give feedback.
-
This is not true, see https://github.com/ossf/scorecard?tab=readme-ov-file#public-data:
We are critical OSS ;) I also am not sure if the data out of the API and the data in the BigQuery dataset are same or not, and whether the BigQuery one has more repo info in it. If the BigQuery dataset has more info, I would be inclined to see if we can get data out of it, even if this is a one-time operation on the purldb side (and this is also why I mentioned purldb-side implementation) We can also create/look for a small library that does this in python (import and store the data in models), and use this as a library in both SCIO/purldb, but this is fine if we are only doing this in SCIO for now. Depends on whether the purldb side implementation is useful or not.
Yes! This would be very important as we need the vcs URL, which is basically the github/gitlab repo link, and scorecard data is keyed by these links. There are efforts in purldb with purl2git: aboutcode-org/purldb#258 trying to improve this, but we would need to ramp this up in SCTK and add support for detecting these in package manifests better wherever applicable and whenever data is present in manifests. We also have some code which can be used here: https://github.com/package-url/packageurl-python/blob/main/src/packageurl/contrib/purl2url.py A good test for the project would be making sure we are scanning ~10-15 packages from different ecosystems, detecting their vcs_url in the metadata correctly in SCTK (and thus in SCIO) and then the scorecard pipeline is correctly fetching the scorecard data back to SCIO. This would a seperate add-on pipeline like the https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/find_vulnerabilities.py which is only run when requested specifically. so check all the implementation details there on how we get data about packages from a different data source, and then store and display these. All of what I'm saying is food-for-thought of course, we are still very much open for discussion on exact implementation details, and this would likely be updated a couple times with community feedback as we go through this. And looking forward to seeing what you are researching and proposing on this. |
Beta Was this translation helpful? Give feedback.
-
From @404-geek:
https://github.com/nexB/aboutcode/wiki/GSOC-2024-Project-Ideas/#purldbscancodeio-enrich-an-sbom-based-on-ossf-security-score-card
I was going through the above project and related issues.
There were some few doubts I had was like right now I see export options of SBOMs in SPDX and CycloneDX but how is this imported in SCIO ?
https://github.com/ossf/scorecard#public-data
https://api.securityscorecards.dev/
are we going to use the data from above 2 APIs to enhance the SBOM json data ?
If we will be integrating this in the SBOMs data then why there is a need to use it as a pipeline because it would fetch data anyway by default?
I see mostly if there is a integration to be done 90% of it is on SCIO side.
Please correct me if I have misunderstood anything.
Beta Was this translation helpful? Give feedback.
All reactions