-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Binary Transparency for Artifact Registries guide #48
base: main
Are you sure you want to change the base?
Conversation
Fixes ossf#47 Signed-off-by: Hayden B <[email protected]>
|
||
BT makes package identifiers immutable and transparent: | ||
|
||
* A package registry will publish a commitment that a specific package identifier maps to a cryptographic digest. This commitment creates an immutable mapping between the identifier and digest. The mapping can only be done per package identifier and can never be updated or deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the ecosystems listed above have some form of package deletion. Many provide a way for a user to request one of the packages be deleted, in a self-service manner, under various safeguards. Even the ones that don't will remove an artifact when compelled by a court. Generally, if they allow deletion they also allow there to be a new owner who is allowed to publish new packages at the same version number. Unfortunately, these corner cases cannot be ignored. How does the system accommodate this reality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question! In one of the footnotes, I touch on if the ecosystem supports yanking a package. Supporting package deletion wouldn't be at odds with binary transparency, as long as the ecosystem wouldn't allow package identifier reuse, e.g. if [email protected]
is deleted from the registry, then the next package version must be v1.2.4
or higher (or more accurately, a version that's never been used before, BT is not opinionated on ever-increasing versions, that's an ecosystem decision).
Lockfile support in an ecosystem also should require this to be true, otherwise yanking and recreating a package at a given version would break existing lockfiles.
A question for the registries, do you allow version reuse? At a glance:
- npm does not allow version reuse
- PyPI does not allow version reuse if a file was deleted
- Maven Central doesn't support deletion
- RubyGems supports yanking, but unspecified on version reuse (Edit: A Stackoverflow thread notes that an error is returned when trying to reupload at a given version)
- NuGet supports unlisting, but not deleting iiuc
- Go catches version reuse through its proxy since it runs a binary transparency log. For example, if a tag is deleted and recreated pointing to a different commit hash, the Go proxy will throw an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To tack onto what @haydentherapper said: PyPI's behavior is subtle in that filenames are always unique and immutable on PyPI, but releases themselves are not. In other words: a project foo
that gets deleted or turned over to a new user can't overwrite foo-1.2.3.tar.gz
if that distribution file was already uploaded by a previous maintainer, but the new maintainer might be able to upload foo-1.2.3-py3-none-any.whl
or similar if no wheel was previously uploaded for that version of foo
.
In practice this means that resolving foo==1.2.3
isn't guaranteed to be stable for a given host, since the new maintainer can always upload a new (unique) file to the old version that's more specific/matches the host's target configuration more precisely.
woodruffw 16 hours ago said:
That sounds extremely problematic to me. I expect that repositories must sometimes remove packages of a given version (e.g., by court order), but I think most users would expect that a given version# would be stable. This loophole is a great way to hide attacks. Is this functionality critical to PyPI somehow? Could PyPI be changed to prevent it (at least, say, after a day or so of the "initial" version being uploaded)? |
This property is used because the ABI of CPython, architectures, and platforms (known in Python-land as "tags") aren't known in-advance and new ones are added over time. With this property Python packages can build and release new artifacts that are compiled for new "tags" without issuing a whole new release (since many times the source code doesn't change, only the tool for building). I'm not sure how often this happens in practice and whether or not it's worth the additional risk because I don't maintain many packages that have these requirements. This property also exists because artifacts are typically built in different processes so arrive at different times. There's currently no mechanism for "drafting" a release, so it'd be a race to get all your artifacts built before a timer expired. I think adding support for "draft" releases would make removing this property of an index viable, but even then I am not sure of the impact for maintainers, needs more studying to be sure. |
There's a long-ish thread on the current behavior here: https://discuss.python.org/t/restricting-open-ended-releases-on-pypi/43566 The TL;DR of it is that PyPI having "open-ended" releases is currently relied upon for some packaging workflows, e.g. there are maintainers who update their releases to contain wheels for new versions of Python rather than publishing an entirely new version with no functional changes. There's also some debate about how serious the vector is, given that (1) the attacker can only upload new files, not overwrite existing release files, and (2) could always just make a new release instead, given that Python as an ecosystem tends to avoid exact version-pinning. But apart from that, +1 to everything @sethmlarson said, especially drafting -- there is a PEP that enables support for drafting on PyPI and other indices, and I've (very) recently been given the resources to begin work on actually implementing it 🙂 |
It happens quite often!
The risk is entirely mitigated by using lockfiles or hash-pinned requirements files.
Draft releases definitely helps with the "artifacts come from different places at different times" issue, and is desirable for other reasons, but it doesn't resolve the "build new artifacts for old releases against new Python versions, ABIs or platforms" so it is not a panacea here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this looks good to me!
Let's start out with some easier items:
-
Time and time again this working group struggles with terminology. In keeping with the working group name, we usually use "repository" and "software repository" where this document uses "registry" and "artifact / package registry". Let's standardize on one term in the document, and it might be a good idea to have a terminology / definition section near the front.
-
In this pull request, can we add a link to this doc on both https://github.com/ossf/wg-securing-software-repos/blob/main/README.md and https://github.com/ossf/wg-securing-software-repos/blob/main/docs/index.md?
-
Now for the tricky one! Generally speaking, our other guides cover an existing successful implementation and high-level guidance on how other repositories can implement it. Of course, Go does have binary transparency (as this doc mentions), but the
Design Ideas
section (where I think the implementation guidance would live) is TBD in this version. Do you want to land a first version of this doc and flesh that out further later? How do you want to proceed here?
It happens often, I'm afraid.
Obviously people differ in what they mean by terms. For example, I should note that I don't normally use these terms as synonyms. Here's how I normally use the terms: A (package) repository stores information. PyPI and CPAN, for example, actually stores the packages that can be installed by a package manager. A source repository stores the source code that you might use (e.g., to build). A registry is an "official" record of "where to get the information" - but often doesn't store the data itself. Registries redirect users to 1+ repositories. Depending on the registry, different components in the registry might be served by different repositories. I know quicklisp works this way, and I think others do too. I don't claim everyone uses the terms the same way. That's part of the challnege here :-).
100% agree. Trying to get everyone to change terminology throughout the world to the same thing is er, hard. Documenting definitions of key terms, as they are used in the document, definitely sounds like a way forward. |
I think we're pretty close to this with PEP 740 for PyPI, but I agree that we might want to wait to publish until that has been fully baked and any unforseen issues are sorted out. |
Fixes #47