Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Binary Transparency for Artifact Registries guide #48

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

haydentherapper
Copy link

Fixes #47


BT makes package identifiers immutable and transparent:

* A package registry will publish a commitment that a specific package identifier maps to a cryptographic digest. This commitment creates an immutable mapping between the identifier and digest. The mapping can only be done per package identifier and can never be updated or deleted.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the ecosystems listed above have some form of package deletion. Many provide a way for a user to request one of the packages be deleted, in a self-service manner, under various safeguards. Even the ones that don't will remove an artifact when compelled by a court. Generally, if they allow deletion they also allow there to be a new owner who is allowed to publish new packages at the same version number. Unfortunately, these corner cases cannot be ignored. How does the system accommodate this reality?

Copy link
Author

@haydentherapper haydentherapper Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! In one of the footnotes, I touch on if the ecosystem supports yanking a package. Supporting package deletion wouldn't be at odds with binary transparency, as long as the ecosystem wouldn't allow package identifier reuse, e.g. if [email protected] is deleted from the registry, then the next package version must be v1.2.4 or higher (or more accurately, a version that's never been used before, BT is not opinionated on ever-increasing versions, that's an ecosystem decision).

Lockfile support in an ecosystem also should require this to be true, otherwise yanking and recreating a package at a given version would break existing lockfiles.

A question for the registries, do you allow version reuse? At a glance:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To tack onto what @haydentherapper said: PyPI's behavior is subtle in that filenames are always unique and immutable on PyPI, but releases themselves are not. In other words: a project foo that gets deleted or turned over to a new user can't overwrite foo-1.2.3.tar.gz if that distribution file was already uploaded by a previous maintainer, but the new maintainer might be able to upload foo-1.2.3-py3-none-any.whl or similar if no wheel was previously uploaded for that version of foo.

In practice this means that resolving foo==1.2.3 isn't guaranteed to be stable for a given host, since the new maintainer can always upload a new (unique) file to the old version that's more specific/matches the host's target configuration more precisely.

@david-a-wheeler
Copy link

woodruffw 16 hours ago said:

To tack onto what @haydentherapper said: PyPI's behavior is subtle in that filenames are always unique and immutable on PyPI, but releases themselves are not. In other words: a project foo that gets deleted or turned over to a new user can't overwrite foo-1.2.3.tar.gz if that distribution file was already uploaded by a previous maintainer, but the new maintainer might be able to upload foo-1.2.3-py3-none-any.whl or similar if no wheel was previously uploaded for that version of foo.

In practice this means that resolving foo==1.2.3 isn't guaranteed to be stable for a given host, since the new maintainer can always upload a new (unique) file to the old version that's more specific/matches the host's target configuration more precisely.

That sounds extremely problematic to me. I expect that repositories must sometimes remove packages of a given version (e.g., by court order), but I think most users would expect that a given version# would be stable. This loophole is a great way to hide attacks. Is this functionality critical to PyPI somehow? Could PyPI be changed to prevent it (at least, say, after a day or so of the "initial" version being uploaded)?

@sethmlarson
Copy link
Contributor

sethmlarson commented Sep 5, 2024

@david-a-wheeler

Is this functionality critical to PyPI somehow?

This property is used because the ABI of CPython, architectures, and platforms (known in Python-land as "tags") aren't known in-advance and new ones are added over time. With this property Python packages can build and release new artifacts that are compiled for new "tags" without issuing a whole new release (since many times the source code doesn't change, only the tool for building).

I'm not sure how often this happens in practice and whether or not it's worth the additional risk because I don't maintain many packages that have these requirements.

This property also exists because artifacts are typically built in different processes so arrive at different times. There's currently no mechanism for "drafting" a release, so it'd be a race to get all your artifacts built before a timer expired.

I think adding support for "draft" releases would make removing this property of an index viable, but even then I am not sure of the impact for maintainers, needs more studying to be sure.

@woodruffw
Copy link
Contributor

This loophole is a great way to hide attacks. Is this functionality critical to PyPI somehow? Could PyPI be changed to prevent it (at least, say, after a day or so of the "initial" version being uploaded)?

There's a long-ish thread on the current behavior here: https://discuss.python.org/t/restricting-open-ended-releases-on-pypi/43566

The TL;DR of it is that PyPI having "open-ended" releases is currently relied upon for some packaging workflows, e.g. there are maintainers who update their releases to contain wheels for new versions of Python rather than publishing an entirely new version with no functional changes. There's also some debate about how serious the vector is, given that (1) the attacker can only upload new files, not overwrite existing release files, and (2) could always just make a new release instead, given that Python as an ecosystem tends to avoid exact version-pinning.

But apart from that, +1 to everything @sethmlarson said, especially drafting -- there is a PEP that enables support for drafting on PyPI and other indices, and I've (very) recently been given the resources to begin work on actually implementing it 🙂

@di
Copy link
Member

di commented Sep 16, 2024

I'm not sure how often this happens in practice and whether or not it's worth the additional risk because I don't maintain many packages that have these requirements.

It happens quite often!

That sounds extremely problematic to me. I expect that repositories must sometimes remove packages of a given version (e.g., by court order), but I think most users would expect that a given version# would be stable. This loophole is a great way to hide attacks.

The risk is entirely mitigated by using lockfiles or hash-pinned requirements files.

I think adding support for "draft" releases would make removing this property of an index viable, but even then I am not sure of the impact for maintainers, needs more studying to be sure.

Draft releases definitely helps with the "artifacts come from different places at different times" issue, and is desirable for other reasons, but it doesn't resolve the "build new artifacts for old releases against new Python versions, ABIs or platforms" so it is not a panacea here.

@di di self-requested a review September 19, 2024 13:06
Copy link
Member

@steiza steiza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this looks good to me!

Let's start out with some easier items:

  • Time and time again this working group struggles with terminology. In keeping with the working group name, we usually use "repository" and "software repository" where this document uses "registry" and "artifact / package registry". Let's standardize on one term in the document, and it might be a good idea to have a terminology / definition section near the front.

  • In this pull request, can we add a link to this doc on both https://github.com/ossf/wg-securing-software-repos/blob/main/README.md and https://github.com/ossf/wg-securing-software-repos/blob/main/docs/index.md?

  • Now for the tricky one! Generally speaking, our other guides cover an existing successful implementation and high-level guidance on how other repositories can implement it. Of course, Go does have binary transparency (as this doc mentions), but the Design Ideas section (where I think the implementation guidance would live) is TBD in this version. Do you want to land a first version of this doc and flesh that out further later? How do you want to proceed here?

@david-a-wheeler
Copy link

david-a-wheeler commented Oct 24, 2024

Time and time again this working group struggles with terminology. In keeping with the working group name

It happens often, I'm afraid.

we usually use "repository" and "software repository" where this document uses "registry" and "artifact / package registry".

Obviously people differ in what they mean by terms. For example, I should note that I don't normally use these terms as synonyms. Here's how I normally use the terms:

A (package) repository stores information. PyPI and CPAN, for example, actually stores the packages that can be installed by a package manager. A source repository stores the source code that you might use (e.g., to build).

A registry is an "official" record of "where to get the information" - but often doesn't store the data itself. Registries redirect users to 1+ repositories. Depending on the registry, different components in the registry might be served by different repositories. I know quicklisp works this way, and I think others do too.

I don't claim everyone uses the terms the same way. That's part of the challnege here :-).

Let's standardize on one term in the document, and it might be a good idea to have a terminology / definition section near the front.

100% agree. Trying to get everyone to change terminology throughout the world to the same thing is er, hard. Documenting definitions of key terms, as they are used in the document, definitely sounds like a way forward.

@di
Copy link
Member

di commented Nov 4, 2024

Now for the tricky one! Generally speaking, our other guides cover an existing successful implementation and high-level guidance on how other repositories can implement it.

I think we're pretty close to this with PEP 740 for PyPI, but I agree that we might want to wait to publish until that has been fully baked and any unforseen issues are sorted out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create "Binary Transparency for Artifact Registries" guide
7 participants