Add Values and Representation chapter #1664

chorman0773 · 2024-10-25T19:44:06Z

This documents the values for most types (where it has been decided), as well as the representation of these values in memory. It also defines what a byte is in Rust (including initialized and uninitialized memory).

The chapter does not define what Provenance carries in Rust. repr(Rust) enums are also not fully elaborated, as there are things undecided.

This makes a normative reference to https://ieeexplore.ieee.org/document/8766229 for floating-point format.

src/values.md

traviscross · 2024-11-05T22:01:22Z

src/values.md

+A floating-point value consists of either a rational number, which is within the range and precision dictated by the type, an infinity, or a NaN value.
+
+r[value.primitive.float-repr]
+A floating-point value is represented the same as a value of the unsigned integer type with the same width given by its [IEEE 754-2019] encoding.


We looked at this one sentence in particular in our lang-docs call today, and we were having a lot of trouble parsing it. Perhaps you could say this another way for us to better understand the intent here. Why does this need to reference "an unsigned integer type of the same width" at all, e.g.?

It's to inherit the definition of endianess from unsigned integer types (where its the easiest to define). Signed integers also do the same thing.

When the entire lang team is confused by a sentence, I think it's fair to say that it needs to be rewritten.

traviscross · 2024-11-05T22:04:31Z

@rust-lang/opsem: We're interested in your review on this. From the lang-docs side, we're particularly interested to confirm that this text is both correct from your perspective and is not making an new guarantees about the language.

traviscross · 2024-11-05T22:09:21Z

cc @rust-lang/lang

src/values.md

src/SUMMARY.md

ehuss

I'm feeling uncertain about this approach of having a chapter specifically for these representations. In the past, we have placed these definitions in the chapters for the types they are defining (for example, the char chapter defines the char). Can we move these rules into those chapters (and avoid any duplication with things that are already there)?

ehuss · 2024-11-05T22:09:46Z

src/values.md

+r[value.pointer]
+
+r[value.pointer.thin]
+Each thin pointer consists of an address and an optional provenance. The address refers to which byte the pointer points to. The provenance refers to which bytes the pointer is allowed to access, and the allocation those bytes are within.


This starts using the term provenance without defining what it means. Would it be possible to at least start that in the glossary? I'm also not sure if we'll need a more elaborate introduction, since rfc#3559 is quite weighty, so maybe there will need to be a more dedicated section for that?

There will be eventually, but it also an incredibly weakly defined topic right now. The main definition we have today is frankly, "It exists". Anything more requires a ton more litigation from T-opsem. Defining a byte requires referring to provenance, though, at least in the most abstract sense.

I added as much as is really decided.

chorman0773 · 2024-11-05T23:03:45Z

Can we move these rules into those chapters (and avoid any duplication with things that are already there)?

We still have to define what a byte is, which is relatively small for its own chapter, but doesn't have any other chapter to go in (glossary shouldn't provide normative definitions). It's also going to duplicate content excessively for aggregate types, since tuples and structs have different chapters, but have the exact same values and representation (according to a particular layout).

src/values.md

Co-authored-by: Ruby Lazuli <[email protected]>

chorman0773 · 2024-11-21T20:24:10Z

Need to figure out how to disentangle the aggregate representation rules to move them into a separate chapter.

chorman0773 · 2024-11-22T17:07:19Z

@rustbot ready

chorman0773 · 2024-11-25T15:26:50Z

Just talked about this on the Community Discord, apparently there is one guarantee this makes that is "new". Namely, it guarantees that wide pointers are represented like some repr(Rust) struct of the data pointer and its metadata. While this seems like it's doing nothing, one thing it does say is that the fields exist somewhere in the representation (and the compiler isn't doing far more interesting shenanigans than it might do to a tuple or a struct).
Is this a problematic guarantee to make?

RalfJung · 2024-12-01T15:32:45Z

src/memory-model.md

+> While bytes in Rust are typically lowered to hardware bytes, they may contain additional values,
+> such as being uninitialized, or storing part of a pointer.


Suggested change

> While bytes in Rust are typically lowered to hardware bytes, they may contain additional values,

> such as being uninitialized, or storing part of a pointer.

> While bytes in Rust are typically lowered to hardware bytes, Rust uses an "abstract"

> notion of bytes that can make distinctions which are absent in hardware,

> such as being uninitialized, or storing part of a pointer.

RalfJung · 2024-12-01T15:33:19Z

src/memory-model.md

+r[memory.byte]
+
+r[memory.byte.intro]
+The most basic unit of memory in Rust is a byte. All values in Rust are computed from 0 or more bytes read from an allocation.


Is "value" defined anywhere?

Values are defined constructively, with each "class" of types. As mentioned in another comment, these are present in different chapters at the request of T-lang-doc and T-spec.

RalfJung · 2024-12-01T15:34:58Z

src/memory-model.md

+r[memory.byte.init]
+Each byte may be initialized, and contain a value of type `u8`, as well as an optional pointer fragment. When present, the pointer fragment carries [provenance][type.pointer.provenance] information.
+
+r[memory.byte.uninit]
+Each byte may be uninitialized.


It seems quite odd to split the definition of a "byte" over two separate items. It is very crucial that this list is exhaustive, and cannot be extended by some other clause somewhere else saying "a byte may also be X". So IMO this should be a single paragaph saying something like:

Each byte is described by one of the following cases:

It is uninitialized.

It is initialized, ...

Separately addressing two points:

I'd like there to be separate rule ids for init and uninit so they can be more easily cited separately (though the items of a list can have their own rule ids - there is ample precedent for that),

As noted, it may not yet be fully desirable to define the exhaustive list, though it is useful to define the initialized/uninitialized separation on its own (clearly set out that types like integers or raw pointers can have "invalid" representations)

We do want "machine-readable" names for the variants, fair. But making them separate paragraphs just feels entirely wrong. This should be a list, formatted like a list. Can we have lists where each item has a "name"?

RalfJung · 2024-12-01T15:39:20Z

src/memory-model.md

+r[memory.encoding]
+
+r[memory.encoding.intro]
+Each type in Rust has 0 or more values, which can have operations performed on them


This seems to be setting up for "memory.encoding", but really it defines the notion of a "value". I think that notion deserves a separate definition, in a r[memory.value], which can then be referenced elsewhere. I am not fully convinced putting this in "memory" is the best choice, since one defining characteristic of values is that they do not exist in memory, but I also don't know the structure well enough to be able to suggest a better place.

This should then also give some sort of definition of a value. Something like:

A value in Rust describes a high-level, "mathematical" view on data at a given type. Examples of values include:

mathematical integers

Boolean truth values

A tuple of values of potentially different types

A homogeneous list of values of identical type

This list is not exhaustive; as Rust evolves, more kinds of values can be added.

This used to be one chapter that fully defined value. I was asked to move those definitions to the respective type chapters to avoid duplication.

Having the encode/decode defined with each type makes sense IMO. But the concept of a value is a central concept that should be defined somewhere, it can't just be a decentral list. (Think: there is one place that defines trait Value, and then many places define what all the things inhabiting that trait are.) Having the list of "what is a value" spread out is fine (after all, my suggestion already said that the list is non-exhaustive), but the central definition should give some examples of what values are, just to make it understandable and to clearly differentiate it from "sequence of bytes".

RalfJung · 2024-12-01T15:43:50Z

src/memory-model.md

+Each value of a type can be encoded into a sequence of bytes, and decoded from a sequence of bytes, which has a length equal to the size of the type.
+The operation to encode or decode a value is determined by the representation of the type.


The representation relation does not determine encode/decode, since there can be more than one representation of a value -- encode makes the choice of which representation to pick. So it's the other way around, encode/decode determine the representation relation.

I would phrase this to be more centered around types, not values. Something like:

Each type defines an operation to decode a sequence of bytes into a value, and to encode a value into a sequence of bytes. When a sequence of bytes decodes to a value, we say that it represents that value. A value that is valid for a given type may be represented by multiple different sequences of bytes, and a sequence of bytes may represent 0 or 1 values of any given type.

RalfJung · 2024-12-01T15:53:19Z

src/types/numeric.md

+Each value of an integer type is a whole number. For unsigned types, this is a positive integer or `0`. For signed types, this can either be a positive integer, negative integer, or `0`.
+
+r[type.numeric.repr.integer-width]
+The range of values an integer type can represent depends on its signedness and its width, in bits. The width of type `uN` or `iN` is `N`. The width of type `usize` or `isize` is the value of the `target_pointer_width` property.


Shouldn't this say how it depends on that?

Specifically, unsigned integers carry values in 0 .. 2^N, and signed integers carry values in -2^(N-1) .. 2^(N-1) (left-inclusive, right-exclusive).

RalfJung · 2024-12-01T15:55:20Z

src/types/struct.md

@@ -29,6 +29,32 @@ A _unit-like struct_ type is like a struct type, except that it has no fields.
 The one value constructed by the associated [struct expression] is the only
 value that inhabits such a type.

+## Struct and aggregate values


What is an "aggregate" here? I would have expected that structs and enums are aggregates, but enums are defined elsewhere and structs seem to be a separate class as well, so I am confused.

Tuples are aggregates also. The current definition of a wide pointer also ends up using aggregate representation.

Tuples have their own section on this, though. So I find the current structure a bit confusing.

Tuples would have the exact same set of values and the exact same representation formula (given the same chosen field offsets, size, and alignment constraints) as a struct with the same type fields, though, so this would just end up duplicating the same text. The tuple chapter does have a clause that explicitly delegates to this section, though, so it's not implicit.

The array section also use to delegate in the same way, but I changed it in a later revision to simply do it manually since unlike structs or tuples, arrays can't shallowly have embedded padding bytes (only padding that exists inside of an element because of the elements representation).

I agree with the goal of not duplicating this for tuples. I just think the way it is currently done is confusing.

Even if this here just says "Struct values", we can still say in the tuple section that they use the same values and encode/deocde as structs, can't we?

RalfJung · 2024-12-01T15:55:42Z

src/types/struct.md

@@ -29,6 +29,32 @@ A _unit-like struct_ type is like a struct type, except that it has no fields.
 The one value constructed by the associated [struct expression] is the only
 value that inhabits such a type.

+## Struct and aggregate values
+
+r[type.struct.value]


This should say somewhere that a value of strict type is a list of values with one value for each field of the struct.

RalfJung · 2024-12-01T15:57:36Z

src/types/struct.md

+The representation of such a struct contains the representation of the value of each field at its corresponding offset.
+
+r[type.struct.value.padding-uninit]
+When a value of an aggregate is encoded, each padding byte is left as uninit


As above, this can IMO be defined more nicely without having to talk about padding at all:

To decode a struct value, decode each field at its respective offset in the byte sequence, and use that to form the decoded value. To encode a struct value, encode each field, and place the encoding at the respective offset, leaving all bytes that are not in any field uninitialized.

NOTE: the bytes that are not in any field are also often called "padding bytes". The representation defined above implies that the contents of padding bytes are lost and reset to uninitialized each time a typed copy of a struct value is performed.

(Have we defined "typed copy" yet?)

We don't really need to define typed copy for this - a typed copy is just a decoding of some memory to a value, then re-encoding that value elsewhere. It may be useful to define non-normatively, or elsewhere, but we don't need to define it for this purpose.
As with enums, we need to define padding anyways b/c unions.

It may be useful to define non-normatively, or elsewhere, but we don't need to define it for this purpose.

Fair. I feel like (non-normatively) stating the fact that padding gets reset can be useful for understanding though. So maybe we can have something like

NOTE: As a consequence of this definition, if some sequence of bytes is decoded and then re-encoded, all information stored in padding bytes is lost (reset to uninitialized) as part of this round-trip.

RalfJung · 2024-12-01T15:59:05Z

src/types/union.md

+r[type.union.value]
+A value of a union type consists of a sequence of bytes, corresponding to each [value byte][type.struct.value.value-bytes]. The value bytes of a union are represented exactly. Each [padding byte][type.struct.value.padding] is set to uninit when encoded.


Given that the fields of a union do not have to be structs, it is odd to refer to type.struct here. This definition requires there to be a general notion of "padding bytes in any type", and each type needs to say what its padding bytes are.

RalfJung · 2024-12-01T16:01:54Z

src/memory-model.md

+> such as being uninitialized, or storing part of a pointer.
+
+r[memory.byte.init]
+Each byte may be initialized, and contain a value of type `u8`, as well as an optional pointer fragment. When present, the pointer fragment carries [provenance][type.pointer.provenance] information.


What is the difference between a "pointer fragment" and "provenance"? Wouldn't it be easier to just say, "[...] as well as optional provenance information"?

RalfJung · 2024-12-01T16:07:43Z

We're interested in your review on this. From the lang-docs side, we're particularly interested to confirm that this text is both correct from your perspective and is not making an new guarantees about the language.

This makes several new guarantees:

We have never defined what Rust's concept of a "byte" is. Uninitialized memory is a thing (though AFAIK this was never explicitly FCP'd either), provenance is a thing, but this PR basically guarantees that that's it, and there's nothing else "odd" about our bytes. I can't immediately think of anything else, but e.g. maybe we need more extra magic state to give an abstract definition of the behavior of coroutines?
We have not defined that reading a pointer byte at integer type just ignores the provenance. It'd probably be better if this PR didn't make a commitment here.

RalfJung · 2024-12-01T17:30:42Z

I should also add that overall I am very happy with the structure and style of the new text here. :) I can finally see how this could hold together as a proper spec also for the operational aspects of the language.

chorman0773 · 2024-12-01T17:42:45Z

We have never defined what Rust's concept of a "byte" is. Uninitialized memory is a thing (though AFAIK this was never explicitly FCP'd either), provenance is a thing, but this PR basically guarantees that that's it, and there's nothing else "odd" about our bytes. I can't immediately think of anything else, but e.g. maybe we need more extra magic state to give an abstract definition of the behavior of coroutines?

I don't believe this does confine the definition of byte to the current list. As you pointed out, the current definition (over two separate clauses) is not clearly exhaustive. I don't think there's anything else that would clearly prohibit a new fancy type of byte from coming into existance (though once we do affirm that, I hope to reflect that in the text).

We have not defined that reading a pointer byte at integer type just ignores the provenance. It'd probably be better if this PR didn't make a commitment here.

This part is fair, though part of me wants to just ask for a T-opsem FCP on that (could be done over on ucg though), as I don't think there's any other rule that satisfies monotonicity (other than PVI which has been ruled out through other means).

RalfJung · 2024-12-01T17:59:05Z

I don't believe this does confine the definition of byte to the current list. As you pointed out, the current definition (over two separate clauses) is not clearly exhaustive. I don't think there's anything else that would clearly prohibit a new fancy type of byte from coming into existance (though once we do affirm that, I hope to reflect that in the text).

I still think "byte" should be defined in a single clause, spreading it out is IMO quite confusing. We can add that the list can be extended in future versions of the spec. (However, this might make formal reasoning about programs impossible, so at some point we have to nail this down.)

This part is fair, though part of me wants to just ask for a T-opsem FCP on that (could be done over on ucg though), as I don't think there's any other rule that satisfies monotonicity (other than PVI which has been ruled out through other means).

I'd rather not block this PR on making that commitment. I agree that if we want provenance monotonicity I don't see another option, but there might be designs that entirely forego the need for provenance monotonicity... OTOH we've already committed to some spec changes for offset and offset_from that may imply that provenance monotonicity is unavoidable at this point.

chorman0773 added 2 commits October 25, 2024 15:30

Add Values and Representation chapter

c8da0a4

Specify representation of floating-point types

7ad36b3

chorman0773 added S-waiting-on-review Status: The marked PR is awaiting review from a maintainer T-opsem Team: opsem T-spec Team: spec labels Oct 25, 2024

Fix lines must not end with spaces

51c1a8a

PatchMixolydic reviewed Oct 31, 2024

View reviewed changes

src/values.md Outdated Show resolved Hide resolved

traviscross reviewed Nov 5, 2024

View reviewed changes

src/values.md Outdated Show resolved Hide resolved

traviscross reviewed Nov 5, 2024

View reviewed changes

src/SUMMARY.md Outdated Show resolved Hide resolved

ehuss reviewed Nov 5, 2024

View reviewed changes

joshtriplett reviewed Nov 10, 2024

View reviewed changes

src/values.md Outdated Show resolved Hide resolved

chorman0773 and others added 7 commits November 11, 2024 11:47

Apply requested changes from PR Reviews

00fc377

Update src/values.md

1de1fc5

Co-authored-by: Ruby Lazuli <[email protected]>

Add section giving a brief explainer of provenance.

c59d504

Fix "Line Must End with Spaces"

b74f458

Move value definitions to appropriate chapters under types.

d6b6744

Remove double line break issues

c176393

Fix "File must end with a newline"

dbafce3

chorman0773 added S-waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. and removed S-waiting-on-review Status: The marked PR is awaiting review from a maintainer labels Nov 21, 2024

Move aggregate values into appropriate chapters

2f75e78

rustbot added S-waiting-on-review Status: The marked PR is awaiting review from a maintainer and removed S-waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. labels Nov 22, 2024

Add note about producing ! at runtime being UB

b026ace

chorman0773 added 3 commits November 22, 2024 13:14

Redefine array layout directly

58573b9

Remove redundant sections on bit validity

5964acc

Elaborate on how union constructors produce union values

5465cc0

RalfJung reviewed Dec 1, 2024

View reviewed changes

		> While bytes in Rust are typically lowered to hardware bytes, they may contain additional values,
		> such as being uninitialized, or storing part of a pointer.

		Each value of a type can be encoded into a sequence of bytes, and decoded from a sequence of bytes, which has a length equal to the size of the type.
		The operation to encode or decode a value is determined by the representation of the type.

		r[type.union.value]
		A value of a union type consists of a sequence of bytes, corresponding to each [value byte][type.struct.value.value-bytes]. The value bytes of a union are represented exactly. Each [padding byte][type.struct.value.padding] is set to uninit when encoded.

Add Values and Representation chapter #1664

Are you sure you want to change the base?

Add Values and Representation chapter #1664

Conversation

chorman0773 commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

traviscross commented Nov 5, 2024

traviscross commented Nov 5, 2024

ehuss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chorman0773 commented Nov 5, 2024

chorman0773 commented Nov 21, 2024

chorman0773 commented Nov 22, 2024

chorman0773 commented Nov 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RalfJung Dec 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RalfJung commented Dec 1, 2024

RalfJung commented Dec 1, 2024

chorman0773 commented Dec 1, 2024

RalfJung commented Dec 1, 2024

RalfJung Dec 1, 2024 •

edited

Loading