Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A type-safe representation of tagged unions #1

Open
hirrolot opened this issue Dec 2, 2023 · 3 comments
Open

A type-safe representation of tagged unions #1

hirrolot opened this issue Dec 2, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@hirrolot
Copy link

hirrolot commented Dec 2, 2023

Hi there! I came from your Reddit post and found this project fascinating.

While reading the sources, I've found the header include/lone/types.h, which uses a common technique of tagged unions. Most notably, struct lone_value:

struct lone_value {
	// Auxiliary data...

	enum lone_type type;

	union {
		struct lone_module module;
		struct lone_function function;
		struct lone_primitive primitive;
		struct lone_list list;
		struct lone_vector vector;
		struct lone_table table;
		struct lone_bytes bytes;
		struct lone_pointer pointer;
		long integer;
	};
};

The downside of this approach is that it's possible to mess up during case analysis by 1) not checking a tag, or by 2) checking a tag A and then using B. The compiler cannot check this automatically, causing the bug to silently creep into the final executable.

Instead of managing tagged unions manually, I would suggest using Datatype99, which is a header-only library designed specifically to deal with the problem of tagged unions. struct lone_value would look as follows 1:

datatype(
    LoneValue,
    (LoneModule, struct lone_module),
    (LoneFunction, struct lone_function),
    (LonePrimitive, struct lone_primitive),
    (LoneList, struct lone_list),
    (LoneVector, struct lone_vector),
    (LoneTable, struct lone_table),
    (LoneBytes, struct lone_bytes),
    (LonePointer, struct lone_pointer),
    (LoneInteger, long)
);

And case-analyzed as follows:

void handle(LoneValue value) {
    match(value) {
        of(LoneModule, module) { /* ... */ }
        of(LoneFunction, function) { /* ... */ }
        of(LonePrimitive, primitive) { /* ... */ }
        of(LoneList, list) { /* ... */ }
        of(LoneVector, vector) { /* ... */ }
        of(LoneTable, table) { /* ... */ }
        of(LoneBytes, bytes) { /* ... */ }
        of(LonePointer, pointer) { /* ... */ }
        of(LoneInteger, integer) { /* ... */ }
    }
}

Since of explicitly provides a variable binding, such as module or function, it's now much harder to make a mistake. The datatype encoding is also more concise than the corresponding tagged union representation, since the former defines both enum lone_type and struct lone_value 2.

Since Datatype99 has no run-time dependencies (not even the C standard library), and has a transparent and formally specified semantics, I think it might be a great fit for Lone 3.

Let me know if you have any thoughts or questions, which I should be able to answer.

Footnotes

  1. Auxiliary data can be added as a separate structure.

  2. Although both types can be manipulated separately under the names LoneValueTag and LoneValue.

  3. Other tagged unions in the project can be rewritten in the same way, if there are any.

@matheusmoreira
Copy link
Collaborator

Hello, thank you for your interest. Your C preprocessor metalanguage project is very impressive! I tried to do something along those lines in GNU Make once and it quickly became unmanageable.

The downside of this approach is that it's possible to mess up during case analysis by 1) not checking a tag

Compilers provide warnings in this case. When switching on an enumeration, both gcc and clang warn me if I forget a case. I wrote the code so as to fully resolve those warnings.

For example, clang warns me if I delete a case:

source/lone/hash.c:26:10: warning: enumeration value 'LONE_PRIMITIVE' not handled in switch [-Wswitch]
   26 |         switch (key->type) {
      |                 ^~~~~~~~~

or by 2) checking a tag A and then using B.

This is definitely a huge benefit, certainly something that I want. However I'm not sure it's enough to justify importing an external dependency into lone. I'll have to try it out to fully evaluate it but from a cursory exploration of your projects I think it would make gdb even more difficult to use due to the wrapping of lone's data structures in the generated types.

Your point did not fall on deaf ears though. I think compilers should be able to provide warnings for this. It would be neat if we could apply some kind of annotation to each union value that tells the compiler which enumerated tag it corresponds to.

struct lone_value {
  /* ... */
  enum lone_type type;
  union {
    struct lone_module module __attribute__((tag, type, LONE_MODULE));
    struct lone_function function __attribute__((tag, type, LONE_FUNCTION));
    struct lone_primitive primitive __attribute__((tag, type, LONE_PRIMITIVE));
    /* ... */
  };
};

Then the compiler would be able to warn if the union value is used outside a conditional block where it knows that type equals the specified tag.

I don't know how to add this feature to the C compilers but I'm going to look for the proper upstream channels to suggest/request it. Such a feature would improve safety for all C projects.

@hirrolot
Copy link
Author

hirrolot commented Dec 2, 2023

Compilers provide warnings in this case. When switching on an enumeration, both gcc and clang warn me if I forget a case.

I was a bit unclear in my initial statement. The explicit tagged union style encourages us to deal with tagged unions explicitly; as a result, it can be tempting to just write value.variant outside of switch. When writing code with sum types (implicit style), we typically only use match and of to deconstruct values. Of course, since Datatype99 is just a syntax sugar over tagged unions, it's still possible to write value.data.variant._0, but it's much less likely for such an expression to be written unintentionally.

The check you've mentioned is called exhaustiveness analysis. It is done by modern compilers both for explicit tagged unions and for match.

However I'm not sure it's enough to justify importing an external dependency into lone.

The only downside I see is a slight complication of the build process. Also, there are scripts that turn header-only libraries into a single header (SQLite uses it IIRC) for easier inclusion.

It would be neat if we could apply some kind of annotation to each union value that tells the compiler which enumerated tag it corresponds to.

I agree, it can be a nice feature. Let me know how it goes with requesting it.

Anyhow, feel free to comment here and ask any further questions.

@matheusmoreira
Copy link
Collaborator

matheusmoreira commented Dec 3, 2023

Alright, I opened issues on both gcc and clang issue trackers.

GCC 112840 - feature request: warn on incorrect tagged union value access

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants