-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
preserve stucture of timezone designation list #5581
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Petros Angelatos <[email protected]>
d4b9bd0
to
e77d1b2
Compare
@nordzilla Can you take a look? It is in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⭐ Praise
@petrosagg, thanks for taking the time to sift through the parsing code, diagnose a potential issue, and submit a pull request.
Background
It's been over two years since I wrote this code, so I had to take a moment to get back up to speed on everything.
I see what you're wanting to do here and understand why, but I'm not sure I fully agree with the changes, at least in terms of ICU4X's use case. I also don't think the implementation here is quite what you intended it to be.
I'm going to address the following in this review comment:
- Explain the rationale behind the original implementation.
- Discuss the rationale behind your new implementation.
- Discuss why the current implementation is not quite correct, along with would need to happen to make the implementation correct.
- Open the floor for further discussion about what to do next.
The Original Implementation
TZif
is a compressed binary format that utilizes self-referential indices for these time-zone designations.
Consider the string "LMT\0AEDT\0AEST\0"
.
The compressed format contains only this single string, and defines indices into that string, which read until the next null terminator character b'\0'
.
- The index
0
would refer toLMT
. - The index
4
would refer toAEDT
. - The index
9
would refer toAEST
.
The compression goes even further, by allowing for an index to start in the middle of a string.
- The index
5
would refer toEDT
. - The index
10
would refer toEST
. - (Hypothetically) the index
12
would refer toT
.
The purpose for this compression in TZif
is because there can be many records that share the same string. There may be five records that all map to the substring section AEDT
.
Since performance, allocations, runtime, etc. are not critical concerns for the data-generation side of ICU4X, I made the choice to unroll all of these strings into a Vec
that is a 1:1 mapping with the records.
Consider the following test case that corresponds to the explanation above:
#[test]
fn parse_time_zone_designations() {
assert_parse_eq!(
time_zone_designations(
14,
vec![
LocalTimeTypeRecord {
idx: 0,
..Default::default()
},
LocalTimeTypeRecord {
idx: 4,
..Default::default()
},
LocalTimeTypeRecord {
idx: 9,
..Default::default()
},
LocalTimeTypeRecord {
idx: 5,
..Default::default()
},
LocalTimeTypeRecord {
idx: 10,
..Default::default()
},
LocalTimeTypeRecord {
idx: 12,
..Default::default()
},
LocalTimeTypeRecord {
idx: 0,
..Default::default()
},
]
),
"LMT\0AEDT\0AEST\0",
vec![
String::from("LMT"),
String::from("AEDT"),
String::from("AEST"),
String::from("EDT"),
String::from("EST"),
String::from("T"),
String::from("LMT"),
],
);
}
In this implementation, the 0
index local time type record (of the vector, not the idx
value) maps to the 0
index time zone designation string from its vector, and so on.
This makes it easy to zip together the two collections, pulling the offset from the record and matching it with the string.
It does, however, mean that the same string may be allocated multiple times in the parsed data.
The Implementation In This PR
@petrosagg it looks like you noticed that the idx
values in the parsed local time type records no longer correspond to anything meaningful: the index 9
doesn't mean anything anymore, because that is the index into the compressed TZif string.
It looks like what you've attempted here is to preserve the compressed TZif format, making each idx
value mean something again within the source data.
The implementation, however, is stuck half way between.
You're still splitting the string up into a Vec<String>
delimited by the null-terminator character:
count_min_max(charcnt, charcnt, any()).map(|bytes: Vec<u8>| {
bytes
.split_inclusive(|b| *b == 0)
.map(|s| String::from_utf8_lossy(&s[0..s.len() - 1]).into_owned())
.collect()
})
But your helper function treats the idx
value, which is an index into the original byte string with multiple null terminators, instead as an index into the first string in the vector that is long enough for the index.
impl DataBlock {
/// Retrieves the timezone designation at index `idx`.
pub fn time_zone_designation(&self, mut idx: usize) -> Option<&str> {
self.time_zone_designations.iter().find_map(|d| {
if idx <= d.len() {
Some(&d[idx..])
} else {
idx -= d.len() + 1;
None
}
})
}
}
How to Make This Implementation Correct
In order to fully achieve what you're trying to do here, you would have to preserve the original string "LMT\0AEDT\0AEST\0"
(not a vector of separated strings), and then the indices would truly match to something meaningful again.
A Discussion About How To Move Forward
I'm still not convinced that reverting back to a compressed string is the best path here.
Perhaps @petrosagg could share more about their use case, or the reason why they want this.
@sffc could you read this over, let me know if anything feels confusing or needs further clarification, and give your input from an ICU4X-maintainer perspective?
Yeah, definitely! I'm porting porting a medium size C datetime parsing library in Rust piece by piece. I'm currently attempting to replace the code that loads a The unrolling of timezone designations that happens in this library prevents me from doing so since some information is lost during parsing about how the timezone list was laid out. So this PR is maintaining the ability of determining the correct timezone designation in all the edge cases that we discussed in the other comment while also preserving more of the original structure of the file. An additional benefit, which I don't need right now but I find beneficial, is that when a tzif file is parsed with the current code it is no longer in a form that can be serialized back into disk in a correct way. In order to do so all the index fields of each timezone designation would have to be re-written and the list of designations would also have to be deduplicated. This complication can be avoided if we keep the original structure as done in this PR, which turns serialization back into a tzif file into a simple process of writing out the fields one by one. |
Hey @petrosagg Thanks for all the further explanation and discussion! I had misinterpreted your code in an attempt to finish this review quickly before the end of my day. I apologize for that! I do believe that the code/implementation is correct. After thinking about it more last night, I was going to suggest that we could just remove the But now it's more clear that there is value for you in being able to round-trip the parsed data back into the original format (which is not a requirement for ICU4X's use case). I have no strong desire to keep the old code if it is more useful to the Rust community to have it structured like this. Using it for ICU4X data generation would only be slightly different. I'm happy to accept this code if it's fine with @sffc. I haven't been an active maintainer/contributor to ICU4X for a while, so I feel better delegating that final decision. This would be a breaking change to the In terms of further review, I would like to see a slightly more detailed explanation of the helper function, more than just the one-liner: It would be helpful to mention what the original format looks like, that the vector of strings are preserved in the original order, and that is why subtracting from the That will help prevent people like me from misinterpreting the function in the future 😅 |
time_zone_designations | ||
count_min_max(charcnt, charcnt, any()).map(|bytes: Vec<u8>| { | ||
bytes | ||
.split_inclusive(|b| *b == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤏 nitpick: style
I would prefer to see *b == b'\0'
here to preserve the relationship to the idea that this acts like a C-style null terminator character.
The intention of this PR is to maintain the structure of the binary data in the
DataBlock
struct. The current code produces a list of timezone designations that has the same length as the list of local time records with the i-th timezone designation corresponding to the i-th local time record. In this context theidx
field of the local time record does not correspond to a position in the timezone designation list.In this alternative implementation the timezone designation list directly corresponds to the null separated list of designations in the binary data. Since the
idx
field of each local time record can refer anywhere in that string, even in the middle of a designation, a helper method is added so that the correct substring is returned.This allows the
DataBlock
to capture the data exactly and also potentially re-encode them without loss.