-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timezone Transformation/Casting #338
Comments
Hi, thanks for the write-up and details. I've put some draft changes in the If you want to put those in your type ParquetWriter struct {
Writer *parquet.Writer
WriterMap *parquet.GenericWriter[map[string]any]
tzOffset int64
columns Columns
decNumScale []*big.Rat
recBuffer []map[string]any
}
func (pw *ParquetWriter) SetTzOffsetValue(offset int64) {
pw.tzOffset = offset * 60 * 60
}
func (pw *ParquetWriter) WriteRec(row []any) error {
rec := map[string]any{}
for i, col := range pw.columns {
switch {
case col.IsBool():
rec[col.Name] = cast.ToBool(row[i]) // since is stored as string
case col.Type == FloatType:
rec[col.Name] = cast.ToFloat64(row[i])
case col.Type == DecimalType:
rec[col.Name] = StringToDecimalByteArray(cast.ToString(row[i]), pw.decNumScale[i], arrowParquet.Types.FixedLenByteArray, 16)
case col.IsDatetime() || col.IsDate():
switch valT := row[i].(type) {
case time.Time:
if row[i] != nil {
switch col.DbPrecision {
case 1, 2, 3:
rec[col.Name] = valT.UnixMilli() + pw.tzOffset*1000
case 4, 5, 6:
rec[col.Name] = valT.UnixMicro() + pw.tzOffset*1000000
case 7, 8, 9:
rec[col.Name] = valT.UnixNano() + pw.tzOffset*1000000000
default:
rec[col.Name] = valT.UnixNano() + pw.tzOffset*1000000000
}
}
default:
rec[col.Name] = row[i]
}
default:
rec[col.Name] = row[i]
}
}
pw.recBuffer = append(pw.recBuffer, rec)
if len(pw.recBuffer) == 100 {
err := pw.writeBuffer()
if err != nil {
return g.Error(err, "error writing buffer")
}
}
return nil
} |
I have thought about this more and I agree that adding another env variable or option does not seem like the correct way to go. What if it wasn't parquet specific? My example was parquet and I focused on it, but it seems like it is more of a column definition problem than a file problem. This code here would fix parquets, but a json lines file would still have the incorrect date. I checked the source and found a column type Something like, can I infer what time zone the timestamp column is in (DB specific behavior)? If not, then does the definition in the target options tell me (Something like |
I like |
I am also leaning towards the transform with params. This will open more possibilities for transforms in general. # all columns (only applies to timestamp/datetime columns)
transforms: [to_timezone(-0400)]
# column specific
transforms:
my_col_ts: [to_timezone(-0400)]
|
I see that you have closed #348 and merged in changes. Does that open the door to what we discussed here? This allows defining a column type or transform at the root of options. I can take a look at adding the transform in the middle. Another thing I have been thinking about as well is Daylight Savings. So if there is a static timezone modifier, for half the year it will be incorrect. I have -0700 in my examples, but that is because it is Snowflake's default. I am in eastern timezone and we are -0400 or -0500. It looks like the Go standard library has a way to load timezone location info from a string: https://pkg.go.dev/time#LoadLocation. Then you can load from And with columns is there a way to specify that a column is just a timestamp and not a timestampz? That is really the root of the problem I had. It was the application of UTC to the timestamp which then later threw it off by 7 hours as Snowflake saw UTC and then applied their default timestamp. I tried it with sling 1.2.15 and this is what it output: Here is the output of using python to inspect the parquet: from the schema: |
Yes, that's the plan. Interesting, good point about DST. I agree that it should be a location name instead of the numerical offset. So the transform will be something like About the |
Yes, I think we are on the same page here. I am thinking that keeping the internal type as always UTC would ease in figuring out how to output. The transform function just tells Sling how to get the datetime to UTC if it is not in the db to start with. I work mainly with MS SQL Server, Postgres, and Snowflake. All of them have a column type to store the datetime with a timezone and without. With Timezone infoSnowflake current time in eastern - TIMESTAMP_TZ: I am not sure which of these are closer to how Sling currently works, but inferring or transforming should end in the same spot. A timestamp with a timezone that is correct in UTC. We just need how to get from the column value to UTC. Without Timezone infoOn the other side, let's look at Snowflake with no time zone info and the current time - TIMESTAMP_NTZ: Previous exampleI will just expand this into the example I started with to work through it. In This would transform the timestamp from Snowflake of When the parquet is ingested into Snowflake using |
Correct, Parquet without transform will always output UTC. With transform, should be in specific timezone. I have added this feature for databases (not parquet). The transform is actually $ ./sling conns exec postgres 'create table public.test6 (col1 timestamp, col2 timestampz)'
$ ./sling conns exec postgres "insert into public.test6 values ('2024-01-01 13:00:00', '2024-01-01 13:13:13 +0400')"
$ ./sling run --src-conn postgres --src-stream public.test6 --stdout
col1,col2
2024-01-01 13:00:00.000000 +00,2024-01-01 09:13:13.000000 +00
$ ./sling run --src-conn postgres --src-stream public.test6 --stdout --transforms "[ set_timezone('America/New_York') ]"
col1,col2
2024-01-01 08:00:00.000000 -05,2024-01-01 04:13:13.000000 -05 |
Feature Description
Is there a way to force the timezone in a file? I am selecting out of Snowflake into Parquet (and I have tried jsonlines as well) and it always outputs the time in UTC. The logical type in the parquet is UTC. The column is not in UTC so I am getting the incorrect time when I use the parquet.
I know timezones are the bane of using any sort of datetime field. I had looked through discord and github issues and I know I can change the column when I am selecting or run a post_sql to change it after inserting.
My understanding is that parquet is going to store timestamps in UTC. I think I am asking for something like
datetime_format
but only for the timezone. A source option (?) that can add the timezone info with the timestamp? It would be optional and if you supply it then it will take the datetime as given and translate it into UTC using the supplied timezone option. The output would always be UTC, but there would be an optional step of translating it before outputting UTC.An example:
LASTUPDATED TIMESTAMP_NTZ
America\Los_Angeles
optionTo summarize, have a source option that will default to
UTC
(I think this is current behavior) that will calculate the actual UTC timestamp without having to write custom SQL. It would apply to every datetime column that cannot be inferred to a specific timezone.I can try and see if I can add this, but I wanted to get your thoughts before starting.
The text was updated successfully, but these errors were encountered: