Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet invalid encoding / not valid UTF8 #435

Open
OneCyrus opened this issue Nov 6, 2024 · 2 comments
Open

Parquet invalid encoding / not valid UTF8 #435

OneCyrus opened this issue Nov 6, 2024 · 2 comments

Comments

@OneCyrus
Copy link

OneCyrus commented Nov 6, 2024

Issue Description

  • Description of the issue:
    When exporting a MSSQL table to parquet I get a parquet file where DuckDB complains about string encoding issues.

"select * from output.parquet" with the latest duckdb results in:
Invalid Input Error: Invalid string encoding found in Parquet file: value "\x00\x00\x00\x00\xA8Y\xE2w" is not valid UTF8!

  • Sling version (sling --version): 1.2.22

  • Operating System (linux, mac, windows): Ubuntu 20.04

  • Replication Configuration:

export MSSQL='sqlserver://user:pw@server:1433?database=mytable'
sling run --src-conn MSSQL --src-stream \"SELECT * FROM dbo.[ConfigurationItem]\" --tgt-object 'file:///runner/project/output.parquet' -d
  • Log Output (please run command with -d):
�[90m2024-11-06 15:01:15�[0m �[33mDBG�[0m Sling version: 1.2.22 (linux amd64)
�[90m2024-11-06 15:01:15�[0m �[33mDBG�[0m type is db-file
�[90m2024-11-06 15:01:15�[0m �[33mDBG�[0m using: {"columns":null,"mode":"full-refresh","transforms":null}
�[90m2024-11-06 15:01:15�[0m �[33mDBG�[0m using source options: {"empty_as_null":false,"null_if":"NULL","datetime_format":"AUTO","max_decimals":-1}
�[90m2024-11-06 15:01:15�[0m �[33mDBG�[0m using target options: {"header":true,"compression":"auto","concurrency":7,"datetime_format":"auto","delimiter":",","file_max_rows":0,"file_max_bytes":0,"max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"}
�[90m2024-11-06 15:01:15�[0m �[33mDBG�[0m opened "sqlserver" connection (conn-sqlserver-nU9)
�[90m2024-11-06 15:01:15�[0m �[32mINF�[0m connecting to source database (sqlserver)
�[90m2024-11-06 15:01:15�[0m �[32mINF�[0m reading from source database
�[90m2024-11-06 15:01:15�[0m �[33mDBG�[0m �[36mSELECT * FROM dbo.[ConfigurationItem]�[0m
�[90m2024-11-06 15:01:16�[0m �[32mINF�[0m writing to target file system (file)
�[90m2024-11-06 15:01:16�[0m �[33mDBG�[0m opened "file" connection (conn-file-DLa)
�[90m2024-11-06 15:01:16�[0m �[33mDBG�[0m writing to file:///runner/project/output.parquet [fileRowLimit=0 fileBytesLimit=0 compression=auto concurrency=7 useBufferedStream=false fileFormat=parquet singleFile=true]
[90m2024-11-06 15:05:47�[0m �[33mDBG�[0m wrote 138 MB: 467182 rows [1,714 r/s]
4m29s 466,602 1737 r/s 1.4 GB | 58% MEM | 86% CPU �[90m2024-11-06 15:05:47�[0m �[32mINF�[0m wrote 467182 rows [1,714 r/s] to file:///runner/project/output.parquet
�[90m2024-11-06 15:05:47�[0m �[33mDBG�[0m closed "sqlserver" connection (conn-sqlserver-nU9)
�[90m2024-11-06 15:05:47�[0m �[32mINF�[0m execution succeeded
@flarco
Copy link
Collaborator

flarco commented Nov 6, 2024

Yes, sling will actually soon use duckdb under the hood to read/write parquet files.
The Go driver (github.com/apache/arrow/go) is unfortunately not great quality, and has given many issues. Stay tuned.

@OneCyrus
Copy link
Author

OneCyrus commented Nov 7, 2024

good to know. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants