Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Errors with NULL columns #9669

Open
1 task done
Riezebos opened this issue Jul 23, 2024 · 4 comments · May be fixed by #9810
Open
1 task done

bug: Errors with NULL columns #9669

Riezebos opened this issue Jul 23, 2024 · 4 comments · May be fixed by #9810
Labels
bug Incorrect behavior inside of ibis

Comments

@Riezebos
Copy link
Contributor

Riezebos commented Jul 23, 2024

What happened?

When a column has only nulls, I can't create a DuckDB table or pyarrow table from it.

Here are some reproducible examples:

import ibis

con = ibis.duckdb.connect()

data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]

t = ibis.memtable(data)
con.create_table("test", t)

Result: ParserException: Parser Error: syntax error at or near "NULL"

t.execute() does work in the above example.

import ibis

con = ibis.duckdb.connect()

data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]

ibis.memtable(data).to_pyarrow()

Result: ArrowNotImplementedError: Unsupported cast from int32 to null using function cast_null

import ibis
import pyarrow as pa

con = ibis.duckdb.connect()

data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]

array = pa.array(data)
pa_table = pa.Table.from_struct_array(array)

con.create_table("test", pa_table)

Result: ParserException: Parser Error: syntax error at or near "NULL"

I am guessing the problem is that PyArrow supports columns having a datatype of null, which most databases probably don't?

DuckDB apparently converts the null datatype into int32:

import duckdb

duckdb.from_arrow(pa_table)
┌───────┬───────┐
│ col1  │ col2  │
│ int64 │ int32 │
├───────┼───────┤
│     1 │  NULL │
│     4 │  NULL │
└───────┴───────┘

I have no idea what the best way to handle this is, maybe raising an exception asking the user to specify a schema when a NULL column exists?

A job I run daily suddenly started giving the first error. With the error I got, it took some experimenting to figure out that it was actually caused by this issue. A column in the source data (from some API) that usually has strings and nulls now had only nulls.

What version of ibis are you using?

9.2.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@Riezebos Riezebos added the bug Incorrect behavior inside of ibis label Jul 23, 2024
@cpcloud
Copy link
Member

cpcloud commented Jul 23, 2024

Based on

A job I run daily suddenly started giving the first error. With the error I got, it took some experimenting to figure out that it was actually caused by this issue. A column in the source data (from some API) that usually has strings and nulls now had only nulls.

it sounds like you know the type of the column: string.

The best thing to do then is to pass that information to ibis.memtable, like so:

In [1]: import ibis
   ...:
   ...: con = ibis.duckdb.connect()
   ...:
   ...: data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]
   ...:
   ...: t = ibis.memtable(data, schema={"col1": "int", "col2": "str"})
   ...: con.create_table("test", t)
Out[1]:
DatabaseTable: memory.main.test
  col1 int64
  col2 string

In [2]: from ibis.interactive import *

In [3]: con.tables.test
Out[3]:
┏━━━━━━━┳━━━━━━━━┓
┃ col1  ┃ col2   ┃
┡━━━━━━━╇━━━━━━━━┩
│ int64 │ string │
├───────┼────────┤
│     1 │ NULL   │
│     4 │ NULL   │
└───────┴────────┘

@Riezebos
Copy link
Contributor Author

Thanks! Yeah, once I knew the cause solving the immediate problem was not difficult, but I still thought I'd raise this issue as the current behavior might not be ideal.

@cpcloud
Copy link
Member

cpcloud commented Jul 24, 2024

@Riezebos Any interest in submitting a PR?

Adding this to the memtable docstring somewhere would be the place to start!

@Riezebos
Copy link
Contributor Author

Riezebos commented Jul 25, 2024

I'd be interested, but then I'd rather like to fix it in a way that doesn't give the errors I had. Both of the errors don't make it clear that the cause of the problem is a column full of nulls. The current way of handling null columns with duckdb/sqlglot seems a little broken, I'll explain why I think so below.

memtable to pyarrow (via duckdb)

When using .to_pyarrow() on the memtable, it is first converted to duckdb using _to_duckdb_relation. Duckdb automatically converts null columns to int32. Then PyArrowData.convert_table is used on the result which gives:

ArrowNotImplementedError: Unsupported cast from int32 to null using function cast_null

The schema of the expression still thinks the datatype of the column is null in self.schema(), but it actually is int32 as duckdb changed it.

This can be checked using:

import ibis

data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]
t = ibis.memtable(data)

con = ibis.duckdb.connect()
print(con._to_duckdb_relation(t).arrow())

Output:

pyarrow.Table
col1: int64
col2: int32
----
col1: [[1,4]]
col2: [[null,null]]

Relevant lines:

Potential solutions for this would be:

  1. Cast the column datatype to null after it has been converted from duckdb to arrow (before comparing the pa.table schema to the desired_schema
  2. Don't try to cast the table to desired_schema in cases like this (maybe giving a warning that the null column has been cast to int32 by duckdb)

create_table in duckdb from memtable

I'm pretty sure this just fails because sqlglot writes a query that tries to create a table with a datatype of NULL which duckdb doesn't support. The method self.compiler.type_mapper.from_ibis(typ) when getting the column definitions returns DataType(this=Type.NULL).

This can be tested by running this after the example above:

con.compiler.type_mapper.from_ibis(t.schema().types[1])

Relevant lines:

Potential solutions for this would be:

  1. Raise a warning/exception when create_table is called with a table that has a null column in its schema (the warning/exception could tell the user to add a schema with a different datatype as an argument)
  2. Cast the column to int32 like duckdb does (with a warning)
  3. Add an optional argument to create_table: cast_null_columns_to: ibis.expr.datatypes.DataType = None. A datatype can be given here to convert null columns to.

I don't know if null columns is something which ibis wants to handle fully, but the null datatype does exist, it works in pyarrow and somewhat works in duckdb. Without solving the above problems, the combination of ibis+duckdb+null column is a bit broken in my opinion.

I personally think that in both cases solution 1 is the best solution. If you agree with the problems and the solutions, I'd be happy to start working on a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Status: backlog
2 participants