Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaping identifieres makes them different #1

Open
Matts966 opened this issue May 23, 2020 · 4 comments
Open

Escaping identifieres makes them different #1

Matts966 opened this issue May 23, 2020 · 4 comments

Comments

@Matts966
Copy link
Owner

Matts966 commented May 23, 2020

`dataset.table` and dataset.table are recognized as different identifieres.

Unifying style in queries prevent this problem for a while.

@lukas-at-harren
Copy link

lukas-at-harren commented Feb 28, 2022

On our end, we handle this by passing the table names through a "bigquery_normalize_identifier(input, default_project_id)" function in our consuming application in Python.
Since BigQuery is currently our only target, this could be passed down to AlphaSQL as well.

Background on how BigQuery handles unqualified table names

The normalisation function receives two arguments, the identifier and default BigQuery project id the query should be executed in.

BigQuery supports queries against tables identified only by their dataset + table name. e.g. core.orders will then query the table orders in the datasets core and the project_id (i.e. database) will be inferred from the caller/where the job is executed.

A fully qualified table identifier can be written like so, encased in two backtick (`) characters: `project_id.dataset.table_name`

AlphaSQL could handle table name normalisation

I can imagine AlphaSQL could normalise all identifiers it finds in a given set of queries.
Imagine the following setup:

Given a query foo.sql that creates a table:

-- samples/issue-1/foo.sql
create or replace table core.foo as
select 'foo' as col1

And a query assert_foo.sql that queries the other table:

-- samples/issue-1/assert_foo.sql
select count(1) > 0
from `project-123.core.foo`
where col1 = 'foo'

Now AlphaSQL could support a command-line argument to "qualify" all identifiers it encounters:

  • --default_project_id= in order to provide a default BigQuery project id
  • --qualify_identifiers a flag to enable the feature when parsing
# Imaginary example
$ alphadag --with_tables --default_project_id=project-123 --qualify_identifiers samples/issue-1/

Currently AlphaSQL outputs the following information:

# At the time of writing, AlphaSQL does not yet support these command line flags
$ alphadag --with_tables samples/issue-1/
Reading paths passed as a command line arguments...
Only files that end with .sql or .bq are analyzed.
Reading "samples/issue-1/assert_foo.sql"
Reading "samples/issue-1/foo.sql"
digraph G {
0 [label="samples/issue-1/assert_foo.sql", shape="", type=query];
1 [label="samples/issue-1/foo.sql", shape="", type=query];
2 [label="core.foo", shape=box, type=table];
3 [label="project-123.core.foo", shape=box, type=table];
1->2 ;
3->0 ;
}
EXTERNAL REQUIRED TABLES:
project-123.core.foo

But it could (given the imaginary options) output the following normalised information. Note the labels on the digraph output.

$ alphadag --with_tables --default_project_id=project-123 --qualify_identifiers samples/issue-1/
Reading paths passed as a command line arguments...
Only files that end with .sql or .bq are analyzed.
Reading "samples/issue-1/assert_foo.sql"
Reading "samples/issue-1/foo.sql"
digraph G {
0 [label="samples/issue-1/assert_foo.sql", shape="", type=query];
1 [label="samples/issue-1/foo.sql", shape="", type=query];
2 [label="`project-123.core.foo`", shape=box, type=table];
1->2 ;
2->0 ;
}
EXTERNAL REQUIRED TABLES:

@Matts966
Copy link
Owner Author

@lukas-at-harren
Thank you for your detailed verification!

I think that this issue is not an actual problem because `table` and table are recognized as same now.
Is there any edge cases for this issue?

Also, project_id is a dynamic property and making it an option seems complex for me.
We automatically suggests unifying project.dataset1.table1 and dataset1.table1 using Python.

Could this issue be closed?

@Matts966
Copy link
Owner Author

Found that making `dataset1.table1` dataset1.table1 cause error in case dataset1.table1 is in JSON.
JSON input is dataset1.table1 and it is weird.

@Matts966
Copy link
Owner Author

Currently CREATE resources like dataset.resource_name is not supported, but CREATE resources like `dataset.resource_name` is supported.

I found that this is because nested catalog is not initialized. Instead, current implementation simply add resources with identifier vectors and this leads to errors Resources not found: resource_name.

We should improve the way to call AddOwned~ functions.

For now, workaround is simply escaping like `dataset.resource_names`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants