-
Notifications
You must be signed in to change notification settings - Fork 965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hive][spark] Support create table with specified location using hive catalog #3843
Conversation
cc @Zouxxyy @YannByron Can you help review this? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1, It is best for Spark to be unified with other engines, as using a new option may harm the interoperability between engines
Hi @JingsongLi This is not introducing a new option, but rather better adapting to the Spark engine. Because Spark has always used it to represent the location of the table.
Are you suggesting that you want users to retrieve the table location information from the However, I would like to point out that the meaning of the DESC command inherently varies across different engines, closely related to the functionality of the engine. For instance, the DESC command in Trino and Flink does not even display table options information, but only column information. Yet, Flink can display primary key attribute of columns because primary keys are part of the Flink specification (this is not the case in Spark). Therefore, I believe we should also adapt this to Spark, which would be more in line with the usage habits of Spark users. BTW, for Iceberg and Hive tables, Spark's DESC EXTENDED DDL command displays the location information through the location field. However, for Paimon tables, the location information is hidden within the options under the path field, making it less convenient to users. This is the motivation behind this PR. |
@@ -60,6 +60,8 @@ case class SparkTable(table: Table) | |||
properties.put(CoreOptions.PRIMARY_KEY.key, String.join(",", table.primaryKeys)) | |||
} | |||
properties.put(TableCatalog.PROP_PROVIDER, SparkSource.NAME) | |||
val location = properties.remove(CoreOptions.PATH.key()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interface (default Map<String, String> properties()
) is only used in show create or desc table, add the TableCatalog.PROP_LOCATION
is ok, but wondering whether we need to remove path
.
Besides, another point worth noting is that paimon currently does not support create table x location('')
actually? The new result of show create table
may cause misunderstanding, or we need support it first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zhongyujiang , we can discuss in this thread, @Zouxxyy raised two very good points:
- We could keep
path
option. - We should also support execution for
show create table
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could keep path option.
Sure
We should also support execution for show create table
I've tested this and found that the location info passed by Spark DDL will be ignored right now
IIUC, the table location is closely related to the warehouse address, and of course, different catalogs have different implementations:
- AbstractCatalog calculates the table location based on the warehouse, database, and table name when loading a table, and both FileSystemCatalog and JdbcCatalog follow this behavior
- HiveCatalog retrieves the table location information from the HMS (Hive Metastore)
Therefore, if we want to support this, we need to differentiate the catalog being used. This is feasible with HiveCatalog, but not with other catalogs. What do you think? @Zouxxyy @JingsongLi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other catalogs, we can check the location, if not equal, throw exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can see ##2619
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this to allow customizing the table location when using the Hive catalog and creating an external table. Please take a look when you have time. Thanks!
paimon-hive/paimon-hive-catalog/src/main/java/org/apache/paimon/hive/HiveCatalog.java
Outdated
Show resolved
Hide resolved
paimon-hive/paimon-hive-catalog/src/main/java/org/apache/paimon/hive/HiveCatalog.java
Outdated
Show resolved
Hide resolved
27eaf8d
to
cd60d6c
Compare
...imon-flink-common/src/main/java/org/apache/paimon/flink/clone/PickFilesForCloneOperator.java
Show resolved
Hide resolved
@@ -117,6 +119,9 @@ protected void dropTableImpl(Identifier identifier) { | |||
|
|||
@Override | |||
public void createTableImpl(Identifier identifier, Schema schema) { | |||
checkArgument( | |||
!schema.options().containsKey(CoreOptions.PATH.key()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just relax this check? For example, If it is the same path, we should allow it.
I am quite concerned that SHOW CREATE TABLE may not run properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing. I would like to share my thoughts on the prohibition of setting table location in FileSystemCatalog
.
I think the SHOW CREATE TABLE
command is usually used to create a test table with the same schema as the source table. Users typically use SHOW CREATE TABLE
to get the DDL of a table and then create a test table with a different name for testing purposes. In this scenario, the location information in the DDL will inevitably mismatch the path assigned by the FileSystemCatalog
for the new table, requiring users to modify the DDL in order to successfully create the table.
So even if the restrictions were relaxed, I believe it would still be somewhat confusing for users, as success would only be guaranteed when the specified location matches the one assigned by the catalog. If that's the case, why bother passing the location at all?
Therefore, instead of relaxing the checks here, I suggest we optimize the documentation by clearly stating this restriction in the documentation. We can declare the location management restrictions of FileSystemCatalog
and the recommended usage of Spark DDL. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also our current approach to managing Iceberg tables in our platform. We host the location for all Iceberg tables on behalf of the users and strongly advise against specifying a location when creating tables. So far, this has been working well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think a relax check will be better. "create a test table with a different name for testing purposes", we can have very clear exception for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Therefore, instead of relaxing the checks here, I suggest we optimize the documentation by clearly stating this restriction in the documentation. We can declare the location management restrictions of
FileSystemCatalog
and the recommended usage of Spark DDL. What do you think?
I'm +1 for this
Thanks @zhongyujiang for update. Left some comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the new commits provide the support for external tables, thanks. Can you add the following test:
For hive catalog (can add in DDLWithHiveCatalogTestBase)
- Create an internal table, drop it, and check if the files are deleted
- Create an external table using location(check it), drop it, and check if the files are deleted
For filesystem catalog (can add to DDLTestBase)
All tables should be internal tables
fcafa9c
to
f37017e
Compare
I made some update, summary:
|
String path = normalizedProperties.remove(TableCatalog.PROP_LOCATION); | ||
normalizedProperties.put(CoreOptions.PATH.key(), path); | ||
// For v2 table, as long as it has specified the location, treat it as external | ||
normalizedProperties.put(Catalog.EXTERNAL_PROP, "true"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this option, hive catalog can know if it is a external table.
f37017e
to
43fd030
Compare
43fd030
to
fadf650
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@Zouxxyy @JingsongLi Thanks for updating this, I'm sorry that I am not following up on this PR in time. |
Purpose
This relocates the 'path' property in the table options to 'location' for better presentation.Spark uses the reserved property
location
to indicate the location of the table, and in the output ofDESC EXTENDED
, the location information will be displayed under the "# Detailed Table Information" section for better visibility.Tests
API and Format
Documentation