[hive][spark] Support create table with specified location using hive catalog #3843

zhongyujiang · 2024-07-30T04:42:52Z

Purpose

~~This relocates the 'path' property in the table options to 'location' for better presentation.~~

Adapt 'path' to Spark's 'location' in table props ( the 'path' is still preserved in table props)
Support the customization of the table location when creating a table(only allowed for creating external table using hive catalog)

Spark uses the reserved property location to indicate the location of the table, and in the output of DESC EXTENDED, the location information will be displayed under the "# Detailed Table Information" section for better visibility.

+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                            |comment|
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
|a                           |bigint                                                                                               |NULL   |
|b                           |varchar(10)                                                                                          |NULL   |
|c                           |char(10)                                                                                             |NULL   |
|                            |                                                                                                     |       |
|# Metadata Columns          |                                                                                                     |       |
|__paimon_file_path          |string                                                                                               |       |
|__paimon_row_index          |bigint                                                                                               |       |
|                            |                                                                                                     |       |
|# Detailed Table Information|                                                                                                     |       |
|Name                        |default.testTableAs                                                                                  |       |
|Type                        |MANAGED                                                                                              |       |
|Location                    |file:/var/folders/2r/v_2n6mbj41v7q14m8f3j9q4w0000gn/T/junit3802231298897594813/default.db/testTableAs|       |
|Provider                    |paimon                                                                                               |       |
|Owner                       |zhongyujiang                                                                                         |       |
|Table Properties            |[file.format=parquet]                                                                                |       |
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+

Tests

API and Format

Documentation

zhongyujiang · 2024-07-30T04:43:41Z

cc @Zouxxyy @YannByron Can you help review this? Thanks!

JingsongLi

-1, It is best for Spark to be unified with other engines, as using a new option may harm the interoperability between engines

zhongyujiang · 2024-07-30T11:07:44Z

Hi @JingsongLi This is not introducing a new option, but rather better adapting to the Spark engine. Because Spark has always used it to represent the location of the table.

the interoperability between engines

Are you suggesting that you want users to retrieve the table location information from the path option within the options when using the DESC command in different engines?

However, I would like to point out that the meaning of the DESC command inherently varies across different engines, closely related to the functionality of the engine. For instance, the DESC command in Trino and Flink does not even display table options information, but only column information. Yet, Flink can display primary key attribute of columns because primary keys are part of the Flink specification (this is not the case in Spark).

Therefore, I believe we should also adapt this to Spark, which would be more in line with the usage habits of Spark users.

BTW, for Iceberg and Hive tables, Spark's DESC EXTENDED DDL command displays the location information through the location field. However, for Paimon tables, the location information is hidden within the options under the path field, making it less convenient to users. This is the motivation behind this PR.

Zouxxyy · 2024-07-30T11:22:43Z

paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/SparkTable.scala

@@ -60,6 +60,8 @@ case class SparkTable(table: Table)
          properties.put(CoreOptions.PRIMARY_KEY.key, String.join(",", table.primaryKeys))
        }
        properties.put(TableCatalog.PROP_PROVIDER, SparkSource.NAME)
+        val location = properties.remove(CoreOptions.PATH.key())


This interface (default Map<String, String> properties()) is only used in show create or desc table, add the TableCatalog.PROP_LOCATION is ok, but wondering whether we need to remove path.

Besides, another point worth noting is that paimon currently does not support create table x location('') actually? The new result of show create table may cause misunderstanding, or we need support it first.

Hi @zhongyujiang , we can discuss in this thread, @Zouxxyy raised two very good points:

We could keep path option.

We should also support execution for show create table.

We could keep path option.

Sure

We should also support execution for show create table

I've tested this and found that the location info passed by Spark DDL will be ignored right now

IIUC, the table location is closely related to the warehouse address, and of course, different catalogs have different implementations:

AbstractCatalog calculates the table location based on the warehouse, database, and table name when loading a table, and both FileSystemCatalog and JdbcCatalog follow this behavior

HiveCatalog retrieves the table location information from the HMS (Hive Metastore)

Therefore, if we want to support this, we need to differentiate the catalog being used. This is feasible with HiveCatalog, but not with other catalogs. What do you think? @Zouxxyy @JingsongLi

For other catalogs, we can check the location, if not equal, throw exception.

can see ##2619

I have updated this to allow customizing the table location when using the Hive catalog and creating an external table. Please take a look when you have time. Thanks!

paimon-hive/paimon-hive-catalog/src/main/java/org/apache/paimon/hive/HiveCatalog.java

...imon-flink-common/src/main/java/org/apache/paimon/flink/clone/PickFilesForCloneOperator.java

JingsongLi · 2024-08-06T04:01:32Z

paimon-core/src/main/java/org/apache/paimon/catalog/FileSystemCatalog.java

@@ -117,6 +119,9 @@ protected void dropTableImpl(Identifier identifier) {

    @Override
    public void createTableImpl(Identifier identifier, Schema schema) {
+        checkArgument(
+                !schema.options().containsKey(CoreOptions.PATH.key()),


Can we just relax this check? For example, If it is the same path, we should allow it.

I am quite concerned that SHOW CREATE TABLE may not run properly.

Thanks for reviewing. I would like to share my thoughts on the prohibition of setting table location in FileSystemCatalog.

I think the SHOW CREATE TABLE command is usually used to create a test table with the same schema as the source table. Users typically use SHOW CREATE TABLE to get the DDL of a table and then create a test table with a different name for testing purposes. In this scenario, the location information in the DDL will inevitably mismatch the path assigned by the FileSystemCatalog for the new table, requiring users to modify the DDL in order to successfully create the table.

So even if the restrictions were relaxed, I believe it would still be somewhat confusing for users, as success would only be guaranteed when the specified location matches the one assigned by the catalog. If that's the case, why bother passing the location at all?

Therefore, instead of relaxing the checks here, I suggest we optimize the documentation by clearly stating this restriction in the documentation. We can declare the location management restrictions of FileSystemCatalog and the recommended usage of Spark DDL. What do you think?

This is also our current approach to managing Iceberg tables in our platform. We host the location for all Iceberg tables on behalf of the users and strongly advise against specifying a location when creating tables. So far, this has been working well.

I still think a relax check will be better. "create a test table with a different name for testing purposes", we can have very clear exception for this case.

Therefore, instead of relaxing the checks here, I suggest we optimize the documentation by clearly stating this restriction in the documentation. We can declare the location management restrictions of FileSystemCatalog and the recommended usage of Spark DDL. What do you think?

I'm +1 for this

JingsongLi · 2024-08-06T04:02:11Z

Thanks @zhongyujiang for update. Left some comments.

Zouxxyy

I see the new commits provide the support for external tables, thanks. Can you add the following test:
For hive catalog (can add in DDLWithHiveCatalogTestBase)

Create an internal table, drop it, and check if the files are deleted
Create an external table using location（check it), drop it, and check if the files are deleted

For filesystem catalog (can add to DDLTestBase)
All tables should be internal tables

Zouxxyy · 2024-11-20T02:23:52Z

I made some update, summary:

Support CREATE TABLE x LOCATION 'xxx' when using hive catalog (for other catalog, like file system catalog, an exception will be throw). The table created this way will be treated as external table.
Drop a managed table, the data files will be deleted.
Drop a external table, the data files will not be deleted.

CC @zhongyujiang @JingsongLi

JingsongLi · 2024-11-20T07:47:49Z

paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/SparkCatalog.java

+            String path = normalizedProperties.remove(TableCatalog.PROP_LOCATION);
+            normalizedProperties.put(CoreOptions.PATH.key(), path);
+            // For v2 table, as long as it has specified the location, treat it as external
+            normalizedProperties.put(Catalog.EXTERNAL_PROP, "true");


We don't need this option, hive catalog can know if it is a external table.

JingsongLi

+1

zhongyujiang · 2024-11-26T10:43:19Z

@Zouxxyy @JingsongLi Thanks for updating this, I'm sorry that I am not following up on this PR in time.

JingsongLi requested changes Jul 30, 2024

View reviewed changes

Zouxxyy reviewed Jul 30, 2024

View reviewed changes

zhongyujiang commented Aug 4, 2024

View reviewed changes

paimon-hive/paimon-hive-catalog/src/main/java/org/apache/paimon/hive/HiveCatalog.java Outdated Show resolved Hide resolved

zhongyujiang commented Aug 4, 2024

View reviewed changes

paimon-hive/paimon-hive-catalog/src/main/java/org/apache/paimon/hive/HiveCatalog.java Outdated Show resolved Hide resolved

zhongyujiang changed the title ~~[Spark]: Relocate the 'path' property in the table options to 'location' for better presentation~~ [Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table Aug 4, 2024

zhongyujiang force-pushed the spark-desc-table branch 2 times, most recently from 27eaf8d to cd60d6c Compare August 5, 2024 13:01

zhongyujiang commented Aug 5, 2024

View reviewed changes

...imon-flink-common/src/main/java/org/apache/paimon/flink/clone/PickFilesForCloneOperator.java Show resolved Hide resolved

JingsongLi reviewed Aug 6, 2024

View reviewed changes

Zouxxyy reviewed Aug 7, 2024

View reviewed changes

1

84914af

Zouxxyy force-pushed the spark-desc-table branch from fcafa9c to f37017e Compare November 20, 2024 02:08

Zouxxyy changed the title ~~[Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table~~ [hive][spark] Support create table with specified location using hive catalog Nov 20, 2024

Zouxxyy mentioned this pull request Nov 20, 2024

[Feature] Support create table use warehouse as base location instead of hive database location when using HMS #4435

Closed

2 tasks

JingsongLi reviewed Nov 20, 2024

View reviewed changes

Zouxxyy force-pushed the spark-desc-table branch from f37017e to 43fd030 Compare November 20, 2024 08:12

update

fadf650

Zouxxyy force-pushed the spark-desc-table branch from 43fd030 to fadf650 Compare November 20, 2024 08:40

Zouxxyy added 5 commits November 20, 2024 16:48

rename

c89f36a

1

417dee9

1

5889be5

1

3f19c48

1

5266bd0

JingsongLi approved these changes Nov 21, 2024

View reviewed changes

JingsongLi merged commit 187825a into apache:master Nov 21, 2024
12 checks passed

zhongyujiang deleted the spark-desc-table branch November 26, 2024 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hive][spark] Support create table with specified location using hive catalog #3843

[hive][spark] Support create table with specified location using hive catalog #3843

zhongyujiang commented Jul 30, 2024 •

edited

Loading

zhongyujiang commented Jul 30, 2024

JingsongLi left a comment

zhongyujiang commented Jul 30, 2024

Zouxxyy Jul 30, 2024

JingsongLi Jul 30, 2024

zhongyujiang Jul 31, 2024

JingsongLi Aug 1, 2024

Zouxxyy Aug 1, 2024

zhongyujiang Aug 4, 2024

JingsongLi Aug 6, 2024

zhongyujiang Aug 6, 2024

zhongyujiang Aug 6, 2024

JingsongLi Aug 7, 2024

Zouxxyy Nov 20, 2024

JingsongLi commented Aug 6, 2024

Zouxxyy left a comment •

edited

Loading

Zouxxyy commented Nov 20, 2024 •

edited

Loading

JingsongLi Nov 20, 2024

JingsongLi left a comment

zhongyujiang commented Nov 26, 2024

[hive][spark] Support create table with specified location using hive catalog #3843

[hive][spark] Support create table with specified location using hive catalog #3843

Conversation

zhongyujiang commented Jul 30, 2024 • edited Loading

Purpose

Tests

API and Format

Documentation

zhongyujiang commented Jul 30, 2024

JingsongLi left a comment

Choose a reason for hiding this comment

zhongyujiang commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JingsongLi commented Aug 6, 2024

Zouxxyy left a comment • edited Loading

Choose a reason for hiding this comment

Zouxxyy commented Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

JingsongLi left a comment

Choose a reason for hiding this comment

zhongyujiang commented Nov 26, 2024

zhongyujiang commented Jul 30, 2024 •

edited

Loading

Zouxxyy left a comment •

edited

Loading

Zouxxyy commented Nov 20, 2024 •

edited

Loading