Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hive][spark] Support create table with specified location using hive catalog #3843

Merged
merged 7 commits into from
Nov 21, 2024

Conversation

zhongyujiang
Copy link
Contributor

@zhongyujiang zhongyujiang commented Jul 30, 2024

Purpose

This relocates the 'path' property in the table options to 'location' for better presentation.

  • Adapt 'path' to Spark's 'location' in table props ( the 'path' is still preserved in table props)
  • Support the customization of the table location when creating a table(only allowed for creating external table using hive catalog)

Spark uses the reserved property location to indicate the location of the table, and in the output of DESC EXTENDED, the location information will be displayed under the "# Detailed Table Information" section for better visibility.

+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                            |comment|
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
|a                           |bigint                                                                                               |NULL   |
|b                           |varchar(10)                                                                                          |NULL   |
|c                           |char(10)                                                                                             |NULL   |
|                            |                                                                                                     |       |
|# Metadata Columns          |                                                                                                     |       |
|__paimon_file_path          |string                                                                                               |       |
|__paimon_row_index          |bigint                                                                                               |       |
|                            |                                                                                                     |       |
|# Detailed Table Information|                                                                                                     |       |
|Name                        |default.testTableAs                                                                                  |       |
|Type                        |MANAGED                                                                                              |       |
|Location                    |file:/var/folders/2r/v_2n6mbj41v7q14m8f3j9q4w0000gn/T/junit3802231298897594813/default.db/testTableAs|       |
|Provider                    |paimon                                                                                               |       |
|Owner                       |zhongyujiang                                                                                         |       |
|Table Properties            |[file.format=parquet]                                                                                |       |
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+

Tests

API and Format

Documentation

@zhongyujiang
Copy link
Contributor Author

cc @Zouxxyy @YannByron Can you help review this? Thanks!

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1, It is best for Spark to be unified with other engines, as using a new option may harm the interoperability between engines

@zhongyujiang
Copy link
Contributor Author

Hi @JingsongLi This is not introducing a new option, but rather better adapting to the Spark engine. Because Spark has always used it to represent the location of the table.

the interoperability between engines

Are you suggesting that you want users to retrieve the table location information from the path option within the options when using the DESC command in different engines?

However, I would like to point out that the meaning of the DESC command inherently varies across different engines, closely related to the functionality of the engine. For instance, the DESC command in Trino and Flink does not even display table options information, but only column information. Yet, Flink can display primary key attribute of columns because primary keys are part of the Flink specification (this is not the case in Spark).

Therefore, I believe we should also adapt this to Spark, which would be more in line with the usage habits of Spark users.

BTW, for Iceberg and Hive tables, Spark's DESC EXTENDED DDL command displays the location information through the location field. However, for Paimon tables, the location information is hidden within the options under the path field, making it less convenient to users. This is the motivation behind this PR.

@@ -60,6 +60,8 @@ case class SparkTable(table: Table)
properties.put(CoreOptions.PRIMARY_KEY.key, String.join(",", table.primaryKeys))
}
properties.put(TableCatalog.PROP_PROVIDER, SparkSource.NAME)
val location = properties.remove(CoreOptions.PATH.key())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface (default Map<String, String> properties()) is only used in show create or desc table, add the TableCatalog.PROP_LOCATION is ok, but wondering whether we need to remove path.

Besides, another point worth noting is that paimon currently does not support create table x location('') actually? The new result of show create table may cause misunderstanding, or we need support it first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zhongyujiang , we can discuss in this thread, @Zouxxyy raised two very good points:

  1. We could keep path option.
  2. We should also support execution for show create table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could keep path option.

Sure

We should also support execution for show create table

I've tested this and found that the location info passed by Spark DDL will be ignored right now

IIUC, the table location is closely related to the warehouse address, and of course, different catalogs have different implementations:

  • AbstractCatalog calculates the table location based on the warehouse, database, and table name when loading a table, and both FileSystemCatalog and JdbcCatalog follow this behavior
  • HiveCatalog retrieves the table location information from the HMS (Hive Metastore)

Therefore, if we want to support this, we need to differentiate the catalog being used. This is feasible with HiveCatalog, but not with other catalogs. What do you think? @Zouxxyy @JingsongLi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other catalogs, we can check the location, if not equal, throw exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can see ##2619

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this to allow customizing the table location when using the Hive catalog and creating an external table. Please take a look when you have time. Thanks!

@zhongyujiang zhongyujiang changed the title [Spark]: Relocate the 'path' property in the table options to 'location' for better presentation [Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table Aug 4, 2024
@zhongyujiang zhongyujiang force-pushed the spark-desc-table branch 2 times, most recently from 27eaf8d to cd60d6c Compare August 5, 2024 13:01
@@ -117,6 +119,9 @@ protected void dropTableImpl(Identifier identifier) {

@Override
public void createTableImpl(Identifier identifier, Schema schema) {
checkArgument(
!schema.options().containsKey(CoreOptions.PATH.key()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just relax this check? For example, If it is the same path, we should allow it.

I am quite concerned that SHOW CREATE TABLE may not run properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. I would like to share my thoughts on the prohibition of setting table location in FileSystemCatalog.

I think the SHOW CREATE TABLE command is usually used to create a test table with the same schema as the source table. Users typically use SHOW CREATE TABLE to get the DDL of a table and then create a test table with a different name for testing purposes. In this scenario, the location information in the DDL will inevitably mismatch the path assigned by the FileSystemCatalog for the new table, requiring users to modify the DDL in order to successfully create the table.

So even if the restrictions were relaxed, I believe it would still be somewhat confusing for users, as success would only be guaranteed when the specified location matches the one assigned by the catalog. If that's the case, why bother passing the location at all?

Therefore, instead of relaxing the checks here, I suggest we optimize the documentation by clearly stating this restriction in the documentation. We can declare the location management restrictions of FileSystemCatalog and the recommended usage of Spark DDL. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also our current approach to managing Iceberg tables in our platform. We host the location for all Iceberg tables on behalf of the users and strongly advise against specifying a location when creating tables. So far, this has been working well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think a relax check will be better. "create a test table with a different name for testing purposes", we can have very clear exception for this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, instead of relaxing the checks here, I suggest we optimize the documentation by clearly stating this restriction in the documentation. We can declare the location management restrictions of FileSystemCatalog and the recommended usage of Spark DDL. What do you think?

I'm +1 for this

@JingsongLi
Copy link
Contributor

Thanks @zhongyujiang for update. Left some comments.

Copy link
Contributor

@Zouxxyy Zouxxyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the new commits provide the support for external tables, thanks. Can you add the following test:
For hive catalog (can add in DDLWithHiveCatalogTestBase)

  1. Create an internal table, drop it, and check if the files are deleted
  2. Create an external table using location(check it), drop it, and check if the files are deleted

For filesystem catalog (can add to DDLTestBase)
All tables should be internal tables

@Zouxxyy Zouxxyy changed the title [Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table [hive][spark] Support create table with specified location using hive catalog Nov 20, 2024
@Zouxxyy
Copy link
Contributor

Zouxxyy commented Nov 20, 2024

I made some update, summary:

  1. Support CREATE TABLE x LOCATION 'xxx' when using hive catalog (for other catalog, like file system catalog, an exception will be throw). The table created this way will be treated as external table.
  2. Drop a managed table, the data files will be deleted.
  3. Drop a external table, the data files will not be deleted.

CC @zhongyujiang @JingsongLi

String path = normalizedProperties.remove(TableCatalog.PROP_LOCATION);
normalizedProperties.put(CoreOptions.PATH.key(), path);
// For v2 table, as long as it has specified the location, treat it as external
normalizedProperties.put(Catalog.EXTERNAL_PROP, "true");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this option, hive catalog can know if it is a external table.

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 187825a into apache:master Nov 21, 2024
12 checks passed
@zhongyujiang
Copy link
Contributor Author

@Zouxxyy @JingsongLi Thanks for updating this, I'm sorry that I am not following up on this PR in time.

@zhongyujiang zhongyujiang deleted the spark-desc-table branch November 26, 2024 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants