Support persisting TableMetadata in the metastore #433

eric-maynard · 2024-11-07T07:13:19Z

Description

This adds a new flag METADATA_CACHE_MAX_BYTES which allows the catalog to store table metadata in the metastore and vend it from there when loadTable is called.

Entries are cached based on the metadata location. Currently, the entire metadata.json content is cached.

Features not included in this PR:

Support for updating the cache when a table is updated
Support for invalidating cache entries in the background, rather than waiting for loadTable to be called
Structured storage for table metadata

There is partial support for (1) here and I want to extend it, but the goal is to structure things in a way that will allow us to implement (2) and (3) in the future as well.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Documentation update
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Existing tests vend table metadata correctly when caching is enabled.

Added a small test in BasePolarisCatalogTest to cover the basic semantics of caching

Manual testing with eclipselink -- I observed the entities getting created in Postgres and saw large metadata being cached:

db=# select length(internalproperties), substring(internalproperties, 1, 1000) from entities where id = 152;
...
 768691 | {"metadata_location":"file:/tmp/quickstart/ns/tn1731005976265/metadata/00000-e77a2576-5efa-4b7a-b948-121813d713f8.metadata.json","content":"{\"format ...

With MySQL, small metadata is persisted:

mysql> SELECT length(internalproperties), substring(internalproperties, 1, 1000) FROM ENTITIES WHERE id = (SELECT MAX(id) FROM ENTITIES WHERE typecode = 10);
. . .
8159 | {"metadata_location":"file:/tmp/quickstart/ns/t2/metadata/00000-64f975bd-c3a8-4069-bb56-f282003e9157.metadata.json","content":"{\"format-version\"

However large metadata may cause internalproperties to exceed the size limit and nothing will be cached. Calls still return safely.

snazy · 2024-11-25T12:44:31Z

Actually, I think loading/saving table metadata should be transparently done, using a delegating code architecture.
Think: a (REST) resource asks for metadata "abc" using some interface IcebergMetadataStore (or so). There would be multiple implementations of that interface:

one that interacts with the actual object store
one that is implemented as a delegate to the one that interacts with the object store, but also provides a caching mechanism

That way the calling code does not need to care about caching at all, and testing would be quite isolated. Using such an interface with various implementations/delegates can also ease testing quite a bit.

eric-maynard · 2024-11-25T18:23:31Z

It would be nice to be have the information in a more structure way, but also efficiently accessible.

Yes @snazy, I completely agree here. It's part of why I'm advocating for the loadTableV2 proposal in Iceberg.

Even for this workstream, I noted in the very beginning as a non-goal for this PR:

Structured storage for table metadata

I think it will be important to add in the future, but I think this PR is valuable as-is without this optimization.

Actually, I think loading/saving table metadata should be transparently done, using a delegating code architecture.
Think: a (REST) resource asks for metadata "abc" using some interface IcebergMetadataStore (or so). There would be multiple implementations of that interface:

I think this idea sounds good. Ideally, if loadTableV2 goes in, we could couple our interface with that API

snazy · 2024-11-25T18:30:45Z

I think, we can already experiment and reason about good approaches of how to "chunk the huge table/view/udf metadata blobs".

My concerns about overly increased heap pressure, especially when metadata becomes big/bigger/huge, still stand. We should really not rush things that have the potential to excessive over-use of resources.

snazy · 2024-11-25T18:33:40Z

However, it would help a lot, if different functionalities (think: persistence and cache) are abstracted and separated. So some REST resource implementation calls "loadMetadataOfThatTable()" on some interface - the first implementation is maybe a tracing facade, the next delegate is a caching layer, the next persistence metrics, and the next the actual persistence layer.

eric-maynard · 2024-11-25T18:33:55Z

Unless I am missing something, this is "safe" if METADATA_CACHE_MAX_BYTES is set to a relatively low value, right?

In the absence of testing showing that there is measurable performance degradation due to churning large strings in the heap when METADATA_CACHE_MAX_BYTES is set to e.g. 10000, I posit that the performance is probably OK for now. Indeed, in the limited testing I have done, performance is actually much better compared to a solution that has no caching and always needs to go to object storage for the metadata.

If a benchmark attached to the PR would make you feel better about the approach, I would be happy to add something like that -- but if you think the approach is totally dead in the water we should have that discussion.

snazy · 2024-11-25T18:37:32Z

when METADATA_CACHE_MAX_BYTES is set to e.g. 10000, I posit that the performance is probably OK for now.

Still, your approach re-serialized especially the big ones over and over again - and in fact: that is completely unbounded. To be clear: that's opening the door wider for a potential denial-of-service due to out-of-memory situations.

RussellSpitzer · 2024-11-25T19:10:22Z

when METADATA_CACHE_MAX_BYTES is set to e.g. 10000, I posit that the performance is probably OK for now.

Still, your approach re-serialized especially the big ones over and over again - and in fact: that is completely unbounded. To be clear: that's opening the door wider for a potential denial-of-service due to out-of-memory situations.

Doesn't the catalog always have to do this regardless? We have to load the metadata.json from disk no matter what so the risk from unbounded json's is always present regardless of whether the Catalog also writes and reads a copy for itself.

snazy · 2024-11-25T19:15:59Z

Doesn't the catalog always have to do this regardless? We have to load the metadata.json from disk no matter what so the risk from unbounded json's is always present regardless of whether the Catalog also writes and reads a copy for itself.

The answer is actually that technically a catalog does not really have to materialize it in full in memory. It could technically stream it directly from the object store - or in chunks from another source.

@RussellSpitzer The bigger problem than materializing the big thing once is, that this change RE-serializes it again - and that is the real issue.

eric-maynard · 2024-11-25T19:49:22Z

The answer is actually that technically a catalog does not really have to materialize it in full in memory. It could technically stream it directly from the object store - or in chunks from another source.

This is true, but we are not talking about what an arbitrary catalog technically has to do but what Polaris actually does do.

In the future I think we should optimize this with a structured schema in the metastore (perhaps that more closely matches the schema of the Java TableMetadata object) -- and when we do, we should optimize the cache as well.

The question is whether this future optimization is a blocker for adding caching on top of the current behavior.

this change RE-serializes it again

I have not observed much overhead associated with this. If this is the really the key point here, let's add a benchmark?

eric-maynard · 2024-11-25T19:52:55Z

polaris-service/src/main/java/org/apache/polaris/service/catalog/BasePolarisCatalog.java

+            case PolarisConfiguration.METADATA_CACHE_MAX_BYTES_INFINITE_CACHING -> true;
+            case PolarisConfiguration.METADATA_CACHE_MAX_BYTES_NO_CACHING -> false;
+            default -> {
+              metadataJson = TableMetadataParser.toJson(metadata);


@snazy in this way, the re-serialization only happens when the feature is enabled

RussellSpitzer · 2024-11-25T21:23:41Z

@RussellSpitzer The bigger problem than materializing the big thing once is, that this change RE-serializes it again - and that is the real issue.

I don't think this is actually a big deal, especially if users can decided whether or not to use it. Users can potentially reduce their latency now for a cost of more memory usage and possible cpu cost. I honestly think most folks would want this now regardless of the current cost since I've seen several private implementations of REST catalogs that do this. I also probably would have it on by default.

eric-maynard · 2024-11-25T21:46:20Z

@snazy sorry for all the pings. I took another look at your comments and the PR, and I identified one idea that I wanted to get your POV on.

I added this commit to bound the extra serde.

I am not in love with this exact implementation, but what do you think like an approach like this? Does this address your concerns around unbounded serde in the event that someone has a metadata.json which exceeds the configured limit?

RussellSpitzer · 2024-11-25T22:10:25Z

polaris-core/src/main/java/org/apache/polaris/core/PolarisConfiguration.java

    this.key = key;
    this.description = description;
    this.defaultValue = defaultValue;
    this.catalogConfigImpl = catalogConfig;
    this.typ = (Class<T>) defaultValue.getClass();
+    this.validation = validation;
+
+    validate(cast(defaultValue));


Is this just a safety check? Feels like this shouldn't be able to break right?

Also isn't this redundant now. (cast calls validate)

So wouldn't this be

validate( cast(defaultValue){ validate(defaultValue) } }

RussellSpitzer · 2024-11-25T22:14:07Z

polaris-core/src/main/java/org/apache/polaris/core/PolarisConfiguration.java

@@ -53,14 +62,27 @@ public String catalogConfig() {
  }

  T cast(Object value) {


I kinda want to call this apply() now instead of cast ... :)

Just because it does dual duty now

eric-maynard · 2024-11-25T22:21:50Z

polaris-service/src/main/java/org/apache/polaris/service/persistence/MetadataCacheManager.java

+      TableMetadataParser.toJson(metadata, generator);
+      generator.flush();
+      String result = boundedWriter.toString();
+      if (boundedWriter.isLimitExceeded()) {


A previous implementation had BoundedStringWriter throw an exception when the limit is hit, so the JsonGenerator actually stops instead of just sending strings into the void.

I am okay going back to this, but I don't like using an exception as flow control. WDYT @snazy ?

snazy · 2024-11-26T09:32:18Z

I added this commit to bound the extra serde.

I am not in love with this exact implementation, but what do you think like an approach like this? Does this address your concerns around unbounded serde in the event that someone has a metadata.json which exceeds the configured limit?

Not sure - but the direction is correct.

However, I think there's a much bigger problem introduced by this PR: the permanent heap usage. Even if you bound each metadata to 1MB, that can accumulate to 100,000 * 1MB (~= 95GB) of permanent heap usage. With 10MB that's nearly 1TB of heap. Plus all the other attributes plus heap pressure during runtime. (100,000 entries is the hard coded limit in o.a.p.core.persistence.cache.EntityCache.)

eric-maynard · 2024-11-26T17:22:58Z

Hi @snazy; I agree that the EntityCache is potentially a problem. I think it can be optimized pretty easily with a Weigher. Although the metastore caching is disabled by default, do you view this as a blocker by itself?

One very simple thing we could do in this PR is just adjust to use weakValues.

Indeed originally in older versions of this PR I had the metadata skip the EntityCache, but @dennishuo convinced me that in many cases it's quite useful to have the in-memory caching.

snazy · 2024-11-27T07:41:33Z

Yes, the heap usage leading to OOMs is a blocker.

eric-maynard · 2024-11-27T09:49:52Z

Sounds good @snazy; please see #490 for a proposal on how we can limit the size of the EntityCache.

eric-maynard added 16 commits November 6, 2024 11:45

wip

6c90a3d

first real commit

bda008e

cleaning up

c38a01d

tests fixed

3a69953

begin refactoring

ca856ed

wire up to loadTable

779d912

autolint

b4c5335

logic to purge old records

6e23be4

autolint

5805679

stable, disabled

f32a82c

passing when enabled

249ecac

stable

0db85eb

autolint

6f0d9df

resolve conflicts; pull main

c0ea191

autolint

f250916

oops

dd87d42

eric-maynard marked this pull request as ready for review November 7, 2024 19:33

eric-maynard requested review from jbonofre, ashvina, RussellSpitzer, snazy, vvcephei, takidau, jackye1995, flyrain, collado-mike, ebyhr and adutra as code owners November 7, 2024 19:33

eric-maynard added 2 commits November 7, 2024 12:15

add try/catch

c46eb03

autolint

1e1b953

defer serialization

73f5dd2

eric-maynard commented Nov 25, 2024

View reviewed changes

eric-maynard added 2 commits November 25, 2024 13:45

bounded serde

59361ed

autolint

c67492f

eric-maynard added 5 commits November 25, 2024 13:54

no more exceptions

85ec25b

comment

424b7d9

Merge branch 'main' of github.com:apache/polaris into metadata-cache

94b5864

fix

a5fe3c1

autolint

0b152f7

RussellSpitzer reviewed Nov 25, 2024

View reviewed changes

eric-maynard added 2 commits November 25, 2024 14:14

stable

3584842

fix

8d704eb

eric-maynard commented Nov 25, 2024

View reviewed changes

snazy mentioned this pull request Dec 4, 2024

Use soft values in the EntityCache; approximate entity size #490

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support persisting TableMetadata in the metastore #433

Support persisting TableMetadata in the metastore #433

eric-maynard commented Nov 7, 2024 •

edited

Loading

snazy commented Nov 25, 2024

eric-maynard commented Nov 25, 2024 •

edited

Loading

snazy commented Nov 25, 2024

snazy commented Nov 25, 2024

eric-maynard commented Nov 25, 2024 •

edited

Loading

snazy commented Nov 25, 2024

RussellSpitzer commented Nov 25, 2024

snazy commented Nov 25, 2024

eric-maynard commented Nov 25, 2024 •

edited

Loading

eric-maynard Nov 25, 2024

RussellSpitzer commented Nov 25, 2024

eric-maynard commented Nov 25, 2024 •

edited

Loading

RussellSpitzer Nov 25, 2024 •

edited

Loading

RussellSpitzer Nov 25, 2024 •

edited

Loading

RussellSpitzer Nov 25, 2024

eric-maynard Nov 25, 2024

snazy commented Nov 26, 2024

eric-maynard commented Nov 26, 2024 •

edited

Loading

snazy commented Nov 27, 2024

eric-maynard commented Nov 27, 2024

		@@ -53,14 +62,27 @@ public String catalogConfig() {
		}

		T cast(Object value) {

Support persisting TableMetadata in the metastore #433

Are you sure you want to change the base?

Support persisting TableMetadata in the metastore #433

Conversation

eric-maynard commented Nov 7, 2024 • edited Loading

Description

Type of change

How Has This Been Tested?

snazy commented Nov 25, 2024

eric-maynard commented Nov 25, 2024 • edited Loading

snazy commented Nov 25, 2024

snazy commented Nov 25, 2024

eric-maynard commented Nov 25, 2024 • edited Loading

snazy commented Nov 25, 2024

RussellSpitzer commented Nov 25, 2024

snazy commented Nov 25, 2024

eric-maynard commented Nov 25, 2024 • edited Loading

eric-maynard Nov 25, 2024

Choose a reason for hiding this comment

RussellSpitzer commented Nov 25, 2024

eric-maynard commented Nov 25, 2024 • edited Loading

RussellSpitzer Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

RussellSpitzer Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

RussellSpitzer Nov 25, 2024

Choose a reason for hiding this comment

eric-maynard Nov 25, 2024

Choose a reason for hiding this comment

snazy commented Nov 26, 2024

eric-maynard commented Nov 26, 2024 • edited Loading

snazy commented Nov 27, 2024

eric-maynard commented Nov 27, 2024

eric-maynard commented Nov 7, 2024 •

edited

Loading

eric-maynard commented Nov 25, 2024 •

edited

Loading

eric-maynard commented Nov 25, 2024 •

edited

Loading

eric-maynard commented Nov 25, 2024 •

edited

Loading

eric-maynard commented Nov 25, 2024 •

edited

Loading

RussellSpitzer Nov 25, 2024 •

edited

Loading

RussellSpitzer Nov 25, 2024 •

edited

Loading

eric-maynard commented Nov 26, 2024 •

edited

Loading