Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support persisting TableMetadata in the metastore #433

Open
wants to merge 59 commits into
base: main
Choose a base branch
from

Conversation

eric-maynard
Copy link
Contributor

@eric-maynard eric-maynard commented Nov 7, 2024

Description

This adds a new flag METADATA_CACHE_MAX_BYTES which allows the catalog to store table metadata in the metastore and vend it from there when loadTable is called.

Entries are cached based on the metadata location. Currently, the entire metadata.json content is cached.


Features not included in this PR:

  1. Support for updating the cache when a table is updated
  2. Support for invalidating cache entries in the background, rather than waiting for loadTable to be called
  3. Structured storage for table metadata

There is partial support for (1) here and I want to extend it, but the goal is to structure things in a way that will allow us to implement (2) and (3) in the future as well.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Documentation update
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Existing tests vend table metadata correctly when caching is enabled.

Added a small test in BasePolarisCatalogTest to cover the basic semantics of caching

Manual testing with eclipselink -- I observed the entities getting created in Postgres and saw large metadata being cached:

db=# select length(internalproperties), substring(internalproperties, 1, 1000) from entities where id = 152;
...
 768691 | {"metadata_location":"file:/tmp/quickstart/ns/tn1731005976265/metadata/00000-e77a2576-5efa-4b7a-b948-121813d713f8.metadata.json","content":"{\"format ...

With MySQL, small metadata is persisted:

mysql> SELECT length(internalproperties), substring(internalproperties, 1, 1000) FROM ENTITIES WHERE id = (SELECT MAX(id) FROM ENTITIES WHERE typecode = 10);
. . .
8159 | {"metadata_location":"file:/tmp/quickstart/ns/t2/metadata/00000-64f975bd-c3a8-4069-bb56-f282003e9157.metadata.json","content":"{\"format-version\"

However large metadata may cause internalproperties to exceed the size limit and nothing will be cached. Calls still return safely.

@snazy
Copy link
Member

snazy commented Nov 25, 2024

Actually, I think loading/saving table metadata should be transparently done, using a delegating code architecture.
Think: a (REST) resource asks for metadata "abc" using some interface IcebergMetadataStore (or so). There would be multiple implementations of that interface:

  • one that interacts with the actual object store
  • one that is implemented as a delegate to the one that interacts with the object store, but also provides a caching mechanism

That way the calling code does not need to care about caching at all, and testing would be quite isolated. Using such an interface with various implementations/delegates can also ease testing quite a bit.

@eric-maynard
Copy link
Contributor Author

eric-maynard commented Nov 25, 2024

It would be nice to be have the information in a more structure way, but also efficiently accessible.

Yes @snazy, I completely agree here. It's part of why I'm advocating for the loadTableV2 proposal in Iceberg.

Even for this workstream, I noted in the very beginning as a non-goal for this PR:

Structured storage for table metadata

I think it will be important to add in the future, but I think this PR is valuable as-is without this optimization.

Actually, I think loading/saving table metadata should be transparently done, using a delegating code architecture.
Think: a (REST) resource asks for metadata "abc" using some interface IcebergMetadataStore (or so). There would be multiple implementations of that interface:

I think this idea sounds good. Ideally, if loadTableV2 goes in, we could couple our interface with that API

@snazy
Copy link
Member

snazy commented Nov 25, 2024

I think, we can already experiment and reason about good approaches of how to "chunk the huge table/view/udf metadata blobs".

My concerns about overly increased heap pressure, especially when metadata becomes big/bigger/huge, still stand. We should really not rush things that have the potential to excessive over-use of resources.

@snazy
Copy link
Member

snazy commented Nov 25, 2024

However, it would help a lot, if different functionalities (think: persistence and cache) are abstracted and separated. So some REST resource implementation calls "loadMetadataOfThatTable()" on some interface - the first implementation is maybe a tracing facade, the next delegate is a caching layer, the next persistence metrics, and the next the actual persistence layer.

@eric-maynard
Copy link
Contributor Author

eric-maynard commented Nov 25, 2024

Unless I am missing something, this is "safe" if METADATA_CACHE_MAX_BYTES is set to a relatively low value, right?

In the absence of testing showing that there is measurable performance degradation due to churning large strings in the heap when METADATA_CACHE_MAX_BYTES is set to e.g. 10000, I posit that the performance is probably OK for now. Indeed, in the limited testing I have done, performance is actually much better compared to a solution that has no caching and always needs to go to object storage for the metadata.

If a benchmark attached to the PR would make you feel better about the approach, I would be happy to add something like that -- but if you think the approach is totally dead in the water we should have that discussion.

@snazy
Copy link
Member

snazy commented Nov 25, 2024

when METADATA_CACHE_MAX_BYTES is set to e.g. 10000, I posit that the performance is probably OK for now.

Still, your approach re-serialized especially the big ones over and over again - and in fact: that is completely unbounded. To be clear: that's opening the door wider for a potential denial-of-service due to out-of-memory situations.

@RussellSpitzer
Copy link
Member

when METADATA_CACHE_MAX_BYTES is set to e.g. 10000, I posit that the performance is probably OK for now.

Still, your approach re-serialized especially the big ones over and over again - and in fact: that is completely unbounded. To be clear: that's opening the door wider for a potential denial-of-service due to out-of-memory situations.

Doesn't the catalog always have to do this regardless? We have to load the metadata.json from disk no matter what so the risk from unbounded json's is always present regardless of whether the Catalog also writes and reads a copy for itself.

@snazy
Copy link
Member

snazy commented Nov 25, 2024

Doesn't the catalog always have to do this regardless? We have to load the metadata.json from disk no matter what so the risk from unbounded json's is always present regardless of whether the Catalog also writes and reads a copy for itself.

The answer is actually that technically a catalog does not really have to materialize it in full in memory. It could technically stream it directly from the object store - or in chunks from another source.

@RussellSpitzer The bigger problem than materializing the big thing once is, that this change RE-serializes it again - and that is the real issue.

@eric-maynard
Copy link
Contributor Author

eric-maynard commented Nov 25, 2024

The answer is actually that technically a catalog does not really have to materialize it in full in memory. It could technically stream it directly from the object store - or in chunks from another source.

This is true, but we are not talking about what an arbitrary catalog technically has to do but what Polaris actually does do.

In the future I think we should optimize this with a structured schema in the metastore (perhaps that more closely matches the schema of the Java TableMetadata object) -- and when we do, we should optimize the cache as well.

The question is whether this future optimization is a blocker for adding caching on top of the current behavior.

this change RE-serializes it again

I have not observed much overhead associated with this. If this is the really the key point here, let's add a benchmark?

case PolarisConfiguration.METADATA_CACHE_MAX_BYTES_INFINITE_CACHING -> true;
case PolarisConfiguration.METADATA_CACHE_MAX_BYTES_NO_CACHING -> false;
default -> {
metadataJson = TableMetadataParser.toJson(metadata);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snazy in this way, the re-serialization only happens when the feature is enabled

@RussellSpitzer
Copy link
Member

@RussellSpitzer The bigger problem than materializing the big thing once is, that this change RE-serializes it again - and that is the real issue.

I don't think this is actually a big deal, especially if users can decided whether or not to use it. Users can potentially reduce their latency now for a cost of more memory usage and possible cpu cost. I honestly think most folks would want this now regardless of the current cost since I've seen several private implementations of REST catalogs that do this. I also probably would have it on by default.

@eric-maynard
Copy link
Contributor Author

eric-maynard commented Nov 25, 2024

@snazy sorry for all the pings. I took another look at your comments and the PR, and I identified one idea that I wanted to get your POV on.

I added this commit to bound the extra serde.

I am not in love with this exact implementation, but what do you think like an approach like this? Does this address your concerns around unbounded serde in the event that someone has a metadata.json which exceeds the configured limit?

this.key = key;
this.description = description;
this.defaultValue = defaultValue;
this.catalogConfigImpl = catalogConfig;
this.typ = (Class<T>) defaultValue.getClass();
this.validation = validation;

validate(cast(defaultValue));
Copy link
Member

@RussellSpitzer RussellSpitzer Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a safety check? Feels like this shouldn't be able to break right?

Copy link
Member

@RussellSpitzer RussellSpitzer Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also isn't this redundant now. (cast calls validate)

So wouldn't this be

validate(
   cast(defaultValue){
      validate(defaultValue)
   }
}

@@ -53,14 +62,27 @@ public String catalogConfig() {
}

T cast(Object value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda want to call this apply() now instead of cast ... :)

Just because it does dual duty now

TableMetadataParser.toJson(metadata, generator);
generator.flush();
String result = boundedWriter.toString();
if (boundedWriter.isLimitExceeded()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A previous implementation had BoundedStringWriter throw an exception when the limit is hit, so the JsonGenerator actually stops instead of just sending strings into the void.

I am okay going back to this, but I don't like using an exception as flow control. WDYT @snazy ?

@snazy
Copy link
Member

snazy commented Nov 26, 2024

I added this commit to bound the extra serde.

I am not in love with this exact implementation, but what do you think like an approach like this? Does this address your concerns around unbounded serde in the event that someone has a metadata.json which exceeds the configured limit?

Not sure - but the direction is correct.

However, I think there's a much bigger problem introduced by this PR: the permanent heap usage. Even if you bound each metadata to 1MB, that can accumulate to 100,000 * 1MB (~= 95GB) of permanent heap usage. With 10MB that's nearly 1TB of heap. Plus all the other attributes plus heap pressure during runtime. (100,000 entries is the hard coded limit in o.a.p.core.persistence.cache.EntityCache.)

@eric-maynard
Copy link
Contributor Author

eric-maynard commented Nov 26, 2024

Hi @snazy; I agree that the EntityCache is potentially a problem. I think it can be optimized pretty easily with a Weigher. Although the metastore caching is disabled by default, do you view this as a blocker by itself?

One very simple thing we could do in this PR is just adjust to use weakValues.

Indeed originally in older versions of this PR I had the metadata skip the EntityCache, but @dennishuo convinced me that in many cases it's quite useful to have the in-memory caching.

@snazy
Copy link
Member

snazy commented Nov 27, 2024

Yes, the heap usage leading to OOMs is a blocker.

@eric-maynard
Copy link
Contributor Author

Sounds good @snazy; please see #490 for a proposal on how we can limit the size of the EntityCache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants