[vmware] Cache images as VM templates #206

leust · 2024-07-11T15:37:07Z

Upon user request, the driver can cache the image as a VM Template and reuse that to create the volume(s). This feature is useful when creating many volumes in parallel from the same image.

Users can request the image cache feature when creating the volume, by passing the use_image_cache='true' as a property (metadata).

The feature must be enabled per backend, for example:

[vmware]
enable_image_cache = true

This will enable the image cache feature for the vmware backend.

The image templates will then be stored in a folder similar to the volumes folder: OpensStack/Project (vmware_image_cache)/Volumes, where {backend}_image_cache is used as a project name.

The driver will periodically delete the cached images that are expired. The expiry time can be controlled via the property image_cache_age_seconds set on the backend configuration.

Only images smaller than the configured image_cache_max_size_gb will be cached.

Change-Id: I6f5e481f6997a180a455b47abe525b93bcf9aa4e

leust · 2024-07-11T15:42:22Z

Needs sapcc/oslo.vmware#46

hemna · 2024-07-11T15:45:57Z

Why not use cinder's built in cache mechanism?

joker-at-work · 2024-07-12T07:08:00Z

Why not use cinder's built in cache mechanism?

I agree that the reasoning for this should be in the commit message.

leust · 2024-07-15T12:09:23Z

I added the reason to the commit message

We're not using the cinder built-in cache functionality because we
need a few extra features:

the built-in cache doesn't account for shards. The cache entry
will be placed on any backend/shard and could trigger a lot of
slower cross-vc migrations when creating volumes from it.
the built-in cache doesn't have a periodic task for deleting the
expired cache entries
we want to cache the images only when the customer requests it

hemna · 2024-07-15T12:49:31Z

How do we account for the consumed space of the cached volume in the specific datastore so that the scheduler knows how much we are using on that datastore?

leust · 2024-07-15T12:58:48Z

How do we account for the consumed space

The driver reports the free_capacity_gb directly from the datastore.summary.freeSpace so I believe we get this information straight from the VMware API response.

hemna · 2024-07-15T13:03:48Z

The scheduler still needs to account for the capacity allocated against the datastore though. These cached images will be hidden from what is actually allocated against the datastore.

leust · 2024-07-17T08:46:13Z

The scheduler still needs to account for the capacity allocated against the datastore though

OK, thanks for the hint.

Now the "image cache" capacity is being added to the pool's provisioned_capacity_gb which seems to be used while weighing as well as in the calculate_virtual_free_capacity.

The volume backend reports a new extra_provisioned_capacity_gb that's being recognised by the host_manager and added to the final provisioned_capacity_gb.

Could you please check the new code @hemna ?

joker-at-work

If we disable enable_image_cache we might still have images around. Would it make sense to run some cleanup or at least count the existing cached images as usage or we we have to make sure that we clean up manually?

cinder/volume/drivers/vmware/vmdk.py

joker-at-work · 2024-08-19T12:52:20Z

cinder/volume/drivers/vmware/vmdk.py

+        return list(itertools.chain(
+            *[self._get_cached_images_in_folder(folder_ref)
+              for folder_ref in folder_refs]))


This looks like you could use itertools.chain.from_iterable() instead of itertools.chain(*[…]).

joker-at-work · 2024-08-19T12:55:29Z

cinder/volume/drivers/vmware/vmdk.py

+          "created_at": date}]
+
+        Where
+            - name: the name of the template VM (set to the image_id)


Should we maybe add a prefix or postfix or something that makes it possible to distinguish image-cache template-vms from shadow-vms? Especially when things are orphaned and only left as directories on the datastore, it's helpful to distinguish them by name and know what can definitely be deleted.

joker-at-work · 2024-08-20T10:43:13Z

cinder/volume/drivers/vmware/vmdk.py

+        max_objects = self.configuration.vmware_max_objects_retrieval
+        options.maxObjects = max_objects
+        try:
+            result = self.session.vim.RetrievePropertiesEx(


Why can't we use WithRetrieval as in _get_image_cache_folder_ref()? This looks to be so much code and I would have expected this code to already exist.

Do you mean using oslo.vmware's get_objects() ?
We couldn't use it because that one looks in the rootFolder and we only want to look into the cache folder here.

with vutil.WithRetrieval(self._session.vim, retr_res) as retr_objects is what I mean. We only get result here and I would assume all the exception handing and such would already be handled in WithRetrieval. We later only use result.objects.

How do we do it in Nova? Did we also copy the contents of get_objects() there? (That's what I understood from you we basically have to do here)

Nova has it's own get_objects()-like called get_inner_objects()
Additionally, WithRetrieval doen't handle this NOT_AUTHENTICATED exception that's thrown when there are no objects in the folder (see nova's _get_image_template_vms).

Ah, ok. Thank you.

What's the use of WithRetrieval then? I thought we added it everywhere in Nova, because code missed to handle cases at times.

joker-at-work · 2024-08-20T10:48:40Z

cinder/volume/drivers/vmware/vmdk.py

+
+        cached_images = []
+        for obj in result.objects:
+            props = vim_util.propset_dict(obj.propSet)


Could it be that propSet doesn't exist? We sometimes have exceptions like that in Nova.

joker-at-work · 2024-08-20T11:01:09Z

cinder/volume/drivers/vmware/vmdk.py

+            img_volume = copy.deepcopy(volume)
+            img_volume['id'] = image_id
+            img_volume['project_id'] = self._cache_project_name()
+            img_volume['size'] = image_size_in_bytes / units.Gi


Why do we keep the other data and what happens with it? Do we really need whatever else is in the volume dict? Would it maybe make sense to be explicit about what we expect to be use going forward? Can there be private information (metadata) that somehow gets copied to another project's cloned root-disk metadata?

joker-at-work · 2024-08-20T11:01:49Z

cinder/volume/drivers/vmware/vmdk.py

@@ -1998,6 +2052,50 @@ def copy_image_to_volume(self, context, volume, image_service, image_id):
                                 VMwareVcVmdkDriver._get_disk_type(volume))
        # TODO(vbala): handle volume_size < disk_size case.

+    def _can_use_image_cache(self, volume, image_size):
+        requested = (volume['metadata']
+                     .get(CREATE_PARAM_USE_IMAGE_CACHE) == "true")


We we maybe want to add a .lower() or does that come in automatically?

joker-at-work · 2024-08-20T11:03:22Z

cinder/volume/drivers/vmware/vmdk.py

+            LOG.debug("The requested image cannot be cached because it's "
+                      "too big (%(image_size)s > %(max_size)s)",
+                      {'image_size': image_size,
+                       'max_size': max_size})


Should we error out here? The customer requested to use the image-cache and thus expect speedy creations, but with the image they can't get it.
Same question towards requested but not enabled, I guess.

From the user experience point of view, I'd expect such an error to be returned in the API response.
Otherwise they will ask why their volume is in error state.

Fair point. We're too late in the process, because scheduling needs to have happened already. On the other hand, nobody will know ever ...

Any conclusion for this?

afaik, there's a way to add error messages and retrieve them with cinder message list. I think we can go with "ignore what the customer requested" for now, but should investigate this as a follow-up and maybe bring it up in a bigger round for a decision.

joker-at-work · 2024-08-20T11:05:42Z

cinder/volume/drivers/vmware/vmdk.py

+
+        img_backing = None
+        if self._can_use_image_cache(volume, metadata['size']):
+            img_backing = self._get_cached_image_backing(


I would rename _get_cached_image_backing() to _get_or_create_cached_image_backing(). Mainly because I was wondering where we would create the cached image backing if we only call a get.

joker-at-work · 2024-08-20T11:07:11Z

cinder/volume/drivers/vmware/vmdk.py

+                LOG.exception("Failed to delete the expired image %s",
+                              cached_image['name'])
+
+    def _get_cached_images(self):


We run this function every minute with the stats generation (I think). How long does it take to run in a real-world environment? Do we have to optimize somewhere?

Upon user request, the driver can cache the image as a VM Template and reuse that to create the volume(s). This feature is useful when creating many volumes in parallel from the same image. We're not using the cinder built-in cache functionality because we need a few extra features: - the built-in cache doesn't account for shards. The cache entry will be placed on any backend/shard and could trigger a lot of slower cross-vc migrations when creating volumes from it. - the built-in cache doesn't have a periodic task for deleting the expired cache entries - we want to cache the images only when the customer requests it Users can request the image cache feature when creating the volume, by passing the use_image_cache='true' as a property (metadata). The feature must be enabled per backend, for example: ``` [vmware] enable_image_cache = true ``` This will enable the image cache feature for the vmware backend. The image templates will then be stored in a folder similar to the volumes folder: OpensStack/Project (vmware_image_cache)/Volumes, where {backend}_image_cache is used as a project name. The driver will periodically delete the cached images that are expired. The expiry time can be controlled via the property `image_cache_age_seconds` set on the backend configuration. Only images smaller than the configured `image_cache_max_size_gb` will be cached. Change-Id: I6f5e481f6997a180a455b47abe525b93bcf9aa4e

leust requested review from hemna and joker-at-work July 11, 2024 15:42

leust force-pushed the imagecache branch from b38add9 to fc481de Compare July 15, 2024 12:08

leust force-pushed the imagecache branch 2 times, most recently from da700b1 to c763bce Compare July 17, 2024 08:20

hemna approved these changes Jul 25, 2024

View reviewed changes

joker-at-work reviewed Aug 20, 2024

View reviewed changes

leust force-pushed the imagecache branch 2 times, most recently from 3f6d754 to 8333237 Compare August 26, 2024 06:47

hemna force-pushed the stable/wallaby-m3 branch from 2b59588 to 17e2b86 Compare August 27, 2024 15:13

leust force-pushed the imagecache branch from 8333237 to c2475be Compare September 3, 2024 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vmware] Cache images as VM templates #206

[vmware] Cache images as VM templates #206

leust commented Jul 11, 2024

leust commented Jul 11, 2024

hemna commented Jul 11, 2024

joker-at-work commented Jul 12, 2024

leust commented Jul 15, 2024

hemna commented Jul 15, 2024

leust commented Jul 15, 2024

hemna commented Jul 15, 2024

leust commented Jul 17, 2024

joker-at-work left a comment

joker-at-work Aug 19, 2024

joker-at-work Aug 19, 2024

joker-at-work Aug 20, 2024

leust Aug 20, 2024

joker-at-work Aug 20, 2024

leust Aug 20, 2024

joker-at-work Aug 21, 2024

joker-at-work Aug 20, 2024

joker-at-work Aug 20, 2024

joker-at-work Aug 20, 2024

joker-at-work Aug 20, 2024

leust Aug 20, 2024

joker-at-work Aug 20, 2024

leust Aug 26, 2024

joker-at-work Aug 26, 2024

joker-at-work Aug 20, 2024

joker-at-work Aug 20, 2024

[vmware] Cache images as VM templates #206

Are you sure you want to change the base?

[vmware] Cache images as VM templates #206

Conversation

leust commented Jul 11, 2024

leust commented Jul 11, 2024

hemna commented Jul 11, 2024

joker-at-work commented Jul 12, 2024

leust commented Jul 15, 2024

hemna commented Jul 15, 2024

leust commented Jul 15, 2024

hemna commented Jul 15, 2024

leust commented Jul 17, 2024

joker-at-work left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment