Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement KatibConfig API #2176

Merged
merged 19 commits into from
Aug 1, 2023

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Jul 19, 2023

What this PR does / why we need it:
I implemented the new KatibConfig API to set all parameters for the katib.
Also, the new KatibConfig supports JSON and YAML, similar to the kubernetes manifest.

NOTE: Since the KatibConfig is NOT CustomResource, we need to create a ConfigMap embedded KatibConfig as in the past.

apiVersion: v1
kind: ConfigMap
metadata:
  name: katib-config
  namespace: kubeflow
data:
  katib-config.yaml: |
    ---
    apiVersion: config.kubeflow.org/v1beta1
    kind: KatibConfig
    init:
      controller:
        webhookPort: 8443
        trialResources:
          - Job.v1.batch
          - TFJob.v1.kubeflow.org
    runtime:
      metricsCollectors:
        - kind: StdOut
          image: docker.io/kubeflowkatib/file-metrics-collector:latest
        - kind: TensorFlowEvent
          image: docker.io/kubeflowkatib/tfevent-metrics-collector:latest
          resources:
            limits:
              memory: 1Gi
      suggestions:
        - algorithmName: random
          image: docker.io/kubeflowkatib/suggestion-hyperopt:latest
        - algorithmName: pbt
          image: docker.io/kubeflowkatib/suggestion-pbt:latest
          persistentVolumeClaimSpec:
            accessModes:
              - ReadWriteMany
            resources:
              requests:
                storage: 5Gi
      earlyStoppings:
        - algorithmName: medianstop
          image: docker.io/kubeflowkatib/earlystopping-medianstop:latest

Given KatibConfig is used for the following:

  • .init.controller: This holds parameters for the katib-controller. Here is evaluated only when launching the controller. So we need to restart the controller's pod to set new parameters. Also, we need to mount the comfigMap embedded KatibConfig to the controller, and set the mounted path to controller's option, --katib-config like:
apiVersion: apps/v1
kind: Deployment
...
spec:
...
  template:
...
    spec:
      serviceAccountName: katib-controller
      containers:
        - name: katib-controller
          args:
            - --katib-config=/katib-config.yaml
...
          volumeMounts:
            - mountPath: /katib-config.yaml
              name: katib-config
              subPath: katib-config.yaml
              readOnly: true
...
      volumes:
        - name: katib-config
          configMap:
            name: katib-config
...
  • .runtime: This holds parameters for the metrics-collectors, suggestion-services, and earlystoppings. Here is reevaluated for every query. So we don't need to restart the controler's pods to set new parameters.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2150

Checklist:

  • Docs included if any changes are user facing

@tenzen-y
Copy link
Member Author

Uhm... I can not reproduce CI error on my local...

Error from server: error when creating "../../testdata/valid-experiment.yaml": admission webhook "validator.experiment.katib.kubeflow.org" denied the request: GetMetricsCollectorConfigData failed: failed to find metrics collector config for kind: StdOut in ConfigMap: katib-config

https://github.com/kubeflow/katib/actions/runs/5613737623/job/15210405916?pr=2176#step:4:9781

@tenzen-y tenzen-y force-pushed the improve-katib-config-ux branch 5 times, most recently from ef2a9ae to 2cec0b2 Compare July 20, 2023 21:06
@tenzen-y
Copy link
Member Author

Uhm... I can not reproduce CI error on my local...

Error from server: error when creating "../../testdata/valid-experiment.yaml": admission webhook "validator.experiment.katib.kubeflow.org" denied the request: GetMetricsCollectorConfigData failed: failed to find metrics collector config for kind: StdOut in ConfigMap: katib-config

https://github.com/kubeflow/katib/actions/runs/5613737623/job/15210405916?pr=2176#step:4:9781

This was resolved.

@tenzen-y tenzen-y changed the title WIP: Implement KatibConfig API Implement KatibConfig API Jul 21, 2023
@tenzen-y tenzen-y force-pushed the improve-katib-config-ux branch 2 times, most recently from 529b005 to 6595c5b Compare July 21, 2023 08:10
@tenzen-y
Copy link
Member Author

@andreyvelich @johnugeorge @gaocegege This PR is ready for review. Please take a look.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great feature @tenzen-y!
This should give users much better experience.
I left a few first comments.

manifests/v1beta1/components/controller/katib-config.yaml Outdated Show resolved Hide resolved
manifests/v1beta1/components/controller/katib-config.yaml Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
cmd/katib-controller/v1beta1/main.go Show resolved Hide resolved
cmd/katib-controller/v1beta1/main.go Outdated Show resolved Hide resolved
pkg/apis/config/v1beta1/defaults.go Outdated Show resolved Hide resolved
pkg/util/v1beta1/katibconfig/config.go Show resolved Hide resolved
return nil
}

func SetDefaults_KatibConfig(cfg *KatibConfig) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, in the future, we should migrate our defaults for Katib Experiment to follow similar model which is more Kubernetes-native.
Currently, we set defaults via mutation webhook: https://github.com/tenzen-y/katib/blob/067c11933792f8060c3f9cd5349ef1b508a5b17c/pkg/webhook/v1beta1/experiment/mutate_webhook.go#L60.

Copy link
Member Author

@tenzen-y tenzen-y Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. If it's possible, it would be good.
By doing that, we can reduce the kube-apiserver load and then improve cluster performance :)

test/e2e/v1beta1/scripts/gh-actions/setup-katib.sh Outdated Show resolved Hide resolved
pkg/controller.v1beta1/suggestion/composer/composer.go Outdated Show resolved Hide resolved
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the kustomize can not merge (patch) against the structured .data in the ConfigMap.
So, we need to have a separate KatibConfig for the leaderElection.

NOTE: In the future, we might be able to merge the structured .data in the ConfigMap using the kustomize:
kubernetes-sigs/kustomize#4517

pkg/controller.v1beta1/consts/const.go Outdated Show resolved Hide resolved
# This KatibConfig is mostly same as https://github.com/kubeflow/katib/manifests/v1beta1/components/controller/katib-config.yaml.
# Only `.init.controller.enableLeaderElection` field is different.
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, Kustomize doesn't allow us to patch the ConfigMap data that is why we need to keep the whole config for each overlay 😞
Which means we need to update the image tags for each release in the future in these overlays.
I think, we need to do something with this script:

echo -e "Update Katib Metrics Collectors, Suggestions and EarlyStopping images\n"
update_yaml_files "${CONFIG_PATH}" "${OLD_PREFIX}" "${NEW_PREFIX}"
update_yaml_files "${CONFIG_PATH}" ":[^[:space:]].*\"" ":${TAG}\""
.

I think, at this stage it maybe worth to discuss how we want to maintain our suggestion image versions.
Currently, tags for suggestions we maintain under /manifests/v1beta1/components/controller/katib-config.yaml, tags for controller we maintain under /manifests/v1beta1/installs/katib-controller/kustomization.yaml.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think @kubeflow/wg-automl-leads @tenzen-y ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which means we need to update the image tags for each release in the future in these overlays.
I think, we need to do something with this script:

It's a good point.

I think, at this stage it maybe worth to discuss how we want to maintain our suggestion image versions.
Currently, tags for suggestions we maintain under /manifests/v1beta1/components/controller/katib-config.yaml, tags for controller we maintain under /manifests/v1beta1/installs/katib-controller/kustomization.yaml.

I think generating KatibConfig for each installs by script would be good. This means we generate manifests/v1beta1/installs/katib-leader-election/katib-config.yaml from manifests/v1beta1/components/katib-config.yaml by a scripts.

If so, we could maintain the suggestion service image versions the same way as in the past.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generating KatibConfig for each installs by script would be good.

What do you mean by that ?

I was thinking that since Kustomize doesn't allow us to patch the ConfgMap data, we can just replace it where it is required.
Similar what you did for Katib Leader Election install.
So for Katib Standalone install and Katib with Kubeflow install we need to do the same to replace Katib Config with ConfigMap with appropriate image tags.
E.g. for master branch the image tags will be :latest for release-0.x branch the image tags will be :v0.x.0.

WDYT @tenzen-y @johnugeorge ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generating KatibConfig for each installs by script would be good.

What do you mean by that ?

@andreyvelich NVM. My suggestion wouldn't work well.

So for Katib Standalone install and Katib with Kubeflow install we need to do the same to replace Katib Config with ConfigMap with appropriate image tags.

Does that mean that we use kustomize imageTagTransformer like

images:
- name: docker.io/kubeflowkatib/katib-controller
newName: docker.io/kubeflowkatib/katib-controller
newTag: latest
- name: docker.io/kubeflowkatib/katib-db-manager
newName: docker.io/kubeflowkatib/katib-db-manager
newTag: latest
- name: docker.io/kubeflowkatib/katib-ui
newName: docker.io/kubeflowkatib/katib-ui
newTag: latest
- name: docker.io/kubeflowkatib/cert-generator
newName: docker.io/kubeflowkatib/cert-generator
newTag: latest
?

Kustomize recognizes .data field as string data, not structured data such as YAML. So we can not replace image tags on the kind: KatibConfig embedded on ConfigMap using imageTagTransformer.

So, we need to replace image tags using vars like

vars:
- fieldref:
fieldPath: metadata.namespace
name: KATIB_UI_NAMESPACE
objref:
apiVersion: apps/v1
kind: Deployment
name: katib-ui
configurations:
- params.yaml
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, that for every install we are going to replace the whole Katib Config, similar to Katib Leader Election. E.g. for Katib Standalone we just have the same config as in /components/katib-config, but it will have different image tags (for master branch latest tag, for release- branch v0.x tag. Does it make sense?

@andreyvelich Oh, I see. Thanks for the clarification.

I think, the idea of installs/overlays was to have 1 place where we set settings, images, or any other changes for our Katib Control Plane.

Other solution could be to just remove Katib Config from components/ manifests and add it only in installs. We can modify our docs to explain it.

It sounds reasonable. I will do it ASAP.

Copy link
Member

@andreyvelich andreyvelich Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge @gaocegege I am happy to discuss any other suggestions on how to configure our Katib manifests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I've done it: 42fe278
Is this your expected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's right @tenzen-y, that what was I proposed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the check :)

@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 1, 2023

Thank you for this great contribution @tenzen-y! I think, we should update our docs in the next few months: https://www.kubeflow.org/docs/components/katib/katib-config/ Please can you create an issue to track this ?

/lgtm /hold for review from @gaocegege and @johnugeorge /assign @johnugeorge @gaocegege

Thanks for the review. I created an issue: #2186

@andreyvelich
Copy link
Member

@tenzen-y Let's also rebase this PR, so we can test it in K8s 1.26.

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 1, 2023

@tenzen-y Let's also rebase this PR, so we can test it in K8s 1.26.

Yea, I rebased this PR.

@johnugeorge
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Aug 1, 2023
@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 1, 2023

/hold cancel

@google-oss-prow google-oss-prow bot merged commit e69235d into kubeflow:master Aug 1, 2023
@tenzen-y tenzen-y deleted the improve-katib-config-ux branch August 1, 2023 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Proposal] Improve the UX for the katib-config
4 participants