diff --git a/docs/design-proposals/resources/eSDKTopologyAwareness.jpg b/docs/design-proposals/resources/eSDKTopologyAwareness.jpg new file mode 100644 index 00000000..4acdbe3f Binary files /dev/null and b/docs/design-proposals/resources/eSDKTopologyAwareness.jpg differ diff --git a/docs/design-proposals/topology-support.md b/docs/design-proposals/topology-support.md new file mode 100644 index 00000000..1f412d18 --- /dev/null +++ b/docs/design-proposals/topology-support.md @@ -0,0 +1,305 @@ +# eSDK support CSI Topology-Aware Volume Provisioning with Kubernetes + +**Author(s)**: [Amit Roushan](https://github.com/AmitRoushan) + +## Version Updates +Date | Version | Description | Author +---|---|---|--- +Aug 5th 2021 | 0.1.0 | Initial design draft for storage topology support for eSDK kubernetes plugin| Amit Roushan + +## Terminology + + Term | Definition +------|------ +CSI | A specification attempting to establish an industry standard interface that Container Orchestration Systems (COs) can use to expose arbitrary storage systems to their containerized workloads. +PV | A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes +PVC | A PersistentVolumeClaim (PVC) is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources. + +## Motivation and background +Some storage systems expose volumes that are not equally accessible by all nodes in a +Kubernetes cluster. Instead volumes may be constrained to some subset of node(s) in the cluster. +The cluster may be segmented into, for example, “racks” or “regions” and “zones” or some other +grouping, and a given volume may be accessible only from one of those groups. + +To enable orchestration systems, like Kubernetes, to work well with storage systems which expose +volumes that are not equally accessible by all nodes, the CSI spec enables: + +- Ability for a CSI Driver to opaquely specify where a particular node exists with respect to + storage system (e.g. "node A" is in "zone 1"). +- Ability for Kubernetes (users or components) to influence where a volume is provisioned + (e.g. provision new volume in either "zone 1" or "zone 2"). +- Ability for a CSI Driver to opaquely specify where a particular volume exists + (e.g. "volume X" is accessible by all nodes in "zone 1" and "zone 2"). + +Kubernetes support this CSI abilities to make intelligent scheduling and provisioning decisions. + +Being a CSI plugin, eSDK strive to support topological scheduling and provisioning for end customer. + +## Goals +The document present detailed design to make eSDK enable for topological volume scheduling +and provisioning in kubernetes cluster. + +The design should +- Enable operator/cluster admin to configure eSDK for topological distribution +- Enable end user to provision volumes based on configured topology +- Add recommendations for topological name and configuration strategy + +### Non-Goals +The document will not explicitly define, provide or explain: +- kubernetes [Volume Topology-aware Scheduling](https://github.com/jsafrane/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md) +- [CSI spec](https://github.com/container-storage-interface/spec/blob/master/spec.md) for topology support + +### Assumptions and Constraints +- The document has only considered kubernetes as orchestrator/provisioner. +- Volume provisioning/scheduling over Kubernetes nodes are part of [kubernetes Volume Topology-aware Scheduling](https://github.com/jsafrane/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md). + +### Input Requirements +Support topology awareness for eSDK + +### Feature Requirements +- Enable operator/cluster admin to configure eSDK for topological distribution +- Enable end user to provision volumes based on configured topology +- Add recommendations for topological name and configuration strategy + +#### Requirement Analysis +Support topology aware volume provisioning on kubernetes for eSDK plugin: +- Should work for pre-provisioned persistent volume (PV) + - Cluster admin can create a PV with NodeAffinity which mean PV can only be accessed from Nodes that + satisfy the NodeSelector + - Kubernetes ensures: + - Scheduler predicate: if a Pod references a PVC that is bound to a PV with NodeAffinity, the predicate will + evaluate the NodeSelector against the Node's labels to filter the nodes that the Pod can be schedule to. + Kubelet: PV NodeAffinity is verified against the Node when mounting PVs. +- Should work for Dynamic provisioned persistent volume (PV) + - Dynamic provisioning aware of pod scheduling decisions, delayed volume binding must also be enabled + - Scheduler will pass its selected node to the dynamic provisioner, and the provisioner will create a + volume in the topology domain that the selected node is part of. +- Operator/Cluster admin should + - Able to specify topological distribution of kubernetes nodes + - Able to provision topology aware dynamic or pre-provisioned persistent volume (PV) + - Able to configure eSDK for topology aware volume provisioning +- Application developer/deployer should + - Able to configure topology aware volume for workload + +##### Functional Requirements +- Should work for pre-provisioned persistent volume (PV) +- Should work for Dynamic provisioned persistent volume (PV) +- Able to specify topological distribution of kubernetes nodes +- Able to provision topology aware dynamic or pre-provisioned persistent volume (PV) +- Able to configure topology aware volume for workload + +##### Non Functional Requirements +- Should support old version of volume provisioning + +### Performance Requirements +- Volume provisioned without topology remains performant + +### Security Requirements +NA +### Other Non Functional Requirements (Scalability, HA etc…) +NA + +## Architecture Analysis + +### System Architecture + +![System Architecture](resources/eSDKTopologyAwareness.jpg) + +Kubernetes supports CSI specification to enable storage provider to write their storage plugin for volume provisioning. +Normally Kubernetes nodes are equally accessible for volume provisioning hence PV controller in controller manager trigger +volume provisioning as soon as PV/PVC are defined. Therefore, scheduler cannot take into account any of the pod’s other +scheduling constraints. This makes it possible for the PV controller to bind a PVC to a PV or provision a PV with +constraints that can make a pod unschedulable. +Detail design of [Volume Topology-aware Scheduling](https://github.com/jsafrane/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md) is in purview of kubernetes +Summary: +- Admin pre-provisions PVs and/or StorageClasses. +- User creates unbound PVC and there are no prebound PVs for it. +- PVC binding and provisioning is delayed until a pod is created that references it. +- User creates a pod that uses the PVC. +- Pod starts to get processed by the scheduler. +- Scheduler processes predicates. the predicate function, will process both bound and unbound PVCs of the Pod. It will + validate the VolumeNodeAffinity for bound PVCs. For unbound PVCs, it will try to find matching PVs for that node + based on the PV NodeAffinity. If there are no matching PVs, then it checks if dynamic provisioning is possible for + that node based on StorageClass AllowedTopologies. +- After evaluation, the scheduler will pick a node. +- Schedule triggers volume provisioning by annotating PV with the selected nodes +- PV controller get informed by the event and start provisioning by passing selected node topogical info to external provisioner +- External provisioner eventually calls ```CreateVolume``` gRPC request to eSDK controller plugin with topological data. +- CSI controller plugin consumes topology data and provision volume accordingly. + +The document handles eSDK adaption towards topological volume provisioning. +eSDK is centralized and split component CSI plugin. The component involves: +- ```Controller plugin``` communicates indirectly with Kubernetes master components and storage backend to + implement CSI Controller service functionality +- ```Node plugin``` communicates indirectly with Kubernetes master components adn storage backend to implement + node service functionality + + +## Detailed Design + +Enabling topology aware provisioning with eSDK has the following aspects: +- Enabling Feature Gates in kubernetes + - Topology aware volume scheduling is controlled by the VolumeScheduling feature gate, + and must be configured in + - kube-scheduler + - kube-controller-manager + - all kubelets. + +- Set ```VOLUME_ACCESSIBILITY_CONSTRAINTS``` flag in ```GetPluginCapabilities``` call for identity service + +- Make Kubernetes aware about topology + - Enable eSDK node plugin to publish Specifies where (regions, zones, racks, etc.) the node is accessible from + - Cluster admin MUST add topology aware labels for each node. + + Ex: topology.kubernetes.io/region or topology.kubernetes.io/zone + - eSDK node plugin fetches node labels and pass on the topological data to kubelet in ```NodeGetInfo``` gRPC call + ``` + NodeGetInfoResponse{ + node_id = "{HostName: k8s-node-1}" + accessible_topology = + {"topology.kubernetes.io/region": "R1", + "topology.kubernetes.io/zone": "Z2"} + } + ``` + - Kubelet creates ```CSINodeInfo``` with topological data. + - Same ```CSINodeInfo``` object is used during Volume provisioning/scheduling + + +- Topology aware volume provisioning: + - Controller plugin responsible to make intelligent volume provisioning decision based on topological data + provided in ```CreateVolume``` CSI gRPC call + - To support topology aware provisioning, configuration of different backend are provided with supported + topological distributions during deployment + ```json + csi.json: { + "backends": [ + { + "storage": "***", + "name": "storage1", + "urls": ["https://*.*.*.*:28443"], + "pools": ["***"], + "parameters": {"protocol": "iscsi", "portals": ["*.*.*.*"]} + "supportedTopologies":[ + {"topology.kubernetes.io/region": "R1", + "topology.kubernetes.io/zone": "Z1"}, + ] + }, + { + "storage": "***", + "name": "storage2", + "urls": ["https://*.*.*.*:28443"], + "pools": ["***"], + "parameters": {"protocol": "iscsi", "portals": ["*.*.*.*"]} + "supportedTopologies":[ + {"topology.kubernetes.io/region": "R1", + "topology.kubernetes.io/zone": "Z2"}, + ] + } + ] + } + ``` + - Topology aware provisioning by controller plugin: + - External provisioner initiate volume provisioning by ```CreateVolume``` CSI gRPC call to controller plugin + - External provisioner pass on ```accessibility_requirements``` parameter in ```CreateVolumeRequest```. + - #### Scenario : ```accessibility_requirements``` parameter supplied: + - Controller need to consider topological constraints provided in ```accessibility_requirements``` + - Controller can get topology attributes in two variants in ```accessibility_requirements```: + - ```requisite```: + - If requisite is specified, the volume MUST be accessible from at least one of requisite topologies + - ```preffered``` + - MUST attempt to make the provisioned volume accessible using the preferred topologies in order from first to last + - if requisite is specified, all topologies in preferred list