- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Alternatives
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
In #2232 we added a new flag to allow suspending jobs to control when the Pods of a Job get created by controller-manager. This was proposed as a primitive to allow a higher-level queue controller to implement job queuing: the queue controller unsuspends the job when resources become available.
To complement the above capability, a queue controller may also want to control on which group of nodes a job should run. For example, it may want to send the job to a specific partition (e.g., preemptibles) or a specific location (e.g., zone x).
This is a proposal to relax update validation on jobs that have never been unsuspended to allow mutating node scheduling directives, namely the pod template's node affinity, node selector, tolerations, schedulingGates, annotations and labels of the job's pod template. This enables a higher-level queue controller to inject such directives before un-suspending a job to influence its placement.
Most kubernetes components are pod centric, including the scheduler and cluster autoscaler. This works well for service type workloads where the pods of a service are mostly independent and all services are expected to be running at all times.
However, for batch jobs, it doesn't make sense to focus only on pods. In most cases a parallel job will want the pods to run within specific constraints, like all in the same zone, or all either on GPU model x or y, but not a mix of both.
We made the first step towards achieving those semantics by introducing the suspend
flag to the Job API, which allowed a queue controller to decide when a job should start.
The idea was that such a controller would unsuspend the job when specific constraints
are met (e.g., capacity is provisioned). However, once a job is unsuspended, the queue
controller has no influence on where the pods of a job will actually land.
Adding the ability to mutate a job's scheduling bits before it starts gives a queue controller the ability to influence placement while at the same time offloading actual pod-to-node assignment to the existing kube-scheduler.
- Allow mutating node affinity, node selector, tolerations, annotations and labels of the pod template of jobs that have never been unsuspended.
- Implement a queue controller.
- Allow mutating scheduling bits of jobs that started at least once. This will not allow queue controllers to support preempting jobs and schedule them on a different partition of the cluster. As discussed in alternatives, this can be addressed in the future.
- Allow mutating scheduling bits of pods.
- Allow mutating scheduling bits of apps other than jobs.
The proposal is to relax update validation of scheduling bits of jobs that have never been unsuspended, specifically node affinity, node selector, tolerations, annotations and labels of the pod template.
This has no impact on the job-controller. The job controller has no dependency on those scheduling directives expressed in a job's pod template.
I want to build a controller that implements job queueing and influences when and where a job should run. Users create v1.Job objects, and to control when the job can run, I have a webhook that forces the jobs to be created in a suspended state, and the controller unsuspends them when capacity becomes available.
At the time of creating the job, we may not know on which part of the cluster (.e.g., a zone or a VM type) a job should run. To control where a job should run, the controller can update node affinity or tolerations of the Job's pod template; by doing so, the queue controller is able to influence the scheduler and autoscaler decisions.
-
New calls from a queue controller to the API server to update affinity or tolerations. The mitigation is for such controllers to make a single API call for both updating affinity/tolerations and unsuspending the job.
-
A race condition could happen that may result in having a mix of pods with different affinity/tolerations, consider the following scenario:
- unsuspend the job
- suspend the job
- update affinity
After unsuspending the job, the job-controller could have created some pods and not yet updated
Status.StartTime
by the time kube-apiserver received the suspend then affinity updates. This is not a typical sequence of events, but it is something that users should watch for.
The pod template validation logic in the API server, what we need to do is relax the validation of the Job's Template field which is currently immutable to be mutable for node affinity, node selector, tolerations, annotations and labels.
The condition we will check to verify that the job has never been unsuspended before is
Job.Spec.Suspend=true && Job.Status.StartTime=nil
.
- Unit and integration tests veryfing that:
- pod template's node affinity, node selector, tolerations, annotations and labels not mutable for jobs that have been unsuspended before.
- pod template's node affinity, node selector tolerations, annotations or labels not mutable for apps other than jobs.
- job controller observes the update and creates pods with the new scheduling directives.
k8s.io/kubernetes/pkg/registry/batch/job/
:1/30/2023
-76.8%
Available under Job integrations tests
Integration tests offer enough coverage.
We will release the feature directly in Beta state. Because the feature is opt-in and doesn't add a new field, there is no benefit in having an alpha release.
- Feature implemented behind a feature flag
- Unit and integration tests passing
- Fix any potentially reported bugs
No changes required to existing cluster to use this feature.
N/A. This feature doesn't impact nodes.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: MutableSchedulingDirectivesForJobs
- Components depending on the feature gate: kube-apiserver
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
Yes, it relaxes validation of updates to jobs. Specifically, it will allow mutating the node affinity, node selector and tolerations.
Yes. If disabled, kube-apiserver will start rejecting mutating node affinity/tolerations of jobs.
kube-apiserver will accept node affinity/tolerations updates of jobs.
Tests are in place. See Job integrations tests. The tests do not directly test switching between enable/disable, that was tested manually (see below).
The change is opt-in, it doesn't impact already running workloads. But problems with the updated validation logic may cause crashes in the apiserver.
Crashes in the apiserver because of potential problems with the updated validation logic.
Tested manually.
- Cluster on 1.22 was upgraded to 1.23
- Created a suspended job
- Mutated scheduling directives and checked that mutation was successful
- Started the job, verifying that the mutation took effect in practice as well
- Created a suspended job
- Downgraded the cluster back to 1.22
- Attempt to mutate the scheduling directives was rejected.
Findings: the feature and the cluster are behaving as expected.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
N/A. This is not a feature that workloads use directly.
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details: Create a job then update the node affinity/tolerations of the pod template of the job.
N/A
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: apiserver_request_total[resource=job, group=batch, verb=UPDATE, code=400]
- [Optional] Aggregation method:
- Components exposing the metric: kube-apiserver
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
N/A
No.
The feature itself doesn't generate API calls. But it will allow the apiserver to accept update requests to mutate part of the job spec, which will encourage implementing controllers that does this.
No.
No.
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No.
Update requests will be rejected.
In a multi-master setup, when the cluster has skewed apiservers, some update requests may get accepted and some may get rejected.
N/A.
- 2021-09-01: Proposed KEP starting in beta status.
- 2021-10-28: Updated the KEP to include annotations and labels of the pod template.
- 2023-01-03: Updated the KEP to include schedulingGates of the pod template and graduated the feature to stable.
This risk with this alternative is the potential for ending up with a mix of pods, existing ones use the old constraints while new ones using the updated constraints. This may surprise users since the old pods may not be schedulable, and the user's intention for changing the constraints is to make the job schedulable.
The solution could be to offer guaranteed behavior by updating the job-controller to re-create existing pods, but we decided to consider this as a followup and start with the proposed limited scope for now.
This is similar to this KEP's proposal but for all suspended jobs, not just ones that have never been unsuspended before. Without ensuring that the job is indeed scaled down before accepting the update, this approach faces race conditions that may result in pods with mixed scheduling directives and so create unpleasnt surprises.