2926-job-mutable-scheduling-directives

KEP-2926: Allow updating scheduling directives of jobs

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- User Stories (Optional)
  - Story 1
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Alternatives
- Alternative 2: allow updating jobs with no restrictions.
- Allow updating all suspended jobs

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

Summary

In #2232 we added a new flag to allow suspending jobs to control when the Pods of a Job get created by controller-manager. This was proposed as a primitive to allow a higher-level queue controller to implement job queuing: the queue controller unsuspends the job when resources become available.

To complement the above capability, a queue controller may also want to control on which group of nodes a job should run. For example, it may want to send the job to a specific partition (e.g., preemptibles) or a specific location (e.g., zone x).

This is a proposal to relax update validation on jobs that have never been unsuspended to allow mutating node scheduling directives, namely the pod template's node affinity, node selector, tolerations, schedulingGates, annotations and labels of the job's pod template. This enables a higher-level queue controller to inject such directives before un-suspending a job to influence its placement.

Motivation

Most kubernetes components are pod centric, including the scheduler and cluster autoscaler. This works well for service type workloads where the pods of a service are mostly independent and all services are expected to be running at all times.

However, for batch jobs, it doesn't make sense to focus only on pods. In most cases a parallel job will want the pods to run within specific constraints, like all in the same zone, or all either on GPU model x or y, but not a mix of both.

We made the first step towards achieving those semantics by introducing the suspend flag to the Job API, which allowed a queue controller to decide when a job should start. The idea was that such a controller would unsuspend the job when specific constraints are met (e.g., capacity is provisioned). However, once a job is unsuspended, the queue controller has no influence on where the pods of a job will actually land.

Adding the ability to mutate a job's scheduling bits before it starts gives a queue controller the ability to influence placement while at the same time offloading actual pod-to-node assignment to the existing kube-scheduler.

Goals

Allow mutating node affinity, node selector, tolerations, annotations and labels of the pod template of jobs that have never been unsuspended.

Non-Goals

Implement a queue controller.
Allow mutating scheduling bits of jobs that started at least once. This will not allow queue controllers to support preempting jobs and schedule them on a different partition of the cluster. As discussed in alternatives, this can be addressed in the future.
Allow mutating scheduling bits of pods.
Allow mutating scheduling bits of apps other than jobs.

Proposal

The proposal is to relax update validation of scheduling bits of jobs that have never been unsuspended, specifically node affinity, node selector, tolerations, annotations and labels of the pod template.

This has no impact on the job-controller. The job controller has no dependency on those scheduling directives expressed in a job's pod template.

User Stories (Optional)

Story 1

I want to build a controller that implements job queueing and influences when and where a job should run. Users create v1.Job objects, and to control when the job can run, I have a webhook that forces the jobs to be created in a suspended state, and the controller unsuspends them when capacity becomes available.

At the time of creating the job, we may not know on which part of the cluster (.e.g., a zone or a VM type) a job should run. To control where a job should run, the controller can update node affinity or tolerations of the Job's pod template; by doing so, the queue controller is able to influence the scheduler and autoscaler decisions.

Risks and Mitigations

New calls from a queue controller to the API server to update affinity or tolerations. The mitigation is for such controllers to make a single API call for both updating affinity/tolerations and unsuspending the job.
A race condition could happen that may result in having a mix of pods with different affinity/tolerations, consider the following scenario:
1. unsuspend the job
2. suspend the job
3. update affinity
After unsuspending the job, the job-controller could have created some pods and not yet updated Status.StartTime by the time kube-apiserver received the suspend then affinity updates. This is not a typical sequence of events, but it is something that users should watch for.

Design Details

The pod template validation logic in the API server, what we need to do is relax the validation of the Job's Template field which is currently immutable to be mutable for node affinity, node selector, tolerations, annotations and labels.

The condition we will check to verify that the job has never been unsuspended before is Job.Spec.Suspend=true && Job.Status.StartTime=nil.

Test Plan

Unit and integration tests veryfing that:
- pod template's node affinity, node selector, tolerations, annotations and labels not mutable for jobs that have been unsuspended before.
- pod template's node affinity, node selector tolerations, annotations or labels not mutable for apps other than jobs.
- job controller observes the update and creates pods with the new scheduling directives.

Unit tests

k8s.io/kubernetes/pkg/registry/batch/job/: 1/30/2023 - 76.8%

Integration tests

Available under Job integrations tests

e2e tests

Integration tests offer enough coverage.

Graduation Criteria

We will release the feature directly in Beta state. Because the feature is opt-in and doesn't add a new field, there is no benefit in having an alpha release.

Beta

Feature implemented behind a feature flag
Unit and integration tests passing

GA

Fix any potentially reported bugs

Upgrade / Downgrade Strategy

No changes required to existing cluster to use this feature.

Version Skew Strategy

N/A. This feature doesn't impact nodes.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: MutableSchedulingDirectivesForJobs
- Components depending on the feature gate: kube-apiserver
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

Yes, it relaxes validation of updates to jobs. Specifically, it will allow mutating the node affinity, node selector and tolerations.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. If disabled, kube-apiserver will start rejecting mutating node affinity/tolerations of jobs.

What happens if we reenable the feature if it was previously rolled back?

kube-apiserver will accept node affinity/tolerations updates of jobs.

Are there any tests for feature enablement/disablement?

Tests are in place. See Job integrations tests. The tests do not directly test switching between enable/disable, that was tested manually (see below).

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The change is opt-in, it doesn't impact already running workloads. But problems with the updated validation logic may cause crashes in the apiserver.

What specific metrics should inform a rollback?

Crashes in the apiserver because of potential problems with the updated validation logic.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Tested manually.

Cluster on 1.22 was upgraded to 1.23
Created a suspended job
Mutated scheduling directives and checked that mutation was successful
Started the job, verifying that the mutation took effect in practice as well
Created a suspended job
Downgraded the cluster back to 1.22
Attempt to mutate the scheduling directives was rejected.

Findings: the feature and the cluster are behaving as expected.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

N/A. This is not a feature that workloads use directly.

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details: Create a job then update the node affinity/tolerations of the pod template of the job.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: apiserver_request_total[resource=job, group=batch, verb=UPDATE, code=400]
- [Optional] Aggregation method:
- Components exposing the metric: kube-apiserver
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

N/A

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

The feature itself doesn't generate API calls. But it will allow the apiserver to accept update requests to mutate part of the job spec, which will encourage implementing controllers that does this.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Update requests will be rejected.

What are other known failure modes?

In a multi-master setup, when the cluster has skewed apiservers, some update requests may get accepted and some may get rejected.

What steps should be taken if SLOs are not being met to determine the problem?

N/A.

Implementation History

2021-09-01: Proposed KEP starting in beta status.
2021-10-28: Updated the KEP to include annotations and labels of the pod template.
2023-01-03: Updated the KEP to include schedulingGates of the pod template and graduated the feature to stable.

Alternatives

Alternative 2: allow updating jobs with no restrictions.

This risk with this alternative is the potential for ending up with a mix of pods, existing ones use the old constraints while new ones using the updated constraints. This may surprise users since the old pods may not be schedulable, and the user's intention for changing the constraints is to make the job schedulable.

The solution could be to offer guaranteed behavior by updating the job-controller to re-create existing pods, but we decided to consider this as a followup and start with the proposed limited scope for now.

Allow updating all suspended jobs

This is similar to this KEP's proposal but for all suspended jobs, not just ones that have never been unsuspended before. Without ensuring that the job is indeed scaled down before accepting the update, this approach faces race conditions that may result in pods with mixed scheduling directives and so create unpleasnt surprises.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
kep.yaml		kep.yaml

Files

2926-job-mutable-scheduling-directives

Directory actions

More options

Directory actions

More options

Latest commit

History

2926-job-mutable-scheduling-directives

Folders and files

parent directory

README.md

README.md

kep.yaml

kep.yaml

README.md

KEP-2926: Allow updating scheduling directives of jobs

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

User Stories (Optional)

Story 1

Risks and Mitigations

Design Details

Test Plan

Unit tests

Integration tests

e2e tests

Graduation Criteria

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Alternatives

Alternative 2: allow updating jobs with no restrictions.

Allow updating all suspended jobs