2599-minreadyseconds-for-statefulsets

KEP-2599: minReadySeconds for StatefulSets

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

Summary

The main goal of this enhancement is to introduce minReadySeconds as an optional field to StatefulSets.

Motivation

minReadySeconds specifies the minimum number of seconds for which a newly created Pod should be ready without any of its containers crashing, for it to be considered available. This defaults to 0 (the pod will be considered available as soon as it is ready).

minReadySeconds is already available as an optional field for Deployments, DaemonSets, ReplicasSets and Replication Controllers. Enabling this option helps in bringing StatefulSets on par with other workload controllers.

Note: The important point to understand is when will a pod be considered Ready which might depend on the container probes configured. For example, for a pod with single container having a readiness check with initialDelaySeconds, for the pod to considered Available it has to be implicitly Ready for initialDelaySeconds+minReadySeconds.

Goals

StatefulSet Controller honoring minReadySeconds and mark Pod ready only if Pod is available for given time mentioned in minReadySeconds.

Non-Goals

Moving minReadySeconds to Pod Spec is beyond the scope of this KEP because of following reasons:

The effort to change pod spec would be large. While this also helps other controllers like endpoint controller to look at the pod status and propagate, the main goal of this KEP is to introduce minReadySeconds field to StatefulSet spec to bring consistency.
Currently our workload controllers are inconsistent and we prioritize consistency of experience.
StatefulSets are just different enough from daemonsets and deployments that real world use of minReadySeconds for stateful sets might influence any future design or point in a more appropriate direction.

More information about the discussion can be found here and why we are going ahead with this approach as there was consensus to bring consistency of experience.

Proposal

StatefulSet spec should be expanded with an optional field called minReadySeconds. To reflect the StatefulSet pods that honored minReadySeconds, an additional field called AvailableReplicas will also be added to StatefulSet status.

User Stories (Optional)

Story 1

As a Kubernetes user, I want to ensure that my StatefulSet Pods have the desired state ready before they start serving requests. Some applications should only be considered as available only when the state transfer has happened properly.

Story 2

As a Kubernetes user, I want to leverage StatefulSet's minReadySeconds behind service loadbalancers in clouds. This is to prevent killing pods in rotation before new pods show up.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

We are proposing a new field called minReadySeconds whose default value will be 0. In this mode, StatefulSet will behave exactly like its current behavior. Its possible we introduce a bug in the implementation. The bug can cause the

Pod to be marked as ready too soon. This means that the end user may think that their workload is available to be used but in reality, the workload is not available yet
Pod to be marked as ready too late. This means that the end user may think that their workload is not available yet but in reality, the workload is ready to be used.

The mitigation currently is that is disabled by default in Alpha phase behind a feature gate for people to try out and give feedback. In Beta phase when its enabled by default, people will only see issues or bugs when minReadySeconds is set to something greater than 0. Since people would have tried this feature in Alpha, we would have had time to fix issues.

Design Details

StatefulSet

type StatefulSetSpec struct {
	// Minimum number of seconds for which a newly created pod should be ready
	// without any of its container crashing, for it to be considered available.
	// Defaults to 0 (pod will be considered available as soon as it is ready)
	// This is an alpha field and requires enabling StatefulSetMinReadySeconds feature gate.
	// +optional
	MinReadySeconds int32
}

type StatefulSetStatus struct {
	// Total number of available pods (ready for at least minReadySeconds) targeted by this statefulset.
	// This is an alpha field and requires enabling StatefulSetMinReadySeconds feature gate.
	// +optional
	AvailabileReplicas int32
}

Test Plan

Unit, integration and E2E tests cover the existing StatefulSet mechanics. Additionally, unit and integration tests will be added to cover the API validation, behavioral change of StatefulSet with feature gate enabled and disabled.

Prerequisite testing updates

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Unit tests

`k8s.io/kubernetes/pkg/apis/apps/validation`                       `06/07/2022`:    `90.6% of statements` `The tests added for the current feature in this package touches the statefulSet Spec and Status fields. No new tests are needed for promotion to GA`
`k8s.io/kubernetes/pkg/apis/apps/validation/validation.go:93`:	   `06/07/2022`:		`93.1% of statements`
`k8s.io/kubernetes/pkg/apis/apps/validation/validation.go:162`:	   `06/07/2022`:		`100% of statements`
`k8s.io/kubernetes/pkg/apis/apps/validation/validation.go:169`:	   `06/07/2022`:		`94.1% of statements`
`k8s.io/kubernetes/pkg/apis/apps/validation/validation.go:199`:	   `06/07/2022`:		`100% of statements`
`k8s.io/kubernetes/pkg/controller/statefulset`:                    `06/07/2022`:    `85.5% of statements` `The tests added for the current feature in this package touches the statefulSet upgrade strategies. No new tests are needed for promotion to GA`
`k8s.io/kubernetes/pkg/registry/apps/statefulset`:                 `06/07/2022`:    `76.7% of statements` `The tests added for the current feature in this package makes sure that the kubernetes version upgrades won't have any impact on the new fields to the statefulset api when persisting to etcd. No new tests are needed for promotion to GA`
`k8s.io/kubernetes/pkg/registry/apps/statefulset/strategy.go:95`:	 `06/07/2022`:	  `100.0% of statements`
`k8s.io/kubernetes/pkg/registry/apps/statefulset/strategy.go:139`: `06/07/2022`:	  `100.0% of statements`
`k8s.io/kubernetes/pkg/registry/apps/statefulset/strategy.go:223`: `06/07/2022`:	  `100.0% of statements`
`k8s.io/kubernetes/pkg/registry/apps/statefulset/strategy.go:118`: `06/07/2022`:	  `100.0% of statements`

Integration tests

Added integration tests to test availabile replicas when minReadySeconds is set on the statefulset spec.

k8s.io/kubernetes/test/integration/statefulset.TestStatefulSetAvailable: test grid

e2e tests

Following e2e tests are added to statefulsets.

StatefulSet MinReadySeconds should be honored when enabled: test grid
StatefulSet AvailableReplicas should get updated accordingly when MinReadySeconds is enabled: test grid

Graduation Criteria

Alpha

Complete feature behind a featuregate
Have proper unit and integration tests

Alpha -> Beta Graduation

Gather feedback from end users
Tests are in Testgrid and linked in KEP

Beta -> GA Graduation

2 examples of end users using this field
- The latest version of OpenShift is using this field for cluster monitoring operator
- Prometheus-operator alert manager uses this field
- Other users requesting for this feature can be found in the github issue

Upgrade / Downgrade Strategy

Upgrades When upgrading from a release without this feature, to a release with minReadySeconds, we will set minReadySeconds to 0. This would give users the same default behavior. The default value of AvailableReplicas would be value of existing field in Status sub-resource called ReadyReplicas
Downgrades When downgrading from a release with this feature, to a release without minReadySeconds, there are two cases
- If minReadySeconds is greater than 0 -- the StatefulSet controller wouldn't honor minReadySeconds which is expected. The AvailableReplicas will be set to ReadyReplicas by StatefulSet controller
- If minReadySeconds is equal to 0 -- in this case user won't see any difference in behavior. The AvailableReplicas will be set to ReadyReplicas by StatefulSet controller

We will ensure that the minReadySeconds field is properly validated before persisting. The validation includes checking for positive number or 0 for the minReadySeconds field.

Version Skew Strategy

If the feature gate is enabled, the controller manager updates the AvailableReplicas to reflect the pods that have been available for minReadySeconds as mentioned earlier, if not the controller manager runs as is. This feature has no node runtime implications.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: StatefulSetMinReadySeconds
- Components depending on the feature gate:
  - kube-controller-manager
  - kube-apiserver

Does enabling the feature change any default behavior?

No, StatefulSet needs to have .spec.minReadySeconds explicitly set

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Using the featuregate is the only way to enable/disable this feature

What happens if we reenable the feature if it was previously rolled back?

The StatefulSet controller starts respecting the minReadySeconds again

Are there any tests for feature enablement/disablement?

Yes, unit and integration tests for feature on, off. Please look at test plan section

Rollout, Upgrade and Rollback Planning

How can a rollout fail? Can it impact already running workloads?

It shouldn't impact already running workloads. This is an opt-in feature since users need to explicitly set the minReadySeconds parameter in the StatefulSet spec i.e .spec.minReadySeconds field. If the feature is disabled the field is preserved. If it was already set in the persisted StatefulSet object, otherwise it is silently dropped.

What specific metrics should inform a rollback?

We have a metric called kube_statefulset_status_replicas_available which we added recently to track the number of available replicas. The cluster-admin could use this metric to track the problems. If the value is immediately equal to the value of Ready replicas or if it is 0, it can be considered as a feature failure.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Manually tested. No issues were found when we enabled the feature gate -> disabled it -> re-enabled the feature gate. Upgrade -> downgrade -> upgrade scenario has been manually tested.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

None

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

By checking the kube_statefulset_status_replicas_available metric. If all the Ready replicas are accounted for in kube_statefulset_status_replicas_available after waiting for minReadySeconds, we can consider the feature to be in use by workloads.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: kube_statefulset_status_replicas_available
- [Optional] Aggregation method:
- Components exposing the metric: kube-controller-manager via kube_state_metrics. PR which adds the metric

The kube_statefulset_status_replicas_available gives the number of replicas available. Since the kube_statefulset_status_replicas_available metric tracks available replicas, comparing it with kube_statefulset_status_replicas_ready metric should give us an understanding of the health of the feature. There should be certain times where kube_statefulset_status_replicas_available lags behind kube_statefulset_status_replicas_ready for a duration of minReadySeconds. This lag defines the correctness of the functionality.

What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

All the Available pods created should be more than the time specified in .spec.minReadySeconds 99% of the time.

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

None. It is part of the StatefulSet controller.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes. API type(s): StatefulSet

Estimated increase in size:
- New field in StatefulSet spec about 4 bytes
- New field in StatefulSet status about 4 bytes

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The controller won't be able to make progress, all currently queued resources are re-queued. This feature does not change current behavior of the controller in this regard.

What are other known failure modes?

minReadySeconds not respected and all the pods are shown Available immediately
- Detection: Looking at kube_statefulset_status_replicas_available metric
- Mitigations: Disable the StatefulSetMinReadySeconds feature flag
- Diagnostics: Controller-manager when starting at log-level 4 and above
- Testing: Yes, e2e tests are already in place
minReadySeconds not respected and none of the pods are shown as Available after minReadySeconds
- Detection: Looking at kube_statefulset_status_replicas_available. None of the pods will be shown available
- Mitigations: Disable the StatefulSetMinReadySeconds feature flag
- Diagnostics: Controller-manager when starting at log-level 4 and above
- Testing: Yes, e2e tests are already in place

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

2021-04-29: Initial KEP merged
2021-06-15: Initial implementation PR merged
2021-07-14: Graduate the feature to Beta proposed

Drawbacks

Adds more complexity to StatefulSet controller in terms of checking Pod availability for a certain amount of time. Measuring the availability may be hard as there can be clock skew between Master/node. However we have successfully implemented this feature in Deployment and DaemonSet controllers.

Alternatives

Use increased readinessProbe and initialDelaySeconds to pod spec. This is not always fool proof and can cause problems

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
kep.yaml		kep.yaml

Files

2599-minreadyseconds-for-statefulsets

Directory actions

More options

Directory actions

More options

Latest commit

History

2599-minreadyseconds-for-statefulsets

Folders and files

parent directory

README.md

README.md

kep.yaml

kep.yaml

README.md

KEP-2599: minReadySeconds for StatefulSets

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

User Stories (Optional)

Story 1

Story 2

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

StatefulSet

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Alpha -> Beta Graduation

Beta -> GA Graduation

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)