- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The main goal of this enhancement is to introduce minReadySeconds as an optional field to StatefulSets.
minReadySeconds specifies the minimum number of seconds for which a newly created Pod should be ready without any of its containers crashing, for it to be considered available. This defaults to 0 (the pod will be considered available as soon as it is ready).
minReadySeconds is already available as an optional field for Deployments, DaemonSets, ReplicasSets and Replication Controllers. Enabling this option helps in bringing StatefulSets on par with other workload controllers.
Note: The important point to understand is when will a pod be considered
Ready
which might depend on the container probes configured. For example,
for a pod with single container having a readiness check with initialDelaySeconds
,
for the pod to considered Available
it has to be implicitly Ready
for
initialDelaySeconds
+minReadySeconds
.
StatefulSet Controller honoring minReadySeconds and mark Pod ready only if Pod is available for given time mentioned in minReadySeconds.
Moving minReadySeconds to Pod Spec is beyond the scope of this KEP because of following reasons:
- The effort to change pod spec would be large. While this also helps other controllers like endpoint controller to look at the pod status and propagate, the main goal of this KEP is to introduce minReadySeconds field to StatefulSet spec to bring consistency.
- Currently our workload controllers are inconsistent and we prioritize consistency of experience.
- StatefulSets are just different enough from daemonsets and deployments that real world use of minReadySeconds for stateful sets might influence any future design or point in a more appropriate direction.
More information about the discussion can be found here and why we are going ahead with this approach as there was consensus to bring consistency of experience.
StatefulSet spec should be expanded with an optional field called
minReadySeconds. To reflect the StatefulSet pods that honored minReadySeconds,
an additional field called AvailableReplicas
will also be added to
StatefulSet status.
As a Kubernetes user, I want to ensure that my StatefulSet Pods have the desired state ready before they start serving requests. Some applications should only be considered as available only when the state transfer has happened properly.
As a Kubernetes user, I want to leverage StatefulSet's minReadySeconds behind service loadbalancers in clouds. This is to prevent killing pods in rotation before new pods show up.
We are proposing a new field called minReadySeconds
whose default value will
be 0. In this mode, StatefulSet will behave exactly like its current behavior.
Its possible we introduce a bug in the implementation. The bug can cause the
- Pod to be marked as ready too soon. This means that the end user may think that their workload is available to be used but in reality, the workload is not available yet
- Pod to be marked as ready too late. This means that the end user may think that their workload is not available yet but in reality, the workload is ready to be used.
The mitigation currently is that is disabled by default in Alpha phase behind
a feature gate for people to try out and give feedback. In Beta phase when
its enabled by default, people will only see issues or bugs when minReadySeconds
is set to something greater than 0. Since people would have tried this feature
in Alpha, we would have had time to fix issues.
type StatefulSetSpec struct {
// Minimum number of seconds for which a newly created pod should be ready
// without any of its container crashing, for it to be considered available.
// Defaults to 0 (pod will be considered available as soon as it is ready)
// This is an alpha field and requires enabling StatefulSetMinReadySeconds feature gate.
// +optional
MinReadySeconds int32
}
type StatefulSetStatus struct {
// Total number of available pods (ready for at least minReadySeconds) targeted by this statefulset.
// This is an alpha field and requires enabling StatefulSetMinReadySeconds feature gate.
// +optional
AvailabileReplicas int32
}
Unit, integration and E2E tests cover the existing StatefulSet mechanics. Additionally, unit and integration tests will be added to cover the API validation, behavioral change of StatefulSet with feature gate enabled and disabled.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
`k8s.io/kubernetes/pkg/apis/apps/validation` `06/07/2022`: `90.6% of statements` `The tests added for the current feature in this package touches the statefulSet Spec and Status fields. No new tests are needed for promotion to GA`
`k8s.io/kubernetes/pkg/apis/apps/validation/validation.go:93`: `06/07/2022`: `93.1% of statements`
`k8s.io/kubernetes/pkg/apis/apps/validation/validation.go:162`: `06/07/2022`: `100% of statements`
`k8s.io/kubernetes/pkg/apis/apps/validation/validation.go:169`: `06/07/2022`: `94.1% of statements`
`k8s.io/kubernetes/pkg/apis/apps/validation/validation.go:199`: `06/07/2022`: `100% of statements`
`k8s.io/kubernetes/pkg/controller/statefulset`: `06/07/2022`: `85.5% of statements` `The tests added for the current feature in this package touches the statefulSet upgrade strategies. No new tests are needed for promotion to GA`
`k8s.io/kubernetes/pkg/registry/apps/statefulset`: `06/07/2022`: `76.7% of statements` `The tests added for the current feature in this package makes sure that the kubernetes version upgrades won't have any impact on the new fields to the statefulset api when persisting to etcd. No new tests are needed for promotion to GA`
`k8s.io/kubernetes/pkg/registry/apps/statefulset/strategy.go:95`: `06/07/2022`: `100.0% of statements`
`k8s.io/kubernetes/pkg/registry/apps/statefulset/strategy.go:139`: `06/07/2022`: `100.0% of statements`
`k8s.io/kubernetes/pkg/registry/apps/statefulset/strategy.go:223`: `06/07/2022`: `100.0% of statements`
`k8s.io/kubernetes/pkg/registry/apps/statefulset/strategy.go:118`: `06/07/2022`: `100.0% of statements`
Added integration tests to test availabile replicas when minReadySeconds is set on the statefulset spec.
k8s.io/kubernetes/test/integration/statefulset.TestStatefulSetAvailable: test grid
Following e2e tests are added to statefulsets.
- StatefulSet MinReadySeconds should be honored when enabled: test grid
- StatefulSet AvailableReplicas should get updated accordingly when MinReadySeconds is enabled: test grid
- Complete feature behind a featuregate
- Have proper unit and integration tests
- Gather feedback from end users
- Tests are in Testgrid and linked in KEP
- 2 examples of end users using this field
- The latest version of OpenShift is using this field for cluster monitoring operator
- Prometheus-operator alert manager uses this field
- Other users requesting for this feature can be found in the github issue
- Upgrades
When upgrading from a release without this feature, to a release with
minReadySeconds
, we will setminReadySeconds
to 0. This would give users the same default behavior. The default value ofAvailableReplicas
would be value of existing field in Status sub-resource calledReadyReplicas
- Downgrades
When downgrading from a release with this feature, to a release without
minReadySeconds
, there are two cases- If
minReadySeconds
is greater than 0 -- the StatefulSet controller wouldn't honorminReadySeconds
which is expected. TheAvailableReplicas
will be set toReadyReplicas
by StatefulSet controller - If
minReadySeconds
is equal to 0 -- in this case user won't see any difference in behavior. TheAvailableReplicas
will be set toReadyReplicas
by StatefulSet controller
- If
We will ensure that the minReadySeconds field is properly validated before persisting. The validation includes checking for positive number or 0 for the minReadySeconds field.
If the feature gate is enabled, the controller manager updates the
AvailableReplicas
to reflect the pods that have been available for
minReadySeconds as mentioned earlier, if not the controller manager
runs as is. This feature has no node runtime implications.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: StatefulSetMinReadySeconds
- Components depending on the feature gate:
- kube-controller-manager
- kube-apiserver
No, StatefulSet needs to have .spec.minReadySeconds
explicitly set
Yes. Using the featuregate is the only way to enable/disable this feature
The StatefulSet controller starts respecting the minReadySeconds
again
Yes, unit and integration tests for feature on, off. Please look at test plan section
It shouldn't impact already running workloads. This is an opt-in feature since
users need to explicitly set the minReadySeconds parameter in the StatefulSet spec i.e .spec.minReadySeconds
field.
If the feature is disabled the field is preserved. If it was already set in the persisted StatefulSet object, otherwise it is silently dropped.
We have a metric called kube_statefulset_status_replicas_available
which we added recently to track the number of available replicas. The cluster-admin could use
this metric to track the problems. If the value is immediately equal to the value of Ready
replicas or if it is 0
, it can be considered as a feature failure.
Manually tested. No issues were found when we enabled the feature gate -> disabled it -> re-enabled the feature gate. Upgrade -> downgrade -> upgrade scenario has been manually tested.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
None
By checking the kube_statefulset_status_replicas_available
metric. If all the Ready
replicas are accounted for in kube_statefulset_status_replicas_available
after waiting for minReadySeconds
, we can consider the feature to be in use by workloads.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
kube_statefulset_status_replicas_available
- [Optional] Aggregation method:
- Components exposing the metric: kube-controller-manager via kube_state_metrics. PR which adds the metric
- Metric name:
The kube_statefulset_status_replicas_available
gives the number of replicas available. Since the
kube_statefulset_status_replicas_available
metric tracks available replicas, comparing it with kube_statefulset_status_replicas_ready
metric should give us an understanding of the health of the feature. There should be certain times where kube_statefulset_status_replicas_available
lags behind kube_statefulset_status_replicas_ready
for a duration of minReadySeconds. This lag defines the correctness of the functionality.
All the Available
pods created should be more than the time specified in .spec.minReadySeconds
99% of the time.
Are there any missing metrics that would be useful to have to improve observability of this feature?
None. It is part of the StatefulSet controller.
No.
No.
No.
Yes. API type(s): StatefulSet
- Estimated increase in size:
- New field in StatefulSet spec about 4 bytes
- New field in StatefulSet status about 4 bytes
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No.
The controller won't be able to make progress, all currently queued resources are re-queued. This feature does not change current behavior of the controller in this regard.
minReadySeconds
not respected and all the pods are shownAvailable
immediately- Detection: Looking at
kube_statefulset_status_replicas_available
metric - Mitigations: Disable the
StatefulSetMinReadySeconds
feature flag - Diagnostics: Controller-manager when starting at log-level 4 and above
- Testing: Yes, e2e tests are already in place
- Detection: Looking at
minReadySeconds
not respected and none of the pods are shown asAvailable
afterminReadySeconds
- Detection: Looking at
kube_statefulset_status_replicas_available
. None of the pods will be shown available - Mitigations: Disable the
StatefulSetMinReadySeconds
feature flag - Diagnostics: Controller-manager when starting at log-level 4 and above
- Testing: Yes, e2e tests are already in place
- Detection: Looking at
- 2021-04-29: Initial KEP merged
- 2021-06-15: Initial implementation PR merged
- 2021-07-14: Graduate the feature to Beta proposed
Adds more complexity to StatefulSet controller in terms of checking Pod availability for a certain amount of time. Measuring the availability may be hard as there can be clock skew between Master/node. However we have successfully implemented this feature in Deployment and DaemonSet controllers.
Use increased readinessProbe
and initialDelaySeconds
to pod spec. This is not always fool proof and can cause problems