Skip to content

Latest commit

 

History

History

2458-node-resource-score-strategy

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

KEP-2458: Resource Fit Scoring Strategy

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The default scheduler includes three score plugins (NodeResourcesLeastAllocated, NodeResourcesMostAllocated and RequestedToCapacityRatio) that implement different strategies for preferred resource allocation. Those plugins are mutually exclusive.

This KEP proposes to deprecate those plugins and combine them under one Score plugin, the same one used for filtering (namely NodeResourcesFit), and add a ScoringStrategy parameter to NodeResourcesFit plugin config that allows users to select which exact scoring strategy to run.

Motivation

The motivation is two fold:

  1. Reduce the complexity of configuring the default scheduler:

Configuring the scheduler plugins is a tedious task. The relatively large number of plugins that the default scheduler supports doesn't make that task easier: more plugins means larger number of combinations of enabled/disabled plugins.

Moreover, some combinations don't make sense and potentially harmful. For example, specific to this KEP, enabling both least and most allocated scoring plugins is not useful.

Finally, we are planning to add more resource fit scoring strategies, like Best/WorstFit which allows preferring nodes with the least/most amount of available absolute resources that can host the pod; also, Least/MostAllocatable which allows preferring nodes with the least/most amount of allocatable resources that can host the pod. Adding those strategies as separate plugins will make the scheduler configuration problem even worse.

  1. A step towards allowing workloads to express their resource fit scoring strategy via pod spec.

Similar to how we allow workloads to express spread and affinity preferences, we want to allow pods to express node resource fit preferences. This is important to achieve higher utilization when running a mix of serving and batch workloads on the same cluster. We believe that the changes we make here will enable this long term vision.

Goals

  • Allow users to configure resource fit preferences using a single plugin configuration
  • Deprecate resource-based scoring plugins that implement individual strategies, specifically:
    • NodeResourcesLeastAllocated
    • NodeResourcesMostAllocated
    • RequestedToCapacityRatio
  • A flexible config API that allows adding new strategies in the future.

Non-Goals

  • Allow users to express resource fit preferences via the pod spec. This is left as a follow up work that will be done under a different KEP.

Proposal

User Stories (Optional)

Story 1

As a cluster operator, I want an easy way to configure the scoring behavior of the scheduler with respect to node resources.

Risks and Mitigations

Potentially reduce the number of allowed combinations too much. This can be mitigated by adding more configuration strategies wherever applicable.

Design Details

Define a new ScoringStrategy type as follows:

type StrategyType string

const (
    LeastAllocated StrategyType = "LeastAllocated"
    MostAllocated StrategyType = "MostAllocated"
    RequestedToCapacityRatio StrategyType = "RequestedToCapacityRatio"
)

type ScoringStrategy struct {
    metav1.TypeMeta

    // Strategy selects which strategy to run.
    Strategy StrategyType
    
    // Resources to consider when scoring.
    // The default resource set includes "cpu" and "memory" with an equal weight.
    // Allowed weights go from 1 to 100.
    Resources []ResourceSpec

    // Arguments specific to RequestedToCapacityRatio strategy.
    RequestedToCapacityRatio *RequestedToCapacityRatio
}

type RequestedToCapacityRatio struct {
    // Points defining priority function shape
    Shape []UtilizationShapePoint
}

// Note that the two types defined below already exist in the scheduler's component config API.
// ResourceSpec represents a single resource.
type ResourceSpec struct {
    // Name of the resource.
    Name string
    // Weight of the resource.
    Weight int64
}

// UtilizationShapePoint represents a single point of a priority function shape.
type UtilizationShapePoint struct {
    // Utilization (x axis). Valid values are 0 to 100. Fully utilized node maps to 100.
    Utilization int32
    // Score assigned to a given utilization (y axis). Valid values are 0 to 10.
    Score int32
}

Add ScoringStrategy to the existing NodeResourcesFitArgs:

// NodeResourcesFitArgs holds arguments used to configure the NodeResourcesFit plugin.
type NodeResourcesFitArgs struct {
    metav1.TypeMeta

    // IgnoredResources is the list of resources that NodeResources fit filter
    // should ignore.
    IgnoredResources []string
    // IgnoredResourceGroups defines the list of resource groups that NodeResources fit 
    // filter should ignore.
    // e.g. if group is ["example.com"], it will ignore all resource names that begin
    // with "example.com", such as "example.com/aaa" and "example.com/bbb".
    // A resource group name can't contain '/'.
    IgnoredResourceGroups []string
    
    // ScoringStrategy selects the node resource scoring strategy.
    ScoringStrategy *ScoringStrategy
}

As of writing this KEP, scheduler component config is v1beta1. The plugins we plan to deprecate will continue to be configurable in v1beta1, and will not be available in v1beta2.

Test Plan

  • Unit and integration tests covering the new configuration path. For example, configuring a MostAllocated strategy should have the same behavior as configuring a NodeResourcesMostAllocated plugin separately.

Graduation Criteria

Alpha -> Beta Graduation

  • The KEP proposes an API change to scheduler component config to allow expressing an existing behavior in a different way. Since this configuration is opt-in, it will not be guarded by a feature flag, and will graduate with component config (i.e., will start in beta directly).
  • In v1beta1, the default scheduler configuration will continue to use the old plugins. In v1beta2, it will use the new config API.

Beta -> GA Graduation

  • Allowing time for feedback to ensure that the new API sufficiently expresses users requirements.

Upgrade / Downgrade Strategy

N/A

Version Skew Strategy

N/A

Production Readiness Review Questionnaire

Feature Enablement and Rollback

This section must be completed when targeting alpha to a release.

  • How can this feature be enabled / disabled in a live cluster?

    • Feature gate (also fill in values in kep.yaml)
      • Feature gate name:
      • Components depending on the feature gate:
    • Other
      • Describe the mechanism: this is an opt-in scheduler component config parameter. If set, then it will allow configuring an existing scheduler behaviour using this new parameter.
      • Will enabling / disabling the feature require downtime of the control plane? yes, requires restarting the scheduler.
      • Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled). No
  • Does enabling the feature change any default behavior? No.

  • Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Yes, this will depend on the component config version used:

    • if using v1beta1, revert to use the legacy plugins directly
    • If using v1beta2, revert to v1beta1 and use the legacy plugins directly
  • What happens if we reenable the feature if it was previously rolled back? This results in rolling back to an older scheduler configuration. If the old configuration is semantically the same, then the pod scheduling behavior should not change.

  • Are there any tests for feature enablement/disablement? No, we will do manual testing.

Rollout, Upgrade and Rollback Planning

This section must be completed when targeting beta graduation to a release.

  • How can a rollout fail? Can it impact already running workloads? It shouldn't impact already running workloads. This is an opt-in feature to express existing behavior using a different API. Operators need to change the scheduler configuration to enable it.

  • What specific metrics should inform a rollback?

    • A spike on metric schedule_attempts_total{result="error|unschedulable"}
  • Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? No, will be manually tested.

  • Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? No.

Monitoring Requirements

This section must be completed when targeting beta graduation to a release.

  • How can an operator determine if the feature is in use by workloads? This is a scheduler configuration feature that operators themselves opt into.

  • What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

    • Metrics
      • Component exposing the metric: kube-scheduler
        • Metric name: pod_scheduling_duration_seconds
        • Metric name: schedule_attempts_total{result="error|unschedulable"}
    • Other (treat as last resort)
      • Details:
  • What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

    • 99% of pod scheduling latency is within x minutes
    • x% of schedule_attempts_total are successful
  • Are there any missing metrics that would be useful to have to improve observability of this feature? No.

Dependencies

This section must be completed when targeting beta graduation to a release.

  • Does this feature depend on any specific services running in the cluster? No

Scalability

For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.

For beta, this section is required: reviewers must answer these questions.

For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.

  • Will enabling / using this feature result in any new API calls? No.

  • Will enabling / using this feature result in introducing new API types? No.

  • Will enabling / using this feature result in any new calls to the cloud provider? No.

  • Will enabling / using this feature result in increasing size or count of the existing API objects? No.

  • Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? No.

  • Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? No.

Troubleshooting

The Troubleshooting section currently serves the Playbook role. We may consider splitting it into a dedicated Playbook document (potentially with some monitoring details). For now, we leave it here.

This section must be completed when targeting beta graduation to a release.

  • How does this feature react if the API server and/or etcd is unavailable?

Running workloads will not be impacted, but pods that are not scheduled yet will not get assigned nodes.

  • What are other known failure modes? N/A

  • What steps should be taken if SLOs are not being met to determine the problem? N/A

Implementation History

  • 2021-02-08: Initial KEP sent for review