Extend NoScaleUpInfo reporting by running simulation on skipped NodeGroups. by shaikenov · Pull Request #9346 · kubernetes/autoscaler

shaikenov · 2026-03-11T10:56:43Z

This change introduces the following change:

run SchedulablePodGroups on skipped node groups (NG) during the ScaleUp simulation to check if skipped NGs satisfy predicates of podEquivalenceGroups:
- if a skipped NG satisfies the predicate of a pod group then it stays in the SkippedNodeGroups list associated to this pod group's pods.
- otherwise this NG moves to the RejectedNodeGroups.
since this change introduces the scale up performance overhead, it is covered by the feature flag.

What type of PR is this?

/kind feature

What this PR does / why we need it:

This change will give better understanding to the user of why the scale up did fail and improve overall observability. If a NG is in the backoff but it does not satisfy the predicate, user will know it right away instead of waiting for when this NG becomes available and we would consider it again.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Adds new "scaleup-simulation-for-skipped-node-groups-enabled" flag which enables an extra SchedulablePodGroups run for the skipped node groups during ScaleUp simulation.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2026-03-11T10:56:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shaikenov
Once this PR has been reviewed and has the lgtm label, please assign aleksandra-malinowska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-03-11T10:56:53Z

Hi @shaikenov. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

shaikenov · 2026-03-13T11:12:33Z

/uncc elmiko
/cc MartynaGrotek

k8s-ci-robot · 2026-03-13T11:12:36Z

@shaikenov: GitHub didn't allow me to request PR reviews from the following users: MartynaGrotek.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/uncc elmiko
/cc MartynaGrotek

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

norbertcyran

Overall looks good, I've left some nits and suggestions, but nothing major

norbertcyran · 2026-03-25T09:55:21Z

cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go

+// GetRemainingPods returns information about pods which CA is unable to help
+// at this moment.
+func (o *ScaleUpOrchestrator) GetRemainingPods(egs []*equivalence.PodGroup, nodeGroups []cloudprovider.NodeGroup, skipped map[string]status.Reasons, nodeInfos map[string]*framework.NodeInfo) []status.NoScaleUpInfo {
+	if !o.autoscalingCtx.ScaleUpSimulationForSkippedNodeGroupsEnabled {


nit: we had a discussion lately about putting more stuff to AutoscalingContext and we agreed that we tend to overuse it: #9353 (comment)

I'd normally ask to avoid using autoscalingCtx to store that flag and instead pass it to the orchestrator via dependency injection. However, IIRC, orchestrator has a weird interface that makes DI a little more complicated (because of the Initialize method). I remember having some issues with that in #8835. Therefore, I won't push on that, but anyway I'd suggest to take a look if injecting ScaleUpSimulationForSkippedNodeGroupsEnabled via DI would be a hassle

I agree that AutoscalingContext seems to be huge and indeed DI appears to be very complex with all the calling from Initialize. TBH, I think that it does not worth it and will make the implementation more complex.

Side comment:
While I understand that it is better to avoid having huge objects with a lot of things inside such as AutoscalingContext I personally do not think that there is a better way to do it. We are adding a lot of flags which might be used in different parts of CA, I feel like having one big object give us a lot more flexibility in that. You do not need to think twice about what to pass and where since you have a context object which can be accessed everywhere. And as long as we have this object I think it is better to use it and not to avoid it.
On the second thought, if we have some flag that is impacting a particular area of CA it might be worth to DI them into that areas and if we have some flags that impacts different CA parts we can put it in AutoscalingContext. If that is what was meant in that discussion, I fully agree on this.
This is just a comment to hear your opinion, maybe I am missing something.

norbertcyran · 2026-03-25T11:49:37Z

cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go

 	// This code here runs a simulation to see which pods can be scheduled on which node groups.
 	for _, nodeGroup := range validNodeGroups {
 		schedulablePodGroups[nodeGroup.Id()] = o.SchedulablePodGroups(podEquivalenceGroups, nodeGroup, nodeInfos[nodeGroup.Id()])
 	}


Have you considered running simulations for skipped nodes somewhere around here? I think it could be cleaner, as with the current proposal scheduling simulations get scattered over the orchestrator code and the logic that was previously responsible only for processing the scale up status now also does scheduling simulations.

We'd have to be extra careful though in order to not include skipped node groups in bin packing. I haven't investigated it in depth, so feel free to discard it if it's not feasible.

Exactly, here I wanted to do the simulations towards the end of ScaleUp call, because of the binpacking and also because we need to somehow preserve the default behavior and all of this did not seem feasible.
Another point: SchedulablePodGroups is marking the pods as schedulable and it requires something extra to keep the pods that were unschedulable before the "second" simulation, after the "second" simulation and also managing all of this with feature flags.

Also the simulation that we intend to do for the skipped node groups are not "full" scale up simulations, but rather only a predicate checker, so I placed it only in the end and we are doing it only for the nonSchedulable pod groups.

cluster-autoscaler/core/scaleup/orchestrator/orchestrator_test.go

…roups. This change introduces the following change: * run SchedulablePodGroups on skipped node groups (NG) during the ScaleUp simulation to check if skipped NGs satisfy predicates of podEquivalenceGroups: * if a skipped NG satisfies the predicate of a pod group then it stays in the SkippedNodeGroups list associated to this pod group's pods. * otherwise this NG moves to the RejectedNodeGroups. * run the SchedulablePodGroups simulation even for AllOrNothing or ExpansionOptionsFilteredOutReason simulation after marking all pods unschedulable: it can give us better idea on if this simulations would have succeeded if some NGs were not skipped. * since this change introduces the scale up performance overhead, it is covered by the feature flag. This change will give better understanding to the user of why the scale up did fail. If a NG is in the backoff but it does not satisfy the predicate, user will know it right away instead of waiting for when this NG becomes available and we would consider it again.

k8s-ci-robot requested review from elmiko and x13n March 11, 2026 10:56

k8s-ci-robot added the area/cluster-autoscaler label Mar 11, 2026

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 11, 2026

shaikenov force-pushed the shaikenov-run-schedulablePodGroups-for-skipped-ngs branch from fd176b0 to c9e137d Compare March 13, 2026 11:11

k8s-ci-robot removed the request for review from elmiko March 13, 2026 11:12

shaikenov marked this pull request as ready for review March 13, 2026 11:14

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026

k8s-ci-robot requested review from BigDarkClown and aleksandra-malinowska March 13, 2026 11:14

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2026

shaikenov force-pushed the shaikenov-run-schedulablePodGroups-for-skipped-ngs branch from c9e137d to 015d03d Compare March 23, 2026 15:37

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 23, 2026

norbertcyran reviewed Mar 25, 2026

View reviewed changes

shaikenov force-pushed the shaikenov-run-schedulablePodGroups-for-skipped-ngs branch from 015d03d to 5b75b41 Compare March 30, 2026 08:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend NoScaleUpInfo reporting by running simulation on skipped NodeGroups.#9346

Extend NoScaleUpInfo reporting by running simulation on skipped NodeGroups.#9346
shaikenov wants to merge 1 commit intokubernetes:masterfrom
shaikenov:shaikenov-run-schedulablePodGroups-for-skipped-ngs

shaikenov commented Mar 11, 2026

Uh oh!

k8s-ci-robot commented Mar 11, 2026

Uh oh!

k8s-ci-robot commented Mar 11, 2026

Uh oh!

shaikenov commented Mar 13, 2026

Uh oh!

k8s-ci-robot commented Mar 13, 2026

Uh oh!

norbertcyran left a comment

Uh oh!

norbertcyran Mar 25, 2026

Uh oh!

shaikenov Mar 27, 2026

Uh oh!

norbertcyran Mar 25, 2026

Uh oh!

shaikenov Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shaikenov commented Mar 11, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Mar 11, 2026

Uh oh!

k8s-ci-robot commented Mar 11, 2026

Uh oh!

shaikenov commented Mar 13, 2026

Uh oh!

k8s-ci-robot commented Mar 13, 2026

Uh oh!

norbertcyran left a comment

Choose a reason for hiding this comment

Uh oh!

norbertcyran Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

shaikenov Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

norbertcyran Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

shaikenov Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants