-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.19.z, 4.20.0
(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
Significant regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 98.81% to 93.80%.
Sample (being evaluated) Release: 4.19
Start Time: 2025-05-27T00:00:00Z
End Time: 2025-06-03T16:00:00Z
Success Rate: 93.80%
Successes: 121
Failures: 8
Flakes: 0
Base (historical) Release: 4.18
Start Time: 2025-01-26T00:00:00Z
End Time: 2025-02-25T23:59:59Z
Success Rate: 98.81%
Successes: 662
Failures: 8
Flakes: 0
View the test details report for additional context.
Test failure always shows:
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less 1h12m32s { fail [k8s.io/kubernetes@v1.32.5/test/e2e/upgrades/apps/daemonsets.go:92]: expected DaemonSet pod to be running on all nodes, it was not Ginkgo exit error 1: exit with code 1}
This test is actually a vendored upstream kube test.
Digging into the stdout for the test failure:
I0603 09:40:38.456094 1193 fixtures.go:126] Number of nodes with available pods controlled by daemonset ds1: 5
I0603 09:40:38.456119 1193 fixtures.go:131] Node ci-op-y4txrgim-e4826-j68sf-worker-c-mh5gb is running 0 daemon pod, expected 1
These logs in loki indicate the host did get a ds1 pod
I0603 09:41:45.761536 1 log.go:245] Awaiting pod deletion. I0603 09:41:45.761510 1 log.go:245] Shutting down after receiving signal: terminated. I0603 09:38:25.110302 1 log.go:245] Awaiting pod deletion. I0603 09:38:25.110262 1 log.go:245] Shutting down after receiving signal: terminated. I0603 09:40:39.573562 1 log.go:245] Serving on port 9376. I0603 08:29:10.584175 1 log.go:245] Serving on port 9376. I0603 08:29:10.584175 1 log.go:245] Serving on port 9376.
However it's possible the pod wasn't fully running at the time we check, the test reports the failure at 09:40:38, the pod logs serving on port at 09:40:39, one second later, possible race condition here? Is the test not waiting sufficiently?
Note the loki logs don't seem to be in order, must be some delay between logging and ingestion, I believe the timestamps above are as logged on the origin system/pod.