-
Bug
-
Resolution: Done
-
Major
-
None
-
4.15.z
-
Moderate
-
None
-
False
-
Description of problem:
Customer upgraded a three node bare metal cluster with schedulable masters from 4.14.44 to 4.15.45.
After the Machine Config Operator triggered the first reboot (machine-config CO and network CO still on version 4.14) the sosreport of the node started to show IPSec errors.
Communication between pods to other nodes started to fail and the node did not came back healthy.
At this point in time it was decided to stop the Machine Config Pool to prevent the cluster from two nodes being unavailable.
Trying to access random URLs in the cluster return errors and workload was affected.
With the machine config pool in stopped state we disabled IPSec, daemonsets and pods got removed, node was rebooted and MCP unpaused.
With IPSec disabled the cluster could be upgraded and IPSec was enabled back.
Version-Release number of selected component (if applicable):
4.15.45
How reproducible:
I have tried to reproduce the issue with a three node cluster on AWS but reproducer did not work. Cluster upgraded just fine. I don't have possibility to verify on BareMetal cluster. It could be related to this and or other networking config.
Steps to Reproduce:
n/a
1.
2.
3.
Actual results:
Cluster upgrade on IPSec enabled cluster to 4.15.45 failed although the known IPSec issues have been resolved by pinning a working libreswan version. The upgrade risk has been removed and the customer was surprised by this issue as there was no mention to disable IPSec in this scenario.
Expected results:
Cluster upgrade with IPSec should work.
Additional info:
Affected Platforms:
It is a customer issue. An sosreport was gathered. It captured the following timings:
Boots:
– Boot 715ec5ad93d34fc6bfea58276ca92702 –
Feb 24 09:55:52
– Boot 3b83e5b40b8745a39e5bb2e8d2b364aa –
Feb 24 10:42:37
– Boot f717881f9ef44ebfbd68279ac3ffbf41 – IPSec was disabled
Feb 24 11:47:52
– Boot 0d696af186c94580a3a26257c675f0ff – Regular reboot triggered by MachineConfig
Feb 24 12:43:02
After the boot at 9:55 sosreport shows messages for problems with IPSec as such:
Feb 24 09:58:13 pluto[18037]: EXPECTATION FAILED: *p == ((void *)0) (unpack_string() +172 /lib/libwhack/whacklib.c)
lasting until
Feb 24 11:40:30 pluto[11682]: EXPECTATION FAILED: *p == ((void *)0) (unpack_string() +172 /lib/libwhack/whacklib.c)
In between the two reboots we see added messages starting at 10:44
Feb 24 10:44:22 pluto[11682]: packet from 141.73.143.5:500: CREATE_CHILD_SA request has no corresponding IKE SA; message dropped
and lasting until disabling of IPSec and reboot at 11:47.
There was no Must Gather taken at the time of the issue but logs are in Splunk so if we like to see specific pod logs I can ask for it.
- duplicates
-
OCPBUGS-50582 [release-4.18] Graceful cleanup of IPsec states
-
- Closed
-