Postmortem: GPU Driver Crashed the Kubernetes Master Node
General Information
- Date of Incident: 2025-03-03 14:15:34
- Owner(s): Barush Mendez barush1203@gmail.com
- Shared With: @Repo_viewers
- Status: Final
Executive Summary
Impact:
- All services were unresponsive, and the Kube-API was down for 6 days, making Kubernetes management impossible.
Root Cause:
- A broken Nvidia GPU driver on the K8s master node caused a Kernel Panic during boot. The issue had been present for weeks but only became apparent after a power outage forced a reboot, preventing the system from booting properly.
Resolution:
- The issue was resolved by booting into a previous kernel version, removing broken packages, reinstalling necessary dependencies, and updating the system.
Problem Summary
- Duration: Main outage: Mon, March 3, 14:15:34 to Sun, March 9, 19:45:21
- Affected Services/Products:
- V-Rising Server
- Kube-API
- Grafana
- Jellyfin
- Linkding
- Scope: 100% of hosted services were unavailable.
- User Impact: All applications hosted on Kubernetes were inaccessible for 6 days, and Kubernetes management was not possible due to the Kube-API outage.
- Revenue Impact: Since this is a homelab, there was no financial impact.
Detection
- How It Was Detected: The issue was detected manually when trying to access services.
- Time to Detection: 3 hours.
Root Causes and Trigger
The Kubernetes master node is equipped with an Nvidia GPU to provide hardware acceleration for the Jellyfin service. During the installation of the nvidia-dkms-390
driver, it failed to build properly with the existing kernel version. As a result, kernel headers and other dependencies also failed to configure. This issue was left unattended, and the system remained operational until a power outage forced a reboot.
Upon reboot, the system encountered a Kernel Panic, rendering the master node unbootable. Since this node was the only master node in the cluster, its failure resulted in:
- Total cluster downtime (no Kubernetes management available).
- All hosted applications becoming inaccessible since services relied on the master node’s private IP.
This incident highlights the risks of a single-point-of-failure architecture and the need for a high-availability (HA) cluster setup.
Resolution
- Steps Taken:
- Since remote access was impossible, a monitor was connected to review boot logs, revealing the Kernel Panic.
- Booted using a previous kernel version, which restored system functionality.
- Attempted to update the server with
sudo apt update && sudo apt upgrade
, but this failed due to broken packages. - Removed the broken packages using
sudo apt-get remove --purge '^nvidia-.*'
, then ransudo dpkg --configure -a
to fix package configuration issues. - Executed
dpkg -l | grep -v ^ii
to identify any remaining broken packages and manually removed them withsudo apt-get remove --purge <package_name>
. - Ran
sudo apt-get install -f
to fix missing dependencies and executedsudo apt-get install --reinstall linux-headers-$(uname -r)
to ensure the kernel was properly configured. - Rebooted the machine and confirmed that the Kernel Panic was resolved.
- Time to Resolution: 6 days.
Lessons Learned
What Went Well
- The issue was relatively simple to resolve once identified.
- No data loss occurred.
What Went Poorly
- There were no automated alerts indicating the outage.
- Resolution time was too long due to a lack of monitoring and immediate detection.
- Single-node master architecture resulted in complete cluster downtime.
Where We Got Lucky
- No luck here—this was a full failure scenario.
Action Items
Prevention Measures
- Ensure GPU driver installations are fully successful or immediately revert changes if errors occur.
- Migrate cluster to a high-availability (HA) setup to prevent a single point of failure.
- Expose the Kube-API through a load balancer in an HA setup to improve reliability.
Emergency Response Improvements
- Set up a ticketing system to document incidents and track resolutions.
- Improve response times for homelab issues by creating predefined troubleshooting workflows.
Monitoring & Alerting Enhancements
- Implement infrastructure monitoring (e.g., Prometheus + Grafana) to detect issues proactively.
- Set up blackbox monitoring to detect failures outside the Kubernetes cluster (e.g., HTTP/TCP checks for critical services).