Skip to the content.

Postmortem: GPU Driver Crashed the Kubernetes Master Node

General Information


Executive Summary

Impact:

Root Cause:

Resolution:


Problem Summary


Detection


Root Causes and Trigger

The Kubernetes master node is equipped with an Nvidia GPU to provide hardware acceleration for the Jellyfin service. During the installation of the nvidia-dkms-390 driver, it failed to build properly with the existing kernel version. As a result, kernel headers and other dependencies also failed to configure. This issue was left unattended, and the system remained operational until a power outage forced a reboot.

Upon reboot, the system encountered a Kernel Panic, rendering the master node unbootable. Since this node was the only master node in the cluster, its failure resulted in:

This incident highlights the risks of a single-point-of-failure architecture and the need for a high-availability (HA) cluster setup.


Resolution


Lessons Learned

What Went Well

What Went Poorly

Where We Got Lucky


Action Items

Prevention Measures

Emergency Response Improvements

Monitoring & Alerting Enhancements