Kubernetes Pod Abnormal Restart Inspection
Background
Kubernetes helps users automatically schedule and scale containerized applications, but modern Kubernetes environments are becoming increasingly complex. When platform and application engineers need to investigate events in dynamic, containerized environments, finding the most meaningful signals may involve many trial-and-error steps. Through intelligent inspection, exceptions can be filtered based on the current search context, thus accelerating event investigation, reducing pressure on engineers, decreasing mean time to repair, and improving end-user experience.
Prerequisites
- Enable 「Container Data Collection」 in TrueWatch
- Deploy [DataFlux Func] (../quick-start/) offline
- Enable the script market for your self-built DataFlux Func
- Create an API Key for operations under 「Management / API Key Management」 in TrueWatch
- Install 「Self-built Inspection Core Package」「Algorithm Library」「Self-built Inspection (K8S-Pod Restart Detection)」 via 「Script Market」 in your self-built DataFlux Func
- Write a custom inspection processing function in your self-built DataFlux Func
- Create a scheduled task (Old version: Automatic Trigger Configuration) for the written function under 「Management / Scheduled Tasks (Old version: Automatic Trigger Configuration)」 in your self-built DataFlux Func.
If you consider using a cloud server for offline deployment of DataFlux Func, please ensure it is deployed with the currently used SaaS deployment of TrueWatch in the same provider and region
Configure Inspection
Create a new script set in your self-built DataFlux Func to enable the configuration for Kubernetes Pod abnormal restart inspection.
Python | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
Enable Inspection
After configuring the inspection in DataFlux Func, you can test it by selecting the run()
method directly on the page and clicking Run. After publishing, you can view and configure the task under 「Management / Scheduled Tasks」 in DataFlux Func.
View Events
The intelligent inspection, based on inspection algorithms, will look for any cases of Pod abnormal restarts within the currently configured clusters. For any anomalies detected, the intelligent inspection will generate corresponding events, which can be viewed in the 「Event Center」 after being generated.
Event Details
- Event Summary: Describes the object and content of the anomaly inspection event
- Abnormal Pods: You can view the status of abnormal pods under the current namespace
- Container Status: You can view detailed error times, container ID statuses, current resource situations, and container types; clicking the container ID will lead to the specific container details page
Common Issues
1. How to configure the detection frequency for Kubernetes Pod abnormal restart inspection
- In your self-built DataFlux Func, add
fixed_crontab='*/30 * * * *', timeout=900
when writing the custom inspection processing function in the decorator, then configure it under 「Management / Scheduled Tasks (Old version: Automatic Trigger Configuration)」.
2. Why might there be no anomaly analysis triggered during Kubernetes Pod abnormal restart inspection
When the inspection report does not include anomaly analysis, check the data collection status of the current datakit
.
3. Under what circumstances will Kubernetes Pod abnormal restart inspection events be generated
Using the percentage of restarted pods under cluster_name + namespace as an entry point, if this metric increases within the last 30 minutes, the logic will trigger an event generation and perform root cause analysis.
4. What to do if previously normal scripts encounter errors during inspection
Update the referenced script sets in the script market of DataFlux Func. You can view the update records of the script market through Change Log to facilitate immediate updates of the scripts.