Kubernetes Pod Abnormal Restart Inspection

Background

Kubernetes helps users automatically schedule and scale containerized applications, but modern Kubernetes environments are becoming increasingly complex. When platform and application engineers need to investigate events in dynamic, containerized environments, finding the most meaningful signals may involve many trial-and-error steps. Through intelligent inspection, exceptions can be filtered based on the current search context, thus accelerating event investigation, reducing pressure on engineers, decreasing mean time to repair, and improving end-user experience.

Prerequisites

Enable 「Container Data Collection」 in TrueWatch
Deploy [DataFlux Func] (../quick-start/) offline
Enable the script market for your self-built DataFlux Func
Create an API Key for operations under 「Management / API Key Management」 in TrueWatch
Install 「Self-built Inspection Core Package」「Algorithm Library」「Self-built Inspection (K8S-Pod Restart Detection)」 via 「Script Market」 in your self-built DataFlux Func
Write a custom inspection processing function in your self-built DataFlux Func
Create a scheduled task (Old version: Automatic Trigger Configuration) for the written function under 「Management / Scheduled Tasks (Old version: Automatic Trigger Configuration)」 in your self-built DataFlux Func.

If you consider using a cloud server for offline deployment of DataFlux Func, please ensure it is deployed with the currently used SaaS deployment of TrueWatch in the same provider and region

Configure Inspection

Create a new script set in your self-built DataFlux Func to enable the configuration for Kubernetes Pod abnormal restart inspection.

Python
from guance_monitor__runner import Runner
from guance_monitor__register import self_hosted_monitor
import guance_monitor_k8s_pod_restart__main as k8s_pod_restart


# Workspace API_KEY configuration (configured by user)
API_KEY_ID  = 'wsak_xxx'
API_KEY     = '5Kxxx'

# The priority of function filters parameters in monitoring\intelligent inspection configuration exists; if function filters parameters are configured, then there is no need to change detection settings in studio monitoring\intelligent inspection. If both sides are configured, the function filter parameters in the script take precedence.

def filter_namespace(cluster_namespaces):
    '''
    Filter namespaces based on custom conditions that match required namespaces, return True for matching ones, False otherwise.
    return True｜False
    '''

    cluster_name = cluster_namespaces.get('cluster_name','')
    namespace = cluster_namespaces.get('namespace','')
    if cluster_name in ['k8s-prod']:
        return True

'''
Task configuration parameters should use:
@DFF.API('K8S-Pod Abnormal Restart Inspection', fixed_crontab='*/30 * * * *', timeout=900)

fixed_crontab: Fixed execution frequency 「every 30 minutes」
timeout: Task execution timeout duration, controlled within 15 minutes
'''

# Kubernetes Pod abnormal restart inspection configuration - no modification needed by user
@self_hosted_monitor(API_KEY_ID, API_KEY)
@DFF.API('K8S-Pod Abnormal Restart Inspection', fixed_crontab='*/30 * * * *', timeout=900)
def run(configs=[]):
    """
    Parameters:
        configs:
            Configure the cluster_name (cluster name, optional, not configured will detect based on namespace)
            Configure the namespace (namespace, required)

        Example configuration: Namespace can configure multiple or single entries
        configs = [
        {
            "cluster_name": "xxx",
            "namespace": ["xxx1", "xxx2"]
        },
        {
            "cluster_name": "yyy",
            "namespace": "yyy1"
        }
        ]

    """
    checkers = [
         # Configure Kubernetes Pod abnormal restart inspection
        k8s_pod_restart.K8SPodRestartCheck(configs=configs, filters=[filter_namespace]),
    ]

    Runner(checkers, debug=False).run()

Enable Inspection

After configuring the inspection in DataFlux Func, you can test it by selecting the run() method directly on the page and clicking Run. After publishing, you can view and configure the task under 「Management / Scheduled Tasks」 in DataFlux Func.

View Events

The intelligent inspection, based on inspection algorithms, will look for any cases of Pod abnormal restarts within the currently configured clusters. For any anomalies detected, the intelligent inspection will generate corresponding events, which can be viewed in the 「Event Center」 after being generated.

Event Details

Event Summary: Describes the object and content of the anomaly inspection event
Abnormal Pods: You can view the status of abnormal pods under the current namespace
Container Status: You can view detailed error times, container ID statuses, current resource situations, and container types; clicking the container ID will lead to the specific container details page

Common Issues

1. How to configure the detection frequency for Kubernetes Pod abnormal restart inspection

In your self-built DataFlux Func, add fixed_crontab='*/30 * * * *', timeout=900 when writing the custom inspection processing function in the decorator, then configure it under 「Management / Scheduled Tasks (Old version: Automatic Trigger Configuration)」.

2. Why might there be no anomaly analysis triggered during Kubernetes Pod abnormal restart inspection

When the inspection report does not include anomaly analysis, check the data collection status of the current datakit.

3. Under what circumstances will Kubernetes Pod abnormal restart inspection events be generated

Using the percentage of restarted pods under cluster_name + namespace as an entry point, if this metric increases within the last 30 minutes, the logic will trigger an event generation and perform root cause analysis.

4. What to do if previously normal scripts encounter errors during inspection

Update the referenced script sets in the script market of DataFlux Func. You can view the update records of the script market through Change Log to facilitate immediate updates of the scripts.