Application Performance Inspection

Background

"Application Performance Monitoring" is based on APM incident root cause analysis detectors. Select the service, resource, project, and env information to be monitored, and regularly perform intelligent inspections of application performance. Automatically analyze upstream and downstream service information through anomalies in application service metrics to identify root cause incidents for the application.

Prerequisites

Applications already integrated into TrueWatch "APM"
Self-hosted DataFlux Func offline deployment
Enable self-hosted DataFlux Func's Script Market
Create an API Key for operations in TrueWatch "Manage / API Key Management"
In the self-hosted DataFlux Func, install via the "Script Market" the following: "Self-built Inspection Core Package", "Algorithm Library", "Self-built Inspection (APM Performance)"
In the self-hosted DataFlux Func, write a self-built inspection processing function
In the self-hosted DataFlux Func, create an automatic trigger configuration for the written function through "Manage / Automatic Trigger Configuration"

If considering using a cloud server for DataFlux Func offline deployment, please ensure it is with the same provider and region as the currently used TrueWatch SaaS deployment same provider same region

Configure Inspection

In the self-hosted DataFlux Func, create a new script set to enable application performance inspection configuration

Python
from guance_monitor__register import self_hosted_monitor
from guance_monitor__runner import Runner
import guance_monitor_apm_performance__main as apm_main

# Account configuration
API_KEY_ID  = 'wsak_xxx'
API_KEY     = 'wsak_xxx'

# Function filters parameter priority exists in both studio monitoring\intelligent inspection configuration calls. After configuring the function filters parameters, there is no need to change the detection configuration in studio monitoring\intelligent inspection. If both are configured, the script's filters parameters take precedence.

def filter_project_servcie_sub(data):
    # {'project': None, 'service_sub': 'mysql:dev'}, {'project': None, 'service_sub': 'redis:dev'}, {'project': None, 'service_sub': 'ruoyi-gateway:dev'}, {'project': None, 'service_sub': 'ruoyi-modules-system:dev'}
    project = data['project']
    service_sub = data['service_sub']
    if service_sub in ['ruoyi-gateway:dev', 'ruoyi-modules-system:dev']:
        return True

'''
Task configuration parameters should use:
@DFF.API('Application Performance Inspection', fixed_crontab='0 * * * *', timeout=900)

fixed_crontab: Fixed execution frequency "once per hour"
timeout: Task execution timeout length, controlled within 15 minutes
'''

@self_hosted_monitor(API_KEY_ID, API_KEY)
@DFF.API('Application Performance Inspection', fixed_crontab='0 * * * *', timeout=900)
def run(configs=[]):
    '''
    Parameters:
    configs :
        project: Project that the service belongs to
        service_sub: Includes service (service), environment (env), version (version) concatenated by ":", example: "service:env:version", "service:env", "service:version"

    Example:
        configs = [
            {"project": "project1", "service_sub": "service1:env1:version1"},
            {"project": "project2", "service_sub": "service2:env2:version2"}
        ]
    '''
    checkers = [
        apm_main.APMCheck(configs=configs, filters=[filter_project_servcie_sub]),
    ]

    Runner(checkers, debug=False).run()

Start Inspection

In DataFlux Func, after configuring the inspection, you can test it by selecting the run() method directly from the page. After publishing, you can view and configure it in DataFlux Func "Manage / Scheduled Tasks".

View Events

Intelligent inspection based on inspection algorithms will search for anomalies in APM metrics, such as sudden anomalies in resources. For abnormal situations, intelligent inspection will generate corresponding events. Once events are generated, they can be viewed in the "Event Center" for corresponding abnormal events.

Event Details

Event Overview: Describes objects and content of abnormal inspection events
Error Trends: You can view performance metrics of the current application over the last hour
Abnormal Impact: You can see which services and resources are affected by the abnormal services in the current chain
Abnormal Chain Sampling: View detailed error times, services, resources, and chain IDs; clicking on services or resources will enter the corresponding data explorer; clicking on chain ID will enter the specific chain detail page.

Common Issues

1. How to configure the detection frequency of application performance inspection

In the self-hosted DataFlux Func, when writing the self-built inspection processing function, add fixed_crontab='0 * * * *', timeout=900 in the decorator, then configure it in "Manage / Automatic Trigger Configuration".

2. Why might there be no anomaly analysis during application performance inspection triggers

When there is no anomaly analysis in the inspection report, check the data collection status of the current datakit.

3. Under what circumstances would application performance inspection events be generated

Use error rate, P99, etc., as entry points. When one of these indicators experiences an abnormal change and affects upstream and downstream chains, alarm information is collected and root cause analysis is performed.

4. During inspection, previously normally running scripts appear with abnormal errors

Please update the referenced script sets in the Script Market of DataFlux Func. You can view the update records of the Script Market through the Change Log to facilitate immediate updates to the scripts.