FAQs about "Monitors"
2023-08-28
Due to their complex configuration and susceptibility to external factors, many people encounter situations where the actual behavior of monitors does not match expectations.
This article summarizes frequently encountered issues and the correct way to ask questions.
Note
A monitor simply executes DQL statements at specified times according to its configuration and makes judgments based on the returned values from the DQL query. (If no data is found, it further handles the situation based on the data gap configuration.)
- Monitors do not handle data reporting: Issues regarding data reporting frequency and database ingestion delays should be directed to the data collection side; monitors cannot answer related questions.
- Monitors do not manage databases: Whatever DQL a user configures, that DQL will be executed by the monitor. If no data is found at the time of execution, then no data was indeed found. You can refer to the section below regarding cases where no data is found, but please do not ask why data was not found if you did not execute the DQL at that exact time. Do not bring up data that could only be queried after the monitor has finished executing.
Secondary Reminder: Do not ask afterward why data could not be found. Being able to find data afterward does not mean the monitor could have found it at the time.
Quick Diagnosis Table
Expected vs Actual Behavior | Possible Causes |
---|---|
Expected events / alerts but none occurred | 1. Monitor configurations have been changed and no longer match historical behaviors. 2. Data reporting and ingestion delays are too large, resulting in no data being found during detection. 3. Aggregation functions selected Count while judgment conditions involve 0. 4. The monitor is disabled and not running. Mute / Silence rules affect. 5. Database query fails / write operation fails. |
No expected events / alerts but they occur | 1. Monitor configurations have been changed and no longer match historical behaviors. 2. Data reporting and ingestion delays are too large, leading to no data being found during detection, thus generating data gap alerts. |
Monitor runs slowly / Queue 8 blocked | 1. Too few worker-8 instances, too many monitors to run within the given timeframe.2. Slow DQL execution by the monitor, reducing the number of monitors that can run per unit of time. 3. Excessive detection objects configured by the user (i.e., a massive number of time series returned by the DQL). 4. Too many associated configurations, such as a large number of alert strategies, notification targets, and silence rules created using scripts. |
When encountering an issue, the scene should be preserved as much as possible. If testing is required, a new monitor should be created. Modifying a monitor believed to be problematic will make it difficult to investigate the issue later.
1. Large Data Reporting Delays Prevent Monitors from Querying Data
Physical Limitations
It is known that light travels at 300,000 kilometers per second in a vacuum. Electrical signals travel approximately 0.7 times the speed of light in copper wires.
Therefore, even if user-reported data completely ignores various software and hardware delays and arrives at the server at light speed for immediate storage and querying, only customers within 300 kilometers can achieve a latency of less than 1 millisecond.
Since data reporting and ingestion delays are objectively present and unavoidable, monitors delay the actual detection execution time by 1 minute to ensure each run aligns with expectations as closely as possible, meaning:
For example, a monitor set to run every 1 minute, originally planned to run at 00:30:00, will actually run after 00:31:00.
This ensures that data reported with a delay of up to 1 minute can still be detected normally.
Therefore, if the data reporting delay exceeds 1 minute, the monitor will not be able to perform effective detection.
The following table illustrates how data reporting and ingestion delays prevent data from being queried:
The current monitor solution allows for a 1-minute delay.
Time | Data Reported | Monitor Detection |
---|---|---|
00:29:59 |
Reported, not stored | Waiting |
00:30:00 |
Not stored, not queryable | Scheduled to run but not yet executed |
00:30:01 |
Stored, queryable | Waiting |
00:31:00 |
Stored, queryable | Actually executed |
Time | Data Reported | Monitor Detection |
---|---|---|
00:29:59 |
Reported, not stored | Waiting |
00:30:00 |
Not stored, not queryable | Actually executed |
00:30:01 |
Stored, queryable |
How to Determine if Data Reporting Ingestion Delays Are Too Large?
Currently, GuanceDB automatically adds create_time
to all reported data. You can use the DQL query tool to check the difference between time
and create_time
:
This method is only for reference and is not comprehensive. It is also invalid in the following scenarios
- The data reporting endpoint specifies the data time
time
, but the actual report is much later than the specified time (e.g., reporting a timestamp of 00:30 at 01:00). - Disk writing lags, i.e., although
create_time
is added at the time of writing, the data cannot be immediately queried from the database.
Text Only | |
---|---|
1 |
|
Thus, based on the content of this section, the only conclusion that can be drawn is:
When the reporting endpoint does not specify
time
, and a large gap is found betweentime
andcreate_time
in the DQL query tool, it indicates that there is certainly a significant reporting delay.
P implies Q, but Q does not imply P
When "the reporting endpoint does not specify time
, and a large gap between time
and create_time
is found in the DQL query tool," then "there is a significant reporting delay" holds true.
This does not mean that "a significant reporting delay" implies "the reporting endpoint does not specify time
, and a large gap between time
and create_time
is found in the DQL query tool."
P implies Q, but not P does not imply not Q
When "the reporting endpoint does not specify time
, and a large gap between time
and create_time
is found in the DQL query tool," then "there is a significant reporting delay" holds true.
This does not mean that when "the reporting endpoint does specify time
," then "there is no significant reporting delay."
It also does not mean that when "in the DQL query tool, no large gap between time
and create_time
is found," then "there is no significant reporting delay."
What to Do if Data Reporting Ingestion Delays Are Too Large?
Currently, there is no solution. A large reporting ingestion delay is the direct cause of detection failure.
The only way to improve is to set the monitor's detection time range larger than the detection frequency (e.g., detecting 5 minutes of data every minute), so that data from the same time period is detected multiple times, increasing the success rate of queries.
However, this will inevitably lead to the side effect of the same data being detected multiple times, possibly resulting in more alerts.
2. Monitor Configurations Have Been Changed and Do Not Match Historical Behaviors
Please check the current monitor configuration and determine whether it matches your expectations.
If unsure whether someone has made changes, you can go to "Manage / Settings / Security / Operation Audit" to view operation records.
3. Aggregation Functions Select Count While Judgment Conditions Involve 0
Roster Principle
Without a roster, you can only know who is present but cannot know who is absent.
Since the data in Guance are unstructured, and detection distinguishes different objects depending on the BY
statement, the set of objects that should exist is unknowable.
Meanwhile, any database (whether MySQL or InfluxDB), when using the aggregation function COUNT
combined with BY
, cannot return entries where COUNT == 0
.
Therefore, in monitor configurations, simply using "aggregation function COUNT
combined with BY
, and setting Result >= 0
" is meaningless.
How to Detect "Non-existence"?
Such detections actually check for "existence of XXX alerts" and combine this with "data gaps generating recovery events" to produce recovery events for objects not counted by COUNT.
How Is Data Gap Detection Implemented?
According to the aforementioned "roster principle," monitors do not know "how many objects there are in total," but they can know "how many objects were in the detection range for this round / last round."
By comparing "objects within the detection range of this round" with "objects within the detection range of the last round," it determines whether there are "data gap" objects and considers them as "data gaps."
Below is an illustration of data gap object determination:
Detection Round | Objects Within Detection Range | Result |
---|---|---|
First Round | Zhang San, Li Si | - |
Second Round | Li Si | Compared to the previous round, "Li Si" is missing Therefore, "Li Si" is the data gap object |
4. Mute / Silence Rules Affect Return
Mute / Silence rules only affect whether alerts are sent, not the generation of events.
You can directly check the information in the event details to understand why mute / silence occurred.
4. Monitor Executes Slowly / Queue 8 Blocked
Too Few worker-8
Instances, Too Many Monitors To Run Within Given Timeframe
- Monitor executes DQL slowly, reducing the number of monitors that can run per unit of time
- Users configure monitors to detect too many objects (i.e., DQL returns a huge number of time series) Excessive associated configurations, such as a large number of alert strategies, notification targets, and silence rules generated using scripts
X. How to Ask Effective Questions
If the above common problems do not solve your issue, you can try asking questions in the following manner:
Please submit the following 5 items to the development team. The development team will query system logs based on the submitted information to determine if there is an issue.
Please fully submit all 5 items. Missing content will prevent the problem from being located.
# | Submission Content | Source |
---|---|---|
1 | Which Guance node? | You can directly copy the URL address from the address bar For example, cn3-console.guance.com refers to cn3 |
2 | Which workspace? | You can view the "Workspace ID" in "Manage / Settings" The workspace ID format is wksp_xxxxx |
3 | Which monitor? | You can directly copy the URL from the monitor configuration page The monitor ID format must be rul_xxxxx |
4 | When? | Please provide an accurate time like: 2023-01-01 00:10:00 Avoid vague descriptions like "yesterday noon," "previously," "just now," which change meaning over time. |
5 | Problem Description | Briefly explain "why you think events / alerts should have occurred." |
Please submit the following 2 items to the development team. The development team will query system logs based on the submitted information to determine if there is an issue.
Please fully submit all 2 items. Missing content will prevent the problem from being located.
# | Submission Content | Source |
---|---|---|
1 | Event JSON File | You can download the event's JSON file from "Event Details / Export / Export JSON File". |
2 | Problem Description | Briefly explain "why you think events / alerts should not have occurred." |