Deployment and Maintenance / Architecture, Scaling, and Resource Limits
This article mainly introduces the overall architecture of DataFlux Func and how to scale to improve processing capacity.
1. Architecture
Internally, the system follows a typical "producer -> consumer" model. Any execution of a Python function goes through the process of "task generation -> enqueue -> dequeue -> execution -> result return".
Any Python function is first packaged into a "task" and enters its designated "work queue" (numbered starting from #0). Subsequently, the corresponding "work unit" (numbered starting from worker-0) retrieves the task from the queue for execution.
flowchart TB
USER[User]
FUNC_SERVER[Func Server Service]
REDIS_QUEUE_N[Redis Queue #N]
FUNC_WORKER_N[Func Worker-N Service]
FUNC_BEAT[Func Beat Service]
USER --HTTP Request--> FUNC_SERVER
FUNC_SERVER --Function Execution Task Enqueues--> REDIS_QUEUE_N
REDIS_QUEUE_N --Function Execution Task Dequeues--> FUNC_WORKER_N
FUNC_BEAT --"Function Execution Task Enqueues
(Scheduled Tasks)"--> REDIS_QUEUE_N
1.1 Services and Their Purposes
DataFlux Func consists of multiple services, each with different responsibilities. The specific services are as follows:
| Service | Purpose |
|---|---|
| server | Web service, providing the following functionalities: 1. Web interface 2. API interfaces 3. Maintaining subscribers |
| worker-{queue number} | Work unit, used to execute user scripts, including: 1. Function APIs 2. Function APIs 3. Scheduled tasks Also handles some system-level background tasks See queue description for details |
| beat | Trigger for scheduled tasks |
| mysql | Database |
| redis | Cache / Function execution task queue |
1.2 Work Unit and Queue Listening Relationships
For the service worker-{queue number} (work unit), each Worker service only listens to specific queues:
Queues and work units do not have to correspond one-to-one
Queues and work units are not required to have a one-to-one correspondence. For example, work unit worker-0 is not limited to listening only to tasks from queue #0. Each work unit can listen to any one or more queues.
Furthermore, the same queue can be listened to by multiple work units simultaneously, or not listened to at all (not recommended).
Different queues for standalone Func and data platform attached Func
Since most standalone Func deployments are for relatively light usage, to reduce unnecessary resource consumption, standalone Func has fewer work units than queues.
Conversely, data platform attached Func, due to handling heavy business like monitors and message sending modules (Message Desk), has a one-to-one correspondence between work units and queues, and has more numbered work units and queues than standalone Func.
| Work Unit | Queues Standalone Deployment |
Queues Data Platform Attached |
|---|---|---|
| worker-0 | #0, #4, #7, #8, #9 | #0 |
| worker-1 | #1 | #1 |
| worker-2 | #2 | #2 |
| worker-3 | #3 | #3 |
| worker-4 | - | #4 |
| worker-5 | #5 | #5 |
| worker-6 | #6 | #6 |
| worker-7 | - | #7 |
| worker-8 | - | #8 |
| worker-9 | - | #9 |
| worker-10 | - | #10 |
| worker-11 | - | #11 |
| worker-12 | - | #12 |
| worker-13 | - | #13 |
| worker-14 | - | #14 |
| worker-15 | - | #15 |
| Work Unit | Queues Standalone Deployment |
Queues Data Platform Attached |
|---|---|---|
| worker-0 | #0, #4, #7, #8, #9 | #0 |
| worker-1 | #1 | #1 |
| worker-2 | #2 | #2 |
| worker-3 | #3 | #3 |
| worker-4 | - | #4 |
| worker-5 | #5 | #5 |
| worker-6 | #6 | #6 |
| worker-7 | - | #7 |
| worker-8 | - | #8 |
| worker-9 | - | #9 |
| Work Unit | Queues |
|---|---|
| worker-0 | #0 |
| worker-1-6 | #1, #2, #3, #4, #5, #6 |
| worker-7 | #7 |
| worker-8-9 | #8, #9 |
2. Services / Queues, Their Responsibilities, and Scaling Recommendations
Scaling requires more hardware investment
Scaling requires higher performance from the server, including but not limited to the server itself, database service, Redis, etc.
Generally, scaling DataFlux Func only requires increasing the replica count of the corresponding services. Therefore, users should first understand their actual business situation to scale targeted.
The complete services, queues, their responsibilities, and scaling recommendations are as follows:
| Service / Queue | Responsibility Standalone Deployment |
Responsibility Data Platform Attached |
Default Pod Count Data Platform Attached |
Scaling Recommendation |
|---|---|---|---|---|
| server | Web service, providing the following functionalities: 1. Web interface 2. API interfaces 3. Maintaining subscribers |
← Same as left | 1 | Generally no need to scale |
| server-inner | (No such service) | Web service, dedicated for internal cluster API calls | 1 | Generally no need to scale |
| worker-0 Queue #0 |
System work unit, not directly involved in user code processing | ← Same as left | 2 | Generally no need to scale |
| worker-1 Queue #1 |
Executes function tasks from synchronously executed function APIs | ← Same as left | 1 | Scale when needing to increase concurrency of synchronously executed function APIs |
| worker-2 Queue #2 |
Executes function tasks from scheduled tasks | ← Same as left | 1 | Scale when needing to increase concurrency of scheduled tasks |
| worker-3 Queue #3 |
Executes function tasks from asynchronously executed function APIs | ← Same as left | 1 | Scale when needing to increase concurrency of asynchronously executed function APIs |
| worker-4 Queue #4 |
(Reserved) | (Reserved) | 0 | No need to scale |
| worker-5 Queue #5 |
Debug code execution i.e., directly running functions in the Web interface |
← Same as left | 1 | Scale when needing to support more users developing scripts simultaneously |
| worker-6 Queue #6 |
Executes function tasks from connector subscription message processing | ← Same as left | 1 | Scale when needing to increase concurrency of connector subscription message processing |
| worker-7 Queue #7 |
(Reserved) | Executes data platform system business function tasks e.g., logging in via data platform backend admin, updating various caches, releasing message aggregation pools, etc. |
2 | Scale when the total number of monitors is high |
| worker-8 Queue #8 |
(Reserved) | Executes data platform threshold detection and other ordinary monitor, metric generation related function tasks | 5 | Scale when the number of ordinary monitors is high |
| worker-9 Queue #9 |
(Reserved) | Executes data platform advanced detection, intelligent monitor function tasks | 3 | Scale when the number of advanced detection and intelligent monitors is high |
| worker-10 Queue #10 |
(No such service) | Executes data platform user-reported event receiving function tasks | 1 | Scale when the volume of user-reported events is large |
| worker-11 Queue #11 |
(No such service) | Executes Message Desk message sending tasks | 3 | Scale when the volume of message sending is large |
| worker-12 Queue #12 |
(No such service) | (Reserved) | 0 | No need to scale |
| worker-13 Queue #13 |
(No such service) | (Reserved) | 0 | No need to scale |
| worker-14 Queue #14 |
(No such service) | Executes AI-related processing requiring immediate user response e.g., calling "Auto-write Pipeline" etc. |
2 | Scale when needing to support more users writing Pipelines simultaneously |
| worker-15 Queue #15 |
(No such service) | Executes AI-related processing not requiring immediate user response e.g., calling "Alert compression and merging" processing, etc. |
2 | Scale when there are many monitors using AI for alert aggregation |
| beat | Trigger for scheduled tasks | ← Same as left | 1 | Must not scale, ensure global single replica |
| mysql | Database | (No such service) | - | No need to scale, choose self-built or cloud service for higher demands |
| redis | Cache / Function execution task queue | (No such service) | - | No need to scale, choose self-built or cloud service for higher demands |
| Service / Queue | Responsibility Standalone Deployment |
Responsibility Data Platform Attached |
Scaling Recommendation |
|---|---|---|---|
| server | Web service, providing the following functionalities: 1. Web interface 2. API interfaces 3. Maintaining subscribers |
← Same as left | Generally no need to scale |
| server-inner | (No such service) | Web service, dedicated for internal cluster API calls | Generally no need to scale |
| worker-0 Queue #0 |
System work unit, not directly involved in user code processing | ← Same as left | Generally no need to scale |
| worker-1 Queue #1 |
Executes function tasks from synchronously executed function APIs | ← Same as left | Scale when needing to increase concurrency of synchronously executed function APIs |
| worker-2 Queue #2 |
Executes function tasks from scheduled tasks | ← Same as left | Scale when needing to increase concurrency of scheduled tasks |
| worker-3 Queue #3 |
Executes function tasks from asynchronously executed function APIs | ← Same as left | Scale when needing to increase concurrency of asynchronously executed function APIs |
| worker-4 Queue #4 |
(Reserved) | (Reserved) | No need to scale |
| worker-5 Queue #5 |
Debug code execution i.e., directly running functions in the Web interface |
← Same as left | Scale when needing to support more users developing scripts simultaneously |
| worker-6 Queue #6 |
Executes function tasks from connector subscription message processing | ← Same as left | Scale when needing to increase concurrency of connector subscription message processing |
| worker-7 Queue #7 |
(Reserved) | Executes data platform system business and message sending function tasks e.g., logging in via data platform backend admin, updating various caches, releasing message aggregation pools, Message Desk message sending |
Scale when the volume of message sending is large |
| worker-8 Queue #8 |
(Reserved) | Executes data platform threshold detection and other ordinary monitor related function tasks | Scale when the number of ordinary monitors is high |
| worker-9 Queue #9 |
(Reserved) | Executes data platform advanced detection, intelligent monitor function tasks | Scale when the number of advanced detection and intelligent monitors is high |
| beat | Trigger for scheduled tasks | ← Same as left | Must not scale, ensure global single replica |
| mysql | Database | (No such service) | No need to scale, choose self-built or cloud service for higher demands |
| redis | Cache / Function execution task queue | (No such service) | No need to scale, choose self-built or cloud service for higher demands |
| Service | Responsibility | Scaling Recommendation |
|---|---|---|
| server | Web service, providing the following functionalities: 1. Web interface 2. API interfaces 3. Maintaining subscribers |
Generally no need to scale |
| worker-0 Queue #0 |
System work unit, not directly involved in user code processing | Generally no need to scale |
| worker-1-6 Queue #1, #2, #3, #4, #5, #6 |
By default, responsible for synchronous function call processing, such as: 1. Synchronously executed function APIs 2. Subscription message processing |
Scale when needing to increase concurrency of synchronously executed function APIs and subscription message processing |
| worker-7 Queue #7 |
By default, responsible for debug code processing (i.e., directly running functions in the Web interface) | Scale when needing to support more users developing scripts simultaneously |
| worker-8-9 Queue #8, #9 |
By default, responsible for asynchronous function call processing, such as: 1. Asynchronously executed function APIs 2. Scheduled tasks |
Scale when needing to increase concurrency of scheduled tasks and asynchronously executed function API processing |
| beat | Trigger for scheduled tasks | Must not scale, ensure global single replica |
| mysql | Database | No need to scale, choose self-built or cloud service for higher demands |
| redis | Cache / Function execution task queue | No need to scale, choose self-built or cloud service for higher demands |
Example: When needing to enhance scheduled task processing capacity...
From the above, scheduled tasks are located in "Queue #8", and "Queue #8" corresponds to "Service worker-8". Therefore, scaling "Service worker-8" is sufficient.
Estimating Scaling Amount
Taking the common worker-8 as an example:
worker-8 in the data platform attached version is mainly responsible for executing monitor tasks. Assuming one detection task takes T milliseconds, then 1 minute can execute 60 × 1,000 ÷ T detections. By default, each worker-8 Pod starts 5 processes.
That is, the detection capacity of a single worker-8 Pod is 5 × (60 × 1,000 ÷ T) monitors.
Formula
| Text Only | |
|---|---|
1 2 | |
A: Detection capacity
T: Detection task execution time (milliseconds)
Based on different monitor execution durations, the following table can be listed:
| Single Detection Time (ms) | Single Pod Detection Capacity | Compared to Baseline |
|---|---|---|
| 300 | 1,000 | 167% |
| 500 | 600 | baseline |
| 800 | 375 | 63% |
| 1,000 | 300 | 50% |
| 2,000 | 150 | 25% |
| 3,000 | 100 | 17% |
Conversely, assuming the total number of monitors is M, then the required number of Pods can be derived from M ÷ (5 × (60 × 1,000 ÷ T)).
Formula
| Text Only | |
|---|---|
1 2 | |
P: Required number of Pods
M: Number of monitors
T: Detection task execution time (milliseconds)
Based on different monitor counts and execution durations, the following tables can be listed:
| Monitor Count | Single Detection Time (ms) | Required Pods | Compared to Baseline |
|---|---|---|---|
| 1,000 | 300 | 1 | 50% |
| 1,000 | 500 | 2 | baseline |
| 1,000 | 800 | 3 | 150% |
| 1,000 | 1,000 | 4 | 200% |
| 1,000 | 2,000 | 7 | 350% |
| 1,000 | 3,000 | 10 | 500% |
| Monitor Count | Single Detection Time (ms) | Required Pods | Compared to Baseline |
|---|---|---|---|
| 5,000 | 300 | 5 | 56% |
| 5,000 | 500 | 9 | baseline |
| 5,000 | 800 | 14 | 156% |
| 5,000 | 1,000 | 17 | 189% |
| 5,000 | 2,000 | 34 | 378% |
| 5,000 | 3,000 | 50 | 556% |
| Monitor Count | Single Detection Time (ms) | Required Pods | Compared to Baseline |
|---|---|---|---|
| 10,000 | 300 | 10 | 59% |
| 10,000 | 500 | 17 | baseline |
| 10,000 | 800 | 27 | 159% |
| 10,000 | 1,000 | 34 | 200% |
| 10,000 | 2,000 | 67 | 394% |
| 10,000 | 3,000 | 100 | 588% |
Operation Method
For single-machine deployed DataFlux Func, scaling can be achieved by modifying the configuration ({installation directory}/docker-stack.yaml) and increasing the deploy.replicas for the corresponding service.
Please refer to the official documentation
For complete information on the deploy.replicas option, please refer to the Docker official documentation: Docker Documentation / Compose file deploy reference / replicas
Taking improving worker-8 processing capacity as an example, the specific modified part is as follows:
Example is only an excerpt
The example only shows the key modified parts. Please ensure the configuration is complete during actual operation.
| docker-stack.yaml Key Modified Part | |
|---|---|
1 2 3 4 5 | |
3. Resource Limits
Resource limits should be adjusted reasonably based on actual business
Please adjust resource limits reasonably based on actual business.
Blindly limiting resources may lead to longer task execution times or insufficient memory preventing code execution.
Operation Method
For single-machine deployed DataFlux Func, resource limits can be implemented by modifying the configuration ({installation directory}/docker-stack.yaml) and adding deploy.resources for the corresponding service.
Please refer to the official documentation
For complete information on the deploy.resources option, please refer to the Docker official documentation: Docker Documentation / Compose file deploy reference / resources
By default, each worker-N replica can occupy up to 5 CPU cores (i.e., each work unit has 5 worker processes).
Taking limiting resources for worker-8 as an example, the specific modified part is as follows:
Example is only an excerpt
The example only shows the key modified parts. Please ensure the configuration is complete during actual operation.
| docker-stack.yaml Key Modified Part | |
|---|---|
1 2 3 4 5 6 7 | |
4. Splitting Work Units
All work units are already split in the new version
In standalone Func version 3.2.0 and later, all non-reserved work units are split by default. Users can enable reserved queues as needed.
In data platform attached Func version 1.77.145 and later, all work units are split by default, and users no longer need to split them manually.
Under special circumstances, default merged work units (e.g., worker-1-6) can be split to achieve finer-grained task scheduling, enabling scaling and resource limiting for work units responsible for specific queues.
Assuming based on business requirements, DataFlux Func has high performance requirements for subscription processing and hopes that subscription message processing does not interfere with synchronous function API processing, then worker-1-6 can be split into worker-1-5 and worker-6.
Operation Method
For single-machine deployed DataFlux Func, splitting work units can be achieved by modifying the configuration ({installation directory}/docker-stack.yaml), adding or modifying corresponding services, and changing the queue numbers specified in the command.
Specifying the queues a work unit listens to is achieved through the parameters after ./run-worker-by-queue.sh. The service name itself is mainly for labeling purposes; it is recommended to keep it consistent with the actual listening queues to avoid confusion.
Example is only an excerpt
The example only shows the key modified parts. Please ensure the configuration is complete during actual operation.
| docker-stack.yaml Key Modified Part | |
|---|---|
1 2 3 4 5 6 7 8 9 10 | |