Troubleshooting / Package Cannot Be Imported or Version Error
1. Introduction
In normal Python development, any changes to third-party packages require a system restart.
However, in DataFlux Func, for the convenience of development, installing third-party packages does not require restarting the entire DataFlux Func system.
But this does not guarantee 100% reliability.
For third-party packages to be installed or upgraded without restarting DataFlux Func, the following conditions must be met:
- The package is written purely in Python (e.g., does not contain C extensions)
- The package does not have preloaded content (e.g., various big data models)
When a third-party package does not meet the above conditions, it can only be guaranteed to work properly during the first installation, and subsequent updates require restarting the entire DataFlux Func
2. Case Study: numpy
The commonly used numpy
package is a typical example of a package that uses C extensions. After deploying DataFlux Func, the first installation of numpy
works fine, but if you install a different version of numpy
later.
Then, you may encounter the following issues:
2.1 numpy
Cannot Be Imported
Taking numpy
as an example, when the script executes import numpy
, the following error is thrown:
Text Only | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
2.2 numpy
Can Be Used but Calls the Old Version
Taking numpy
as an example, by printing the __version___
attribute of the package object, you can see the actual version of the numpy
package during code execution, which is inconsistent with the installed version.
Python | |
---|---|
1 |
|
Satisfying this condition does not mean the code can still run normally. Simply restarting DataFlux Func will revert to the issue mentioned in 1.1 above. Do not ignore this.
3. Explanation
Third-party packages like numpy may download additional resources (e.g., code in other languages that needs to be compiled, data models, etc.) during installation. After the first installation, Python can read this third-party package normally and load external data.
However, when installing a different version of the same package again, the resource files from the previous version are still in use and cannot be updated/overwritten, leading to incorrect installation.
At this point, DataFlux Func may experience the following scenarios:
- If a Worker process has previously
import
ed this third-party package, it may still run using the cached data, but it will appear to call the old version (i.e., the version first cached) - If a Worker process has never
import
ed this third-party package before, it will fail toimport
it when trying to load the related resources for the first time, because the package was not correctly installed during the second installation - At this point, if DataFlux Func is restarted, scenario 1 above will release the cache due to the restart and turn into scenario 2 above.
4. Solution
Taking numpy
1.22.1 as an example, after installation, the following folders will be generated in the Resource Catalog / extra-python-packages
:
numpy
numpy-1.22.1.dist-info
numpy.libs
The default host location for the Resource Catalog is: /usr/local/dataflux-func/data/resources/extra-python-packages
The container location for the Resource Catalog is: /data/resources/extra-python-packages
The steps to resolve this issue are as follows:
- Completely delete the above folders related to
numpy
- Restart DataFlux Func
- Reinstall
numpy