Troubleshooting / Package Cannot Be Imported or Version Error
1. Preface
In normal Python development, any changes to third-party packages require restarting the system.
However, in DataFlux Func, for development convenience, installing third-party packages does not require restarting the entire DataFlux Func system.
But this does not guarantee 100% reliability.
For third-party packages to be installed or upgraded normally without restarting DataFlux Func, the following conditions must be met:
- The package is written purely in Python (e.g., without C extensions).
- It has no preloaded content (e.g., various large data models).
When a third-party package does not meet the above conditions, it can only be guaranteed to work properly during the first installation. Subsequent updates require restarting the entire DataFlux Func.
2. Case Study: numpy
The commonly used numpy package is a typical example of a package that uses C extensions. After deploying DataFlux Func, numpy can be installed and used normally for the first time. However, if a different version of numpy is installed later.
Then, you may encounter the following types of failures:
2.1 numpy Cannot Be Imported
Taking numpy as an example, when the script executes import numpy, the following error is thrown:
| Text Only | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
2.2 numpy Is Usable but Calls the Old Version
Taking numpy as an example, by printing the __version__ attribute of the package object, you can see the actual version of the numpy package during code execution, but it does not match the actually installed version.
| Python | |
|---|---|
1 | |
Meeting this condition does not mean the code can still run normally. Simply restarting DataFlux Func will revert to the problem described in 1.1 above. Do not ignore this.
3. Explanation of the Cause
For third-party packages like numpy, during the installation process, there may be situations where additional resources are downloaded (e.g., code in other languages that needs to be compiled, data models, etc.). After the first installation, Python can normally read this third-party package and load external data.
However, when installing a different version of the same package again, the resource files from the previous version are still in use and cannot actually be updated/overwritten, leading to an incorrect installation.
At this point, the following scenarios exist for DataFlux Func:
- If a Worker process has previously
imported this third-party package, that Worker process may still be able to run relying on cached data, but it manifests as calling the old version (the version cached during the first import). - If a Worker process has never
imported this third-party package before, when it firstimports and loads related resources, because the package was not installed correctly during the second installation, it cannot beimported normally. - At this point, if DataFlux Func is restarted, scenario 1 above will change to scenario 2 above because the restart releases the cache.
4. Solution
Taking numpy 1.22.1 as an example, after installation, the following folders will be generated in the Resource Catalog / extra-python-packages:
numpynumpy-1.22.1.dist-infonumpy.libs
The default host location for the Resource Catalog is: /usr/local/dataflux-func/data/resources/extra-python-packages
The container location for the Resource Catalog is: /data/resources/extra-python-packages
The steps to resolve this issue are as follows:
- Completely delete the folders related to
numpymentioned above. - Restart DataFlux Func
- Reinstall
numpy.