Troubleshooting / Package cannot be imported or version error
1. Preface
In normal Python development, any changes to third-party packages require a system restart.
However, in DataFlux Func, for the convenience of development, installing third-party packages does not require a restart of the entire DataFlux Func system.
But this cannot guarantee 100% stability.
Under the condition that DataFlux Func does not restart, third-party packages that can be installed and upgraded normally must meet the following conditions:
- Packages purely written in Python (e.g., do not include C extensions)
- No preloaded content (e.g., various big data models)
When a third-party package does not meet the above conditions, only the first installation can be guaranteed to work properly, and subsequent updates require a full restart of DataFlux Func.
2. Failure Case numpy
The commonly used numpy
package is a typical example of a package that uses C extensions. After deploying DataFlux Func, the initial installation of numpy
can be used normally, but if you install a different version of numpy
later,
You may encounter the following types of failures:
2.1 numpy
cannot be imported
Taking numpy
as an example, when the script executes to import numpy
, the following error is thrown:
Text Only | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
2.2 numpy
can be used but calls the old version
Taking numpy
as an example, by printing the __version___
of the package object, it can be seen that during the actual code execution process, the version of the numpy
package does not match the installed one.
Python | |
---|---|
1 |
|
Meeting this situation does not mean the code can still run normally. Simply restarting DataFlux Func will turn it back into the issue described in 1.1 above. Do not overlook this.
3. Cause Explanation
Third-party packages like numpy may download additional resources (such as other language codes that need to be compiled, data models, etc.) during installation. After the first installation, Python can read this third-party package normally and load external data.
However, when installing a different version of the same package again, the resource files from the previous version are still in use and cannot actually be updated/overwritten, leading to improper installation.
At this point, there are several scenarios for DataFlux Func:
- If a Worker process has already
import
ed this third-party package before, then this Worker process may rely on cached data and continue to run, but it will show that the old version (and the first cached version) is being called. - If a Worker process has never
import
ed this third-party package before, then during the firstimport
and loading of related resources, since this package was not properly installed during the second installation, it cannot be imported normally. - At this time, if DataFlux Func is restarted, scenario 1 above will change into scenario 2 because the cache is released upon restart.
4. Solution
For example, after installing numpy
1.22.1, the following folders will be generated under the Resource Catalog / extra-python-packages
:
numpy
numpy-1.22.1.dist-info
numpy.libs
The default location of the Resource Catalog on the host machine is: /usr/local/dataflux-func/data/resources/extra-python-packages
The location of the Resource Catalog inside the container is: /data/resources/extra-python-packages
The steps to resolve this issue are as follows:
- Thoroughly delete the folders related to
numpy
mentioned above - Restart DataFlux Func
- Reinstall
numpy