Skip to content

Troubleshooting / Package Cannot Be Imported or Version Error

1. Introduction

In normal Python development, any changes to third-party packages require a system restart.

However, in DataFlux Func, for the convenience of development, installing third-party packages does not require restarting the entire DataFlux Func system.

But this does not guarantee 100% reliability.

For third-party packages to be installed or upgraded without restarting DataFlux Func, the following conditions must be met:

  1. The package is written purely in Python (e.g., does not contain C extensions)
  2. The package does not have preloaded content (e.g., various big data models)

When a third-party package does not meet the above conditions, it can only be guaranteed to work properly during the first installation, and subsequent updates require restarting the entire DataFlux Func

2. Case Study: numpy

The commonly used numpy package is a typical example of a package that uses C extensions. After deploying DataFlux Func, the first installation of numpy works fine, but if you install a different version of numpy later.

Then, you may encounter the following issues:

2.1 numpy Cannot Be Imported

Taking numpy as an example, when the script executes import numpy, the following error is thrown:

Text Only
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#1 --------------------
Executing function: demo__demo.test_numpy()

Error stack:
Traceback (most recent call last):
  File "demo__demo", line 1, in <module>
    import numpy
ImportError:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
    https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
  * The Python version is: Python3.8 from "/opt/python/bin/python3.8"
  * The NumPy version is: "1.22.1"
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.
Original error was: libopenblas64_p-r0-2f7c42d4.3.18.so: cannot open shared object file: No such file or directory

2.2 numpy Can Be Used but Calls the Old Version

Taking numpy as an example, by printing the __version___ attribute of the package object, you can see the actual version of the numpy package during code execution, which is inconsistent with the installed version.

Python
1
print(numpy.__version__)

Satisfying this condition does not mean the code can still run normally. Simply restarting DataFlux Func will revert to the issue mentioned in 1.1 above. Do not ignore this.

3. Explanation

Third-party packages like numpy may download additional resources (e.g., code in other languages that needs to be compiled, data models, etc.) during installation. After the first installation, Python can read this third-party package normally and load external data.

However, when installing a different version of the same package again, the resource files from the previous version are still in use and cannot be updated/overwritten, leading to incorrect installation.

At this point, DataFlux Func may experience the following scenarios:

  1. If a Worker process has previously imported this third-party package, it may still run using the cached data, but it will appear to call the old version (i.e., the version first cached)
  2. If a Worker process has never imported this third-party package before, it will fail to import it when trying to load the related resources for the first time, because the package was not correctly installed during the second installation
  3. At this point, if DataFlux Func is restarted, scenario 1 above will release the cache due to the restart and turn into scenario 2 above.

4. Solution

Taking numpy 1.22.1 as an example, after installation, the following folders will be generated in the Resource Catalog / extra-python-packages:

  1. numpy
  2. numpy-1.22.1.dist-info
  3. numpy.libs

The default host location for the Resource Catalog is: /usr/local/dataflux-func/data/resources/extra-python-packages

The container location for the Resource Catalog is: /data/resources/extra-python-packages

The steps to resolve this issue are as follows:

  1. Completely delete the above folders related to numpy
  2. Restart DataFlux Func
  3. Reinstall numpy