How to import python file as module in Jupyter notebook?


I am developing AWS Glue scripts and I am trying to use the Dev Endpoint. I followed the wizard to create a Dev Endpoint and a SageMaker notebook attached to it. When I open the SageMaker notebook, it directs me to a web page called Jupyter.

In Jupyter, I created several notebooks with my python files. The problem is that some shared python files could Not be imported into the notebooks as modules. I got the following error : ”
No module named shared.helper
Traceback (most recent call last):

Import Error: No module named shared.helper

Here is my project structure on the Jupyter notebook:

I tried many attempts which I searched out on the Internet, but it didn’t work.

In a_notebook.ipynb, I just use import shared.helper as helper, and it shows me the above error.

I don’t know if there is anything in relation with the AWS Glue? As I am opening the Jupyter from the Sagemaker notebook under AWS Glue console.



According to the docs

You need to upload your python files to an S3 bucket. If you have more than one you need to zip them. When you start the dev endpoint, there is a setting Python library path under Security configuration, script libraries, and job parameters (optional) to set the path to the S3 bucket containing custom libraries (including scripts, modules, packages). You’ll also need to make sure the IAM Policy attached to the IAM role used by the dev endpoint has access to list/head/getobject etc for that bucket.


It’s a bit of extra work but the main reason is that the libraries need to be loaded to each and every DPU (execution container) in the Spark cluster.

When you use the Sparkmagic (pyspark) kernel, it is using a Spark library called livy to connect to and run your code on a remote Spark cluster. The dev endpoint is effectively a Spark cluster, and your “Sagemaker notebook”^ is connecting to the livy host on the Spark cluster.

This is quite different from a normal Python environment, mainly because the present-working-directory and where the code gets executed are not the same place. Sagemaker allows use of a lot of the Jupyter magics, so you can test this out and see.

For example in a paragraph run this

It will show you what you expected to see, something like


And try this:

And you’ll see something like this

Glue Examples/ lost+found/ shared/ a_notebook.ipynb

Those magics are using the Notebook’s context and showing you directories relative to it.
If you try this:

You’ll see something quite different:


That’s a Spark (hadoop HDFS really) directory from the driver container on the cluster. Hadoop directories are distributed with redundancy so it’s not necessarily correct to say that the directory is in that container, nor is that really important. The point is that the directory is on the remote cluster, not on the ec2 instance running your notebook.

Sometimes a nice trick to load modules is to modify your sys.path to include a directory you want to import modules from. Unfortunately that doesn’t work here because if you appended /home/ec2-user/Sagemaker to the path, firstly that path won’t exist on HDFS, and secondly the pyspark context can’t search the path on your notebook’s EC2 host.

Another thing you can do to prove this is all true is to change your kernel in the running notebook. There’s a kernel menu option for that in Jupyter. I suggest conda_python3.

Of course, this kernel will not be connected to the Spark cluster so no Spark code will work, but you can again try the above tests for %pwd, and print(os.getcwd()) and see that they now show the same local directory. You should also be able to import your module, although you may need to modify the path, e.g.

You should then be able to run this

But at this point, you’re not in the Sparkmagic (pyspark) kernel, so that’s no good to you.

It’s a long explanation but it should help make clear why the annoying requirement to upload scripts to an S3 bucket. When your dev endpoint launches, it has a hook to load your custom libraries from that location so they are available to the Spark cluster containers.

^ Note that Sagemaker is the AWS re-branding of Jupyter notebooks, which is a little confusing. Sagemaker is also the name of a service in AWS for automated machine learning model training / testing / deployment lifecycle management. It’s essentially Jupyter notebooks plus some scheduling plus some API endpoints on top. I’d be surprised if it weren’t something like papermill running under the hood.

Leave a Reply