Question:
I am trying to create a Sklearn processing job in Amazon Sagemekar to perform some data transformation of my input data before I do model training.
I wrote a custom python script preprocessing.py
which does the needful. I use some python package in this script. Here is the Sagemaker example I followed.
When I try to submit the Processing Job I get an error –
1 2 3 4 5 |
............................Traceback (most recent call last): File "/opt/ml/processing/input/code/preprocessing.py", line 6, in import snowflake.connector ModuleNotFoundError: No module named 'snowflake.connector' |
I understand that my processing job is unable to find this package and I need to install it. My question is how can I accomplish this using Sagemaker Processing Job API? Ideally there should be a way to define a requirements.txt
in the API call, but I don’t see such functionality in the docs.
I know I can create a custom Image with relevant packages and later use this image in the Processing Job, but this seems too much work for something that should be built-in?
Is there an easier/elegant way to install packages needed in Sagemaker Processing Job ?
Answer:
One way would be to call pip from Python:
1 2 |
subprocess.check_call([sys.executable, "-m", "pip", "install", package]) |
Another way would be to use an SKLearn Estimator (training job) instead, to do the same thing. You can provide the
source_dir
, which can include a requirements.txt
file, and these requirements will be installed for you
1 2 3 4 5 6 7 8 9 |
estimator = SKLearn( entry_point="foo.py", source_dir="./foo", # no trailing slash! put requirements.txt here framework_version="0.23-1", role = ..., instance_count = 1, instance_type = "ml.m5.large" ) |