Question:
When creating a new cluster using boto3, I want to use configuration from existing clusters (which is terminated) and thus clone it.
As far as I know, emr_client.run_job_flow
requires all the configuration(Instances, InstanceFleets etc
) to be provided as parameters.
Is there any way I can clone from existing cluster like I can do from aws console for EMR.
Answer:
What i can recommend you, is using the AWS CLI to fire your Cluster.
It permit to versioning your cluster configuration and you can easily load steps configuration with a json file.
1 2 |
aws create-cluster --name "Cluster's name" --ec2-attributes KeyName=SSH_KEY --instance-type m3.xlarge --release-label emr-5.2.1 --log-uri s3://mybucket/logs/ --enable-debugging --instance-count 1 --use-default-roles --applications Name=Spark --steps file://step.json |
Where step.json looks like :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
[ { "Name": "Step #1", "Type":"SPARK", "Jar":"command-runner.jar", "Args": [ "--deploy-mode", "cluster", "--class", "com.your.data.set.class", "s3://path/to/your/spark-job.jar", "-c", "s3://path/to/your/config/or/not", "--aws-access-key", "ACCESS_KEY", "--aws-secret-key", "SECRET_KEY" ], "ActionOnFailure": "CANCEL_AND_WAIT" } ] |
(Multiple steps is okey too)
After that you can always startUp the same configured Cluster.
And for example Schedule the whole Cluster and steps from one AirFlow job.
But if you really want to use Boto3, i suppose that the describe_cluster() method can help you to get the whole informations and use the returned object to Fire Up a new one.