Question:
I’m trying to launch a cluster and run a job all using boto.
I find lot’s of examples of creating job_flows. But I can’t for the life of me, find an example that shows:
- How to define the cluster to be used (by clusted_id)
- How to configure an launch a cluster (for example, If I want to use spot instances for some task nodes)
Am I missing something?
Answer:
Boto and the underlying EMR API is currently mixing the terms cluster and job flow, and job flow is being deprecated. I consider them synonyms.
You create a new cluster by calling the boto.emr.connection.run_jobflow()
function. It will return the cluster ID which EMR generates for you.
First all the mandatory things:
1 2 3 4 5 6 7 8 |
#!/usr/bin/env python import boto import boto.emr from boto.emr.instance_group import InstanceGroup conn = boto.emr.connect_to_region('us-east-1') |
Then we specify instance groups, including the spot price we want to pay for the TASK nodes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
instance_groups = [] instance_groups.append(InstanceGroup( num_instances=1, role="MASTER", type="m1.small", market="ON_DEMAND", name="Main node")) instance_groups.append(InstanceGroup( num_instances=2, role="CORE", type="m1.small", market="ON_DEMAND", name="Worker nodes")) instance_groups.append(InstanceGroup( num_instances=2, role="TASK", type="m1.small", market="SPOT", name="My cheap spot nodes", bidprice="0.002")) |
Finally we start a new cluster:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
cluster_id = conn.run_jobflow( "Name for my cluster", instance_groups=instance_groups, action_on_failure='TERMINATE_JOB_FLOW', keep_alive=True, enable_debugging=True, log_uri="s3://mybucket/logs/", hadoop_version=None, ami_version="2.4.9", steps=[], bootstrap_actions=[], ec2_keyname="my-ec2-key", visible_to_all_users=True, job_flow_role="EMR_EC2_DefaultRole", service_role="EMR_DefaultRole") |
We can also print the cluster ID if we care about that:
1 2 |
print "Starting cluster", cluster_id |