EMR slave bootstrap failure in node provisioner AFTER bootstrap action succeeds

Question:

I am trying to spin up a cluster in AWS using EMR w/ Spark. I have a single bash bootstrap script to install some python packages, download credentials, and apply some configuration. The bootstrap action succeeds on the master, but fails on the slave. The only hint of an error is “i-#####: failed to start. bootstrap action 2 failed with non-zero exit code”. The message immediately preceding it is “i-#####: bootstrap action 1 completed”. (In both cases referring to the instance ID of the slave. The master is also reporting success of bootstrap action 1).

So it looks like the last command to execute in bootstrap action 2 is having an error and causing the bootstrap script to return a non-zero exit code. However, I only configured one bootstrap action. Do non-master nodes have another bootstrap action running automatically?

None of the logs show what the actual error is. I’ve looked at the bootstrap logs on S3 (which do not reliably show up) and tried tailing the /var/log/bootstrap-actions/ logs on the slave and master during startup.

I am pretty sure the error is not in my script (says every developer ever…). I am able to create a vanilla EMR cluster w/ no bootstrapping, then login once it is waiting and run my bootstrap script as user hadoop w/ no errors. I have also reviewed the last few commands (a grep and an echo) and verified that they do not return a non-zero exit nor cause the script to return a non-zero exit code.

I think the problem must be in some mystery second bootstrap action. Is this the case? How do I determine the error?

UPDATE
I logged into the slave node during boot up. I found the bootstrap action in /emr/instance-controller/lib/bootstrap-actions. There was only 1 sub folder and it contained my bootstrap script. I then ran
tail -f /emr/instance-controller/log/instance-controller.log. I verified that my script launched. After about 15 cycles of status checks (15 minutes), I see

Then I see another AWS script starting up and this seems to be the one that fails.

I’m not familiar with that /var/lib/aws/emr/provision-node script, but its sole contents are

Looking at /usr/share/aws/emr/node-provisioner/bin/provision-node, i can see that this script does a bunch of work to identify the path to $EMR_NODE_PROVISIONER_HOME and then run the following Java class from there

java -classpath '/usr/share/aws/emr/node-provisioner/lib/*' com.amazonaws.emr.node.provisioner.Program --phase hadoop _UUID_

I figured this out by viewing the source of the provision-node script and running it independently. I haven’t been able to catch the log or failure in real-time to see what goes wrong. When I ran it separately, I got the following exception. but I think this is because I passed garbage data instead of the UUID (I don’t know where the UUID comes from and it is different on the launch of each slave).

So my question now is what is com.amazonaws.emr.node.provisioner.Program and why is it failing (or how do I figure out why?)?

UPDATE 2

I managed to tail the outputs of /usr/share/aws/emr/node-provisioner/bin/provision-node all the way to failure and the result is the same as in my independent running above.

I’m guessing this might be a firewall/security group issue, but I’m using the default security groups generated by EMR so I would expect the ports to be open. I am building this cluster in a private subnet of a VPC so that may be an issue. However, when I build a cluster with no bootstrapping, I don’t get this failure. My next debug step is to build a vanilla cluster w/o bootstrapping and watch for this same command.

UPDATE 3
Confirmed that a vanilla EMR w/ Spark deployment succeeds w/ no network changes. No errors in /usr/share/aws/emr/node-provisioner/bin/provision-node. After launching the java command, the next line of stderr shows a JSON dump of platform config parameters. However, stdout shows a yum package install from repo Bigtop. I don’t see a yum command in the script or stderr output (from set -xe) so I think the yum commands must be within that Java program. Not sure why they succeed here but not w/ a bootstrap action.

My private VPC does have an S3 endpoint w/ subnet routing and firewall rules allowing access to the endpoint plist. My bootstrap script is able to successfully install packages using yum (not from Bigtop repo), copy files from S3, and download code from an external git repo on the internet.

Answer:

I ran into this problem myself and here is what I have figured out. The bootstrap action 2 refers to the node provisioning, which fails due to a misstep done in custom bootstrap action 1. For example in my case, I was installing a MySQL server with version X and then during node provisioning, the execution fails since yum was trying to install a different version of MySQL server. It seems conflicting yum updates are one major source of the problem.

The messaging is very unclear and there is no good documentation either.

If you run into a similar situation, go through all the logs of your cluster. In my case, I found the problematic message in the following logfile

Leave a Reply