Question:
I am trying to spin up a cluster in AWS using EMR w/ Spark. I have a single bash bootstrap script to install some python packages, download credentials, and apply some configuration. The bootstrap action succeeds on the master, but fails on the slave. The only hint of an error is “i-#####: failed to start. bootstrap action 2 failed with non-zero exit code”. The message immediately preceding it is “i-#####: bootstrap action 1 completed”. (In both cases referring to the instance ID of the slave. The master is also reporting success of bootstrap action 1).
So it looks like the last command to execute in bootstrap action 2 is having an error and causing the bootstrap script to return a non-zero exit code. However, I only configured one bootstrap action. Do non-master nodes have another bootstrap action running automatically?
None of the logs show what the actual error is. I’ve looked at the bootstrap logs on S3 (which do not reliably show up) and tried tailing the /var/log/bootstrap-actions/ logs on the slave and master during startup.
I am pretty sure the error is not in my script (says every developer ever…). I am able to create a vanilla EMR cluster w/ no bootstrapping, then login once it is waiting and run my bootstrap script as user hadoop w/ no errors. I have also reviewed the last few commands (a grep and an echo) and verified that they do not return a non-zero exit nor cause the script to return a non-zero exit code.
I think the problem must be in some mystery second bootstrap action. Is this the case? How do I determine the error?
UPDATE
I logged into the slave node during boot up. I found the bootstrap action in /emr/instance-controller/lib/bootstrap-actions
. There was only 1 sub folder and it contained my bootstrap script. I then ran
tail -f /emr/instance-controller/log/instance-controller.log
. I verified that my script launched. After about 15 cycles of status checks (15 minutes), I see
1 2 |
2017-06-02 13:44:30,173 INFO InstanceConfigurer: Script 1 - Execution succeeded |
Then I see another AWS script starting up and this seems to be the one that fails.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
2017-06-02 13:44:30,181 INFO InstanceConfigurer: Running provision-node, with id 5aed1c54-4210-4387-944a-4fdbbce6dc8d 2017-06-02 13:44:30,188 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Fetching file '/var/lib/aws/emr/provision-node' 2017-06-02 13:44:30,188 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - startExec '/var/lib/aws/emr/provision-node' 2017-06-02 13:44:30,189 INFO InstanceConfigurer: startExec '/var/lib/aws/emr/provision-node' 2017-06-02 13:44:30,190 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Environment: ... 2017-06-02 13:44:54,201 INFO InstanceConfigurer: Output from command '/var/lib/aws/emr/provision-node': stdout: stderr: 2017-06-02 13:44:54,202 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - waitProcessCompletion ended with exit code 255 : /var/lib/aws/emr/provision-node 2017-06-02 13:44:54,202 INFO InstanceConfigurer: waitProcessCompletion ended with exit code 255 : /var/lib/aws/emr/provision-node 2017-06-02 13:44:54,203 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - total process run time: 24 seconds 2017-06-02 13:44:54,203 INFO InstanceConfigurer: total process run time: 24 seconds 2017-06-02 13:44:54,217 ERROR InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Execution for /var/lib/aws/emr/provision-node failed with code '255' 2017-06-02 13:44:54,219 ERROR InstanceConfigurer: Startup failed with aws157.instancecontroller.common.model.InstanceConfiguratorException: Source: PROVISION_NODE | ErrorCode: SCRIPT_EXECUTION_FAILED_CODE | Execution for /var/lib/aws/emr/provision-node failed with code '255' at aws157.instancecontroller.common.InstanceConfigurator.runScript(InstanceConfigurator.java:563) at aws157.instancecontroller.common.InstanceConfigurator.provisionNode(InstanceConfigurator.java:225) at aws157.instancecontroller.common.InstanceConfigurator.doDistributionConfigure(InstanceConfigurator.java:201) at aws157.instancecontroller.common.InstanceConfigurator.access$200(InstanceConfigurator.java:70) at aws157.instancecontroller.common.InstanceConfigurator$1.run(InstanceConfigurator.java:251) |
I’m not familiar with that /var/lib/aws/emr/provision-node
script, but its sole contents are
1 2 3 4 5 |
#!/bin/bash set -ex sudo /usr/share/aws/emr/node-provisioner/bin/provision-node "$@" |
Looking at /usr/share/aws/emr/node-provisioner/bin/provision-node
, i can see that this script does a bunch of work to identify the path to $EMR_NODE_PROVISIONER_HOME and then run the following Java class from there
java -classpath '/usr/share/aws/emr/node-provisioner/lib/*' com.amazonaws.emr.node.provisioner.Program --phase hadoop _UUID_
I figured this out by viewing the source of the provision-node script and running it independently. I haven’t been able to catch the log or failure in real-time to see what goes wrong. When I ran it separately, I got the following exception. but I think this is because I passed garbage data instead of the UUID (I don’t know where the UUID comes from and it is different on the launch of each slave).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
2017-06-02 14:55:13,593 ERROR main: Encountered a problem while provisioning java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) at com.google.api.client.http.javanet.NetHttpResponse. at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94) at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972) at com.amazonaws.emr.node.provisioner.http.JsonHttpClient.doRequest(JsonHttpClient.java:49) at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:38) at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:31) at com.amazonaws.emr.node.provisioner.bigtop.config.PlatformContextProvider.provide(PlatformContextProvider.java:32) at com.amazonaws.emr.node.provisioner.phase.PhaseWorkflow.work(PhaseWorkflow.java:51) at com.amazonaws.emr.node.provisioner.phase.ProvisionHadoopPhase.perform(ProvisionHadoopPhase.java:21) at com.amazonaws.emr.node.provisioner.Program.main(Program.java:20) |
So my question now is what is com.amazonaws.emr.node.provisioner.Program and why is it failing (or how do I figure out why?)?
UPDATE 2
I managed to tail the outputs of /usr/share/aws/emr/node-provisioner/bin/provision-node all the way to failure and the result is the same as in my independent running above.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
java -classpath '/usr/share/aws/emr/node-provisioner/lib/*' com.amazonaws.emr.node.provisioner.Program --phase hadoop 2017-06-02 17:05:37,869 ERROR main: Encountered a problem while provisioning java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) at com.google.api.client.http.javanet.NetHttpResponse. at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94) at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972) at com.amazonaws.emr.node.provisioner.http.JsonHttpClient.doRequest(JsonHttpClient.java:49) at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:38) at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:31) at com.amazonaws.emr.node.provisioner.bigtop.config.PlatformContextProvider.provide(PlatformContextProvider.java:32) at com.amazonaws.emr.node.provisioner.phase.PhaseWorkflow.work(PhaseWorkflow.java:51) at com.amazonaws.emr.node.provisioner.phase.ProvisionHadoopPhase.perform(ProvisionHadoopPhase.java:21) at com.amazonaws.emr.node.provisioner.Program.main(Program.java:20) |
I’m guessing this might be a firewall/security group issue, but I’m using the default security groups generated by EMR so I would expect the ports to be open. I am building this cluster in a private subnet of a VPC so that may be an issue. However, when I build a cluster with no bootstrapping, I don’t get this failure. My next debug step is to build a vanilla cluster w/o bootstrapping and watch for this same command.
UPDATE 3
Confirmed that a vanilla EMR w/ Spark deployment succeeds w/ no network changes. No errors in /usr/share/aws/emr/node-provisioner/bin/provision-node. After launching the java command, the next line of stderr shows a JSON dump of platform config parameters. However, stdout shows a yum package install from repo Bigtop. I don’t see a yum command in the script or stderr output (from set -xe) so I think the yum commands must be within that Java program. Not sure why they succeed here but not w/ a bootstrap action.
My private VPC does have an S3 endpoint w/ subnet routing and firewall rules allowing access to the endpoint plist. My bootstrap script is able to successfully install packages using yum (not from Bigtop repo), copy files from S3, and download code from an external git repo on the internet.
Answer:
I ran into this problem myself and here is what I have figured out. The bootstrap action 2 refers to the node provisioning, which fails due to a misstep done in custom bootstrap action 1. For example in my case, I was installing a MySQL server with version X and then during node provisioning, the execution fails since yum was trying to install a different version of MySQL server. It seems conflicting yum updates are one major source of the problem.
The messaging is very unclear and there is no good documentation either.
If you run into a similar situation, go through all the logs of your cluster. In my case, I found the problematic message in the following logfile
1 2 |
s3://aws-logs- |