I am fairly new to Amazon. I have a Java file which reads GBs of crawled data and I am running this using AWS ToolKit for Eclipse. The disadvantage here is, I have to keep my machine running for weeks if I need to read the entire crawled data and that is not possible. Apart from that, I can’t download GBs of data in to my local PC (Because it is reading data).
Is there any way that I can upload the Jar to Amazon, and Amazon run it without engaging with my computer? I have heard about web crawlers running in Amazon for weeks without downloading data into the developers machine, and without letting the developer to turn on his machine without shutting down for months.
The feature I am asking is just like “job flows” in Amazon Elastic Map-Reduce. You upload the code, it runs it inside. It doesn’t matter whether you keep “your” machine turned on or not.
You can run with the nohup command for *nix
nohup java -jar myjar.jar 2>&1 >> logfile.log &
This will run your jar file, directing the output [stderr and stdout] to
& is needed so that it runs in the background, freeing up the command line / shell/
!! EDIT !!
It’s worth noting that the easiest way I’ve found for stopping the job once it’s started is:
ps -ef | grep java
ec2-user 19082 19056 98 18:12 pts/0 00:00:11 java -jar myjar.jar
Note, you can
tail -f logfile.log or other such derivatives [less, cat, head] to view the output from the jar.
Answer to question/comment
Hi. You can use
System.out.println(), yes, and that’ll end up in logfile.log. The command that indicates that is the
2&>1 which means “redirect stream 2 into stream 1”. In unix speak that means redirect/pipe stderr into stdout. We then specify
>> logfile.log which means “append output to logfile.log”. As System.out.println() writes to stdout it’ll end up in logfile.log.
However, if you’re app is set up to use Log4j/commons-logging then using
LOG.info("statement"); will end up in the configured ‘log4j.properties’ log file. With this configuration the only statements that will end up in
logfile.log will be those that are either System generated (errors, linux internal system messages) or anything that’s written explicitly to the stdout (ie