MapReduce on AWS


Anybody played around with MapReduce on AWS yet? Any thoughts? How’s the implementation?


It’s easy to get started.

Here’s a FAQ:

And here’s the Getting Started Guide:

If you have an EC2 account already, you can enable MapReduce and have a sample application up and running in less than 10 minutes using the AWS Management Console.

I did the pre-packaged Word Count sample application, which returns a count of each word contained in about 20 MB of text. You can provision up to 20 instances to run concurrently, though I just used 2 instances and the job completed in about 3 minutes.

The job returns a 300 KB alphabetized list of words and how often each word appears in the sample corpus.

I really like that MapReduce jobs can be written in my choice of Perl, Python, Ruby, PHP, C++, R, or Java. The process was painless and straightforward, and the interface gives good feedback on the status of your instances and the job flow.

Be aware that, since AWS charges for a full hour when an instance is created, and since the MapReduce instances are automatically terminated at the end of the job flow, the cost of multiple fast-running job flows can add up quickly.

For example, if I create a job flow that uses 20 instances and returns results in 15 minutes, and then re-run the job flow 3 more times, I’ll be charged for 80 hours of machine time even though I only had 20 instances running for 1 hour.

Leave a Reply