EMR How to join files into one?

Question:

I’ve splitted big binary file to (2Gb) chunks and uploaded it to Amazon S3.
Now I want to join it back to one file and process with my custom

I’ve tried to run

but it failed due to -cat output data to my local terminal – it does not work remotely…

How I can do this?

P.S. I’ve tried to run cat as a streaming MR job:

this job was finished successfully. But. I had 3 file parts in dir/in – now I have 6 parts in /dir/out

And file _SUCCESS ofcource which is not part of my output…

So. How to join splitted before file?

Answer:

So. I’ve found a solution. Maybe not better – but it is working.

So. I’ve created an EMR job flow with bootstrap action

in that joinfiles.sh I’m downloading my file pieces from S3 using wget, join them using regular cat a b c > abc.

After that I’ve added a s3distcp which copied result back to S3. ( sample could be found at: https://stackoverflow.com/a/12302277/658346 ).
That is all.

Leave a Reply