Merging files on AWS S3 (Using Apache Camel)

Question:

I have some files that are being uploaded to S3 and processed for some Redshift task. After that task is complete these files need to be merged. Currently I am deleting these files and uploading merged files again.
These eats up a lot of bandwidth. Is there any way the files can be merged directly on S3?

I am using Apache Camel for routing.

Answer:

S3 allows you to use an S3 file URI as the source for a copy operation. Combined with S3’s Multi-Part Upload API, you can supply several S3 object URI’s as the sources keys for a multi-part upload.

However, the devil is in the details. S3’s multi-part upload API has a minimum file part size of 5MB. Thus, if any file in the series of files under concatenation is < 5MB, it will fail.

However, you can work around this by exploiting the loop hole which allows the final upload piece to be < 5MB (allowed because this happens in the real world when uploading remainder pieces).

My production code does this by:

  1. Interrogating the manifest of files to be uploaded
  2. If first part is
    under 5MB, download pieces* and buffer to disk until 5MB is buffered.
  3. Append parts sequentially until file concatenation complete
  4. If a non-terminus file is < 5MB, append it, then finish the upload and create a new upload and continue.

Finally, there is a bug in the S3 API. The ETag (which is really any MD5 file checksum on S3, is not properly recalculated at the completion of a multi-part upload. To fix this, copy the fine on completion. If you use a temp location during concatenation, this will be resolved on the final copy operation.

* Note that you can download a byte range of a file. This way, if part 1 is 10K, and part 2 is 5GB, you only need to read in 5110K to get meet the 5MB size needed to continue.

** You could also have a 5MB block of zeros on S3 and use it as your default starting piece. Then, when the upload is complete, do a file copy using byte range of 5MB+1 to EOF-1

P.S. When I have time to make a Gist of this code I’ll post the link here.

Leave a Reply