I have a table with about 6 million records and want to start archiving records, I have thought of creating a backup version of the same table and moving the records across once they meet the criteria for being archived. However, I have been told that it is also possible to use Hive to copy this data to an S3.
Could someone please explain why I would opt to copy the data in to an S3 bucket rather than store it in another dynamodb table.
DynamomDB has a time-to-live mechanism and you can set a stream of records deletions which will call an AWS Lambda and put the data to S3. Check this detailed guide on how to set it up. Also, you can try out AWS Data Pipeline with an EMR cluster which is a common way to set one-time or periodical migrations.
If you actively use full scan operations over your DynamoDB then it’s better to archive and remove the records you don’t use. If you query the records only by the primary key, then most probably archiving doesn’t worth the effort. You can verify the bill, but storing the first 25 GB in DynamoDB are free.