Question:
I have two DynamoDB tables: Table_1 and Table_2. I am trying to deprecate Table_1 and copy information into Table_2 from Table_1, which has different GSIs and different LSIs.
Table_1 attributes are: Id, state, isReused, empty, normal
Table_2 attributes are: UserId, Status, isOld, normal
Id maps to UserId, state maps to status, normal maps to normal, empty is dropped from Table_2, and if the state is “OLD” then isOld sets to true.
What is the best way to export this data from Table_1, do the transform on the attributes/data, and then load the information back into Table_2?
Currently, I am able to use AWS Data Pipeline to import/export data from Table_1 to Table_2 with the given templates, but this does not do the transforms. I’m guessing that I need to use EMR to do the transforms.
I also use DynamoDB streams to keep the table in sync, but from my understanding, DynamoDB streams only streams updated information, not information that already exists in a table.
Answer:
Assuming that you need this data movement only once, I can think of two options:
- Spawn an EMR cluster with Hive installed. Create two Hive tables, one each on Table_1 and Table_2. Then, do something like ‘insert into Table_2 select … from Table_1’. This will give you full control on the transformations you need.
- It is true that only updated and new items are sent to a DynamoDB stream. If you already have a syncer setup, why don’t you ‘update’ all existing items in Table_1 once? Just add a dummy field to all of them, so that they are pushed to Table_1’s stream, after which they’ll be automatically copied to Table_2! Just make sure that your syncer ignores this dummy field. Advantages of this approach:
- Will handle race conditions well.
- Based on your business logic, you can quickly check if an item in Table_1 has already been synced to Table_2. This gives you the freedom to copy only those items in Table_1 that haven’t been updated in Table_2 yet.