How to use AWS Glue / Spark to convert CSVs partitioned and split in S3 to partitioned and split Parquet


In AWS Glue’s catalog, I have an external table defined with partitions that looks roughly like this in S3 and partitions for new dates are added daily:

How could I use Glue/Spark to convert this to parquet that is also partitioned by date and split across n files per day?. The examples don’t cover partitioning or splitting or provisioning (how many nodes and how big). Each day contains a couple hundred GBs.

Because the source CSVs are not necessarily in the right partitions (wrong date) and are inconsistent in size, I’m hoping to to write to partitioned parquet with the right partition and more consistent size.


Since the source CSV files are not necessarily in the right date, you could add to them additional information regarding collect date time (or use any date if already available):

Then your job could use this information in the output DynamicFrame and ultimately use them as partitions. Some sample code of how to achieve this:

Did not test it, but the script is just a sample of how to achieve this and is fairly straightforward.


1) As for having specific file sizes/numbers in output partitions,

Spark’s coalesce and repartition features are not yet implemented in Glue’s Python API (only in Scala).

You can convert your dynamic frame into a data frame and leverage Spark’s partition capabilities.

Convert to a dataframe and partition based on “partition_col”

partitioned_dataframe = datasource0.toDF().repartition(1)

Convert back to a DynamicFrame for further processing.

partitioned_dynamicframe = DynamicFrame.fromDF(partitioned_dataframe,
glueContext, “partitioned_df”)

The good news is that Glue has an interesting feature that if you have more than 50,000 input files per partition it’ll automatically group them to you.

In case you want to specifically set this behavior regardless of input files number (your case), you may set the following connection_options while “creating a dynamic frame from options”:

In the previous example, it would attempt to group files into 1MB groups.

It is worth mentioning that this is not the same as coalesce, but it may help if your goal is to reduce the number of files per partition.

2) If files already exist in the destination, will it just safely add it (not overwrite or delete)

Glue’s default SaveMode for write_dynamic_frame.from_options is to append.

When saving a DataFrame to a data source, if data/table already
exists, contents of the DataFrame are expected to be appended to
existing data.

3) Given each source partition may be 30-100GB, what’s a guideline for # of DPUs

I’m afraid I won’t be able to answer that. It depends on how fast it’ll load your input files (size/number), your script’s transformations, etc.

Leave a Reply