Question:
I need to define a grok pattern in AWS Glue Classifie to capture the datestamp
with milliseconds on the datetime
column of file (which is converted as string
by AWS Glue Crawler. I used the DATESTAMP_EVENTLOG
predefined in AWS Glue and tried to add the milliseconds into the pattern.
Classification: datetime
Grok pattern: %{DATESTAMP_EVENTLOG:string}
Custom patterns:
1 2 3 |
MILLISECONDS (\d){3,7} DATESTAMP_EVENTLOG %{YEAR}-%{MONTHNUM}-%{MONTHDAY}T%{HOUR}:%{MINUTE}:%{SECOND}.%{MILLISECONDS} |
Answer:
The misconception with the Classifiers is that they are for specifying file formats, in addition to the inbuilt ones like JSON, CSV, etc. And NOT for specifying individual data type parse formats.
As user @lilline suggests the best way to change a data type is with an ApplyMapping function.
When creating a Glue Job you can select the option: A proposed script generated by AWS Glue
Then when selecting the table from the Glue Catalog as a source, you can make changes to the datatypes, column names, etc.
The output code might looking something like the following:
1 2 |
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("paymentid", "string", "paymentid", "string"), ("updateddateutc", "string", "updateddateutc", "timestamp"), ...], transformation_ctx = "applymapping1") |
Effectively casting the updateddateutc string to a timestamp.
In order to create a Classifier you would need to specify each individual column in the file.
1 2 3 4 5 |
Classifier type: Grok Classification: Name Grok pattern: %{MY_TIMESTAMP} Custom patterns MY_TIMESTAMP (%{USERNAME:test}[,]%{YEAR:year}[-]%{MONTHNUM:mm}[-]%{MONTHDAY:dd} %{TIME:time}) |