Question:
I’m following the Athena getting started guide and trying to parse my own Cloudfront logs. However, the fields are not being parsed.
I used a small test file, as follows:
1 2 3 4 5 |
#Version: 1.0 #Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type 2016-02-02 07:57:45 LHR5 5001 86.177.253.38 GET d3g47gpj5mj0b.cloudfront.net /foo 404 - Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 - - Error -tHYQ3YpojqpR8yFHCUg5YW4OC_yw7X0VWvqwsegPwDqDFkIqhZ_gA== d3g47gpj5mj0b.cloudfront.net https421 0.076 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error 2016-02-02 07:57:45 LHR5 1158241 86.177.253.38 GET d3g47gpj5mj0b.cloudfront.net /images/posts/cover/404.jpg 200 https://d3g47gpj5mj0b.cloudfront.net/foo Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 - - Miss oUdDIjmA1ON1GjWmFEKlrbNzZx60w6EHxzmaUdWEwGMbq8V536O4WA== d3g47gpj5mj0b.cloudfront.net https 419 0.440 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss |
And created the table with this SQL:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( `Date` DATE, Time STRING, Location STRING, Bytes INT, RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, os STRING, Browser STRING, BrowserVersion STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$" ) LOCATION 's3://test/athena-csv/' |
But no data comes back:
I can see it returns 4 rows, but the first 2 should be excluded because they start with a #, so it’s like the regex isn’t being parsed correctly.
Am I doing something wrong? Or is the regex wrong (seems unlikely, as it’s in the docs, and looks fine to me)?
Answer:
This is what I ended up with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
CREATE EXTERNAL TABLE logs ( `date` date, `time` string, `location` string, `bytes` int, `request_ip` string, `method` string, `host` string, `uri` string, `status` int, `referer` string, `useragent` string, `uri_query` string, `cookie` string, `edge_type` string, `edget_requiest_id` string, `host_header` string, `cs_protocol` string, `cs_bytes` int, `time_taken` string, `x_forwarded_for` string, `ssl_protocol` string, `ssl_cipher` string, `result_type` string, `protocol` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '^(?!#.*)(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s*(\\S*)' ) LOCATION 's3://logs' |
Note the double backslashes are intentional.
The format of the cloudfront logs changed at some point to add the protocol
. This handles older and newer files.