Question:
Amazon S3 has a new feature called select from
which allows one to run simple SQL queries against simple data files – like CSV or JSON. So I thought I’d try it.
I created and uploaded the following CSV to my S3 bucket in Oregon (I consider this file to be extremely simple):
1 2 3 4 5 |
aaa,bbb,ccc 111,111,111 222,222,222 333,333,333 |
I indicated this was CSV with a header row and issued the following SQL:
select * from s3object s
…which worked as expected, returning:
1 2 3 4 |
111,111,111 222,222,222 333,333,333 |
Then I tried one of the provided sample queries, which failed:
1 2 |
select s._1, s._2 from s3object s |
…the error message was “Some headers in the query are missing from the file. Please check the file and try again.”.
Also tried the following, each time receiving the same error:
1 2 3 4 5 6 |
select aaa from s3object s select s.aaa from s3object s select * from s3object s where aaa = 111 select * from s3object s where s.aaa = 111 select * from s3object s where s._1 = 111 |
So anytime my query references a column, either by name or number, either in the SELECT or WHERE clauses, I get the “headers in the query are missing”. The AWS documentation provides no follow up information on this error.
So my question is, what’s wrong? Is there an undocumented requirement about the column headers? Is there an undocumented way to reference columns? Does the “Select From” feature have a bug in it?
Answer:
I did the following:
- Created a file with the contents you show above
- Entered S3 Select on the file, and ticked File has header row
- Changed no other settings
These queries did NOT work:
1 2 3 |
select s._1, s._2 from s3object s select * from s3object s where s._1 = 111 |
The reason they didn’t work is that the file contains headers, so the columns have actual names.
These queries DID work:
1 2 3 4 5 |
select aaa from s3object s select s.aaa from s3object s select * from s3object s where aaa = 111 (Gave empty result) select * from s3object s where s.aaa = 111 (Gave empty result) |
When I treated the last two queries as strings, they returned the row as expected:
1 2 3 |
select * from s3object s where aaa = '111' select * from s3object s where s.aaa = '111' |