Question:
I am trying to query data from json data in s3-select.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
{ person = [ { "Id": 1, "Name": "Anshu", "Address": "Templestowe", "Car": "Jeep" } { "Id": 2, "Name": "Ben Mostafa", "Address": "Las Vegas", "Car": "Mustang" } { "Id": 3, "Name": "Rohan Wood", "Address": "Wooddon", "Car": "VW" } ] } QUERY = "select * from S3Object s" QUERY = "select s.person from S3Object s" QUERY = "select s.person[0] from S3Object s" QUERY = "select s.person[0].Name from S3Object s" |
All these queries work fine and returns the respective object as desired but
when i am trying to search data on name/Car, it doesn’t work.
1 2 |
QUERY = "select * from S3Object s where s.person.Name = \"Anshu\" " |
error: com.amazonaws.services.s3.model.AmazonS3Exception: The column index at line 1, column 32 is invalid.
There is not much related content available on s3-select online.
Wondering whether we can query on field name or not!
There are no examples of select query with where clause for s3-select given in the documentation
Answer:
I can’t find this in any of AWS documentations, but I was just playing around and discovered a working syntax:
1 2 |
QUERY = "select * from S3Object s where 'Anshu' in s.person[*].Name" |
Based on some deductions:
- I know that syntax like WHERE (‘blah’ in s.tags) work when tags property is an array of string.
- AWS documentation also say that s.person[#] should work when # is a valid index/digit. Based on this, I discovered that using star (*) between square bracket, as in s.person[*].Name, also work. This is after failed testing of various syntax like s.Person[], s.Person[#], s.Person[?], etc…
Proof with Python and Boto3:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import boto3 S3_BUCKET = 'your-bucket-name' s3 = boto3.client('s3') r = s3.select_object_content( Bucket=S3_BUCKET, Key='your-file-name.json', ExpressionType='SQL', Expression="select * from s3object s where 'Anshu' in s.person[*].Name", InputSerialization={'JSON': {"Type": "Lines"}}, OutputSerialization={'JSON': {}} ) for event in r['Payload']: if 'Records' in event: records = event['Records']['Payload'].decode('utf-8') print(records) |
Weird, I know. Remember to set [default] credentials in ~/.aws/credentials file.