How to query S3 objects using AWS S3 SELECT with example?
Hello Everyone
Welcome to CloudAffaire and this is Debjeet.
Today we will discuss how to query S3 objects (CSV, JSON, Compressed) using AWS S3 SELECT with examples.
What is AWS S3 SELECT?
With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.
Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only), and server-side encrypted objects. You can specify the format of the results as either CSV or JSON, and you can determine how the records in the result are delimited.
You pass SQL expressions to Amazon S3 in the request. Amazon S3 Select supports a subset of SQL. You can perform SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the Amazon S3 console. The Amazon S3 console limits the amount of data returned to 40 MB. To retrieve more data, use the AWS CLI or the API.
Requirements and limitations of S3 SELECT:
- You must have s3:GetObject permission for the object you are querying.
- If the object you are querying is encrypted with a customer-provided encryption key (SSE-C), you must use https, and you must provide the encryption key in the request.
- The maximum length of a SQL expression is 256 KB.
- The maximum length of a record in the input or result is 1 MB.
- Amazon S3 Select can only emit nested data using the JSON output format.
- You cannot specify the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, or REDUCED_REDUNDANCY storage classes.
- Amazon S3 Select supports only columnar compression using GZIP or Snappy. Amazon S3 Select doesn’t support whole-object compression for Parquet objects.
- Amazon S3 Select doesn’t support Parquet output. You must specify the output format as CSV or JSON.
- The maximum uncompressed row group size is 512 MB.
- You must use the data types specified in the object’s schema.
- Selecting on a repeated field returns only the last value.
Note: AWS also has a dedicated service named Athena that can be used to query S3 bucket.
S3 SELECT VS Athena:
S3 Select is an S3 feature designed It works by retrieving a subset of an object’s data (using simple SQL expressions) instead of the entire object, which can be up to 5 terabytes in size. s3 select runs query on a single object at a time in the s3 bucket.
Amazon Athena on the other hand is a query service that makes it easy to analyze data stored in S3 using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, pay only for the queries. It scales automatically – executing queries in parallel, this makes it produce faster results, even with large datasets and complex queries.
How to query S3 objects using AWS S3 SELECT with example?
Prerequisites:
AWS CLI installed and configured.
Step 1: Create an S3 bucket.
1 2 3 4 5 |
## Create a S3 bucket aws s3api create-bucket \ --bucket cloudaffaire-s3-select-demo \ --region ap-south-1 \ --create-bucket-configuration LocationConstraint=ap-south-1 |
Step 2: Create a bucket policy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
## Get S3 bucket ARN and AWS Account ARN S3_BUCKET_ARN='arn:aws:s3:::cloudaffaire-s3-select-demo' && ACCOUNT_ARN=$(aws sts get-caller-identity | jq -r .Arn) ## Create a s3 bucket policy definition file cat << EOF > bucket_policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "S3Access", "Principal": {"AWS": ["$ACCOUNT_ARN"]}, "Effect": "Allow", "Action": ["s3:*"], "Resource": ["$S3_BUCKET_ARN/*"] } ] } EOF ## Create a s3 bucket policy aws s3api put-bucket-policy \ --bucket cloudaffaire-s3-select-demo \ --policy file://bucket_policy.json |
Step 3: Upload some sample employee data in CSV and JSON format.
1 2 3 4 5 6 7 8 9 10 11 12 |
## Download sample employee data git clone https://github.com/CloudAffaire/sample_data.git ## Upload employee sample data (in json and csv format) aws s3api put-object \ --bucket cloudaffaire-s3-select-demo \ --key employees.csv \ --body sample_data/employees.csv && aws s3api put-object \ --bucket cloudaffaire-s3-select-demo \ --key employees.json \ --body sample_data/employees.json |
Step 4: Query JSON and CSV data in the S3 bucket using S3 SELECT.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
## Select CSV data using S3 select aws s3api select-object-content \ --bucket cloudaffaire-s3-select-demo \ --key employees.csv \ --expression "select * from s3object limit 2" \ --expression-type 'SQL' \ --input-serialization '{"CSV": {}, "CompressionType": "NONE"}' \ --output-serialization '{"CSV": {}}' "csv_output.csv" ## Check the output cat csv_output.csv ## You can also output CSV data in JSON format aws s3api select-object-content \ --bucket cloudaffaire-s3-select-demo \ --key employees.csv \ --expression "select * from s3object limit 2" \ --expression-type 'SQL' \ --input-serialization '{"CSV": {}, "CompressionType": "NONE"}' \ --output-serialization '{"JSON": {}}' "csv_output.json" ## Check the output cat csv_output.json ## Select JOSN data using S3 select aws s3api select-object-content \ --bucket cloudaffaire-s3-select-demo \ --key employees.json \ --expression "select * from s3object limit 2" \ --expression-type 'SQL' \ --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' \ --output-serialization '{"JSON": {}}' "json_output.json" ## Check the output cat json_output.json |
Step 5: Clean up.
1 2 3 |
## Delete the S3 bucket with objects aws s3 rb \ s3://cloudaffaire-s3-select-demo --force |
S3 SELECT supports the full range of SQL features like WHERE clause, OPERATORS, AGGREGATE Functions, etc. Please refer to the below document for a complete reference of S3 SELECT SQL syntax.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference.html
Hope you have enjoyed this article. To get more details in AWS S3, please refer to the below documentation.
https://docs.aws.amazon.com/aws-backup/index.html
No problem with the select *
instead if I try to run the query with
select EMPLOYEE_ID from s3object limit 100
I have error
An error occurred (InvalidColumnIndex) when calling the SelectObjectContent operation: The column index at line 1, column 8 is invalid. Please check the service documentation and try again.
any idea