How to query S3 objects using AWS S3 SELECT with example?

How to query S3 objects using AWS S3 SELECT with example?

How to query S3 objects using AWS S3 SELECT with example?

Hello Everyone

Welcome to CloudAffaire and this is Debjeet.

Today we will discuss how to query S3 objects (CSV, JSON, Compressed) using AWS S3 SELECT with examples.

What is AWS S3 SELECT?

With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.

Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only), and server-side encrypted objects. You can specify the format of the results as either CSV or JSON, and you can determine how the records in the result are delimited.

You pass SQL expressions to Amazon S3 in the request. Amazon S3 Select supports a subset of SQL. You can perform SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the Amazon S3 console. The Amazon S3 console limits the amount of data returned to 40 MB. To retrieve more data, use the AWS CLI or the API.

Requirements and limitations of S3 SELECT:

  • You must have s3:GetObject permission for the object you are querying.
  • If the object you are querying is encrypted with a customer-provided encryption key (SSE-C), you must use https, and you must provide the encryption key in the request.
  • The maximum length of a SQL expression is 256 KB.
  • The maximum length of a record in the input or result is 1 MB.
  • Amazon S3 Select can only emit nested data using the JSON output format.
  • You cannot specify the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, or REDUCED_REDUNDANCY storage classes.
  • Amazon S3 Select supports only columnar compression using GZIP or Snappy. Amazon S3 Select doesn’t support whole-object compression for Parquet objects.
  • Amazon S3 Select doesn’t support Parquet output. You must specify the output format as CSV or JSON.
  • The maximum uncompressed row group size is 512 MB.
  • You must use the data types specified in the object’s schema.
  • Selecting on a repeated field returns only the last value.

Note: AWS also has a dedicated service named Athena that can be used to query S3 bucket.

S3 SELECT VS Athena:

S3 Select is an S3 feature designed It works by retrieving a subset of an object’s data (using simple SQL expressions) instead of the entire object, which can be up to 5 terabytes in size. s3 select runs query on a single object at a time in the s3 bucket.

Amazon Athena on the other hand is a query service that makes it easy to analyze data stored in S3 using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, pay only for the queries. It scales automatically – executing queries in parallel, this makes it produce faster results, even with large datasets and complex queries.

How to query S3 objects using AWS S3 SELECT with example?

Prerequisites:

AWS CLI installed and configured.

Step 1: Create an S3 bucket.

Step 2: Create a bucket policy.

Step 3: Upload some sample employee data in CSV and JSON format.

Step 4: Query JSON and CSV data in the S3 bucket using S3 SELECT.

Step 5: Clean up.

S3 SELECT supports the full range of SQL features like WHERE clause, OPERATORS, AGGREGATE Functions, etc. Please refer to the below document for a complete reference of S3 SELECT SQL syntax.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference.html

Hope you have enjoyed this article. To get more details in AWS S3, please refer to the below documentation.

https://docs.aws.amazon.com/aws-backup/index.html

Leave a Reply

Close Menu