I am currently using AWS S3 + Athena for a project.
However, after using it for 1-2 months, I find some limitations about it.
I am not sure if I do not know how to use it or it is really a limitation.
But please do not ask why I choose to use it before enough research. I think that there are 2 points:
- It is required by the project
- The resources of Athena S3 and AWS is not quite centralized and its functionality keeps changing. It is difficult for me to find what Athena + S3 can do or cannot do before actually using it for some time.
Get carried away too far, now back to the topic. >_<
Currently, I am facing a problem. With the increasing data volume, the data scan size and the query are getting bigger and longer dramatically (sometimes, there is even exception happening, such as too many open files when I perform a query). However, it seems that there is only partition but not index for AWS S3 + Athena. Hence, the problems come.
Can I do something like an index at AWS S3 + Athena?
If I use a partition, it seems that only one composite key (one or more columns as labels at S3 folder) can be specified; otherwise, the data size will be doubled. Is this true?
Even I am willing to increase the data size, it is impossible for a table with 2 composite keys. I must have 2 Athena tables and 2 the same set of the data but at 2 types of partition in S3 in order to achieve this. Is this true?
For the error “too many open files”, after some research, it seems that it is a OS level issue with a predefined limited number of file descriptor. My current situation is that the SQL does not have exception most of the time, but at certain period, it has exception easily. My understanding is that Amazon will have a cluster of computers (for example, 32 node servers) to serve certain number of customers, including my company and other companies. Each server has a limited number of file descriptors available and shared among all the customers. Then, at some peak periods (other companies are performing heavy queries), the available number of file descriptor will drop and this also explains why my SQL with the same set of data has exception sometimes but not always. Is this true?
Due to lack of index function, S3 + Athena is not supposed to perform complicated SQL queries. That means, complicated joining logic can only be done somewhere at the transformation layer before loading into S3. Is this true?
This question follows the previous one, Question 5.
Let me use a simple example to illustrate:
A reporting system is developed to display Order and Trade. The relationship is that after execution of an order, a trade will be generated.The Order_ID is the key to link up a trade and its related Order Activities. The partition is set to the date.
Now, the following data comes:
The requirement is that:
1. For report on day 1, only Order record O001-Place Order is shown
2. For report on day 2, only Order record O002-Order Change Order is shown
3. For report on day 3, all records, including the 4 Order records and the 1 Trade record, are shown
4. For report on day 4, only Order record O004-Remark Change is shown
For Days 1, 2 and 4, it is easy as I just display what the data comes on the same day.
However, for Day 3, I need to display all the data, some in the past and some in the future (O001-Remark Change).
In order to avoid complicated SQL, I can only do the join logic at the Transformation layer.
However, when performing the transformation on Day 3, if the party does not send me the data on Days 1 and 2, you can only find the historical files, which is not good as you never know how many days you need to search back.
Even we do the search at Athena, as the Order_ID is not at the partition, a full table scan is needed.
The above is not the worst, the worst case is that at the transformation on Day 3, O001-Remark Change on Day 4 is future data and should not be known on Day 3.
Is any better way to do that? Or AWS S3 + Athena is just not suitable for such complicated case (The above case is just a simplified ver of my current situation)?
I know that my questions are too many and quite a lot. But all of these are what I really want to clarify and there is no clear answer I can find. Any help is highly appreciated and thanks so much.
No, Amazon Athena (and Presto, upon which it is based) does not support indexes. This is because Athena/Presto (and even Redshift) are designed for Big Data, so even an index on Big Data is also Big Data so it would not be efficient to maintain a huge index.
While traditional databases get faster with indexes, this does not apply to Big Data systems. Instead, use indexing, compression and columnar data formats to improve performance.
Partitions are hierarchical (eg
Year -> Month -> Day). The objective with partitions is to “skip over” files that don’t need to be read. Therefore, they will only provide a benefit if the
WHERE clause uses the partition hierarchy.
For example, using
SELECT ... WHERE year=2018 will use the partition and skip all other years.
If your query does not put one of these partitioned fields in the
WHERE clause, then all directories and files need to be scanned, so there is no benefit gained.
You say “the data size will be doubled”, but this is not the case. All data is stored only once. Partitioning does not modify data size.
too many open files
If this is an error generated by Amazon Athena, then you should raise it with AWS Support.
Athena can certainly do complex queries, but it would not be the ideal platform. If you are regularly doing complex SQL queries, consider using Amazon Redshift.
Question 6: Table/Query
I am unable to follow the flow of your requirements. If you are seeking help for SQL, please create a separate question showing a sample of the data in the table and a sample of the output you are seeking.