Question:
I’m trying to understand how to work with s3 inventory.
I’m following this tutorial
After loading inventory lists into my table I’m trying to query it and find two issues.
1) SELECT key, size FROM table;
Size column for all records show a magic number(value) 4923069104295859283
2) select * from table;
Query Id: cf07c309-c685-4bf4-9705-8bca69b00b3c
.
Receiving error:
1 2 |
HIVE_BAD_DATA: Field size's type LONG in ORC is incompatible with type varchar defined in table schema |
Here is my table schema:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
CREATE EXTERNAL TABLE `table`( `bucket` string, `key` string, `version_id` string, `is_latest` boolean, `is_delete_marker` boolean, `size` bigint, `last_modified_date` timestamp, `e_tag` string, `storage_class` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://......../hive' TBLPROPERTIES ( 'transient_lastDdlTime'='1516093603') |
Answer:
The following command of any of your orc file coming from the inventory generated by AWS S3 will give you the actual structure of your inventory:
1 2 |
gt; hive –orcfiledump ~/Downloads/017c2014-1205-4431-a30d-2d9ae15492d6.orc
…
Processing data file /tmp/017017c2014-1205-4431-a30d-2d9ae15492d6.orc [length: 4741786]
Structure for /mp/017c2014-1205-4431-a30d-2d9ae15492d6.orc
File Version: 0.12 with ORC_135
Rows: 223473
Compression: ZLIB
Compression size: 262144
Type: struct<bucket:string,key:string,size:bigint,last_modified_date:timestamp,e_tag:string,storage_class:string,is_multipart_uploaded:boolean,replication_status:string,encryption_status:string>
…
It appears that the example provided by aws here expects that your inventory is not just for the current version
but for all versions
of objects in your bucket.
The right table structure for Athena
is then for an encrypted bucket:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
CREATE EXTERNAL TABLE inventory( bucket string, key string, version_id string, is_latest boolean, is_delete_marker boolean, size bigint, last_modified_date timestamp, e_tag string, storage_class string, is_multipart_uploaded boolean, replication_status string, encryption_status string ) PARTITIONED BY (dt string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION 's3://............/hive' TBLPROPERTIES ('has_encrypted_data'='true'); |