Question:
I have a dataset registered in Glue / Athena, call it my_db.table
. I’m able to query it via Athena and everything generally seems to be in order.
I’m trying to use this table in a Glue job, but am getting the following fairly opaque error message:
1 2 3 |
py4j.protocol.Py4JJavaError: An error occurred while calling o54.getCatalogSource. : java.lang.Error: No classification or connection in my_db.table |
This would appear to indicate that Glue can’t see the catalog entry for my table, or can’t use the information in that entry, but I don’t have any further visibility than that.
Has anyone experience with this error and what might be causing it?
Answer:
The error message actually describes the problem well – there was no classification for the table being queried.
Tables created via Glue are registered with a Classification – csv
, parquet
, orc
, avro
, json
. See Creating Tables Using Athena for AWS Glue Jobs.
The table I created ‘manually’ via Athena did not have a classifcation. See the below screenshot from the Glue ‘tables’ page.
The solution is easy: at the end of the CREATE TABLE
script user must append a classification
property like so
1 2 3 4 5 6 7 8 9 10 11 12 13 |
CREATE EXTERNAL TABLE IF NOT EXISTS my_db.my_table ( `id` int, `description` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'field.delim' = ',', 'collection.delim' = 'undefined', 'mapkey.delim' = 'undefined' ) LOCATION 's3://my_bucket/' TBLPROPERTIES ('classification'='csv'); |
Now the table has a classification within the Glue interface and is accessible via a Glue job.