Question:
So I am trying to use Amazon Textract to read in multiple pdf files, with multiple pages using the StartDocumentTextDetection
method as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
client = boto3.client('textract') textract_bucket = s3.Bucket('my_textract_console-us-east-2') for s3_file in textract_bucket.objects.all(): print(s3_file) response = client.start_document_text_detection( DocumentLocation = { "S3Object": { "Bucket": "my_textract_console_us-east-2", "Name": s3_file.key, } }, ClientRequestToken=str(random.randint(1,1e10))) print(response) break |
When just trying to retrieve the response object from s3
, I’m able to see it printed out as:
1 2 |
s3.ObjectSummary(bucket_name='my_textract_console-us-east-2', key='C:\\Users\\My_User\\Documents\\Folder\\Sub_Folder\\Sub_sub_folder\\filename.PDF') |
Correspondingly, I’m using that s3_file.key
to access the object later. But I’m getting the following error that I can’t figure out:
InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentTextDetection operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.
So far I have:
- Checked the region from boto3 session, both the bucket and aws configurations settings are set to
us-east-2
. - Key cannot be wrong, I’m passing it directly from the object response
- Permissions wise, I checked the IAM console, and have it set to
AmazonS3FullAccess
andAmazonTextractFullAccess
.
What could be going wrong here?
[EDIT] I did rename the files so that they didn’t have \\
, but seems like it’s still not working, that’s odd..
Answer:
I ran into the same issue and solved it by specifying a region in extract client. In my case I used us-east2
1 2 |
client = boto3.client('textract', region_name='us-east-2') |
The clue to do so came from this issue: https://github.com/aws/aws-sdk-js/issues/2714