Question:
I have aws cli
and boto3
installed in my python 2.7
environment. I want to do various operations like get schema information, get database details of all the tables present in AWS Glue console. I tried below samples of scripts:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job glueContext = GlueContext(SparkContext.getOrCreate()) persons = glueContext.create_dynamic_frame.from_catalog( database="records", table_name="recordsrecords_converted_json") print "Count: ", persons.count() persons.printSchema() |
I got error ImportError: No module named awsglue.transforms
which should be correct as there is no such package present in boto3 as I identified using the command dir(boto3)
. I found that boto3
offers various client calls through awscli
and we can access them by using client=boto3.client('glue')
. So, for getting schema information as above, I tried below sample code:
1 2 3 4 5 6 7 8 9 10 |
import sys import boto3 client=boto3.client('glue') response = client.get_databases( CatalogId='string', NextToken='string', MaxResults=123 ) print client |
But then I get this error:
AccessDeniedException: An error occurred (AccessDeniedException) when calling the GetDatabases operation: Cross account access is not allowed.
I am pretty sure that either one of them or probably both of them are correct approaches to get what I am trying to get but something doesn’t fall into correct slots here. Any ideas to get the details about the schema and database tables from AWS Glue using python 2.7 locally like I tried above?
Answer:
The following code works for me, and am using locally setup Zeppelin notebook, as a dev end point. The printschema reads the schema from the data catalog.
Hope you have enabled the ssh tunnelling as well.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
%pyspark import sys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.transforms import * from pyspark.sql.functions import udf from pyspark.sql.types import StringType # Create a Glue context glueContext = GlueContext(SparkContext.getOrCreate()) # Create a DynamicFrame using the 'persons_json' table medicare_dynamicframe = glueContext.create_dynamic_frame.from_catalog(database="payments", table_name="medicaremedicare_hospital_provider_csv") # Print out information about this data print "Count: ", medicare_dynamicframe.count() medicare_dynamicframe.printSchema() |
Also you may need to make some changes for Spark interpreter, (tick on the Connect to existing process option available in the top, and host(localhost), port number (9007).
For second part You need to to do aws configure
and then create glue client after installing boto3
client. After this, check your proxy settings for hiding behind a firewall or company network.
To be clear, boto3
client is helpful for all AWS related client side api and for server side, Zeppelin way is the best.
Hope this helps.