Published on March 3, 2022 by Kevin Graham
The Glue connector is a metadata connector, which is used for querying and
creating tables in AWS Glue.
When you create an external table with this connector, if you give it the
name of a table name already in Glue. The connector finds
out the table’s column types, data location and storage format. The Kognitio external table is then created with the
appropriate column types, attributes and data connector. Alternatively, you can create a writable Kognitio external table with no corresponding table in Glue.
In this case the table will also be created in Glue.
See external table connectors for
details of whether your deployment option supports this connector.
- Kognitio version 8.2.3 or later.
- Java 1.7 or later installed on all nodes.
- Access to AWS Glue. To do this, make sure the AWS instances which comprise your Kognitio system are in an IAM role which has permission to use Glue.
- Access to the S3 bucket in which the data for your Glue table is stored. The S3 data connector instance therefore needs credentials (see the S3 Connector reference sheet), or your AWS instance’s IAM role must have permission to access that S3 bucket.
To create a connector for the Glue data catalog on your cluster you just need to make sure the glue module is active and
then create a connector:
create module java mode active; create connector myglue source java target 'class com.kognitio.javad.glueconnector.GlueConnector';
|catalog_id||string||AWS Account ID||The name of the Glue data catalogue in which to find the database and table.|
Note: On Kognitio on AWS systems a connector called GLUE connected to your default catalog is created by default.
The create external table command for use with the Glue connector accepts the following specific options – see
Create External Table for details of syntax and standard options.
The Glue connector queries the Glue catalog to get the column definitions for the table so they don’t need to be
specified in the create table statement.
|table||string||The name of the table as known to AWS Glue, in the form <databasename>.<tablename>|
|use_glue_columns||boolean||true, except for Avro tables which provide a reader schema||Use the column types given by Glue to determine the column types of our external table. This is the default. If it’s set to false, it’s up to the data connector to infer the column types from the data files or from information provided by this connector.|
Example: Creating a connector and external table to read from a Glue table
To run the examples below you need to have permissions to create connectors and external tables, have built the simple Glue tables and have created a “test” schema on kognitio.
The Glue connector uses the java plugin so if the plugin is not already loaded and active run:
create module java mode active;
Then create the connector:
create connector myglueconnector source java target ' class com.kognitio.javad.glueconnector.GlueConnector ';
This creates a connector that will read tables in the Glue database on this cluster.
Create the Kognitio external table referencing the Glue table. You do not need to specify the columns as they will be picked up from the Glue catalog:
create external table test.glue_elb_logs from myglueconnector target 'table test.elb_logs';
You can then test this with:
select top 100 * from test.glue_elb_logs;
If you want to run multiple queries against the table, you should create a view image and query that:
create view test.v_glue_elb_logs as select * from test.glue_elb_logs; create view image test.v_glue_elb_logs; select elb_name, count(*) from test.v_glue_elb_logs group by 1;
Currently, the data formats recognised are delimited text, ORC, Parquet and Avro.
Most Glue column types are represented by a similar type in Kognitio. The Glue BOOLEAN type maps to a Kognitio
TINYINT, using 0 for false and 1 for true. Only the scalar types are supported. Complex types such as ARRAY, MAP and
STRUCT are not supported.
If AWS Glue indicates that a table is partitioned, Kognitio expects the data files to be laid out according to Hive’s
partitioning scheme. This means a directory with a name of the form colname=value is assumed to contain rows in
which the value of the column colname is value, and the column colname should not appear in the file.
If the Glue table’s data is in Avro format, and an Avro reader schema is given by Glue for that table, the Glue connector
will not use Glue’s list of column types; instead it will pass the reader schema to the data connector and the data
connector is expected to infer the column types from that.
As with all metadata connectors, in order to create an external table from the Glue connector, you only need privileges on
the Glue connector itself, not on the data connector (e.g. the HDFS or Parquet connector) to which it delegates the table.