LogoLogo
  • Welcome to e6data
  • Introduction to e6data
    • Concepts
    • Architecture
      • e6data in VPC Deployment Model
      • Connect to e6data serverless compute
  • Get Started
  • Sign Up
  • Setup
    • AWS Setup
      • In VPC Deployment (AWS)
        • Prerequisite Infrastructure
        • Infrastructure & Permissions for e6data
        • Setup Kubernetes Components
        • Setup using Terraform in AWS
          • Update a AWS Terraform for your Workspace
        • AWS PrivateLink and e6data
        • VPC Peering | e6data on AWS
      • Connect to e6data serverless compute (AWS)
        • Workspace Creation
        • Catalog Creation
          • Glue Metastore
          • Hive Metastore
          • Unity Catalog
        • Cluster Creation
    • GCP Setup
      • In VPC Deployment (GCP)
        • Prerequisite Infrastructure
        • Infrastructure & Permissions for e6data
        • Setup Kubernetes Components
        • Setup using Terraform in GCP
        • Update a GCP Terraform for your Workspace
      • Connect to e6data serverless compute (GCP)
    • Azure Setup
      • Prerequisite Infrastructure
      • Infrastructure & Permissions for e6data
      • Setup Kubernetes Components
      • Setup using Terraform in AZURE
        • Update a AZURE Terraform for your Workspace
  • Workspaces
    • Create Workspaces
    • Enable/Disable Workspaces
    • Update a Workspace
    • Delete a Workspace
  • Catalogs
    • Create Catalogs
      • Hive Metastore
        • Connect to a Hive Metastore
        • Edit a Hive Metastore Connection
        • Delete a Hive Metastore Connection
      • Glue Metastore
        • Connect to a Glue Metastore
        • Edit a Glue Metastore Connection
        • Delete a Glue Metastore Connection
      • Unity Catalog
        • Connect to Unity Catalog
        • Edit Unity Catalog
        • Delete Unity Catalog
      • Cross-account Catalog Access
        • Configure Cross-account Catalog to Access AWS Hive Metastore
        • Configure Cross-account Catalog to Access Unity Catalog
        • Configure Cross-account Catalog to Access AWS Glue
        • Configure Cross-account Catalog to Access GCP Hive Metastore
    • Manage Catalogs
    • Privileges
      • Access Control
      • Column Masking
      • Row Filter
  • Clusters
    • Edit & Delete Clusters
    • Suspend & Resume Clusters
    • Cluster Size
    • Load Based Sizing
    • Auto Suspension
    • Query Timeout
    • Monitoring
    • Connection Info
  • Pools
    • Delete Pools
  • Query Editor
    • Editor Pane
    • Results Pane
    • Schema Explorer
    • Data Preview
  • Notebook
    • Editor Pane
    • Results Pane
    • Schema Explorer
    • Data Preview
  • Query History
    • Query Count API
  • Connectivity
    • IP Sets
    • Endpoints
    • Cloud Resources
    • Network Firewall
  • Access Control
    • Users
    • Groups
    • Roles
      • Permissions
      • Policies
    • Single Sign-On (SSO)
      • AWS SSO
      • Okta
      • Microsoft My Apps-SSO
      • Icons for IdP
    • Service Accounts
    • Multi-Factor Authentication (Beta)
  • Usage and Cost Management
  • Audit Log
  • User Settings
    • Profile
    • Personal Access Tokens (PAT)
  • Advanced Features
    • Cross-Catalog & Cross-Schema Querying
  • Supported Data Types
  • SQL Command Reference
    • Query Syntax
      • General functions
    • Aggregate Functions
    • Mathematical Functions & Operators
      • Arithematic Operators
      • Rounding and Truncation Functions
      • Exponential and Root Functions
      • Trigonometric Functions
      • Logarithmic Functions
    • String Functions
    • Date-Time Functions
      • Constant Functions
      • Conversion Functions
      • Date Truncate Function
      • Addition and Subtraction Functions
      • Extraction Functions
      • Format Functions
      • Timezone Functions
    • Conditional Expressions
    • Conversion Functions
    • Window Functions
    • Comparison Operators & Functions
    • Logical Operators
    • Statistical Functions
    • Bitwise Functions
    • Array Functions
    • Regular Expression Functions
    • Generate Functions
    • Cardinality Estimation Functions
    • JSON Functions
    • Checksum Functions
    • Unload Function (Copy into)
    • Struct Functions
  • Equivalent Functions & Operators
  • Connectors & Drivers
    • DBeaver
    • DbVisualiser
    • Apache Superset
    • Jupyter Notebook
    • Tableau Cloud
    • Tableau Desktop
    • Power BI
      • Setting up Power BI on-premises Gateway
    • Metabase
    • Zeppelin
    • Python Connector
      • Performance and Integration Guide
      • Code Samples
    • JDBC Driver
      • Code Samples
      • API Support
    • Configure Cluster Ingress
      • ALB Ingress in Kubernetes
      • GCE Ingress in Kubernetes
      • Ingress-Nginx in Kubernetes
  • Security & Trust
    • Best Practices
      • AWS Best Practices
    • Features & Responsibilities Matrix
    • Data Protection Addendum(DPA)
  • Tutorials and Best Practices
    • How to configure HIVE metastore if you don't have one?
    • How-To Videos
  • Known Limitations
    • SQL Limitations
    • Other Limitations
    • Restart Triggers
    • Cloud Provider Limitations
  • Error Codes
    • General Errors
    • User Account Errors
    • Workspace Errors
    • Catalog Errors
    • Cluster Errors
    • Data Governance Errors
    • Query History Errors
    • Query Editor Errors
    • Pool Errors
    • Connectivity Errors
  • Terms & Condition
  • Privacy Policy
    • Cookie Policy
  • FAQs
    • Workspace Setup
    • Security
    • Catalog Privileges
  • Services Utilised for e6data Deployment
    • AWS supported regions
    • GCP supported regions
    • AZURE supported regions
  • Release Notes & Updates
    • 6th Sept 2024
    • 6th June 2024
    • 18th April 2024
    • 9th April 2024
    • 30th March 2024
    • 16th March 2024
    • 14th March 2024
    • 12th March 2024
    • 2nd March 2024
    • 10th February 2024
    • 3rd February 2024
    • 17th January 2024
    • 9th January 2024
    • 3rd January 2024
    • 18th December 2023
    • 12th December 2023
    • 9th December 2023
    • 4th December 2023
    • 27th November 2023
    • 8th September 2023
    • 4th September 2023
    • 26th August 2023
    • 21st August 2023
    • 19th July 2023
    • 23rd May 2023
    • 5th May 2023
    • 28th April 2023
    • 19th April 2023
    • 15th April 2023
    • 10th April 2023
    • 30th March 2023
Powered by GitBook
On this page
  • Overview
  • System Features:
  • Components and Interfaces:
  • Execution Details and Timings
  • gRPC Options Configuration
  • Internal Working of the Connector
  • Using pyformat for Parameterized Queries
  • Error Handling and Recovery
  • Performance Testing
  • Leveraging Python 3.12 or Later for Improved Performance
  1. Connectors & Drivers
  2. Python Connector

Performance and Integration Guide

PreviousPython ConnectorNextCode Samples

Last updated 1 day ago

Overview

The e6Data Python Connector is a DB-API 2.0 interface for interacting with e6Data clusters via gRPC. It supports secure connections, query execution, metadata retrieval, and robust error handling, making it easy to integrate into Python applications.

This document covers key features, components, execution lifecycle, gRPC configuration, query parameterization, error handling, performance testing, and best practices for using Python 3.12 or later.

System Features:

  • Database Connection: Connect to e6Data clusters securely using SSL, with configurable connection parameters.

  • Query Execution: Support for prepared and parameterized queries using pyformat style.

  • Result Handling: Fetch results via fetchone(), fetchmany(), fetchall(), and memory-efficient fetchall_buffer().

  • Metadata Access: Retrieve schemas, tables, and columns dynamically; support explain_analyse for query diagnostics.

  • Error Handling & Authentication: Automatic retries on authentication failure and connection errors.

  • Testing: Comprehensive unit and integration tests ensure reliability.

Components and Interfaces:

Connection Object:

Represents a live database connection.

Parameter
Type
Description

host

string

IP or hostname of the e6Data cluster

port

int

gRPC service port (e.g., 80, 443)

username

string

User email for authentication

password

string

Personal Access Token from e6Data dashboard

catalog

string

Catalog name for metadata access

database

string

Database name

secure

bool

Use SSL-secured gRPC channel (default False)

grpc_options

dict

Advanced gRPC settings (timeouts, message sizes)

Cursor Object

Used to execute queries and fetch results. Tied to a Connection.

Method
Description

execute(sql, params)

Execute a SQL statement with parameters

fetchone()

Fetch a single row

fetchmany(size=N)

Fetch N rows

fetchall()

Fetch all rows (small datasets)

fetchall_buffer()

Fetch all rows (optimized for large data)

explain_analyse()

Retrieve runtime query statistics

get_tables()

List tables in current database

get columns()

List columns metadata for a table

clear()

Clear cached results to free memory

Execution Details and Timings

Execution Lifecycle

  1. Initialization: Establishes gRPC channel with optional SSL and connection options.

  2. Authentication: Authenticates using username and Personal Access Token.

  3. Prepare Statement: Queries compiled via prepare Statement API.

  4. Execution: Executes prepared queries, collecting optional metrics (parsing, queuing time).

  5. Fetching: Results streamed in batches for optimized memory use.

Best Practices to Minimize Execution Time

  • Enable gRPC query caching.

  • Limit result sets with SQL LIMIT clauses.

  • Use connection pooling to reduce overhead.

gRPC Options Configuration

The grpc_options dictionary allows tuning of gRPC behaviors for performance and stability.

Option
Default
Description

grpc_prepare_timeout

600000 ms (10 min)

Max wait time for query preparation

keepalive_timeout_ms

900000 ms

Idle time before sending keepalive ping

keepalive_time_ms

30000 ms (30 sec)

Interval between keepalive pings

keepalive_permit_without_calls

1

Allow keepalive even without active RPC calls

max_receive_message_length

-1 (unlimited)

Max size of incoming messages

max_send_message_length

300 MB

Max size of outgoing messages

http2.max_pings_without_data

0 (unlimited)

Max pings sent without data

http2.min_time_between_pings_ms

15000 ms (15 sec)

Minimum interval between pings

http2.min_ping_interval_without_data_ms

15000 ms (15 sec)

Interval between pings without data exchange

Example Use Cases

  • Large Queries:

grpc_options = {
    'max_receive_message_length': 500 * 1024 * 1024,
    'max_send_message_length': 500 * 1024 * 1024,
}
 

Connection Stability:

grpc_options = {
    'keepalive_time_ms': 10000,
    'keepalive_timeout_ms': 20000,
    'keepalive_permit_without_calls': 1,
}

Best Practices

  • Start with defaults unless tuning is required.

  • Monitor network and resource usage to adjust settings.

  • Configure message size limits cautiously for large datasets.

Internal Working of the Connector

During connection setup, gRPC channels are created with options passed in to configure behavior.

self._channel = grpc.insecure_channel(
    target=f"{self._host}:{self._port}",
    options=self._get_grpc_options
)
 

For SSL-secured connections:

self._channel = grpc.secure_channel(
    target=f"{self._host}:{self._port}",
    options=self._get_grpc_options,
    credentials=grpc.ssl_channel_credentials()
)

User-defined grpc_options override defaults during channel creation.

Using pyformat for Parameterized Queries

pyformat is a PEP-249 compliant parameter style using named placeholders (%(param)s) to write secure, readable SQL queries.

Example

sql = "SELECT * FROM users WHERE age > %(age)s"
params = {"age": 25}
cursor.execute(sql, params)

Benefits

  • Prevents SQL injection by safely escaping parameters.

  • Enhances query readability.

  • Allows flexible, dynamic parameter passing.

Common Errors and Fixes

KeyError: 'placeholder_name'

Missing parameter key in dictionary

Ensure all placeholders have matching keys

TypeError: %d format

Parameter type mismatch

Match parameter types to placeholders

Programming Error: syntax error

Incorrect SQL or placeholder usage

Verify query syntax and placeholder format

Best Practices

  • Always use parameterized queries.

  • Validate parameter types and inputs.

  • Log queries cautiously (avoid sensitive data).

  • Handle execution errors with try-except blocks.

Error Handling and Recovery

Robust error handling with retry logic on common gRPC errors:

Authentication

Invalid/expired credentials

Retry authentication up to 5 times

Connection

Timeout or server unavailability

Graceful retries on _InactiveRpcError

Performance Testing

Workflow

  • Validate connection using environment variables.

  • Execute various queries to measure latency and correctness.

  • Test with caching enabled/disabled.

  • Use explain_analyse to verify query plans.

Leveraging Python 3.12 or Later for Improved Performance

Key Benefits

  • Faster startup and module import times.

  • Improved iteration speed and memory management.

  • Enhanced garbage collection for short-lived objects.

  • Support for the perf module for profiling.

  • Better error tracing for debugging.

  • Typing enhancements for cleaner code.

Best Practices

  • Use built-in functions for better efficiency (len(), max(), min()).

  • Enable adaptive specialization via python3.12 -X adaptive.

  • Utilize async improvements for gRPC async calls.

  • Upgrade Python environment with virtual env and reinstall dependencies.

Transition Recommendations

python3.12 -m venv venv
source venv/bin/activate
pip install --upgrade .

Performance Testing Example

import time

start = time.perf_counter()
data = cursor.fetchall()
end = time.perf_counter()
print(f"Fetch Time: {end - start} seconds")

PEP-249 compliant