Performance and Integration Guide

Overview

The e6Data Python Connector is a PEP-249 compliant DB-API 2.0 interface for interacting with e6Data clusters via gRPC. It supports secure connections, query execution, metadata retrieval, and robust error handling, making it easy to integrate into Python applications.

This document covers key features, components, execution lifecycle, gRPC configuration, query parameterization, error handling, performance testing, and best practices for using Python 3.12 or later.

System Features:

  • Database Connection: Connect to e6Data clusters securely using SSL, with configurable connection parameters.

  • Query Execution: Support for prepared and parameterized queries using pyformat style.

  • Result Handling: Fetch results via fetchone(), fetchmany(), fetchall(), and memory-efficient fetchall_buffer().

  • Metadata Access: Retrieve schemas, tables, and columns dynamically; support explain_analyse for query diagnostics.

  • Error Handling & Authentication: Automatic retries on authentication failure and connection errors.

  • Testing: Comprehensive unit and integration tests ensure reliability.

Components and Interfaces:

Connection Object:

Represents a live database connection.

Parameter
Type
Description

host

string

IP or hostname of the e6Data cluster

port

int

gRPC service port (e.g., 80, 443)

username

string

User email for authentication

password

string

Personal Access Token from e6Data dashboard

catalog

string

Catalog name for metadata access

database

string

Database name

secure

bool

Use SSL-secured gRPC channel (default False)

grpc_options

dict

Advanced gRPC settings (timeouts, message sizes)

Cursor Object

Used to execute queries and fetch results. Tied to a Connection.

Method
Description

execute(sql, params)

Execute a SQL statement with parameters

fetchone()

Fetch a single row

fetchmany(size=N)

Fetch N rows

fetchall()

Fetch all rows (small datasets)

fetchall_buffer()

Fetch all rows (optimized for large data)

explain_analyse()

Retrieve runtime query statistics

get_tables()

List tables in current database

get columns()

List columns metadata for a table

clear()

Clear cached results to free memory

Execution Details and Timings

Execution Lifecycle

  1. Initialization: Establishes gRPC channel with optional SSL and connection options.

  2. Authentication: Authenticates using username and Personal Access Token.

  3. Prepare Statement: Queries compiled via prepare Statement API.

  4. Execution: Executes prepared queries, collecting optional metrics (parsing, queuing time).

  5. Fetching: Results streamed in batches for optimized memory use.

Best Practices to Minimize Execution Time

  • Enable gRPC query caching.

  • Limit result sets with SQL LIMIT clauses.

  • Use connection pooling to reduce overhead.

gRPC Options Configuration

The grpc_options dictionary allows tuning of gRPC behaviors for performance and stability.

Option
Default
Description

grpc_prepare_timeout

600000 ms (10 min)

Max wait time for query preparation

keepalive_timeout_ms

900000 ms

Idle time before sending keepalive ping

keepalive_time_ms

30000 ms (30 sec)

Interval between keepalive pings

keepalive_permit_without_calls

1

Allow keepalive even without active RPC calls

max_receive_message_length

-1 (unlimited)

Max size of incoming messages

max_send_message_length

300 MB

Max size of outgoing messages

http2.max_pings_without_data

0 (unlimited)

Max pings sent without data

http2.min_time_between_pings_ms

15000 ms (15 sec)

Minimum interval between pings

http2.min_ping_interval_without_data_ms

15000 ms (15 sec)

Interval between pings without data exchange

Example Use Cases

  • Large Queries:

grpc_options = {
    'max_receive_message_length': 500 * 1024 * 1024,
    'max_send_message_length': 500 * 1024 * 1024,
}
 

Connection Stability:

grpc_options = {
    'keepalive_time_ms': 10000,
    'keepalive_timeout_ms': 20000,
    'keepalive_permit_without_calls': 1,
}

Best Practices

  • Start with defaults unless tuning is required.

  • Monitor network and resource usage to adjust settings.

  • Configure message size limits cautiously for large datasets.

Internal Working of the Connector

During connection setup, gRPC channels are created with options passed in to configure behavior.

self._channel = grpc.insecure_channel(
    target=f"{self._host}:{self._port}",
    options=self._get_grpc_options
)
 

For SSL-secured connections:

self._channel = grpc.secure_channel(
    target=f"{self._host}:{self._port}",
    options=self._get_grpc_options,
    credentials=grpc.ssl_channel_credentials()
)

User-defined grpc_options override defaults during channel creation.

Using pyformat for Parameterized Queries

pyformat is a PEP-249 compliant parameter style using named placeholders (%(param)s) to write secure, readable SQL queries.

Example

sql = "SELECT * FROM users WHERE age > %(age)s"
params = {"age": 25}
cursor.execute(sql, params)

Benefits

  • Prevents SQL injection by safely escaping parameters.

  • Enhances query readability.

  • Allows flexible, dynamic parameter passing.

Common Errors and Fixes

KeyError: 'placeholder_name'

Missing parameter key in dictionary

Ensure all placeholders have matching keys

TypeError: %d format

Parameter type mismatch

Match parameter types to placeholders

Programming Error: syntax error

Incorrect SQL or placeholder usage

Verify query syntax and placeholder format

Best Practices

  • Always use parameterized queries.

  • Validate parameter types and inputs.

  • Log queries cautiously (avoid sensitive data).

  • Handle execution errors with try-except blocks.

Error Handling and Recovery

Robust error handling with retry logic on common gRPC errors:

Authentication

Invalid/expired credentials

Retry authentication up to 5 times

Connection

Timeout or server unavailability

Graceful retries on _InactiveRpcError

Performance Testing

Workflow

  • Validate connection using environment variables.

  • Execute various queries to measure latency and correctness.

  • Test with caching enabled/disabled.

  • Use explain_analyse to verify query plans.

Leveraging Python 3.12 or Later for Improved Performance

Key Benefits

  • Faster startup and module import times.

  • Improved iteration speed and memory management.

  • Enhanced garbage collection for short-lived objects.

  • Support for the perf module for profiling.

  • Better error tracing for debugging.

  • Typing enhancements for cleaner code.

Best Practices

  • Use built-in functions for better efficiency (len(), max(), min()).

  • Enable adaptive specialization via python3.12 -X adaptive.

  • Utilize async improvements for gRPC async calls.

  • Upgrade Python environment with virtual env and reinstall dependencies.

Transition Recommendations

python3.12 -m venv venv
source venv/bin/activate
pip install --upgrade .

Performance Testing Example

import time

start = time.perf_counter()
data = cursor.fetchall()
end = time.perf_counter()
print(f"Fetch Time: {end - start} seconds")

Last updated