Performance and Integration Guide
Overview
The e6Data Python Connector is a PEP-249 compliant DB-API 2.0 interface for interacting with e6Data clusters via gRPC. It supports secure connections, query execution, metadata retrieval, and robust error handling, making it easy to integrate into Python applications.
This document covers key features, components, execution lifecycle, gRPC configuration, query parameterization, error handling, performance testing, and best practices for using Python 3.12 or later.
System Features:
Database Connection: Connect to e6Data clusters securely using SSL, with configurable connection parameters.
Query Execution: Support for prepared and parameterized queries using pyformat style.
Result Handling: Fetch results via fetchone(), fetchmany(), fetchall(), and memory-efficient fetchall_buffer().
Metadata Access: Retrieve schemas, tables, and columns dynamically; support explain_analyse for query diagnostics.
Error Handling & Authentication: Automatic retries on authentication failure and connection errors.
Testing: Comprehensive unit and integration tests ensure reliability.
Components and Interfaces:
Connection Object:
Represents a live database connection.
host
string
IP or hostname of the e6Data cluster
port
int
gRPC service port (e.g., 80, 443)
username
string
User email for authentication
password
string
Personal Access Token from e6Data dashboard
catalog
string
Catalog name for metadata access
database
string
Database name
secure
bool
Use SSL-secured gRPC channel (default False)
grpc_options
dict
Advanced gRPC settings (timeouts, message sizes)
Cursor Object
Used to execute queries and fetch results. Tied to a Connection.
execute(sql, params)
Execute a SQL statement with parameters
fetchone()
Fetch a single row
fetchmany(size=N)
Fetch N rows
fetchall()
Fetch all rows (small datasets)
fetchall_buffer()
Fetch all rows (optimized for large data)
explain_analyse()
Retrieve runtime query statistics
get_tables()
List tables in current database
get columns()
List columns metadata for a table
clear()
Clear cached results to free memory
Execution Details and Timings
Execution Lifecycle
Initialization: Establishes gRPC channel with optional SSL and connection options.
Authentication: Authenticates using username and Personal Access Token.
Prepare Statement: Queries compiled via prepare Statement API.
Execution: Executes prepared queries, collecting optional metrics (parsing, queuing time).
Fetching: Results streamed in batches for optimized memory use.
Best Practices to Minimize Execution Time
Enable gRPC query caching.
Limit result sets with SQL LIMIT clauses.
Use connection pooling to reduce overhead.
gRPC Options Configuration
The grpc_options dictionary allows tuning of gRPC behaviors for performance and stability.
grpc_prepare_timeout
600000 ms (10 min)
Max wait time for query preparation
keepalive_timeout_ms
900000 ms
Idle time before sending keepalive ping
keepalive_time_ms
30000 ms (30 sec)
Interval between keepalive pings
keepalive_permit_without_calls
1
Allow keepalive even without active RPC calls
max_receive_message_length
-1 (unlimited)
Max size of incoming messages
max_send_message_length
300 MB
Max size of outgoing messages
http2.max_pings_without_data
0 (unlimited)
Max pings sent without data
http2.min_time_between_pings_ms
15000 ms (15 sec)
Minimum interval between pings
http2.min_ping_interval_without_data_ms
15000 ms (15 sec)
Interval between pings without data exchange
Example Use Cases
Large Queries:
grpc_options = {
'max_receive_message_length': 500 * 1024 * 1024,
'max_send_message_length': 500 * 1024 * 1024,
}
Connection Stability:
grpc_options = {
'keepalive_time_ms': 10000,
'keepalive_timeout_ms': 20000,
'keepalive_permit_without_calls': 1,
}
Best Practices
Start with defaults unless tuning is required.
Monitor network and resource usage to adjust settings.
Configure message size limits cautiously for large datasets.
Internal Working of the Connector
During connection setup, gRPC channels are created with options passed in to configure behavior.
self._channel = grpc.insecure_channel(
target=f"{self._host}:{self._port}",
options=self._get_grpc_options
)
For SSL-secured connections:
self._channel = grpc.secure_channel(
target=f"{self._host}:{self._port}",
options=self._get_grpc_options,
credentials=grpc.ssl_channel_credentials()
)
User-defined grpc_options override defaults during channel creation.
Using pyformat for Parameterized Queries
pyformat is a PEP-249 compliant parameter style using named placeholders (%(param)s) to write secure, readable SQL queries.
Example
sql = "SELECT * FROM users WHERE age > %(age)s"
params = {"age": 25}
cursor.execute(sql, params)
Benefits
Prevents SQL injection by safely escaping parameters.
Enhances query readability.
Allows flexible, dynamic parameter passing.
Common Errors and Fixes
KeyError: 'placeholder_name'
Missing parameter key in dictionary
Ensure all placeholders have matching keys
TypeError: %d format
Parameter type mismatch
Match parameter types to placeholders
Programming Error: syntax error
Incorrect SQL or placeholder usage
Verify query syntax and placeholder format
Best Practices
Always use parameterized queries.
Validate parameter types and inputs.
Log queries cautiously (avoid sensitive data).
Handle execution errors with try-except blocks.
Error Handling and Recovery
Robust error handling with retry logic on common gRPC errors:
Authentication
Invalid/expired credentials
Retry authentication up to 5 times
Connection
Timeout or server unavailability
Graceful retries on _InactiveRpcError
Performance Testing
Workflow
Validate connection using environment variables.
Execute various queries to measure latency and correctness.
Test with caching enabled/disabled.
Use explain_analyse to verify query plans.
Leveraging Python 3.12 or Later for Improved Performance
Key Benefits
Faster startup and module import times.
Improved iteration speed and memory management.
Enhanced garbage collection for short-lived objects.
Support for the perf module for profiling.
Better error tracing for debugging.
Typing enhancements for cleaner code.
Best Practices
Use built-in functions for better efficiency (len(), max(), min()).
Enable adaptive specialization via python3.12 -X adaptive.
Utilize async improvements for gRPC async calls.
Upgrade Python environment with virtual env and reinstall dependencies.
Transition Recommendations
python3.12 -m venv venv
source venv/bin/activate
pip install --upgrade .
Performance Testing Example
import time
start = time.perf_counter()
data = cursor.fetchall()
end = time.perf_counter()
print(f"Fetch Time: {end - start} seconds")
Last updated