Introduction

PySpark Compatibility Layer

Overview

e6-spark-compat is a drop-in compatibility library that lets you run existing PySpark and Apache Sedona code on e6data. Update your import statements, configure the e6data connection, and your Spark code works as-is — no rewrites needed.

DataFrame operations are lazily evaluated. Transformations build a query plan tree, and when an action (collect, show, count) is called, the plan is translated into optimized SQL using SQLGlot and executed on e6data.

Key Capabilities

Full PySpark DataFrame API — select, filter, join, groupBy, orderBy, union, pivot, and more
130+ SQL functions — string, math, aggregate, date/time, window, conditional
Window functions with complete Window specification API
70+ Apache Sedona-compatible spatial functions (ST_*)
File format support — Parquet, ORC, CSV, JSON, GeoParquet, Delta
Read and write operations

Installation

# Install from PyPI
pip install e6data-spark-compatibility

# With spatial support
pip install e6data-spark-compatibility[spatial]

# Install from GitHub
pip install git+https://github.com/e6data/e6-spark-compat.git

Prerequisites

An active e6data workspace and cluster
A Personal Access Token from the e6data console (User Settings > Personal Access Tokens)
Python 3.8+

Quick Example

from e6_spark_compat import SparkSession
from e6_spark_compat.sql.functions import col, upper, count, sum

spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.e6data.host", "<cluster-host>") \
    .config("spark.e6data.username", "<username>") \
    .config("spark.e6data.password", "<access-token>") \
    .config("spark.e6data.database", "<database>") \
    .config("spark.e6data.catalog", "<catalog>") \
    .config("spark.e6data.cluster", "<cluster-name>") \
    .config("spark.e6data.secure", True) \
    .getOrCreate()

df = spark.read.parquet("s3://bucket/path/to/data.parquet")

result = df.filter(col("age") > 21) \
    .select("name", upper(col("city")).alias("city_upper"), "salary") \
    .groupBy("city_upper") \
    .agg(count("*").alias("total"), sum("salary").alias("total_salary")) \
    .orderBy(col("total").desc())

result.show()

PreviousPySpark Compatibility NextGetting started

Last updated 12 hours ago

hashtagPySpark Compatibility Layer

hashtagOverview

hashtagKey Capabilities

hashtagInstallation

hashtagPrerequisites