# PySpark Compatibility

## PySpark Compatibility Layer

### Overview

e6-spark-compat is a drop-in compatibility library that lets you run existing PySpark and Apache Sedona code on e6data. Update your import statements, configure the e6data connection, and your Spark code works as-is, no rewrites needed.

DataFrame operations are lazily evaluated. Transformations build a query plan tree, and when an action (`collect`, `show`, `count`) is called, the plan is translated into optimized SQL using SQLGlot and executed on e6data.

### Key Capabilities

* Full PySpark DataFrame API — `select`, `filter`, `join`, `groupBy`, `orderBy`, `union`, `pivot`, and more
* 130+ SQL functions — string, math, aggregate, date/time, window, conditional
* Window functions with complete `Window` specification API
* 70+ Apache Sedona-compatible spatial functions (ST\_\*)
* File format support — Parquet, ORC, CSV, JSON, GeoParquet, Delta
* Read and write operations

### Installation

```bash
# Install from PyPI
pip install e6data-spark-compatibility

# With spatial support
pip install e6data-spark-compatibility[spatial]

# Install from GitHub
pip install git+https://github.com/e6data/e6-spark-compat.git
```

### Prerequisites

* An active e6data workspace and cluster
* A Personal Access Token from the e6data console (User Settings > Personal Access Tokens)
* Python 3.8+

### Quick Example

```python
from e6_spark_compat import SparkSession
from e6_spark_compat.sql.functions import col, upper, count, sum

spark = (SparkSession.builder
    .appName("MyApp")
    .config("spark.e6data.host", "<cluster-host>")
    .config("spark.e6data.username", "<username>")
    .config("spark.e6data.password", "<access-token>")
    .config("spark.e6data.database", "<database>")
    .config("spark.e6data.catalog", "<catalog>")
    .config("spark.e6data.cluster", "<cluster-name>")
    .config("spark.e6data.secure", True)
    .getOrCreate())

df = spark.read.parquet("s3://bucket/path/to/data.parquet")

result = (df.filter(col("age") > 21)
    .select("name", upper(col("city")).alias("city_upper"), "salary")
    .groupBy("city_upper")
    .agg(count("*").alias("total"), sum("salary").alias("total_salary"))
    .orderBy(col("total").desc()))

result.show()
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.e6data.com/product-documentation/pyspark-compatibility.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
