# PySpark Compatibility

## PySpark Compatibility Layer

### Overview

e6-spark-compat is a drop-in compatibility library that lets you run existing PySpark and Apache Sedona code on e6data. Update your import statements, configure the e6data connection, and your Spark code works as-is, no rewrites needed.

DataFrame operations are lazily evaluated. Transformations build a query plan tree, and when an action (`collect`, `show`, `count`) is called, the plan is translated into optimized SQL using SQLGlot and executed on e6data.

### Key Capabilities

* Full PySpark DataFrame API — `select`, `filter`, `join`, `groupBy`, `orderBy`, `union`, `pivot`, and more
* 130+ SQL functions — string, math, aggregate, date/time, window, conditional
* Window functions with complete `Window` specification API
* 70+ Apache Sedona-compatible spatial functions (ST\_\*)
* File format support — Parquet, ORC, CSV, JSON, GeoParquet, Delta
* Read and write operations

### Installation

```bash
# Install from PyPI
pip install e6data-spark-compatibility

# With spatial support
pip install e6data-spark-compatibility[spatial]

# Install from GitHub
pip install git+https://github.com/e6data/e6-spark-compat.git
```

### Prerequisites

* An active e6data workspace and cluster
* A Personal Access Token from the e6data console (User Settings > Personal Access Tokens)
* Python 3.8+

### Quick Example

```python
from e6_spark_compat import SparkSession
from e6_spark_compat.sql.functions import col, upper, count, sum

spark = (SparkSession.builder
    .appName("MyApp")
    .config("spark.e6data.host", "<cluster-host>")
    .config("spark.e6data.username", "<username>")
    .config("spark.e6data.password", "<access-token>")
    .config("spark.e6data.database", "<database>")
    .config("spark.e6data.catalog", "<catalog>")
    .config("spark.e6data.cluster", "<cluster-name>")
    .config("spark.e6data.secure", True)
    .getOrCreate())

df = spark.read.parquet("s3://bucket/path/to/data.parquet")

result = (df.filter(col("age") > 21)
    .select("name", upper(col("city")).alias("city_upper"), "salary")
    .groupBy("city_upper")
    .agg(count("*").alias("total"), sum("salary").alias("total_salary"))
    .orderBy(col("total").desc()))

result.show()
```
