# Getting started

Migrate your PySpark application to e6data in Three steps:\
\
1\. **Update imports**\
2\. **Configure the connection**\
3\. **Run**

## Getting Started

### Step 1: Update Imports

Replace your PySpark imports with e6-spark-compat equivalents. The API is identical.

{% tabs %}
{% tab title="Before (PySpark)" %}

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper, count, sum, row_number
from pyspark.sql.window import Window
```

{% endtab %}

{% tab title="After (e6data)" %}

```python
from e6_spark_compat import SparkSession
from e6_spark_compat.sql.functions import col, upper, count, sum, row_number
from e6_spark_compat.sql.window import Window
```

{% endtab %}
{% endtabs %}

For spatial (Sedona) operations:

{% tabs %}
{% tab title="Before (Sedona)" %}

```python
from sedona.register import SedonaRegistrator
```

{% endtab %}

{% tab title="After (e6data)" %}

```python
from e6_spark_compat.sedona import SedonaRegistrator
```

{% endtab %}
{% endtabs %}

### Step 2: Configure the Connection

Create a `SparkSession` pointing to your e6data cluster.

```python
spark = SparkSession.builder 
    .appName("MyApp") 
    .config("spark.e6data.host", "<cluster-host>") 
    .config("spark.e6data.username", "<username>") 
    .config("spark.e6data.password", "<access-token>") 
    .config("spark.e6data.database", "<database>") 
    .config("spark.e6data.catalog", "<catalog>") 
    .config("spark.e6data.cluster", "<cluster-name>") 
    .config("spark.e6data.secure", True) 
    .getOrCreate()
```

#### Configuration Parameters

| Parameter               | Description                                                     | Required |
| ----------------------- | --------------------------------------------------------------- | -------- |
| `spark.e6data.host`     | Cluster hostname or IP address                                  | Yes      |
| `spark.e6data.username` | e6data account email                                            | Yes      |
| `spark.e6data.password` | Personal Access Token from the e6data console                   | Yes      |
| `spark.e6data.database` | Target database name                                            | Yes      |
| `spark.e6data.catalog`  | Catalog name                                                    | Yes      |
| `spark.e6data.cluster`  | Cluster name                                                    | Yes      |
| `spark.e6data.secure`   | Use TLS for the connection (`True` or `False`). Default: `True` | No       |

{% hint style="info" %}
You can find your cluster hostname and connection details in the e6data console under **Clusters > Connection Info**.
{% endhint %}

{% hint style="warning" %}
Do not hardcode your access token in source code. Use environment variables or a secrets manager instead.
{% endhint %}

#### Using Environment Variables

```python
import os

spark = SparkSession.builder 
    .appName("MyApp") 
    .config("spark.e6data.host", os.getenv("E6DATA_HOST")) 
    .config("spark.e6data.username", os.getenv("E6DATA_USERNAME")) 
    .config("spark.e6data.password", os.getenv("E6DATA_TOKEN")) 
    .config("spark.e6data.database", os.getenv("E6DATA_DATABASE")) 
    .config("spark.e6data.catalog", os.getenv("E6DATA_CATALOG")) 
    .config("spark.e6data.cluster", os.getenv("E6DATA_CLUSTER")) 
    .config("spark.e6data.secure", True) 
    .getOrCreate()
```

### Step 3: Run Your Code

Your existing PySpark logic works without modification.

```python
# Read data
df = spark.read.parquet("s3://bucket/path/to/data.parquet")

# Transformations (lazy — no execution yet)
result = df.filter(col("age") > 21) 
    .select("name", "city", "salary") 
    .groupBy("city") 
    .agg(count("*").alias("total"), sum("salary").alias("total_salary")) 
    .orderBy(col("total").desc())

# Action triggers execution on e6data
result.show()
```

### Catalog Operations

Discover databases, tables, and columns programmatically.

```python
# List databases
spark.catalog.listDatabases()

# List tables in current database
spark.catalog.listTables()

# List columns of a table
spark.catalog.listColumns("my_table")

# Check if a table exists
spark.catalog.tableExists("my_table")
```

### Closing the Session

```python
spark.stop()
```
