Getting started

Description: Migrate your PySpark application to e6data in Three steps: update imports, configure the connection, and run.

Getting Started

Step 1: Update Imports

Replace your PySpark imports with e6-spark-compat equivalents. The API is identical.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper, count, sum, row_number
from pyspark.sql.window import Window

For spatial (Sedona) operations:

from sedona.register import SedonaRegistrator

Step 2: Configure the Connection

Create a SparkSession pointing to your e6data cluster.

spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.e6data.host", "<cluster-host>") \
    .config("spark.e6data.username", "<username>") \
    .config("spark.e6data.password", "<access-token>") \
    .config("spark.e6data.database", "<database>") \
    .config("spark.e6data.catalog", "<catalog>") \
    .config("spark.e6data.cluster", "<cluster-name>") \
    .config("spark.e6data.secure", True) \
    .getOrCreate()

Configuration Parameters

Parameter

Description

Required

spark.e6data.host

Cluster hostname or IP address

Yes

spark.e6data.username

e6data account email

Yes

spark.e6data.password

Personal Access Token from the e6data console

Yes

spark.e6data.database

Target database name

Yes

spark.e6data.catalog

Catalog name

Yes

spark.e6data.cluster

Cluster name

Yes

spark.e6data.secure

Use TLS for the connection (True or False). Default: True

You can find your cluster hostname and connection details in the e6data console under Clusters > Connection Info.

Do not hardcode your access token in source code. Use environment variables or a secrets manager instead.

Using Environment Variables

import os

spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.e6data.host", os.getenv("E6DATA_HOST")) \
    .config("spark.e6data.username", os.getenv("E6DATA_USERNAME")) \
    .config("spark.e6data.password", os.getenv("E6DATA_TOKEN")) \
    .config("spark.e6data.database", os.getenv("E6DATA_DATABASE")) \
    .config("spark.e6data.catalog", os.getenv("E6DATA_CATALOG")) \
    .config("spark.e6data.cluster", os.getenv("E6DATA_CLUSTER")) \
    .config("spark.e6data.secure", True) \
    .getOrCreate()

Step 3: Run Your Code

Your existing PySpark logic works without modification.

# Read data
df = spark.read.parquet("s3://bucket/path/to/data.parquet")

# Transformations (lazy — no execution yet)
result = df.filter(col("age") > 21) \
    .select("name", "city", "salary") \
    .groupBy("city") \
    .agg(count("*").alias("total"), sum("salary").alias("total_salary")) \
    .orderBy(col("total").desc())

# Action triggers execution on e6data
result.show()

Catalog Operations

Discover databases, tables, and columns programmatically.

# List databases
spark.catalog.listDatabases()

# List tables in current database
spark.catalog.listTables()

# List columns of a table
spark.catalog.listColumns("my_table")

# Check if a table exists
spark.catalog.tableExists("my_table")

Closing the Session

spark.stop()

PreviousIntroduction NextCode samples

Last updated 12 hours ago

hashtagDescription: Migrate your PySpark application to e6data in Three steps: update imports, configure the connection, and run.

hashtagGetting Started

hashtagStep 1: Update Imports

hashtagStep 2: Configure the Connection

hashtagConfiguration Parameters

hashtagUsing Environment Variables

hashtagStep 3: Run Your Code

hashtagCatalog Operations

hashtagClosing the Session

Description: Migrate your PySpark application to e6data in Three steps: update imports, configure the connection, and run.

Getting Started

Step 1: Update Imports

Step 2: Configure the Connection

Configuration Parameters

Using Environment Variables

Step 3: Run Your Code

Catalog Operations

Closing the Session