DataFrame Operations

DataFrame Operations

Transformations

Transformations are lazy — they build a query plan without executing. Execution is triggered only when an action is called.

Selection and Filtering

# Select columns
df.select("name", "age", "city")
df.select(col("name"), col("age") + 1)

# Select with SQL expressions
df.selectExpr("name", "age + 1 as age_plus_one", "UPPER(city) as city_upper")

# Filter rows
df.filter(col("age") > 21)
df.where(col("status") == "active")
df.filter("age > 21 AND status = 'active'")

Joins

# Inner join
df1.join(df2, df1["id"] == df2["id"], "inner")

# Left outer join
df1.join(df2, "id", "left")

# Cross join
df1.crossJoin(df2)

Supported join types: inner, left, right, full, cross, left_semi, left_anti.

Grouping and Aggregation

Sorting

Set Operations

Column Operations

All Transformations

Method
Description

select(*cols)

Select columns / expressions

selectExpr(*exprs)

Select using SQL expressions

filter(condition) / where()

Filter rows

join(other, on, how)

Join DataFrames

crossJoin(other)

Cross join

groupBy(*cols)

Group for aggregation

orderBy(*cols) / sort()

Sort rows

limit(n)

Limit to first n rows

distinct()

Remove duplicate rows

union(other)

Union (all)

unionByName(other)

Union matching by name

intersect(other)

Set intersection

exceptAll(other)

Set difference

withColumn(name, col)

Add / replace a column

withColumnRenamed(old, new)

Rename a column

drop(*cols)

Drop columns

cache() / persist()

Cache hint (pass-through)

coalesce(n)

Repartition hint

Actions

Actions trigger query execution on e6data and return results.

Method
Description

collect()

Return all rows as a list of Rows

count()

Return the total row count

show(n)

Print the first n rows (default 20)

first()

Return the first row

head(n)

Return the first n rows

take(n)

Return the first n rows as a list

toPandas()

Convert results to a Pandas DataFrame

explain()

Print the generated SQL query

describe(*cols)

Compute summary statistics

Examples

Temporary Views

Register a DataFrame as a temporary view to query it with SQL.

Read Operations

Use spark.read to load data from various file formats.

Supported Read Formats

Format
Method

Parquet

spark.read.parquet(path)

ORC

spark.read.orc(path)

CSV

spark.read.csv(path)

JSON

spark.read.json(path)

GeoParquet

spark.read.format("geoparquet").load(path)

GeoJSON

spark.read.format("geojson").load(path)

Delta

spark.read.format("delta").load(path)

Text

spark.read.text(path)

Write Operations

Use df.write to save results.

Write Modes

Mode
Behavior

error

Throw an error if data already exists (default)

append

Append to existing data

overwrite

Overwrite existing data

ignore

Silently skip if data already exists

Last updated