Skip to content

LLM Mapping Demo

The map_column_with_llm helper (introduced in spark-fuse 0.2.0) normalizes free-form values by delegating disambiguation to an LLM with caching and batching.

from spark_fuse.spark import create_session
from spark_fuse.utils.llm import map_column_with_llm

spark = create_session(app_name="spark-fuse-llm-demo")

df = spark.createDataFrame(
    [
        {"fruit": "apples"},
        {"fruit": "Banana"},
        {"fruit": "Cerry"},
    ]
)

standard = ["Apple", "Banana", "Cherry"]
mapped = map_column_with_llm(
    df,
    column="fruit",
    target_values=standard,
    model="o4-mini",
    temperature=None,
    dry_run=False,
)

mapped.select("fruit", "fruit_mapped").show()

Key features:

  • Executor-side caching to avoid repeated API calls.
  • Configurable batching and retry logic.
  • Optional dry runs to gauge match rates before sending traffic.

Notebook walkthrough

Explore the end-to-end workflow (including configuration tips) in the notebook: