spark-fuse¶
spark-fuse is an open-source toolkit for PySpark — providing utilities, connectors, and tools to fuse your data workflows across Azure Storage (ADLS Gen2), Databricks, Microsoft Fabric Lakehouses (via OneLake/Delta), Unity Catalog, Hive Metastore, and REST APIs.
Features¶
- Connectors for ADLS Gen2 (
abfss://
), Fabric OneLake (onelake://
orabfss://...onelake.dfs.fabric.microsoft.com/...
), Databricks DBFS and catalog tables, and REST APIs (JSON). - Unity Catalog and Hive Metastore helpers to create catalogs/schemas and register external Delta tables.
- SparkSession helpers with sensible defaults and environment detection (Databricks/Fabric/local).
- DataFrame utilities for previews, name management, casts, whitespace cleanup, resilient date parsing, and LLM-backed semantic column mapping.
- Typer-powered CLI: list connectors, preview datasets, register tables, submit Databricks jobs.
Quickstart¶
from spark_fuse.spark import create_session
spark = create_session(app_name="spark-fuse-quickstart")
from spark_fuse.io.azure_adls import ADLSGen2Connector
df = ADLSGen2Connector().read(spark, "abfss://container@account.dfs.core.windows.net/path/to/delta")
df.show(5)
from spark_fuse.io.rest_api import RestAPIReader
reader = RestAPIReader()
pokemon = reader.read(
spark,
"https://pokeapi.co/api/v2/pokemon",
source_config={
"records_field": "results",
"pagination": {"mode": "response", "field": "next", "max_pages": 2},
},
)
pokemon.select("name").show(5)
from spark_fuse.catalogs import unity
unity.create_catalog(spark, "analytics")
unity.create_schema(spark, catalog="analytics", schema="core")
unity.register_external_delta_table(
spark,
catalog="analytics",
schema="core",
table="events",
location="abfss://container@account.dfs.core.windows.net/path/to/delta",
)
LLM-powered semantic mapping¶
from spark_fuse.utils.transformations import map_column_with_llm
targets = ["Apple", "Banana", "Cherry"]
normalized = map_column_with_llm(
df,
column="fruit",
target_values=targets,
model="o4-mini",
temperature=None,
)
normalized.select("fruit", "fruit_mapped").show()
Azure OpenAI token estimate¶
For sizing, assume:
-
2,000 rows with ~50 characters each (≈50 tokens).
-
target_values
totalling 100 characters (≈25 tokens). -
o4-mini
model withtemperature=None
.
Each request sends ≈75 input tokens and returns ≈10 tokens, totalling ≈85 tokens per row.
At 2,000 rows ⇒ about 170k tokens. Multiply by your provider's per-token price to estimate cost; adjust partitions or dry-run to manage quota.
As a reference, OpenAI currently lists o4-mini input tokens at $0.0006 per 1K tokens and output tokens at $0.0024 per 1K tokens. - Input: 150K tokens × $0.0006 ≈ $0.09 - Output: 20K tokens × $0.0024 ≈ $0.048 - Approximate total: $0.14 for the batch
Azure OpenAI pricing may differ; always confirm with your subscription's rate card before running large workloads.
Use dry_run=True
during development to avoid external API calls until credentials and prompts are ready. Some models only accept their default sampling configuration—use temperature=None
to omit the parameter when required. The LLM mapper is available starting in spark-fuse 0.2.0.
CLI¶
spark-fuse connectors
spark-fuse read --path <uri> --show 5
spark-fuse uc-create --catalog <name> --schema <name>
spark-fuse uc-register-table --catalog <c> --schema <s> --table <t> --path <uri>
spark-fuse hive-register-external --database <db> --table <t> --path <uri>
spark-fuse fabric-register --table <t> --path <onelake-or-abfss-on-onelake>
spark-fuse databricks-submit --json job.json
See the Install and CLI pages for more.