Skip to content

spark-fuse — Install Guide

Prerequisites - Python: 3.9+ - Java: JDK 8 or 11 (required by PySpark 3.4.x) - OS packages: build tools to compile native deps if needed

Virtual Environment (recommended) - macOS/Linux: - python3 -m venv .venv - source .venv/bin/activate - python -m pip install --upgrade pip - Windows (PowerShell): - python -m venv .venv - .\\.venv\\Scripts\\Activate.ps1 - python -m pip install --upgrade pip

Quick Install (PyPI) - pip install spark-fuse

Development Install (editable) - python -m pip install --upgrade pip - pip install -e ".[dev]"

Verify Installation - python -c "import spark_fuse; print(spark_fuse.__version__)" - spark-fuse --help

Java Setup Notes - macOS (Homebrew): brew install openjdk@11 then add to your shell: - export JAVA_HOME="$(/usr/libexec/java_home -v 11)" - Linux: install OpenJDK 11 using your distro’s package manager and set JAVA_HOME accordingly.

Optional: Authentication and Environment - Azure ADLS Gen2 (abfss://): set environment variables for a Service Principal and pass Spark configs as needed. - Env vars: AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET - Example Spark configs via code: - fs.azure.account.auth.type.<account>.dfs.core.windows.net=OAuth - fs.azure.account.oauth.provider.type.<account>.dfs.core.windows.net=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider - fs.azure.account.oauth2.client.id.<account>.dfs.core.windows.net=$AZURE_CLIENT_ID - fs.azure.account.oauth2.client.secret.<account>.dfs.core.windows.net=$AZURE_CLIENT_SECRET - fs.azure.account.oauth2.client.endpoint.<account>.dfs.core.windows.net=https://login.microsoftonline.com/$AZURE_TENANT_ID/oauth2/token - Databricks REST submit: - Env vars: DATABRICKS_HOST, DATABRICKS_TOKEN - CLI: spark-fuse databricks-submit --json payload.json - Microsoft Fabric (OneLake): use onelake://... or abfss://...@onelake.dfs.fabric.microsoft.com/... URIs. Access is managed by your Fabric workspace identity.

Minimal Usage Example

from spark_fuse.spark import create_session
from spark_fuse.io.registry import connector_for_path

spark = create_session(app_name="spark-fuse-demo")
path = "abfss://container@account.dfs.core.windows.net/path/to/delta"
conn = connector_for_path(path)
df = conn.read(spark, path)  # delta by default
df.show(5)

Testing and Linting - Run tests: pytest - Lint: ruff check src tests - Format: ruff format src tests

Publishing (Maintainers) - Bump version in pyproject.toml:1 - Create GitHub Release (tag vX.Y.Z) - Workflow .github/workflows/publish.yml:1 builds and uploads to PyPI using the protected pypi environment (PYPI_API_TOKEN).