spark-fuse — Install Guide¶
Prerequisites - Python: 3.9+ - Java: JDK 17 (recommended; PySpark 4.x) - OS packages: build tools to compile native deps if needed
Virtual Environment (recommended)
- macOS/Linux:
- python3 -m venv .venv
- source .venv/bin/activate
- python -m pip install --upgrade pip
- Windows (PowerShell):
- python -m venv .venv
- .\\.venv\\Scripts\\Activate.ps1
- python -m pip install --upgrade pip
Quick Install (PyPI)
- pip install "spark-fuse>=1.0.2"
Development Install (editable)
- python -m pip install --upgrade pip
- pip install -e ".[dev]"
Verify Installation
- python -c "import spark_fuse; print(spark_fuse.__version__)"
- spark-fuse --help
Java Setup Notes
- macOS (Homebrew): brew install openjdk@17 then add to your shell:
- export JAVA_HOME="$(/usr/libexec/java_home -v 17)"
- Linux: install OpenJDK 17 (or 11 at minimum) using your distro’s package manager and set JAVA_HOME accordingly.
Optional: Authentication and Environment
- REST APIs: set headers / request_kwargs in the REST data source options (see build_rest_api_config) for API keys, OAuth tokens, or proxies. Use environment variables to avoid committing secrets.
- SPARQL endpoints: many public services require a descriptive User-Agent header—pass it via the data source config.
Minimal Usage Example
import json
from spark_fuse.spark import create_session
from spark_fuse.io import (
REST_API_CONFIG_OPTION,
REST_API_FORMAT,
build_rest_api_config,
register_rest_data_source,
)
spark = create_session(app_name="spark-fuse-demo")
register_rest_data_source(spark)
config = build_rest_api_config(
spark,
"https://pokeapi.co/api/v2/pokemon",
source_config={"records_field": "results", "pagination": {"mode": "response", "field": "next"}},
)
df = (
spark.read.format(REST_API_FORMAT)
.option(REST_API_CONFIG_OPTION, json.dumps(config))
.load()
)
df.select("name").show(5)
Testing and Linting
- Run tests: pytest
- Lint: ruff check src tests
- Format: ruff format src tests
Publishing (Maintainers)
- Bump version in pyproject.toml:1
- Create GitHub Release (tag vX.Y.Z)
- Workflow .github/workflows/publish.yml:1 builds and uploads to PyPI using the protected pypi environment (PYPI_API_TOKEN).