REST API Data Source Demo¶
The REST API data source ingests paginated JSON endpoints into Spark DataFrames with optional request throttling and retry support.
import json
from spark_fuse.spark import create_session
from spark_fuse.io import (
REST_API_CONFIG_OPTION,
REST_API_FORMAT,
build_rest_api_config,
register_rest_data_source,
)
spark = create_session(app_name="spark-fuse-rest-demo")
register_rest_data_source(spark)
payload = build_rest_api_config(
spark,
"https://pokeapi.co/api/v2/pokemon",
source_config={
"request_type": "GET", # switch to "POST" when the API expects a payload
"records_field": "results",
"pagination": {"mode": "response", "field": "next", "max_pages": 2},
},
)
pokemon = (
spark.read.format(REST_API_FORMAT)
.option(REST_API_CONFIG_OPTION, json.dumps(payload))
.load()
)
pokemon.select("name").show(5)
Token pagination example (e.g., APIs that return paging.next.after):
payload = build_rest_api_config(
spark,
"https://api.hubapi.com/marketing/v3/marketing-events/",
source_config={
"records_field": "results",
"params": {"limit": 100},
"pagination": {"mode": "token", "param": "after", "field": "paging.next.after"},
},
)
Highlights:
- Cursor-based (
response) and token-based (token) pagination. - Optional request headers and query parameters.
- Issue
GETorPOSTcalls by settingrequest_type, attaching payloads withrequest_body. - Built-in retry/backoff controls.
- Optional
include_response_payloadcolumn to capture the full server JSON per row.
Notebook walkthrough¶
See the data source in action with additional configuration examples: