Qdrant Data Source¶

The spark-fuse-qdrant connector streams points from (and into) Qdrant over the HTTP API. Reading uses the Scroll endpoint with optional payload/vector selection, pagination, and retries; writing batches points to the Points endpoint with configurable payload extraction.

Reading from Qdrant¶

Use build_qdrant_config with spark.read.format(QDRANT_FORMAT) after registering the data source. The connector can infer the schema from returned points or consume an explicit schema.

import json
from spark_fuse.io import (
    QDRANT_CONFIG_OPTION,
    QDRANT_FORMAT,
    build_qdrant_config,
    register_qdrant_data_source,
)

register_qdrant_data_source(spark)
config = build_qdrant_config(
    spark,
    endpoint="http://localhost:6333",
    collection="demo",
    with_vectors=True,
    limit=200,
)

df = (
    spark.read.format(QDRANT_FORMAT)
    .option(QDRANT_CONFIG_OPTION, json.dumps(config))
    .load()
)
df.show(5)

Set infer_schema=False in the config when you want to provide a schema explicitly (pass QDRANT_SCHEMA_OPTION with schema.json in the DataFrame reader options).

Reader options¶

Option	Type / Default	Description
`endpoint`	string, required	Base HTTP URL for Qdrant (must start with `http://` or `https://`).
`collection`	string, required	Target collection to scroll.
`api_key`	string, optional	Adds `api-key` header when provided.
`headers`	mapping, optional	Extra headers merged into every request.
`timeout`	float, `30.0`	Request timeout in seconds.
`max_retries`	int, `3`	Retry attempts per request.
`backoff_factor`	float, `0.5`	Exponential backoff multiplier between retries.
`with_payload`	bool \| str \| sequence \| mapping, `True`	Controls the `with_payload` scroll flag. `True` includes all payload, `False` drops it, strings/sequences select payload keys, mappings pass through advanced payload selectors.
`with_vectors`	bool \| str \| sequence, `False`	Controls the `with_vectors` scroll flag. `True` includes all vectors, strings/sequences select named vectors.
`limit`	int, optional	Total point cap; must be positive.
`page_size`	int, `128`	Points per scroll page (clipped to `limit` when set).
`max_pages`	int, optional	Maximum number of pages to request.
`filter`	mapping, optional	Qdrant filter object sent with every scroll request.
`offset`	any, optional	Scroll offset token to start from.
`infer_schema`	bool, `True`	When `False`, an explicit schema is required via `QDRANT_SCHEMA_OPTION`.

Writing to Qdrant¶

Build a writer config with build_qdrant_write_config and pass it to df.write.format(QDRANT_FORMAT), or call write_qdrant_points directly for non-Spark workflows.

import json
from spark_fuse.io import (
    QDRANT_CONFIG_OPTION,
    QDRANT_FORMAT,
    build_qdrant_write_config,
    register_qdrant_data_source,
)

register_qdrant_data_source(spark)
write_config = build_qdrant_write_config(
    endpoint="http://localhost:6333",
    collection="demo",
    vector_field="embedding",
    payload_fields=["text", "source"],
)

(
    df.write.format(QDRANT_FORMAT)
    .option(QDRANT_CONFIG_OPTION, json.dumps(write_config))
    .mode("append")
    .save()
)

By default, the writer pulls payload columns from all fields except the vector (and id if used); set payload_fields to restrict which columns become payload.

Writer options¶

Option	Type / Default	Description
`endpoint`	string, required	Base HTTP URL for Qdrant (must start with `http://` or `https://`).
`collection`	string, required	Target collection to write into.
`api_key`	string, optional	Adds `api-key` header when provided.
`headers`	mapping, optional	Extra headers merged into every request.
`timeout`	float, `30.0`	Request timeout in seconds.
`max_retries`	int, `3`	Retry attempts per batch.
`backoff_factor`	float, `0.5`	Exponential backoff multiplier between retries.
`batch_size`	int, `128`	Number of points sent per HTTP request.
`wait`	bool, `True`	Passes the `wait` flag to Qdrant to block until the write is applied.
`id_field`	string \| None, `"id"`	Column to use as the point ID. Set to `None` to let Qdrant assign IDs.
`vector_field`	string, `"vector"`	Column containing the vector to index; required in every record.
`payload_fields`	string \| sequence, optional	If set, only these columns are sent as payload. When omitted, all non-ID and non-vector columns become payload.
`create_collection`	bool, `False`	When true, auto-creates the collection (using the first point to infer vector size) if a GET on the collection returns 404.
`distance`	string, `"Cosine"`	Distance metric to use when creating a collection automatically.
`payload_format`	string, `"auto"`	Payload encoding for writes: `points` (list-of-points), `batch` (ids/vectors/payloads arrays), or `auto` (try points then fall back to batch on 400 “missing ids”).
`write_method`	string, `"auto"`	HTTP method for writes: `put`, `post`, or `auto` (tries PUT then POST).

API Reference¶

Build the options payload consumed by the Qdrant data source.

Source code in src/spark_fuse/io/qdrant/reader.py

def build_qdrant_config(
    spark: SparkSession,
    endpoint: Any,
    *,
    collection: Optional[str] = None,
    schema: Optional[StructType] = None,
    source_config: Optional[Mapping[str, Any]] = None,
    headers: Optional[Mapping[str, str]] = None,
    **kwargs: Any,
) -> Dict[str, Any]:
    """Build the options payload consumed by the Qdrant data source."""

    config: Dict[str, Any] = {}
    for mapping in (source_config, kwargs):
        if mapping:
            config.update(mapping)

    endpoint_str = str(endpoint)
    if not _validate_http_url(endpoint_str):
        raise ValueError("endpoint must start with http:// or https:// for Qdrant reads")

    collection_name = collection or config.get("collection")
    if not collection_name or not str(collection_name).strip():
        raise ValueError("collection must be provided for Qdrant reads")
    config["collection"] = str(collection_name).strip()

    infer_schema = bool(config.get("infer_schema", schema is None))
    if not infer_schema and schema is None:
        raise ValueError("schema must be provided when infer_schema=False for Qdrant reads")

    base_headers: Dict[str, str] = {}
    for header_map in (config.get("headers"), headers):
        if isinstance(header_map, Mapping):
            base_headers.update({str(k): str(v) for k, v in header_map.items()})

    limit_value = config.get("limit")
    if limit_value is not None:
        limit_value = int(limit_value)
        if limit_value <= 0:
            raise ValueError("limit must be positive when provided")
        config["limit"] = limit_value

    page_size = int(config.get("page_size", _DEFAULT_PAGE_SIZE))
    if page_size <= 0:
        raise ValueError("page_size must be a positive integer")
    if limit_value is not None:
        page_size = min(page_size, int(limit_value))
    config["page_size"] = page_size

    max_pages = config.get("max_pages")
    if max_pages is not None:
        max_pages = int(max_pages)
        if max_pages <= 0:
            raise ValueError("max_pages must be positive when provided")
        config["max_pages"] = max_pages

    filter_value = config.get("filter")
    if filter_value is not None and not isinstance(filter_value, Mapping):
        raise TypeError("filter must be a mapping when provided")
    if isinstance(filter_value, Mapping):
        config["filter"] = _normalize_jsonable(filter_value)

    config_payload = {
        "endpoint": endpoint_str.rstrip("/"),
        "collection": config["collection"],
        "api_key": config.get("api_key"),
        "headers": base_headers,
        "timeout": float(config.get("timeout", 30.0)),
        "max_retries": int(config.get("max_retries", 3)),
        "backoff_factor": float(config.get("backoff_factor", 0.5)),
        "with_payload": _normalize_payload_option(config.get("with_payload", True)),
        "with_vectors": _normalize_vectors_option(config.get("with_vectors", False)),
        "limit": config.get("limit"),
        "page_size": config["page_size"],
        "max_pages": config.get("max_pages"),
        "filter": config.get("filter"),
        "offset": config.get("offset"),
        "infer_schema": infer_schema,
    }

    return config_payload

Build the config payload used for Qdrant writes (DataFrameWriter options).

Source code in src/spark_fuse/io/qdrant/writer.py

def build_qdrant_write_config(
    endpoint: Any,
    *,
    collection: str,
    id_field: Optional[str] = "id",
    vector_field: str = "vector",
    payload_fields: Optional[Sequence[str]] = None,
    wait: bool = True,
    batch_size: int = 128,
    api_key: Optional[str] = None,
    headers: Optional[Mapping[str, str]] = None,
    timeout: float = 30.0,
    max_retries: int = 3,
    backoff_factor: float = 0.5,
    create_collection: bool = False,
    distance: str = "Cosine",
    payload_format: str = "auto",
    write_method: str = "auto",
    **overrides: Any,
) -> Dict[str, Any]:
    """Build the config payload used for Qdrant writes (DataFrameWriter options)."""

    config: Dict[str, Any] = {}
    for mapping in (overrides,):
        if mapping:
            config.update(mapping)

    config["endpoint"] = endpoint
    config["collection"] = collection
    config["api_key"] = api_key
    config["headers"] = headers or {}
    config["timeout"] = timeout
    config["max_retries"] = max_retries
    config["backoff_factor"] = backoff_factor
    config["batch_size"] = batch_size
    config["wait"] = wait
    config["id_field"] = id_field
    config["vector_field"] = vector_field
    config["payload_fields"] = payload_fields
    config["create_collection"] = create_collection
    config["distance"] = distance
    config["payload_format"] = payload_format
    config["write_method"] = write_method

    # Validate by constructing the resolved config; return raw dict for JSON serialization.
    _QdrantWriteConfig.from_dict(config)
    return config

Write an iterable of records to a Qdrant collection via the HTTP API.

Source code in src/spark_fuse/io/qdrant/writer.py

def write_qdrant_points(
    records: Iterable[Mapping[str, Any]],
    endpoint: Any,
    *,
    collection: str,
    id_field: Optional[str] = "id",
    vector_field: str = "vector",
    payload_fields: Optional[Sequence[str]] = None,
    wait: bool = True,
    batch_size: int = 128,
    api_key: Optional[str] = None,
    headers: Optional[Mapping[str, str]] = None,
    timeout: float = 30.0,
    max_retries: int = 3,
    backoff_factor: float = 0.5,
    create_collection: bool = False,
    distance: str = "Cosine",
    payload_format: str = "auto",
    write_method: str = "auto",
) -> int:
    """Write an iterable of records to a Qdrant collection via the HTTP API."""

    config_dict = {
        "endpoint": endpoint,
        "collection": collection,
        "api_key": api_key,
        "headers": headers or {},
        "timeout": timeout,
        "max_retries": max_retries,
        "backoff_factor": backoff_factor,
        "batch_size": batch_size,
        "wait": wait,
        "id_field": id_field,
        "vector_field": vector_field,
        "payload_fields": payload_fields,
        "create_collection": create_collection,
        "distance": distance,
        "payload_format": payload_format,
        "write_method": write_method,
    }
    config = _QdrantWriteConfig.from_dict(config_dict)
    return _write_points_iter(records, config)

Bases: DataSource

Source code in src/spark_fuse/io/qdrant/datasource.py

def __init__(self, options: Mapping[str, str]) -> None:
    super().__init__(options)
    raw_config = options.get(QDRANT_CONFIG_OPTION)
    if not raw_config:
        raise ValueError("Qdrant data source requires the config option")
    config_data = json.loads(raw_config)

    self._read_config = _QdrantResolvedConfig.from_dict(config_data)
    self._write_config = _QdrantWriteConfig.from_dict(config_data)

    schema_json = options.get(QDRANT_SCHEMA_OPTION)
    self._user_schema = StructType.fromJson(json.loads(schema_json)) if schema_json else None
    self._schema_cache: Optional[StructType] = None