Skip to content

The Pipeline

mindoff-dataport is built around a single, linear pipeline. Data flows one way: from a template file, through an in-memory schema, into a portable bundle, and out to a rendered file. Each stage has one job and one output.

For a user-facing walkthrough of these steps, see the Quick Start.

Data Flow

.xlsx template  ──extract()──►  WorkbookSchema
                              compile(schema, data)
                              ReportBundle (directory)
                              ├── manifest.json
                              ├── report.json
                              └── data/*.parquet
                            export(bundle, path, format=…)
                             ┌───────┴───────┐
                          .xlsx           .pdf

The Three Stages

Stage Entry point Input → Output Owns
Extract extractor.py / extract_template .xlsxWorkbookSchema Reading styles, dimensions, merges, breaks, placeholders
Compile bundle.py / compile_report_bundle schema + data → ReportBundle Validation, scalar resolution, anchors, repeat plans, shifting
Export xlsx_renderer.py, pdf_renderer.py bundle → .xlsx / .pdf Materialising the plan into a styled file

Each stage is covered in depth on its own page: Extraction, Compilation & Bundle, XLSX Rendering, and PDF Rendering.

The Core Design Idea

The pipeline's defining decision is that the bundle stores a plan, not a rendering. Scalars are resolved into cells, but tables are not expanded; report.json holds compact anchors (column names, start position, style), and the actual rows stay in Parquet under data/. Expansion happens only at export time, row by row.

That single choice is what makes large exports possible. A million-row table costs the same in the bundle as a thousand-row one, because the bundle never contains the rows, only the instructions for where and how to write them. The streaming renderers then read Parquet in batches and emit output incrementally, so peak memory stays flat regardless of dataset size. The proof is in Benchmarking.

The ReportBundle Directory

When you pass bundle_path to compile(), the bundle is written as a directory. The same directory can be reloaded and re-exported without recompiling.

report_bundle/
├── manifest.json      # bundle version, inputs, sheet metadata, dataframe sources, capabilities
├── report.json        # resolved scalar cells and dataframe anchor/repeat plans
└── data/
    └── *.parquet      # dataframe sources materialised from Polars inputs
File / dir Holds
manifest.json Bundle version, the input contract, per-sheet metadata, dataframe sources, capabilities
report.json Resolved scalar/static cells, and dataframe anchors plus repeat plans (not expanded rows)
data/*.parquet Dataframe sources materialised from the Polars inputs; read in batches at export time

Anchors, not rows

report.json stores dataframe anchors (column names, start row/column, and style) and never the expanded row data. Rows live in Parquet and are read at export time. This is why a persisted bundle stays small even for enormous datasets.

Invariants the Pipeline Guarantees

A few properties hold across the whole pipeline, by design:

  • Templates are never mutated. Compilation builds a new bundle; the input schema is untouched.
  • Formulas are preserved. Extraction reads with data_only=False, so formula cells survive.
  • XLSX and PDF consume the same resolved layout. The two renderers share one plan; differences are limited to supplied font availability and deterministic page scaling in PDF.
  • Output sheet names stay unique, and sheet order follows template order.

These are the contracts the rest of the architecture relies on. The pages that follow show how each stage upholds them.