Provenance

Provenance describes the origin and processing history of a research object. It records where data came from, which transformations were applied, which tools or workflows were used, and how a derived result relates to its sources.

What provenance captures

Good provenance usually records:

  • the source datasets, documents, or observations used as inputs
  • the transformation steps applied to those inputs
  • the software, versions, parameters, and workflow logic involved
  • timestamps, responsible people, and identifiers that connect one stage to the next

That means provenance is not just a citation list. It is the chain of derivation.

Why it matters

Without provenance, a dataset or figure may be technically available but scientifically difficult to trust or reuse. Provenance is what makes it possible to answer questions such as:

  • Which raw observations or inputs were used?
  • Which processing steps created this output?
  • Which software, parameters, or workflow version were involved?
  • Can the result be reproduced or audited?

Levels of provenance

Provenance can be recorded at different levels:

  • file level: where a single output file came from
  • dataset level: how a collection of files was assembled and versioned
  • workflow level: how a pipeline transforms inputs into outputs
  • claim level: which figure, table, or research statement depends on which evidence

Relation to FAIR

Provenance is a key part of making research outputs FAIR. It strengthens reusability by exposing context, assumptions, and derivation steps instead of presenting outputs as isolated files.

In climate and data workflows

Provenance is especially important when combining observational data, reanalysis products, model simulations, and derived diagnostics. In those settings, knowing the chain from source to result is often as important as the result itself.

For example, a climate figure may depend on several linked stages: raw observations or reanalysis, preprocessing, model execution, post-processing, aggregation, and finally visualization. Without provenance, the final figure can be hard to audit even if the underlying files still exist.

Common failure modes

  • derived outputs are saved without references to their inputs
  • manual spreadsheet or notebook steps are not documented
  • file versions change without clear versioning or changelogs
  • figures and claims become disconnected from the exact workflow that produced them

Metadata vs provenance

Metadata describes the research object itself: variables, units, creators, scope, and format. Provenance explains how that object came to exist. Strong research-data practice usually needs both.

In this garden

Provenance connects metadata practice, data standards, and structured knowledge representation.

This note is the conceptual bridge between data-description notes and workflow notes.

See also: Metadata, FAIR, ATMODAT, UC2 data standard, Open Research Knowledge Graph, MOC Open Science Data and Knowledge Graphs