Stoplight Test Framework for Multispecies Model Ensemble Inclusion

GOACLIM / ACLIM Topic 2 — Proposed Model Comparison Methods

Published

March 18, 2026

1 Overview

This document proposes a practical screening framework for deciding whether a multispecies or ecosystem model is credible enough to include in an ensemble analysis. The stoplight framework is not a formal statistical test. Its purpose is to support a transparent group decision about model inclusion, model caveats, and model fitness for a specific use.

In plain terms, the framework asks three questions:

Does the model produce biologically and dynamically plausible behavior?
Does that behavior agree with available assessment information closely enough for the intended use?
If the model disagrees, is that disagreement informative, or is it a sign that the model is not ready for inclusion?

The stoplight language is meant to summarize those answers:

Green: The model passes the criterion and can be used without special concern for that issue.
Yellow: The model is usable, but the result should be interpreted cautiously and documented explicitly.
Red: The model fails a key criterion badly enough that it should be revised, excluded, or only discussed as a clear outlier.

Tests are organized by purpose (research synthesis vs. management advice) and by phase (hindcast/initialization vs. projection period). Focal species for Topic 2 are walleye pollock, Pacific cod, and arrowtooth flounder.

2 Tiered Approach

The rigor of the stoplight criteria should scale with the intended use of the ensemble. A model that is acceptable for a research synthesis may still be too weak for direct management use.

Tier 1 (Research/synthesis paper): The model should reproduce broad ecological patterns and plausible species dynamics. This is the minimum standard for including a model in the Topic 2 (F=0) climate comparison.
Tier 2 (Management advice): The model should not only be plausible, but should also agree quantitatively with assessment-based benchmarks closely enough to support management interpretation. This is the more demanding standard for harvest strategy evaluation and advice-oriented work (Topics 3–4).

The main significance of the framework is therefore decision-oriented rather than statistical: it helps the working group explain why a model was included, included with caveats, or left out.

3 Criteria: Hindcast / Model Initialization Phase

Table 1 summarizes the stoplight criteria applied during the hindcast and initialization phase.

Table 1: Stoplight criteria for the hindcast / model initialization phase.

Criterion	Metric	Tier 1 (Research)	Tier 2 (Management)	Notes
Species persistence	All focal species present at end of hindcast	Required (hard stop)	Required (hard stop)	No extinctions of pollock, cod, or ATF
Biomass plausibility	Biomass of focal species vs. historical bounds	Within order of magnitude of assessment estimates	Within assessment CI at start of projection period	Absolute scale differs across model types; relative metrics preferred
Trend direction	Direction of recent biomass trend (increasing/stable/declining)	Qualitative agreement with assessment	Quantitative agreement (similar slope sign and approximate magnitude)	Compare over final 5–10 years of hindcast
No steep artifacts	Biomass trajectories free of sharp artificial trends	No >50% change in first 5 years unless data-supported	No >25% change in first 5 years unless data-supported	Particularly relevant for Atlantis/Ecopath spin-up
Stock status agreement	Relative status (B/B_MSY or B/B₀)	Same broad category (above/below reference point)	Within ±20% of assessment-based status ratio	Key criterion — relative metric normalizes across model structures
F_ref consistency	Multispecies F_MSY or proxy vs. single-species F_ref	Multispecies F_ref ≤ single-species F_ref (directional check)	Multispecies F_ref within 0.3–1.0× single-species F_ref	Multispecies F_ref expected to be lower due to explicit predation mortality; less critical for F=0 runs but useful diagnostic
Diet composition	Proportion consumed by major predators	Major prey items present in diet; qualitative agreement with diet data	Proportions within ±20% of observed diet fractions for key predator–prey pairs	Circular for models fit to diet data (e.g., CEATTLE); most useful for Atlantis, mizer, Ecopath
Age/size structure	Age or size composition at equilibrium or end of hindcast	Consistent with initial conditions (no major departures)	Consistent with assessment-estimated compositions	Model-type dependent; more applicable to age-structured models

4 Criteria: Projection Period

Table 2 summarizes the stoplight criteria applied during the projection period.

Table 2: Stoplight criteria for the projection period.

Criterion	Metric	Tier 1 (Research)	Tier 2 (Management)	Notes
Species persistence	All focal species present through 2100	Required (hard stop)	Required (hard stop)	Under \(F=0\), extinction would indicate serious model issues
Trend plausibility	Direction and magnitude of projected change	Ecologically plausible (no runaway dynamics)	Consistent with expected climate responses from literature	Hard to define precisely; focus on identifying clear outliers
Ranking consistency	Relative ranking of species biomass	Models agree on which species are most/least abundant	Models agree on ranking and approximate relative proportions	Useful cross-model diagnostic
Climate signal	Difference between SSP126 and SSP585 trajectories	Detectable difference in expected direction for at least some species	Statistically distinguishable trajectories with ecologically consistent direction	If a model shows no climate signal, investigate before including

5 Defining “Similar”

Because models span a wide range of structural complexity (fitted statistical models like CEATTLE vs. whole-ecosystem models like Atlantis), a single quantitative threshold for “similar” is inappropriate. We recommend a hierarchical definition:

5.1 Qualitative concordance (minimum for Tier 1)

Models agree on direction of change and relative ranking of species. For example, if the assessment indicates pollock biomass > cod biomass and pollock is stable while cod is declining, a passing model should reproduce this pattern.

5.2 Model-class-specific quantitative thresholds

Each modeler proposes what constitutes reasonable agreement for their model type, documented transparently. Suggested starting points:

Fitted statistical models (CEATTLE, mizer): biomass within ±20% of assessment, status ratios within ±0.1
Minimally tuned ecosystem models (Atlantis, Ecopath/Ecosim/Rpath): biomass within ±50% of assessment, correct status category

5.3 Ensemble-level diagnostics

Rather than strict pass/fail for individual models, assess whether the ensemble as a whole brackets the assessment estimate. A model that is an outlier on every metric may warrant exclusion, while a model that is an outlier on one metric but informative on others may still contribute.

6 Implementation Notes

\(F=0\) context. Since Topic 2 runs use \(F=0\), criteria related to \(F_\text{ref}\) and catch are diagnostics of model setup rather than projection outputs. They remain useful for verifying that internal productivity is reasonable.

Model-specific expectations. Atlantis hindcasts will never reproduce data patterns like a fitted CEATTLE model. Stoplight tests should be calibrated accordingly — the goal is not to hold all models to the same quantitative standard but to ensure each model is performing within the expectations of its structural class.

Post-hoc evaluation. Some stoplight assessments (particularly for projection plausibility) may need to occur after initial runs are completed. The framework should be viewed as iterative: run, evaluate, discuss, potentially adjust inclusion criteria.

Transparency over exclusion. Where possible, retain models that are borderline and document their performance rather than excluding them. The ensemble benefits from structural diversity, and understanding why a model disagrees can be as informative as the agreement among models that pass.

7 Suggested Workflow

Each modeler completes hindcast and \(F=0\) projection runs per Topic 2 specifications.
Each modeler evaluates their model against the Tier 1 criteria above and reports results in a standardized format.
Working group reviews results collectively, identifies any hard-stop failures.
For borderline cases, discuss whether to include with caveats or exclude.
Document all decisions and rationale in the synthesis paper methods section.