ML-Driven Injury Risk Function Development — Continuum Resources WP-CR-2026-01

Section 01

Executive Summary

Motor vehicle crashes impose a staggering human and economic burden on the United States — 42,939 fatalities in 2021 alone, with an estimated $340 billion in annual societal costs. The Federal Motor Vehicle Safety Standards (FMVSS) that govern crashworthiness requirements are grounded in injury risk functions (IRFs): mathematical models that translate physical crash measurements into probabilities of occupant injury. These functions underpin every compliance test, every New Car Assessment Program (NCAP) rating, and every regulatory rulemaking that shapes the vehicles Americans drive.

The prevailing IRF methodology was largely established in the 1970s and 1980s. It relies on small cadaveric and volunteer test series, single-parameter logistic regression, and anthropomorphic test device (ATD) measurements that are imperfect proxies for human biomechanical response. The result: functions that are structurally sound but limited in their ability to capture the multivariate, interaction-driven nature of real-world crash injury causation.

Modern machine learning — applied to the rich, nationally representative crash investigation data that NHTSA has accumulated for decades — offers a credible path to next-generation IRFs. This paper examines that path in detail: the data infrastructure, the algorithmic toolkit, and a validation framework aligned to regulatory requirements.

42,939

US Traffic Fatalities (2021)

$340B

Annual Societal Cost

8,000+

Annual CISS Investigations

Key Findings

Field data dominates. NHTSA's Crash Investigation Sampling System (CISS) provides a nationally representative, biomechanically coded sample of ~8,000 crashes per year — orders of magnitude more data than cadaveric or ATD test series. ML methods are uniquely equipped to exploit this volume.
Survival analysis provides a natural statistical framework. The dose-response relationship between crash severity and injury probability maps directly onto time-to-event models, enabling continuous IRF curves with covariate adjustment rather than discrete threshold criteria.
Ensemble methods outperform classical logistic IRFs on MAIS prediction tasks, particularly for interactions between delta-v, occupant age, restraint use, and vehicle type — relationships that parametric models systematically underfit.
Feature engineering from EDR data is the critical bottleneck. The quality of ML-derived IRFs is bounded by the fidelity of crash kinematics reconstruction, making Event Data Recorder (EDR) extraction methodology a foundational dependency.
Regulatory acceptance requires calibrated probability outputs. Point predictions are insufficient; probability calibration — alignment between predicted and observed injury rates — is a prerequisite for FMVSS rulemaking applications.

Section 02

Introduction — Traffic Safety as a Data Science Problem

NHTSA's Office of Vehicle Safety Research (NSR-200) frames its mission explicitly around data: crash investigation, data collection, and data analysis activities that support safety regulatory decision-making. This framing is not incidental — it reflects a half-century of institutional investment in the infrastructure required to transform crashes from discrete tragedies into analyzable populations.

The resulting data ecosystem is extraordinary in scope. NHTSA maintains census-level records of every U.S. fatal crash (FARS), deep-investigation samples of the broader crash population (CISS, formerly NASS-CDS), police-reported crash estimates (CRSS), and controlled laboratory test series from NCAP and biomechanics research programs. Across these sources, injury coding follows the Abbreviated Injury Scale (AIS) — a hierarchical severity taxonomy that enables consistent comparison across crash types, body regions, and time periods.

What has historically been missing is not data — it is the analytical methodology to extract multivariate signal from that data at scale. Classical IRF development operated within the statistical constraints of the 1970s: small samples, parametric assumptions, single-predictor models. Machine learning dissolves those constraints. The question this paper addresses is not whether ML can improve IRF development — it clearly can — but rather how: which algorithms, which features, which validation standards, and which architectural choices are most defensible in a regulatory context.

The injury risk function is the connective tissue between a crash test measurement and a safety standard. Getting it right — making it as accurate, generalizable, and well-calibrated as the data allow — is one of the highest-leverage problems in applied vehicle safety research.

— WP-CR-2026-01, Continuum Resources LLC

Scope and Limitations

This paper focuses on occupant injury risk prediction using field crash data — specifically CISS and FARS — supplemented by NCAP test series data where available. It does not address pedestrian or cyclist injury biomechanics, which involve distinct biomechanical mechanisms and data sources. The paper is analytical and architectural in nature; it does not present original empirical results from CISS data analysis, which would require NHTSA data use agreements and formal IRB protocols.

Section 03

The NHTSA Data Ecosystem

Four primary data sources constitute the foundation for ML-driven IRF development. Each has distinct strengths, limitations, and roles in the analytical pipeline. Understanding the data before selecting algorithms is not optional — it is the defining act of responsible injury biomechanics research.

Fatality Analysis Reporting System (FARS)

FARS is a census of all U.S. crashes resulting in at least one fatality within 30 days, maintained continuously since 1975. With approximately 35,000–43,000 records annually, it provides the largest single-source dataset for fatal crash analysis. Key variables include crash kinematics (speed, delta-v when EDR data is available), occupant demographics, restraint use, AIS-coded injury severity, vehicle characteristics, and crash configuration (frontal, side, rear, rollover).

For IRF development, FARS is most valuable for fatality-specific endpoints (AIS 6, MAIS 5–6) and longitudinal trend analysis — tasks where the completeness of census data offsets the absence of non-fatal comparators. The primary limitation is survivor exclusion: FARS cannot support standard binary classification of AIS 3+ injury because the non-fatal stratum is not represented.

Crash Investigation Sampling System (CISS)

CISS, which replaced NASS-CDS in 2017 with expanded sampling and enhanced biomechanical coding, is the workhorse dataset for IRF development. It is a nationally representative, stratified probability sample of approximately 8,000 crash investigations per year, spanning the full injury severity spectrum from property-damage-only to fatality. Each case includes detailed AIS coding by body region, vehicle damage documentation with Principal Direction of Force (PDOF) assessment, scene reconstruction, and — increasingly — EDR data extraction.

📋 CISS Key Variables for IRF Modeling

Delta-v: Change in velocity from EDR or crush-based reconstruction. Primary kinematic predictor.
PDOF (Principal Direction of Force): Crash direction in clock-face notation; decomposed into longitudinal and lateral components for multi-axis models.
AIS codes by body region: Head, neck, thorax, abdomen, spine, pelvis, lower extremity. Basis for MAIS calculation and regional injury models.
Occupant attributes: Age, sex, BMI, restraint system type, seating position, belt use, airbag deployment status.
Vehicle attributes: Make, model, year, curb weight, body type, structural category (car, LTV, truck).
Intrusion measures: Occupant compartment intrusion by zone — a critical predictor for lower extremity injury, particularly in frontal crashes.

NASS-CDS (Legacy) and Longitudinal Continuity

The National Automotive Sampling System Crashworthiness Data System operated from 1979 to 2015 and remains the only source for long-horizon trend analysis. Variable definition changes between NASS-CDS and CISS require careful harmonization — particularly for AIS version differences (AIS-90 vs. AIS-2005 Update 2008) and changes in EDR extraction protocols. For ML applications requiring large training sets, harmonized NASS-CDS + CISS data substantially expands the training population for pre-2017 vehicle cohorts.

NCAP Test Series Data

NCAP provides controlled experimental data — standardized crash tests with instrumented ATDs generating high-frequency time-series measurements (head acceleration, chest deflection, femur load, etc.). While the N is small relative to CISS, NCAP data offers precise biomechanical measurements unconstrained by the reconstruction uncertainty inherent in field data. Its primary role in ML-augmented IRF development is feature pre-training and calibration: ML models trained on CISS can be anchored to ATD-derived biomechanical parameters by linking NCAP vehicle performance scores to CISS-identified vehicle-crash configurations.

Database	Coverage	Annual Volume	Primary IRF Use	Key Limitation
FARS	Fatal crashes only (census)	~42,000	Fatality endpoint models, longitudinal trends	No non-fatal comparators
CISS	All severity levels (sample)	~8,000	Full-spectrum IRF, MAIS prediction	Reconstruction uncertainty, sample weights required
NASS-CDS	All severity levels (legacy)	~5,000 (peak)	Longitudinal and pre-2017 vehicle cohorts	Variable definition changes, AIS version drift
NCAP	Controlled tests	~200 tests	ATD calibration, biomechanical feature anchoring	Small N, controlled conditions only

Section 04

Classical IRF Methods — Foundation and Limitations

The modern injury risk function emerged from biomechanics research conducted primarily at Wayne State University, the University of Michigan Transportation Research Institute (UMTRI), and NHTSA's own biomechanics laboratory in the 1960s through 1980s. The canonical form — logistic regression of a binary injury outcome on a single biomechanical loading parameter — has proven durable in its simplicity and tractability. Understanding its structure is prerequisite to understanding what ML changes and what it does not.

The Logistic IRF Paradigm

Classical IRF development follows a consistent methodological sequence: select a biomechanical loading metric from ATD measurements, gather experimental data (cadaveric tests, volunteer sled experiments, or ATD test series), code injury outcomes using AIS, and fit a two-parameter logistic model relating the loading metric to injury probability. The resulting function defines the probability P of sustaining an injury of AIS severity k or greater as a function of loading metric x.

Classical Logistic IRF — Canonical Form

P(injury ≥ AIS k | x) = 1 / (1 + exp(−(β₀ + β₁·x)))

where x is the biomechanical loading metric (e.g., HIC₁₅, Nij, δ-chest)
β₀, β₁ are estimated from experimental data via maximum likelihood

The 50th-percentile injury threshold — P = 0.50 — defines the tolerance criterion used in FMVSS performance limits. The choice of k (commonly AIS 3 or AIS 4) is injury-region dependent.

Major IRF families defined by this paradigm include the Head Injury Criterion (HIC, standardized at a 15ms integration window as HIC₁₅ with a 700 tolerance value for the 50th-percentile male), the Normalized Neck Injury Criterion (Nij, combining axial load and bending moment), the Thoracic Trauma Index (TTI for side impact), and lower extremity load criteria for frontal crash femur and tibia loadings. Each was derived from a distinct experimental dataset ranging from a few dozen to a few hundred observations.

Structural Limitations

The classical paradigm's limitations are not failures of execution — given the data and computational tools available at the time, the methodology was appropriate. They are structural constraints that become increasingly consequential as the research questions become more nuanced.

Small experimental N. Cadaveric test series rarely exceed 100 subjects; most foundational IRF datasets contain 20–60 observations. Confidence intervals on the resulting logistic curves are wide, and the functions are poorly constrained in the tails — precisely where regulatory tolerance thresholds are set.
Single-parameter assumption. Classical IRFs condition on one loading metric, ignoring documented interactions: the injury risk associated with a given HIC value differs systematically between belted and unbelted occupants, between males and females, and between young and elderly crash victims.
ATD-centricity. IRFs derived from Hybrid III measurements apply strictly to Hybrid III ATD performance — not directly to the human occupant population. The Hybrid III 50th-percentile male is not a representative human; it is a calibration reference. Extrapolation to real-world occupant diversity requires assumptions that are rarely made explicit.
Parametric form imposition. The logistic function is convenient but not necessarily correct. The true dose-response relationship may be non-monotonic in the tails, may exhibit threshold effects, or may be better described by alternative functional forms (Weibull, log-normal). The classical paradigm provides no mechanism for model selection.
Binary outcome loss of information. Reducing the continuous AIS scale to a binary AIS ≥ k indicator discards ordinal severity information. A model that distinguishes AIS 1 from AIS 4 provides strictly more information than one that collapses both to "injured."

⚠ The Calibration Gap

A classical IRF derived from cadaveric data predicts injury probability for a population that is by definition deceased at the time of testing. The occupant population of interest — living drivers and passengers in real-world crashes — differs in tissue properties, pre-existing conditions, and age distribution. This calibration gap is well-recognized in the biomechanics literature but has resisted systematic correction absent field data of sufficient scale and quality. CISS, at sufficient depth and combined with ML methods, provides the first credible path to closing it.

Section 05

ML-Driven IRF Development — Core Methods

Machine learning changes two things in IRF development simultaneously: the data scale from dozens to thousands of observations, and the hypothesis space from a fixed parametric form to a flexible function class capable of capturing nonlinear interactions. These two changes are related but distinct — and both are necessary. A flexible ML model trained on a cadaveric dataset of 40 observations will overfit catastrophically. The same model trained on 50,000 harmonized CISS-NASS records can generalize.

Problem Formulation

IRF development from field crash data can be formulated in three related ways, each appropriate for different regulatory contexts. The choice of formulation has downstream consequences for algorithm selection, validation approach, and regulatory interpretability.

Formulation	Target Variable	Methods	Regulatory Application
Binary classification	AIS ≥ 3 (serious injury)	Logistic regression, RF, XGBoost, neural networks	Performance limit thresholds (FMVSS pass/fail criteria)
Ordinal regression	MAIS (0–6 ordered)	Ordinal logistic, ordinal RF, neural ordinal heads	NCAP star rating models, injury severity scoring
Survival / dose-response	Injury probability as function of continuous loading	Cox PH, Kaplan-Meier, parametric survival	Tolerance curve derivation, dose-response characterization

Random Forest for Interaction Capture

Random Forest (RF) is a natural first choice for field crash data IRF development. Its core mechanism — averaging over many decorrelated decision trees, each trained on a bootstrap sample with random feature subsets — is well-suited to the tabular structure of CISS data, handles missing values (common in field data), and is resistant to overfitting on moderate-sized datasets. Most importantly, RF captures interaction effects without requiring their explicit specification: the tree-splitting process automatically discovers that the injury risk associated with a given delta-v is modulated by occupant age, restraint status, and vehicle type simultaneously.

Feature importance from RF is a particularly valuable analytical tool in a regulatory context. Permutation importance and SHAP (SHapley Additive exPlanations) values quantify each predictor's contribution to injury risk — providing the kind of mechanistic insight that pure black-box predictions cannot. When delta-v, occupant age, and belt use emerge as the dominant predictors (as biomechanical theory would predict), this correspondence serves as a model sanity check. When unexpected features rank highly, it warrants investigation of potential confounders.

Random Forest IRF — Ensemble Prediction

P̂(injury | x) = (1/B) Σᵦ hᵦ(x)

where B is the number of trees, hᵦ(x) is the injury probability
predicted by tree b for feature vector x

For calibrated probability outputs (required for regulatory applications), Platt scaling or isotonic regression is applied to the raw RF probability estimates. Uncalibrated RF probabilities are systematically compressed toward 0.5.

Gradient Boosting for Tabular Data Performance

Gradient boosting — specifically XGBoost and LightGBM — consistently achieves state-of-the-art performance on tabular data prediction tasks, including injury prediction from crash data. The sequential tree-building process, where each tree corrects residuals from its predecessors, is particularly effective at capturing complex dose-response relationships in the middle of the delta-v distribution where injury probability is most sensitive to predictor values.

For MAIS prediction, gradient boosting's native handling of missing values is a meaningful practical advantage: CISS records systematically lack EDR data for older vehicles, certain ATD measurements may not be available for non-NCAP crash configurations, and injury coding completeness varies across body regions. XGBoost's sparsity-aware algorithm handles these patterns without imputation, avoiding the introduction of imputation-induced bias into the training set.

Class Imbalance in Severe Injury Prediction

Severe crash injuries (AIS 4–6, MAIS 5–6) are rare events even in a crash investigation sample. In CISS, serious-to-critical injuries represent approximately 8–12% of all occupant cases, with MAIS 5 and 6 below 2%. Standard ML training objectives optimize overall accuracy, which can result in models that achieve high accuracy by predicting "no serious injury" for every observation — a useless classifier from a safety standpoint.

Three complementary strategies address this: class weighting (penalizing misclassification of the minority class more heavily in the loss function), SMOTE (Synthetic Minority Over-sampling Technique, generating synthetic serious-injury cases in feature space), and threshold optimization (selecting a classification threshold that maximizes a utility-weighted metric rather than overall accuracy). For regulatory applications, recall on the serious-injury class is typically the binding constraint — a safety criterion that misses 30% of serious injuries is not an acceptable FMVSS performance criterion regardless of overall accuracy.

Section 06

Survival Analysis in Crash Biomechanics

Survival analysis — the statistical framework developed for time-to-event data in clinical medicine — maps naturally onto injury biomechanics. The conceptual substitution is straightforward: replace "time until death or disease" with "loading magnitude until injury." An occupant subjected to increasing crash severity traverses a dose-response curve from no injury to AIS 1 to AIS 3 to fatality. Survival analysis provides a principled, non-parametric toolkit for estimating this curve from observed data, with the critical capability to adjust for covariates that classical IRFs ignore.

Kaplan-Meier Injury Probability Curves

The Kaplan-Meier (KM) estimator provides a non-parametric estimate of the cumulative injury probability as a function of delta-v, requiring no assumptions about the functional form of the dose-response relationship. Applied to CISS data stratified by injury outcome (AIS ≥ 3 vs. not), KM curves produce an empirical IRF that directly reflects the observed population — not a parametric fit that may misspecify the functional form in the tails.

The KM estimator handles censored observations naturally. In a crash sample, "censored" observations are cases where the occupant survived the crash but delta-v is not precisely known — common when EDR data is unavailable and reconstruction produces a range estimate. Left-censoring (the occupant sustained an injury, but it may have occurred at a lower delta-v than the observed crash) is treated differently from right-censoring (the occupant was not injured, but might have been at higher loading), and both can be accommodated within the survival analysis framework.

Cox Proportional Hazards Model

The Cox Proportional Hazards (Cox PH) model extends the non-parametric KM approach to multivariate covariate adjustment — the critical capability that classical single-parameter IRFs lack. In the crash biomechanics context, the Cox PH model estimates the hazard of injury exceedance as a function of delta-v, conditional on occupant and vehicle covariates.

Cox Proportional Hazards — Crash Biomechanics Formulation

h(v | X) = h₀(v) · exp(β₁·age + β₂·belt + β₃·PDOF_cos + β₄·mass_ratio + …)

where v is delta-v (the "time" variable), h₀(v) is the baseline hazard function
(estimated non-parametrically), and X is the covariate vector

The proportional hazards assumption — that covariates scale the baseline hazard multiplicatively and independently of v — should be tested using Schoenfeld residuals. Crash data frequently violates this assumption for age and restraint use, requiring time-varying coefficient extensions or stratification.

The key insight from Cox PH modeling of crash data is that the age coefficient is large and statistically dominant. A 70-year-old occupant in a 30 mph delta-v crash faces an injury risk substantially higher than a 30-year-old in the same crash — a relationship that classical ATD-based IRFs cannot represent because the Hybrid III has no age. Covariate-adjusted IRFs derived from Cox PH models are therefore more representative of the actual occupant population than their ATD-based predecessors.

Competing Risks for Multi-Region Injury Modeling

Real-world crash occupants face multiple simultaneous injury risks — head, thorax, spine, lower extremity — that are not independent. A frontal impact with severe intrusion may preclude thorax injury because the occupant contacts the steering wheel before the airbag deploys, or it may compound thorax injury through belt interaction with redirected occupant kinematics. The competing risks framework, implemented via the Fine-Gray subdistribution hazard model, models each injury region's cumulative incidence function while treating other injury regions as competing events.

For NHTSA's research purposes, competing risks models provide a unified analytical framework for studying the priority ordering of injury countermeasures. If reducing head injury risk by 20% (through improved airbag geometry) is associated with a 5% increase in thorax injury risk (through altered occupant kinematics), the competing risks model quantifies this trade-off explicitly — a capability that single-region IRFs cannot provide.

Section 07

Ensemble Methods for MAIS Prediction

Maximum Abbreviated Injury Scale (MAIS) prediction — estimating the worst-body-region injury severity for a given crash occupant — is the most consequential prediction task in IRF development. MAIS drives NCAP scoring, FMVSS compliance, and crash cost modeling. It is also an intrinsically ordinal outcome: the severity ordering MAIS 0 < 1 < 2 < 3 < 4 < 5 < 6 carries information that binary classification discards and nominal classification ignores.

Ordinal Treatment of MAIS

Three approaches exist for incorporating the ordinal structure of MAIS into ML models. Ordinal logistic regression (proportional odds model) is the classical approach — it models the cumulative log-odds of MAIS ≥ k for each threshold k, sharing a common effect of predictors across thresholds. It is interpretable and well-calibrated but imposes the proportional odds assumption, which CISS data routinely violates for age and restraint interactions. Multi-class classification treats MAIS levels as nominal categories, ignoring the ordering — a loss of information that increases prediction error on out-of-sample data. Regression with ordinal loss (using quadratic weighted kappa as the optimization target) preserves the ordering while permitting flexible non-linear estimation.

Stacked Ensemble Architecture

Stacked generalization (stacking) combines predictions from multiple base learners via a meta-learner, typically outperforming any individual base model on complex tabular tasks. For MAIS prediction from CISS data, the recommended architecture combines base learners that are complementary in their error structures: a gradient boosting model that excels on frequent crash configurations in the training data, a random forest that provides more stable probability estimates in data-sparse regions, and a Cox PH model that imposes biomechanically grounded structure on the delta-v dose-response.

Layer 0 — Data Inputs

CISS Crash Records

EDR / Delta-v

NCAP Vehicle Scores

FARS Linkage

Feature Engineering & Survey-Weight Normalization ↓

Layer 1 — Base Learners (Cross-Validated OOB Predictions)

XGBoost (Ordinal)

Random Forest (Ordinal)

Cox PH (Delta-v)

Neural Ordinal Net

Out-of-fold Predictions as Meta-Features ↓

Layer 2 — Meta-Learner

Regularized Ordinal Logistic (Lasso)

Platt Calibration Layer

Calibrated Probability Output ↓

Output — Regulatory Deliverables

P(MAIS ≥ 3 | crash params)

Full MAIS Distribution

SHAP Explanations

Calibration Diagnostic

Figure 1 — Stacked Ensemble Architecture for MAIS Prediction from CISS Data. Survey-weighted training preserves national representativeness. Out-of-fold predictions prevent meta-learner overfitting.

Survey Weights and National Representativeness

CISS is a stratified probability sample, not a simple random sample. Each case carries a survey weight reflecting how many real-world crashes that investigation case represents in the national crash population. Standard ML training procedures that treat all cases as equally weighted will bias the resulting model toward over-represented strata (urban crashes, accessible crash scenes, cooperative jurisdictions) and away from under-represented ones. Survey-weighted loss functions — where each training case is weighted by its inverse sampling probability — are the methodologically correct treatment and are straightforwardly implemented in all major ML frameworks.

Section 08

Feature Engineering from Crash Data

Model performance in injury prediction is bounded by feature quality. A gradient boosting model given imprecise delta-v estimates will underperform a logistic regression given accurate ones. Feature engineering for CISS-based IRF development is therefore not a peripheral concern — it is the binding constraint on model quality, particularly as crash investigation methodology continues its transition from field reconstruction to EDR-based measurement.

Primary Kinematic Features

Delta-v remains the single most important predictor of injury severity across crash configurations. EDR-extracted delta-v is the gold standard; field reconstruction from crush measurements (using CRASH3 or equivalent algorithms) introduces reconstruction uncertainty of ±15–25% that should be propagated through the model as a measurement error covariate rather than ignored. For crashes prior to mandatory EDR installation (pre-2012 for most vehicles), reconstruction-based delta-v with appropriately wide confidence intervals is the only option.

PDOF decomposition transforms the clock-face direction variable into continuous Cartesian components — longitudinal (cos PDOF), lateral (sin PDOF) — enabling the model to treat oblique crashes as intermediate between pure frontal and pure side configurations rather than as a discrete third category. This decomposition substantially improves model performance on the increasingly prevalent oblique crash mode that NCAP's Moving Deformable Barrier (MDB) and Small Overlap Rigid Barrier (SORB) tests were designed to address.

Occupant Vulnerability Features

Age: The single largest non-kinematic predictor. The hazard ratio for MAIS ≥ 3 increases approximately 2.5–3.5x from age 25 to age 75, controlling for other predictors. Age enters the model best as a continuous predictor with a spline basis — the injury-age relationship is nonlinear, with accelerating risk above approximately 60.
Sex: Female occupants show higher thorax and lower extremity injury risk at matched delta-v, partly attributable to ATD calibration bias (the Hybrid III female ATD is a scaled male, not a biomechanically female model) and partly to real anatomical differences. Its effect is smaller than age but consistently significant.
BMI: Higher BMI is associated with increased abdominal and thorax injury risk through altered belt load distribution and airbag interaction geometry. Available in CISS from investigator estimation; subject to measurement uncertainty.
Restraint system features: Belt use (binary) and belt pretensioner deployment, airbag deployment, airbag type (frontal/curtain/thorax), and seating position. Restraint features interact strongly with delta-v — their protective effect is delta-v-dependent and should enter the model as interaction terms.

Vehicle Compatibility Features

Vehicle mass ratio (striking vehicle curb weight / struck vehicle curb weight) is the primary compatibility metric for multi-vehicle crashes. It captures the energy partitioning between vehicles that determines occupant loading in the struck vehicle. Structural category interactions (car-to-LTV vs. car-to-car vs. LTV-to-LTV) encode systematic stiffness and geometry differences that mass ratio alone does not capture — a high-profile LTV striking a car in a side impact has different injury consequences than two matching-mass car-to-car impacts, even at the same mass ratio and delta-v.

Intrusion as a Structural Feature

Compartment intrusion — the deformation of the occupant compartment measured in centimeters by body zone — is a critical predictor for lower extremity injury (particularly for frontal crashes) and spine injury (for side and rear impacts). It is systematically under-utilized in classical IRF development because ATDs do not experience intrusion in the same way human occupants do. In CISS, intrusion measures are coded by investigators from physical inspection of the damaged vehicle; their inclusion in ML IRF models consistently improves prediction of AIS 3+ lower extremity injury and is a primary source of performance advantage over ATD-based classical IRFs.

🔧 The EDR Extraction Imperative

The quality of kinematic features — and therefore model performance — is gated by the completeness and accuracy of EDR data extraction. NHTSA's EDR extraction protocols have evolved substantially since mandatory installation requirements took effect in 2014 (49 CFR Part 563). For the pre-2014 vehicle cohort in harmonized CISS/NASS data, EDR availability is substantially lower and extraction fidelity more variable. Any ML IRF development program should explicitly quantify the fraction of training cases with EDR-derived vs. reconstruction-derived delta-v and validate model performance separately on each stratum — the models may need to be distinct.

Section 09

Validation Framework — Regulatory Alignment

Validation for a regulatory application is categorically different from validation for a research publication. The gold standard in ML validation — out-of-sample AUC-ROC on a held-out test set — is necessary but not sufficient. A model that achieves 0.85 AUC but predicts P(MAIS ≥ 3) = 0.60 when the true rate is 0.35 will produce tolerance criteria that are systematically wrong, and regulatory errors in either direction have large real-world consequences. Calibration, not just discrimination, is the binding constraint.

Cross-Validation Strategy

Standard random k-fold cross-validation underestimates out-of-sample error for crash data because crashes cluster — by geographic area, vehicle cohort, and model year. A model trained on 2020 CISS data may encounter 2019 vehicles (slight feature shift) or 2025 vehicles (substantial feature shift representing newer restraint technology and structural design). Temporal out-of-sample validation — training on CISS 2017–2021 and validating on 2022–2023 — provides a more honest performance estimate for prospective applications. Vehicle cohort stratification — holding out a specific vehicle model year group — tests generalization to structural designs not seen in training.

Calibration Assessment

Calibration is the agreement between predicted injury probabilities and observed injury rates. A well-calibrated model predicts P = 0.30 for a group of occupants of whom 30% actually sustain AIS ≥ 3 injuries. Calibration is assessed graphically via reliability diagrams (observed rate vs. predicted probability in decile bins) and quantitatively via the Expected Calibration Error (ECE) and Brier Score.

For regulatory applications, the most consequential calibration region is P ∈ [0.40, 0.60] — the neighborhood of the 50th-percentile tolerance criterion. Miscalibration in this region shifts the implied tolerance criterion (the ATD performance value corresponding to P = 0.50) up or down, directly affecting FMVSS compliance thresholds. Post-hoc calibration via Platt scaling or isotonic regression should be applied to all base models before regulatory use.

Performance Metrics for Regulatory Context

Metric	Formula / Method	Regulatory Relevance	Target
AUC-ROC	Area under receiver operating curve for binary injury outcome	Overall discrimination; required baseline	> 0.80
Brier Score	Mean squared error of probability predictions	Combined discrimination + calibration	< 0.15
ECE	Expected Calibration Error (weighted mean calibration gap)	Calibration quality at tolerance criterion	< 0.03
QWK	Quadratic Weighted Kappa for MAIS ordinal prediction	Ordinal severity accuracy for NCAP scoring	> 0.65
Serious Injury Recall	Sensitivity for MAIS ≥ 3 at operational threshold	Safety criterion — missed serious injuries	> 0.80
Temporal Degradation	AUC drop from training cohort to +3yr holdout	Model longevity and update cadence planning	< 0.05

The Ground Truth Problem

AIS coding — the ground truth for all injury outcome models — is subject to inter-rater variability estimated at approximately 10–15% for MAIS disagreement at the ±1 level. Coding is performed by trained CISS investigators from medical records, and coding accuracy varies with the completeness of available medical documentation. For ML models that will be used to derive regulatory criteria, this noise floor in the ground truth is a fundamental limitation: no model can reliably discriminate better than the coding reliability of the training data. Formal uncertainty quantification for the resulting IRF should account for coding noise as a source of irreducible model uncertainty, distinct from finite-sample statistical uncertainty.

Section 10

Interactive Model Performance Explorer

The following reference compares classical IRF approaches against ML-based methods across injury region and prediction task. Performance estimates reflect published literature on field crash data applications. Click any row to expand technical notes. Filter by injury region or task type.

Section 11

The Continuum Approach

Continuum Resources' applied data science capabilities — built on our AI/ML platform portfolio and demonstrated through production deployments in defense and enterprise environments — map directly onto the analytical and computational research workstreams identified in this paper. We describe three specific capability alignments below.

PathRAG: Multi-Source Crash Data Intelligence

PathRAG, Continuum's knowledge graph-augmented retrieval platform, is designed for exactly the kind of multi-source, multi-hop analytical problem that NHTSA's crash data ecosystem presents. The platform ingests structured data from CISS case records, NCAP test reports, and FARS crash summaries into a Neo4j-backed knowledge graph, exposing entity-level queries that traverse relationships across data sources — for example, connecting a specific vehicle make/model to its NCAP structural performance scores, to its representation in CISS crash investigations, to the injury outcomes of its occupants, and to the FMVSS compliance history of its restraint systems.

In a crash data analytics context, multi-hop graph traversal enables questions that flat-file SQL queries cannot efficiently address: "For frontal crashes between 25–35 mph delta-v involving unbelted rear occupants in crossover SUVs model year 2018–2022, what is the observed MAIS distribution by age decile, and how does it compare to NCAP-predicted occupant performance?" This class of query is the foundation of evidence-based countermeasure prioritization.

DocOps Pipeline: NHTSA Report and Investigation Data Processing

NHTSA's crash investigation case files are document-rich: each CISS investigation includes narrative crash reports, medical coding worksheets, vehicle damage diagrams, and EDR data extraction reports in a mix of structured and unstructured formats. Continuum's DocOps Pipeline — currently in production with KBR's T&E team for test summary and requirements mapping — is designed for exactly this kind of multi-format document governance problem.

Applied to NHTSA's research environment, DocOps can automate the extraction of structured biomechanical features from investigator narratives, flag coding inconsistencies across case records, and maintain a versioned, auditable data governance trail for research datasets that are used in rulemaking proceedings — a critical requirement given the administrative record standards that apply to FMVSS rulemakings under the National Traffic and Motor Vehicle Safety Act.

ML Platform: IRF Model Development and Validation

Continuum's ML engineering capability — combining the ensemble methods, survival analysis frameworks, and calibration protocols described in this paper — provides the core analytical engine for IRF development from CISS and FARS data. Our team's experience with survey-weighted analysis, ordinal outcome prediction, and regulatory-grade calibration assessment positions us to deliver not just technically accurate models but models that are defensible in a federal rulemaking context.

✓ Continuum Capabilities — Vehicle Safety Research Alignment

CISS/FARS Data Analytics: Survey-weighted ML model development, MAIS prediction, temporal out-of-sample validation pipelines.
Knowledge Graph Integration: PathRAG deployment connecting CISS, NCAP, and FARS data sources with multi-hop query capability for countermeasure research.
Document Processing and Governance: DocOps Pipeline for crash investigation case data extraction, coding consistency validation, and audit-ready data management.
IRF Calibration and Regulatory Alignment: Platt scaling, isotonic regression, ECE assessment, and calibration reporting aligned to FMVSS performance criterion derivation requirements.
Computational Modeling Support: Feature engineering from EDR data, PDOF decomposition, vehicle compatibility metric development, and intrusion-based lower extremity injury prediction.

Section 12

Conclusion

The case for ML-driven injury risk function development is not a speculative argument about future capabilities — it is a practical argument about underutilized existing data. NHTSA has invested decades and hundreds of millions of dollars in building the world's most comprehensive crash investigation data infrastructure. The analytical methods to extract multivariate, interaction-aware, well-calibrated injury risk functions from that data now exist and have been validated in the literature. The primary remaining barriers are methodological standardization, regulatory acceptance criteria for ML-derived IRFs, and the engineering infrastructure to apply these methods rigorously at scale.

Survival analysis provides the statistical bridge between the classical IRF paradigm and modern machine learning — preserving the dose-response intuition that biomechanical engineers and regulators use to reason about tolerance criteria while extending it to the multivariate, covariate-adjusted domain that field crash data enables. Ensemble methods provide the performance advantage on MAIS prediction tasks. Feature engineering from EDR data — combined with careful handling of survey weights, missing data, and inter-rater coding variability — provides the data quality foundation that both approaches require.

The regulatory acceptance path is real but requires deliberate navigation. Models must be calibrated, not merely accurate. Validation must be temporal and structural, not just cross-sectional. Uncertainty quantification must account for coding noise, reconstruction error, and distributional shift across vehicle cohorts. And the resulting IRFs must be interpretable in biomechanical terms — not black boxes that produce correct numbers for opaque reasons, but transparent models whose feature importance and covariate effects align with established biomechanical knowledge.

The next generation of FMVSS injury criteria will be data-derived, covariate-adjusted, and continuously validated against the crash population they are designed to protect. The analytical toolkit to build them exists today.