LLM Defense Evaluation — Continuum Resources WP-CR-2025-09

Section 00

Executive Summary

The rapid proliferation of open-source large language models — Llama 3, Mistral, Falcon, Phi, and their continuously evolving derivatives — has created an unprecedented opportunity and a complex challenge for defense organizations. These models offer the ability to deploy sovereign AI capabilities within classified enclaves, on air-gapped networks, and in operational environments where commercial API-based LLMs are not viable. They also introduce evaluation obligations that defense organizations are not yet equipped to meet: the commercial model evaluation literature does not translate directly to environments where output failures can affect intelligence assessments, mission planning, force protection, or acquisition decisions.

This white paper, authored by Kurt A. Richardson, PhD, presents the LLM Defense Evaluation Framework (LDEF) — a structured, five-dimension methodology for evaluating open-source LLMs for defense deployment. LDEF addresses the unique evaluation requirements of defense contexts: quality assurance that reflects operational accuracy requirements, not just benchmark leaderboard rankings; fairness analysis that identifies demographic and entity-level biases with operational consequence; security evaluation that goes beyond prompt injection testing to assess model backdoor risks, supply chain integrity, and fine-tuning poisoning vulnerabilities; operational performance characterization under the resource constraints of classified environments; and governance assessment for RMF/ATO compliance.

47+

Open-source LLM families with defense-relevant capabilities as of Q1 2025

Fraction of commercial LLM benchmarks designed specifically for defense decision-support accuracy

NIST

AI RMF MEASURE function requires quantified evaluation — LDEF provides the defense-specific metrics

⚡ Core Finding

Commercial LLM benchmarks — MMLU, HellaSwag, HumanEval, TruthfulQA — measure academic and coding performance that does not predict operational quality in defense contexts. A model that ranks in the top 10% on academic benchmarks may have unacceptable failure modes on defense-specific tasks: intelligence analysis, acquisition support, military doctrine reasoning, or classified environment operation. Defense organizations must build their own evaluation infrastructure, not rely on commercial leaderboards.

Section 01

Introduction: The Open-Source LLM Opportunity

Commercial LLM services — GPT-4, Claude, Gemini — have demonstrated transformative potential for productivity, analysis, and decision support across virtually every domain of human activity. For defense organizations, their use is constrained by a fundamental architecture problem: commercial LLMs are cloud-hosted services that require data to leave the organization's control, traverse public networks, and be processed on infrastructure the organization does not own. For classified programs, sensitive intelligence analysis, and operational planning, this architecture is categorically unacceptable.

Open-source LLMs — models whose weights are publicly available for download, deployment, and modification — solve this architecture problem. A Llama 3.1-70B model can be deployed on government-owned hardware, inside a classified enclave, on an air-gapped network, with full control over data handling, logging, and access. No data leaves the boundary. No external API call is made. The model is the government's to configure, fine-tune, and operate as a sovereign AI capability. This is a genuinely transformative opportunity for defense AI adoption.

"Open-source LLMs give defense organizations the ability to deploy AI at every classification level — not just unclassified cloud environments. The evaluation challenge is that the same openness that enables sovereign deployment also means there is no vendor standing behind the model's performance, safety, or security. The organization must evaluate it themselves."

— Kurt A. Richardson, PhD, Head of R&D, Continuum Resources LLC

The Evaluation Gap

The evaluation challenge is as significant as the opportunity. Open-source LLMs arrive without the testing infrastructure, safety tuning validation, or operational performance documentation that would accompany a commercial enterprise software procurement. The model's license may be permissive; its evaluation documentation is typically absent. What evaluation exists — benchmark scores on academic datasets — was designed to compare models in commercial development contexts, not to assess whether a model is safe to deploy for defense decision support.

NIST AI RMF requires that AI systems deployed in government contexts be evaluated against quantified performance, fairness, and reliability criteria before deployment, and monitored continuously during operation. CDAO's Responsible AI guidance requires documented evaluation of AI systems for bias, reliability, and security. DoD Directive 3000.09 (Autonomous Weapons) creates evaluation obligations for AI systems involved in targeting-adjacent decisions. Meeting these obligations requires a defense-specific evaluation framework — which is what LDEF provides.

Relationship to the Continuum Research Series

LDEF is the evaluation counterpart to the adversarial security framework in WP-CR-2025-04 (Prompt Injection & Adversarial Attacks). Where WP-04 focuses on protecting deployed LLMs against adversarial manipulation, LDEF focuses on evaluating candidate LLMs before deployment to determine whether they meet the performance, safety, and security thresholds required for their intended use case. Together, they form the pre-deployment evaluation + operational security architecture for defense LLM programs.

Section 02

Why Open-Source LLMs for Defense

The case for open-source LLMs in defense deployments rests on five operational advantages that commercial API-based LLMs cannot match. Each advantage has direct implications for the evaluation framework required.

ADVANTAGE 01

Sovereign Data Control

All data — queries, retrieved documents, model outputs, interaction logs — remains within the organization's control boundary. No exposure to commercial provider data handling policies, no foreign intelligence service visibility into model usage patterns, no risk of training data incorporation from sensitive organizational queries. Enables deployment on classified networks (SIPRNET, JWICS) where commercial cloud access is prohibited.

ADVANTAGE 02

Fine-Tuning for Domain Adaptation

Open-source weights enable fine-tuning on domain-specific corpora — defense doctrine, acquisition regulations, intelligence analysis tradecraft, technical specifications — that commercial providers cannot be given access to. A fine-tuned model that has learned from thousands of Army field manuals or DFARS regulatory documents will dramatically outperform a general model on domain-specific tasks.

ADVANTAGE 03

Auditability and Explainability

Open weights enable mechanistic interpretability research — tools that examine which parts of the model's internal representations are driving specific outputs. For high-stakes defense applications requiring explainability, the ability to analyze model behavior at the weight level provides accountability that API-access-only commercial models cannot offer.

ADVANTAGE 04

Supply Chain Independence

Commercial LLM providers can change their models, adjust pricing, modify usage policies, or discontinue services. A defense program depending on an external API can lose its AI capability with 30 days notice. Open-source models, once downloaded and deployed, are immutable program assets — they cannot be changed by an external party without the organization's consent.

CRITICAL LIMITATION

Open-Source LLMs Require Active Evaluation — Commercial Models Do Not Have This Advantage

The same openness that enables sovereign deployment means there is no vendor maintaining safety tuning, responding to security vulnerabilities, or providing performance certifications. A commercial provider continuously monitors and updates their models; an open-source deployment at a fixed model version receives no updates unless the organization pulls and redeploys a new version. This places the full evaluation and monitoring burden on the deploying organization — which is the problem LDEF solves.

The Defense Open-Source LLM Landscape

Model Family	Origin	Largest Variant	License	Defense Relevance
Llama 3.x	Meta AI	405B	Llama Community License	General-purpose; strongest performance at scale; ITAR considerations for non-US deployment
Mistral / Mixtral	Mistral AI (EU)	8×22B MoE	Apache 2.0	Strong instruction following; European provenance; MoE efficiency for air-gap deployment
Falcon	TII (UAE)	180B	Falcon License (custom)	Non-Western origin creates evaluation obligation for national security applications; strong multilingual
Phi-3 / Phi-4	Microsoft Research	14B	MIT	High capability per parameter — suitable for edge/disconnected deployment; small footprint
Gemma 2	Google DeepMind	27B	Gemma ToS	Safety-focused training; strong reasoning; Google provenance provides some third-party safety validation
OLMo / OLMO2	Allen Institute for AI	13B	Apache 2.0	Full training transparency — weights, data, code fully public; highest auditability of any major model

Section 03

The Five Evaluation Dimensions

LDEF organizes LLM evaluation for defense deployment across five dimensions that together constitute a comprehensive pre-deployment assessment. Each dimension addresses a different failure mode that could affect operational outcomes or create legal, ethical, or national security risks.

Dimension 1 — Quality Assurance (Task Performance)

Factual Accuracy

Reasoning Quality

Domain Task Performance

Hallucination Rate

Calibration

Robustness to Distribution Shift

QA score threshold → deployment recommendation

Dimension 2 — Fairness & Bias

Demographic Bias

Geopolitical Bias

Entity Representation

Training Data Provenance

Stereotype Propagation

Bias assessment → use-case restrictions and mitigations

Dimension 3 — Security Evaluation

Adversarial Robustness

Backdoor Detection

Supply Chain Integrity

Fine-Tuning Poisoning Risk

Data Memorization

Security findings → accreditation gate (pass/conditional/fail)

Dimension 4 — Operational Performance

Inference Latency

Throughput & Concurrency

Hardware Footprint

Quantization Quality

Disconnected Operation

Ops profile → hardware sizing and deployment architecture

Dimension 5 — Governance & Compliance

License Compliance

NIST AI RMF Alignment

CDAO Responsible AI

ATO Documentation

Continuous Monitoring Plan

Figure 1 — LDEF Five-Dimension Evaluation Architecture — All dimensions required for defense deployment recommendation

How LDEF Differs from Commercial Evaluation

Dimension	Commercial Evaluation Focus	LDEF Defense Focus
Quality Assurance	Academic benchmark scores (MMLU, GPQA, HumanEval)	Defense domain task performance, hallucination rates on sensitive topics, calibration for low-certainty outputs
Fairness	Protected characteristic demographic parity	Geopolitical representation bias, adversary nation portrayal, allied nation fairness, military entity stereotyping
Security	Basic jailbreak resistance, harmful content refusal	Backdoor detection, supply chain integrity, classified content handling, adversarial suffix resilience, training data exfiltration
Performance	Benchmark throughput on A100/H100	Performance on government-approved hardware (no gaming GPUs), air-gapped deployment, quantized model quality degradation
Governance	Terms of service compliance	NIST AI RMF GOVERN/MAP/MEASURE/MANAGE, CDAO RAI, DoDD 3000.09, ATO documentation completeness

Section 04

Dimension 1: Quality Assurance

Quality assurance in LDEF measures whether a model produces outputs that are sufficiently accurate, well-reasoned, and calibrated for the specific defense tasks it will be asked to perform. The key insight is that task-specific quality measurement must replace — or heavily supplement — general benchmark scores. A model's MMLU score predicts its performance on academic multiple-choice questions; it does not predict its performance on intelligence analysis summarization, DFARS compliance reasoning, or military doctrine question answering.

Factual Accuracy and Hallucination Rate

Hallucination — the model generating confident, fluent, but factually incorrect outputs — is the most consequential quality failure mode for defense applications. An LLM that hallucinates a citation, an operational procedure, a weapon system capability, or a regulatory requirement does so with the same fluency and confidence as a correct output. There is no syntactic signal that distinguishes a hallucinated fact from a recalled one. Defense evaluation must quantify the hallucination rate for the specific domains relevant to the deployment use case.

LDEF measures hallucination using three complementary approaches:

Closed-book factual probing: A curated set of domain-specific factual questions with known, verified answers drawn from authoritative defense sources (DoD Instructions, field manuals, acquisition regulations). The model is queried without context documents; responses are scored for factual accuracy against the gold standard.
RAG faithfulness testing: The model is provided a retrieved document set and asked to answer questions. Responses are scored for faithfulness to the provided documents versus generation of content not in the context — a specific hallucination failure mode in RAG-based deployments relevant to the Secure RAG architecture in CR-04.
Calibration measurement: The model is asked to express confidence in its answers, and that expressed confidence is compared to its actual accuracy across a large sample. A well-calibrated model that expresses 80% confidence is right approximately 80% of the time. An overconfident model that expresses 95% confidence but is only right 70% of the time is a decision-support liability.

Domain-Specific Task Performance

Quality evaluation must include task-specific performance benchmarks constructed for the program's specific use cases. LDEF defines four primary defense task categories, each requiring a purpose-built evaluation set:

Task Category	Evaluation Approach	Key Metrics	Failure Mode
Intelligence Analysis Support	Summarization accuracy, entity extraction precision/recall, claim verification against source documents	ROUGE-L, factual precision, hallucination rate	False claim generation; source misattribution; over-confident inference
Acquisition & Regulatory Reasoning	FAR/DFARS question answering, contract clause interpretation, compliance determination	Answer accuracy, legal error rate	Incorrect regulatory interpretation with high confidence
Technical Documentation	System requirements generation, interface specification interpretation, design review support	Requirement quality metrics, INCOSE compliance	Ambiguous requirements; specification errors
Decision Support & Planning	Course of action analysis, risk assessment synthesis, logistics planning support	Decision quality, completeness of consideration, bias detection	Missing key considerations; confirmation bias amplification

Robustness to Distribution Shift

Defense operational environments change rapidly. A model evaluated in peacetime conditions may encounter significantly different input distributions in crisis conditions — novel terminology, degraded communications, fragmentary reporting, unfamiliar geographic contexts. LDEF tests robustness by evaluating models on out-of-distribution variants of the task evaluation sets: paraphrased questions, incomplete inputs, noisy text (OCR artifacts from scanned documents), and domain-adjacent questions that test whether the model generalizes appropriately or fails in unexpected ways.

⚠ The Overconfidence Risk

The most dangerous quality failure in defense LLM deployment is not a wrong answer presented tentatively — it is a wrong answer presented with high confidence. LDEF's calibration measurement specifically targets this: models that are systematically overconfident on defense domain tasks are flagged as high-risk regardless of their average accuracy, because overconfident errors are much more likely to be acted upon without additional verification. Programs should establish maximum permissible calibration error thresholds as a deployment gate, not just accuracy thresholds.

Section 05

Dimension 2: Fairness & Bias

Fairness evaluation for defense LLM deployment extends well beyond the demographic bias analysis that characterizes commercial AI fairness assessments. In defense contexts, the most consequential bias categories are geopolitical, entity-level, and operational — biases that could cause the model to systematically misrepresent adversary capabilities, undervalue allied contributions, stereotype foreign military forces, or apply different analytical standards to different nations or organizations.

Defense-Specific Bias Categories

BIAS TYPE 01

Geopolitical Representation Bias

Does the model apply consistent analytical standards and information density when discussing different nations, alliances, and adversaries? Models trained predominantly on Western-language web content may have systematically different quality of knowledge about China, Russia, Iran, or North Korea versus allied Western nations — producing outputs of different quality and reliability depending on the geographic focus of a query.

BIAS TYPE 02

Military Entity Stereotyping

Does the model apply stereotypical characterizations to military organizations, branches, specialties, or demographics that could affect its analysis quality? Stereotypical priors about specific military units, national forces, or service members that differ from objective analysis could systematically skew decision-support outputs in ways that are difficult to detect without targeted evaluation.

BIAS TYPE 03

Adversary Capability Assessment Bias

Does the model systematically over- or underestimate adversary capabilities based on preconceptions embedded in its training data? A model trained on Western open-source intelligence that consistently underestimates adversary capabilities — or that reflects media portrayals that differ from intelligence assessments — creates an intelligence analysis tool with systematically skewed outputs.

BIAS TYPE 04

Training Data Temporal Bias

Does the model apply different standards based on the temporal distribution of its training data? Events, capabilities, and organizational structures from before the training cutoff are represented; more recent developments are absent. This is a form of temporal bias that creates systematic blind spots in intelligence and current events analysis — potentially more dangerous than geographic bias because it is less visible.

Bias Measurement Methodology

LDEF measures geopolitical and military bias through counterfactual substitution: identical analytical queries are posed with different nation, force, or entity substitutions, and response quality, completeness, and characterization are compared systematically. For example, a query asking for an analysis of a specific military capability is posed with multiple national subjects (U.S., allied nations, adversary nations), and the responses are evaluated for consistency in analytical depth, factual accuracy, and characterization quality. Systematic differences — consistently shallower analysis for one national category, consistently more negative characterization for another — indicate a bias requiring mitigation before operational deployment.

📊 Counterfactual Bias Test Design

Effective counterfactual bias testing requires care in test construction: substitutions must be semantically equivalent except for the entity being substituted, the evaluation rubric must be objective and independently assessed, and the test set must be large enough to distinguish systematic patterns from stochastic variation. LDEF recommends minimum 200 counterfactual pairs per bias dimension, rated by a panel of domain experts with specific rubrics for each quality dimension (completeness, accuracy, characterization tone, analytical depth).

Demographic Fairness in Defense Contexts

Beyond geopolitical bias, LDEF also addresses traditional demographic fairness — whether the model produces different quality outputs for queries involving different demographic groups of service members, veterans, or civilians. Biased outputs in personnel assessment, medical triage support, administrative processing, or casualty analysis have direct human consequences and legal obligations under DoD equal opportunity and non-discrimination policy.

Section 06

Dimension 3: Security Evaluation

Security evaluation of open-source LLMs for defense deployment must address threat categories that do not apply to commercial API-based models. Because the model weights are public, any adversary can study them. Because fine-tuning modifies the weights, a supply chain attack on the fine-tuning process can embed persistent behaviors that survive all other safety controls. Because the model will be deployed on classified networks, the consequences of a security failure extend beyond information exposure to potential intelligence compromise.

The Open-Source Security Threat Landscape

Pre-trained weight integrity: The weights of widely used open-source models are hosted on repositories (HuggingFace, model-specific releases) that could be compromised by an adversary replacing authentic weights with modified versions containing backdoors or degraded capabilities. Supply chain verification — cryptographic hash verification of downloaded weights against authenticated checksums published by the model's authors — is a required pre-deployment step for all LDEF evaluations.
Backdoor vulnerabilities: Research has demonstrated that LLMs can be fine-tuned to exhibit specific behaviors (producing harmful outputs, bypassing safety controls, returning attacker-specified content) when triggered by specific input patterns, while appearing normal on all other inputs. For a defense deployment using a community fine-tuned model, the risk that the fine-tuning introduced a backdoor requires explicit backdoor detection evaluation.
Training data memorization: LLMs memorize portions of their training data and can be induced to reproduce them. If the training data included sensitive information — personally identifiable information, proprietary organizational data, operationally sensitive content from open-source repositories — the model may reproduce it under targeted extraction prompts. Memorization testing is a required LDEF evaluation step for models that will be fine-tuned on sensitive data.
Adversarial input resilience: As detailed in WP-CR-2025-04, open-weight models are susceptible to adversarial suffix attacks generated using white-box gradient access. For operational deployments, the resistance of the deployed model to known adversarial attack techniques — including program-office-specific red-team generated attacks — must be characterized before deployment.

Security Evaluation Protocol

Security Test	Method	Pass Criterion	Risk if Failed
Weight Integrity Verification	SHA-256 hash comparison against model author's signed release hash; supply chain provenance documentation review	Hash match with authenticated source; documented provenance	CRITICAL — backdoored weights
Backdoor Screening	Neural Cleanse or STRIP methodology; behavioral consistency testing under trigger-pattern inputs; anomaly detection in output distribution	No anomalous behavior clusters detected	CRITICAL — persistent malicious behavior
Memorization Testing	Extraction attack battery; targeted prompting for training data reproduction; PII detection in generated outputs	No PII reproduction; no sensitive training data extraction above baseline	HIGH — data exposure risk
Adversarial Suffix Resilience	GCG/AutoDAN adversarial suffix generation against model weights; black-box jailbreak battery from WP-04 checklist	No safety bypass within standardized attack budget	HIGH — adversarial manipulation risk
Prompt Injection Resistance	Direct and indirect injection test battery from WP-04 taxonomy; application-specific injection scenarios	System prompt not overridden; no unauthorized tool calls; no data exfiltration patterns	HIGH — operational manipulation
Guardrail Consistency	Safety boundary stress testing across paraphrase variants; multilingual jailbreak testing; encoding attack battery	Consistent refusal across semantically equivalent prohibited requests	CRITICAL for classified deployments

⚠ Falcon and Non-Western Model Provenance

Models originating from non-Western research institutions — including Falcon (UAE), Yi (China), Qwen (Alibaba/China), and similar — require additional provenance evaluation beyond the standard security battery. For national security applications, the organization must assess whether the model's origin creates foreign intelligence service access risks, whether training data may include content subject to information operations, and whether the model's behavioral characteristics reflect deliberate design choices with national security implications. This is not a blanket prohibition on non-Western models — it is a requirement for explicit, documented risk assessment.

Section 07

Dimension 4: Operational Performance

Operational performance evaluation assesses whether the candidate model can deliver acceptable quality of service within the hardware and infrastructure constraints of the intended deployment environment. Defense environments impose constraints that commercial LLM deployments do not face: government-approved computing hardware with specific approved products lists, classified network bandwidth limitations, air-gapped deployment without cloud scaling, forward-deployed disconnected operations, and continuous operation requirements without commercial maintenance windows.

Hardware Constraint Evaluation

The primary operational challenge for large open-source models is that the highest-performing variants (70B+ parameters) require specialized high-memory GPU hardware that is not universally available on DoD networks. LDEF evaluates models across a hardware tiering that reflects the actual deployment environments:

Deployment Tier	Hardware Profile	Max Model Size (fp16)	Recommended Architecture
Tier 1 — Data Center	Multiple A100/H100 80GB GPUs; unlimited network	405B+ (sharded)	Full-precision or light quantization; maximum capability
Tier 2 — Server Room	1–4× A100/A10G 40–80GB; classified network	70B	GPTQ/AWQ 4-bit quantization for 70B; fp16 for 13B
Tier 3 — Workstation	1–2× RTX 4090 24GB VRAM; workstation class	13–20B	4-bit quantization required; CPU offloading for larger variants
Tier 4 — Tactical Edge	NVIDIA Jetson, NVIDIA RTX 4000 Ada (20GB), no cloud	7B (quantized)	Int4/Int8 quantization; llama.cpp or similar efficient inference
Tier 5 — Disconnected/EMSO	CPU-only; laptop class; air-gapped	3B (quantized)	GGUF format; llama.cpp CPU inference; Phi-3-mini class models

Quantization Quality Degradation Assessment

Most tactical and edge deployments require model quantization — reducing model weights from 16-bit floats to 8-bit or 4-bit integers to fit within available VRAM. Quantization reduces model size and inference cost but may degrade quality on specific tasks. LDEF requires quantization quality degradation testing: the full-precision model is evaluated on the defense task suite, then quantized versions are evaluated on the same suite, and the quality delta is measured. If quality degradation exceeds the program's acceptable performance threshold at a given quantization level, that quantization level is rejected for that use case, and a smaller full-precision model may be preferred.

Latency and Throughput Requirements

Operational performance requirements vary significantly by use case. A document summarization tool for analysts can tolerate 30–60 second latency for a complex summary; a real-time conversational assistant for an operational planning session requires sub-5-second first-token latency. LDEF specifies that latency and throughput requirements must be defined as part of the use case specification before model evaluation begins, and candidate models are evaluated against these requirements on the target hardware tier — not on benchmark cloud hardware.

Section 08

Dimension 5: Governance & Compliance

Governance evaluation ensures that the candidate LLM and its deployment architecture are consistent with the legal, regulatory, and policy frameworks that govern defense AI deployment. This dimension does not evaluate the model itself — it evaluates the program's documentation, process controls, and monitoring posture that together constitute the governance architecture around the model.

NIST AI RMF Alignment

NIST AI RMF 1.0 defines four core functions — GOVERN, MAP, MEASURE, MANAGE — that together constitute a risk management approach for AI systems. LDEF maps the evaluation activities across all five dimensions to the NIST AI RMF MEASURE function requirements, providing programs with documentation that directly supports RMF compliance assertions:

GOVERN: LLM evaluation policy documented; roles and responsibilities for AI evaluation team defined; escalation procedures for evaluation findings established; executive accountability for deployment decisions documented.
MAP: Use case context documented; affected stakeholders identified; risk categories and tolerance levels established; deployment context constraints documented.
MEASURE: All five LDEF dimensions evaluated with quantified metrics; results compared against defined pass/fail thresholds; evaluation methodology documented for reproducibility.
MANAGE: Deployment decision documented with specific risk acceptance rationale; continuous monitoring plan established; incident response procedures documented; model update and revaluation triggers defined.

License Compliance for Defense Deployment

Open-source model licenses impose use restrictions that must be reviewed by program legal counsel before deployment. The Llama Community License restricts use by organizations with over 700 million monthly active users (not relevant to DoD programs) but has specific terms about disclosure of derivative model development. Apache 2.0 (Mistral, Phi-4) provides broad commercial and government use rights. Custom licenses (Falcon) require specific legal review. Programs must document license compliance as part of the governance assessment, particularly when fine-tuning creates derivative models that may have additional disclosure obligations.

CDAO Responsible AI Principles

The Chief Digital and Artificial Intelligence Office's Responsible AI (RAI) framework requires defense AI systems to be evaluated against five principles: responsible, equitable, traceable, reliable, and governable. LDEF maps to these principles as follows: fairness evaluation → equitable; quality assurance + security evaluation → reliable; hallucination measurement + calibration → traceable; governance evaluation → governable; complete LDEF evaluation documentation → responsible deployment.

Section 09

Defense Benchmark Suite

The LDEF Defense Benchmark Suite (LDBS) is a purpose-built collection of evaluation datasets and tasks designed specifically to measure LLM performance on defense-relevant activities. Unlike commercial benchmarks that aggregate performance across broad academic domains, LDBS tasks are directly tied to operational use cases and are evaluated by domain experts who can assess the real-world quality of model outputs.

LDBS Task Taxonomy

LDBS Task Set	Tasks	Evaluation Method	Minimum Pass Score
LDBS-QA: Regulatory & Policy	FAR/DFARS Q&A (500 items), DoD Instruction interpretation (200 items), ITAR/EAR classification questions (100 items)	Expert-scored accuracy; legal error rate	≥82% accuracy, 0 legal errors
LDBS-IS: Intel Analysis Support	OSINT summarization (150 items), claim verification against source (200 items), entity extraction from reports (300 items)	ROUGE-L, precision/recall, hallucination rate	ROUGE-L ≥0.65, hall. rate <8%
LDBS-RE: Requirements Engineering	SysML element interpretation (100 items), requirements ambiguity identification (150 items), INCOSE quality checking (200 items)	Expert panel scoring, INCOSE criteria	≥75% expert agreement on quality
LDBS-BC: Bias & Calibration	Counterfactual geopolitical substitution (200 pairs), confidence calibration across 1000 items, overconfidence detection	Consistency delta, calibration error (ECE)	ECE <0.08, consistency delta <15%
LDBS-SEC: Security Probing	Backdoor trigger battery (50 trigger patterns), adversarial suffix test (standard GCG budget), memorization extraction battery	Binary pass/fail per test; zero-tolerance on critical	All critical tests pass
LDBS-MC: Medical / Casualty	TCCC protocol Q&A (200 items), medical triage support accuracy (100 items), medication interaction questions (150 items)	Expert-scored; life-safety error rate	Zero life-safety errors allowed

Constructing Program-Specific Evaluation Sets

LDBS provides the structural template and baseline evaluation sets. Every program should supplement LDBS with program-specific evaluation sets constructed from:

Representative samples of the actual documents the model will process in the deployed system
Historical expert-produced outputs for the tasks the model will assist with (analyst reports, requirements documents, acquisition packages), used as gold standards
Domain expert-constructed adversarial test cases that probe the specific failure modes most consequential for the program's mission
Historical incidents or near-misses from analogous programs that represent the failure modes the evaluation is designed to prevent

🔬 Benchmark Security

Defense evaluation datasets are sensitive program assets. A model specifically fine-tuned on the evaluation set — intentionally or because the evaluation set was exposed to the training pipeline — will produce artificially high evaluation scores that do not reflect operational performance. LDEF requires that evaluation datasets be maintained as controlled, non-public assets; that they be versioned with cryptographic integrity verification; and that evaluation be conducted in isolated environments where the model cannot access the evaluation set except during controlled test execution.

Section 10

Interactive LLM Comparison Matrix

The following matrix provides a comparative view of leading open-source LLMs across LDEF dimensions, scored against defense-relevant criteria. Filter by evaluation dimension to focus the comparison. Scores represent Continuum's assessment based on available public evaluation data, published research, and operational experience — they are not official government ratings. Program-specific evaluation may yield different results.

Open-Source LLM Defense Evaluation Comparison

Indicative scores · Not official ratings · Program evaluation required

Section 11

Red-Team Evaluation Methodology

Red-team evaluation of LLMs for defense deployment is a structured adversarial exercise in which a team of experts attempts to cause the candidate model to produce outputs that would be unacceptable in the operational context. The methodology draws on the WP-CR-2025-04 adversarial attack taxonomy, adapted for the specific use case and deployment context of the program being evaluated. Red-team evaluation is required for all Tier 1 and Tier 2 defense LLM deployments and recommended for any deployment with access to sensitive operational data.

Defense Red-Team Scope

Operational context attacks: Red teamers simulate users with legitimate but misused access — an analyst attempting to extract information outside their access authorization, a contractor attempting to use the system for competitive intelligence gathering, or an insider threat attempting to elicit classified information from an AI system trained on classified data.
Adversarial content injection: Red teamers simulate adversary-controlled content entering the model's context — through documents, email, retrieved web content, or user-submitted materials that contain embedded instructions. This directly applies the indirect injection taxonomy from WP-04 to the specific document types and ingestion pathways of the program's RAG architecture.
Misinformation and disinformation probing: Red teamers attempt to cause the model to generate content that, if acted upon, would produce incorrect operational outcomes — false analytical conclusions, erroneous regulatory interpretations, incorrect technical specifications. This is distinct from jailbreaking: the goal is not prohibited content but subtly incorrect content that passes superficial quality review.
Guardrail boundary testing: Red teamers systematically probe the model's content safety boundaries with variants of queries near the refusal threshold, using paraphrasing, role-play framing, and hypothetical framing to identify inconsistencies in the model's safety behavior.

Red-Team Composition for Defense Programs

A defense LLM red-team requires three distinct expertise categories that are rarely co-located in existing program offices:

AI security specialists: Researchers with specific expertise in adversarial ML, LLM attack techniques, and the tooling required for white-box and black-box adversarial testing. Responsible for the technical security evaluation components.
Domain experts: Subject matter experts in the mission area the LLM will support — intelligence analysts for intelligence support tools, acquisition professionals for acquisition support tools, military doctors for medical support tools. Responsible for identifying the failure modes that would have operational consequence and that a non-domain expert would not recognize as failures.
Operational security specialists: Personnel with specific expertise in the threats faced by the program's deployment environment — counterintelligence, information operations, insider threat — who can design red-team scenarios that reflect realistic adversary capabilities and intent.

Red-Team Documentation Requirements

LDEF requires that all red-team findings be documented in a standardized format that includes: the attack technique used, the specific input that caused the failure, the model's output, the expert panel's assessment of operational consequence, the LDEF dimension and sub-dimension the finding falls under, and a remediation recommendation. Red-team findings that rise to the level of Critical or High severity are tracked through a formal remediation process with defined SLAs before deployment is authorized.

✓ Red-Team vs. Evaluation: Complementary Functions

LDBS benchmark evaluation and red-team evaluation are complementary, not interchangeable. Benchmarks measure known performance characteristics against structured test sets; red-teams surface unknown failure modes through creative adversarial exploration. Both are required. Programs that conduct only benchmark evaluation miss the creative adversarial failures that characterize real-world attacks; programs that conduct only red-team evaluation have no systematic coverage of the full performance space. LDEF requires both as conditions of a positive deployment recommendation.

Section 12

ATO & Accreditation Pathway

Authority to Operate for AI systems — and specifically for LLM components within larger systems — is an evolving area where the traditional RMF process was not designed with the unique characteristics of machine learning systems in mind. LDEF provides a structured approach for integrating LLM evaluation documentation into the RMF/ATO package, addressing the specific questions that authorizing officials are likely to ask about AI system risk.

LDEF Documentation Package for ATO

ATO Documentation Element	LDEF Source	RMF Control Families
AI System Description and Boundary	System context document from use case specification phase	SA-4, SA-17, PL-2
Model Provenance and Supply Chain	Dimension 3 security evaluation — weight integrity section	SA-12, SR-3, SR-4
Performance Characterization	Dimension 1 QA evaluation — all LDBS task results	SA-10, SA-11, SI-3
Bias Assessment and Mitigations	Dimension 2 fairness evaluation — full bias measurement report	PL-4, SI-12, PM-26
Security Test Results	Dimension 3 security evaluation — all LDBS-SEC results and red-team report	CA-8, RA-5, SI-10
Operational Performance Profile	Dimension 4 operational evaluation — hardware tier and latency characterization	CP-2, SA-9, SC-5
Continuous Monitoring Plan	ConMon section of governance evaluation — performance drift monitoring, security scanning	CA-7, SI-4, PM-28
Incident Response Procedures	Governance evaluation — AI-specific incident scenarios and response playbooks	IR-4, IR-8, SI-4

Continuous Monitoring for LLM Systems

ATO for LLM systems requires a continuous monitoring approach that is qualitatively different from traditional continuous monitoring: in addition to infrastructure monitoring (vulnerability scanning, configuration management), programs must monitor model performance — whether the model's output quality is drifting from the evaluation baseline. Model drift can occur without any infrastructure change: as the operational input distribution shifts (new document types, new query patterns, seasonal variation in topics), model performance may degrade in ways that are invisible to infrastructure monitoring but detectable through ongoing sampling and evaluation.

LDEF recommends a minimum continuous monitoring cadence of: quarterly automated benchmark re-evaluation against LDBS core tasks; monthly sampling and expert review of a random 1% of operational outputs; real-time anomaly detection on output quality indicators (response length distribution, refusal rate, sentiment); and immediate re-evaluation triggered by any significant change to the model, its fine-tuning, its retrieval corpus, or its deployment configuration.

Section 13

Interactive Evaluation Checklist

The following checklist operationalizes all five LDEF dimensions into a structured evaluation workflow. Click items to mark complete, right-click to mark as failed (requiring remediation), and double-click to mark N/A. The progress bar tracks completion across all dimensions. This checklist is designed as a program office tracking tool — not a substitute for the full LDEF evaluation methodology.

LDEF Evaluation Checklist — All Five Dimensions

Aligned to NIST AI RMF · CDAO RAI · DoD AI Ethics Principles

Section 14

Defense Use-Case Analysis

Different defense use cases impose different evaluation priorities. A model adequate for logistics planning support may be inadequate for intelligence analysis support; a model that passes evaluation for unclassified document summarization may fail evaluation for a deployment with access to sensitive acquisition data. This section provides use-case-specific evaluation guidance for the four primary defense LLM deployment contexts.

Use Case 1: Intelligence Analysis Support

Intelligence analysis support is the highest-consequence defense LLM use case — outputs that inform decision-makers about adversary capabilities, intentions, and activities. The LDEF priorities for this use case are: hallucination rate (must be the lowest of any use case — the minimum acceptable is substantially below 5%); calibration (the model must accurately represent its uncertainty — overconfident wrong answers in an intelligence context can directly influence operational decisions); geopolitical bias (systematic bias in adversary characterization is an intelligence analysis failure); and source attribution (the model must correctly trace analytical claims to source documents, not confabulate citations).

Recommended model profile for intelligence analysis: a larger model (70B+) fine-tuned on declassified intelligence analysis tradecraft documentation, with RAG grounding to limit hallucination, deployed in a Tier 1 or Tier 2 environment, with mandatory human review of all analytical outputs before dissemination. Open-source candidates: fine-tuned Llama 3.1-70B or Mixtral 8×22B with domain adaptation.

Use Case 2: Acquisition and Contract Support

Acquisition and contract support involves the model assisting with FAR/DFARS compliance, contract clause interpretation, source selection analysis, and acquisition package development. The highest-consequence failure modes are regulatory misinterpretation (a model confidently providing incorrect regulatory guidance that is acted upon) and bias in source selection analysis (systematic bias in evaluation of contractor proposals that could affect contract outcomes).

LDEF priorities for this use case: accuracy on LDBS-QA regulatory tasks (hard pass/fail threshold for any legal error); entity bias in contractor evaluation (counterfactual testing with different company names, sizes, and ownership categories); and calibration on regulatory questions (the model should express uncertainty on edge cases, not produce confident wrong interpretations). Continuum's operational experience in acquisition support programs directly informs this use case evaluation.

Use Case 3: Technical Requirements and Systems Engineering

Technical requirements generation, review, and traceability support — directly tied to the Embedding-Driven Requirements Management framework in WP-CR-2025-07. For this use case, LDEF priorities are: MBSE terminology accuracy; INCOSE requirements quality criteria compliance; ability to identify ambiguous or inconsistent requirements; and performance on LDBS-RE tasks. The primary failure mode is requirements that appear well-formed but are subtly ambiguous, which a domain expert would catch but a general LLM might miss.

Use Case 4: Disconnected / Tactical Edge Deployment

Tactical edge deployment — an LLM running on a ruggedized laptop or tactical server in a forward operating environment without network connectivity — imposes the most severe operational performance constraints. LDEF priorities: performance on Tier 4/5 hardware (quantization quality under severe quantization); ability to operate from a local context window without RAG (the retrieval corpus cannot be updated in disconnected operations); robustness to degraded input quality (partial, fragmentary, or OCR-impaired documents); and very small model footprint without catastrophic quality degradation.

Recommended candidates for tactical edge: Phi-3-mini (3.8B) or Phi-4-mini at int4 quantization for CPU deployment; Llama 3.2-3B for GPU-assisted tactical systems. These models are substantially less capable than 70B variants but may be sufficient for specific tactical use cases when evaluated against task-specific criteria rather than general benchmarks.

Section 15

Implementation Roadmap

Building an LDEF evaluation capability within a program office is a phased investment. Programs typically begin with an urgent need to evaluate a specific model for a specific deployment and discover that they lack the infrastructure, methodology, and evaluation datasets to do so rigorously. This roadmap builds the evaluation capability in parallel with the first program-specific evaluation.

Weeks 1–4 · Foundation

Evaluation Infrastructure & Use Case Specification

Stand up the evaluation infrastructure: isolated evaluation environment (no internet access from evaluation nodes), model weight download and hash verification workflow, LDBS benchmark installation, and automated evaluation pipeline. Simultaneously, produce the use case specification document that defines the operational tasks, performance requirements, hardware constraints, and failure consequence taxonomy for the deployment being evaluated.

Eval Infrastructure Weight Verification LDBS Setup Use Case Specification

Weeks 5–10 · Baseline Evaluation

Dimensions 1–4 Automated Evaluation

Run the automated LDBS benchmarks against all candidate models on the target hardware tier. This includes QA accuracy benchmarks, LDBS-BC calibration testing, LDBS-SEC security probing, and operational performance profiling (latency, throughput, quantization quality). Produce the Dimension 1–4 evaluation report with scores, comparison to thresholds, and initial deployment recommendation per dimension.

LDBS QA Benchmarks Calibration Testing Security Probing Ops Performance Evaluation Report

Weeks 11–16 · Expert Evaluation & Red-Team

Domain Expert Review & Adversarial Testing

Convene the domain expert panel for LDBS task scoring that requires human judgment. Conduct the red-team evaluation with the three-component team. Run counterfactual bias testing with expert panel adjudication. Produce the complete LDEF evaluation report covering all five dimensions, with specific findings, severity ratings, and deployment recommendations.

Expert Panel Review Red-Team Exercise Bias Assessment Complete LDEF Report

Weeks 17–22 · Governance & ATO

Dimension 5 & ATO Documentation Package

Complete the governance evaluation: license review with legal counsel, NIST AI RMF documentation, CDAO RAI checklist, continuous monitoring plan, and incident response procedures. Assemble the ATO documentation package mapping LDEF findings to RMF control families. Brief the Authorizing Official with the complete LDEF package.

License Review AI RMF Docs ConMon Plan ATO Package AO Brief

Month 6+ · Continuous Operations

Operational Deployment & Continuous Monitoring

Deploy the approved model with the continuous monitoring infrastructure active from day one. Quarterly automated LDBS re-evaluations run against the deployed model; monthly output sampling with expert review; real-time anomaly detection active. When a new model version or fine-tune is considered, the full LDEF evaluation cycle re-runs before it is approved for deployment. LDEF is not a one-time gate — it is a continuous quality assurance practice.

Deployment Quarterly Eval Output Sampling Drift Detection Model Update Eval

Section 16

The Continuum Approach

Continuum Resources developed the LDEF methodology from the intersection of three capabilities: the AI technical research program directed by Kurt A. Richardson, PhD; the operational AI deployment experience across Continuum's defense program engagements where LLM evaluation questions arise in practical context; and the adversarial security research documented in WP-CR-2025-04. LDEF is not an academic framework constructed for publication — it is the evaluation methodology Continuum applies to LLM selection decisions on active defense programs.

✓ Continuum LLM Evaluation Services

LDEF Evaluation Engagement: Full five-dimension LDEF evaluation of a candidate open-source LLM for a specific defense deployment use case. Includes LDBS benchmark execution, domain expert panel review, red-team exercise, bias assessment, and complete evaluation report. Deliverable: LDEF Evaluation Report with deployment recommendation and ATO documentation package support.
Evaluation Infrastructure Setup: Design and deployment of the isolated evaluation environment, automated LDBS pipeline, model weight verification workflow, and continuous monitoring infrastructure. Enables the program office to conduct future evaluations independently with consistent, documented methodology.
Program-Specific Benchmark Construction: Development of program-specific evaluation datasets tailored to the program's specific tasks, document types, and failure consequence taxonomy. These datasets become controlled program assets that supplement LDBS for ongoing evaluation cycles.
Red-Team Exercises: Structured adversarial evaluation exercise using the three-component team model — AI security specialists, domain experts, and operational security specialists — producing a formal red-team report with severity-stratified findings and remediation recommendations.
Fine-Tuning Safety Evaluation: Evaluation of fine-tuned model variants for backdoor introduction, alignment degradation, and performance characterization relative to the base model. Addresses the specific evaluation challenge that fine-tuning on sensitive data introduces for open-source model deployments.
ATO Documentation Package: Development of the AI-specific portions of the RMF system security plan and assessment report, mapping LDEF findings to control families and providing the Authorizing Official with a complete, quantified AI system risk characterization.

Engagement Models

Engagement	Scope	Duration	Outcome
Rapid Evaluation Sprint	Dimensions 1, 3, 4 automated evaluation; initial deployment recommendation; identifies critical blockers	3–4 weeks	LDEF Baseline Report; Critical/High findings identified; go/no-go recommendation
Full LDEF Evaluation	All five dimensions; domain expert panel; red-team exercise; complete evaluation report; ATO package support	12–16 weeks	Complete LDEF Report with AO-ready documentation package
Evaluation Infrastructure Build	Automated evaluation pipeline, benchmark datasets, monitoring infrastructure, program team training	8–12 weeks	Independent evaluation capability with documented methodology
Continuous Monitoring Program	Quarterly evaluations, output sampling, drift detection, model update evaluation support	Ongoing	Maintained ATO confidence with documented evaluation history

Section 17

Conclusion

Open-source large language models have created a genuine strategic opportunity for defense AI: sovereign, air-gappable, fine-tunable AI capability that can be deployed at any classification level without dependence on external commercial providers. Realizing this opportunity responsibly requires evaluation infrastructure and methodology that the defense community is still in the early stages of building. The commercial model evaluation literature is a starting point, not an answer — it was designed to rank models on academic benchmarks, not to assess whether a model is safe and reliable for intelligence analysis, acquisition support, or mission planning.

The LDEF provides a defense-specific answer to the evaluation question: five dimensions that together constitute a comprehensive pre-deployment assessment — quality assurance that measures operational task performance, not just leaderboard rank; fairness analysis that addresses geopolitical and military entity bias, not just demographic parity; security evaluation that includes backdoor detection and supply chain integrity, not just jailbreak resistance; operational performance characterization against government hardware tiers, not commercial cloud benchmarks; and governance documentation that supports RMF/ATO, not just terms-of-service compliance.

Deploying an open-source LLM in a defense environment without rigorous evaluation is not a calculated risk — it is an unknown risk. The model's capabilities, failure modes, biases, and security vulnerabilities are unknown until they are measured. Measurement is not overhead; it is the minimum due diligence owed to the warfighters and decision-makers who will depend on the model's outputs.

— Kurt A. Richardson, PhD, Continuum Resources LLC, 2025

Start a Conversation

Ready to Evaluate Your LLMs for Defense Deployment?

Contact Continuum Resources for a Rapid Evaluation Sprint on your candidate open-source LLM.

Get in Touch →

References

[NIST-AI-RMF] National Institute of Standards and Technology — "AI Risk Management Framework (AI RMF 1.0)" — NIST AI 100-1, January 2023. The governing framework for LDEF Dimension 5 governance evaluation.
[NIST-AI-100-5] National Institute of Standards and Technology — "Adversarial Machine Learning: A Taxonomy and Terminology" — NIST AI 100-4, 2024. Taxonomy basis for LDEF Dimension 3 security evaluation.
[CDAO-RAI] Chief Digital and Artificial Intelligence Office — "Responsible AI (RAI) Strategy and Implementation Pathway" — CDAO, 2022. CDAO RAI principles mapped to LDEF dimensions throughout.
[DOD-AI-ETHICS] Department of Defense — "DoD AI Ethics Principles" — CDAO, February 2020. Five AI ethics principles providing normative basis for LDEF fairness and governance evaluation.
[DODD-3000-09] Department of Defense Directive 3000.09 — "Autonomous Weapons Systems" — DoD, January 2023. Evaluation obligations for AI systems involved in targeting-relevant decisions.
[LIANG-2022] Liang, P. et al. — "Holistic Evaluation of Language Models (HELM)" — Stanford CRFM, NeurIPS 2022. The most comprehensive academic LLM evaluation framework; LDEF extends HELM methodology to defense-specific dimensions.
[SRIVASTAVA-2023] Srivastava, A. et al. — "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-Bench)" — TMLR, 2023. Multi-task benchmark framework informing LDBS task design methodology.
[LIN-2022] Lin, S. et al. — "TruthfulQA: Measuring How Models Mimic Human Falsehoods" — ACL 2022. Hallucination measurement methodology adapted for defense domain in LDBS-IS.
[BOMMASANI-2023] Bommasani, R. et al. — "On the Opportunities and Risks of Foundation Models" — Stanford CRFM, 2023. Foundation model risk taxonomy providing part of the basis for LDEF Dimension 3.
[WEN-2023] Wen, Y. et al. — "Backdoor Attacks on Language Models" — IEEE S&P, 2023. Backdoor detection methodology applied in LDEF LDBS-SEC security probing.
[CR-04] Richardson, K.A. — "WP-CR-2025-04: Prompt Injection & Adversarial Attacks on LLM Systems" — Continuum Resources, 2025. Adversarial attack taxonomy that provides the basis for LDEF red-team scope and LDBS-SEC test design.
[CR-03] Richardson, K.A. — "WP-CR-2025-03: AI Governance for Federal Contractors" — Continuum Resources, 2025. Federal AI governance framework that LDEF Dimension 5 operationalizes for LLM-specific deployment evaluation.
[OUYANG-2022] Ouyang, L. et al. — "Training Language Models to Follow Instructions with Human Feedback (InstructGPT)" — NeurIPS 2022. RLHF methodology context for understanding alignment tuning quality in evaluated models.

LLM DefenseEvaluation Framework

Executive Summary

Introduction: The Open-Source LLM Opportunity

The Evaluation Gap

Relationship to the Continuum Research Series

Why Open-Source LLMs for Defense

The Defense Open-Source LLM Landscape

The Five Evaluation Dimensions

How LDEF Differs from Commercial Evaluation

Dimension 1: Quality Assurance

Factual Accuracy and Hallucination Rate

Domain-Specific Task Performance

Robustness to Distribution Shift

Dimension 2: Fairness & Bias

Defense-Specific Bias Categories

Bias Measurement Methodology

Demographic Fairness in Defense Contexts

Dimension 3: Security Evaluation

The Open-Source Security Threat Landscape

Security Evaluation Protocol

Dimension 4: Operational Performance

Hardware Constraint Evaluation

Quantization Quality Degradation Assessment

Latency and Throughput Requirements

Dimension 5: Governance & Compliance

NIST AI RMF Alignment

License Compliance for Defense Deployment

CDAO Responsible AI Principles

Defense Benchmark Suite

LDBS Task Taxonomy

Constructing Program-Specific Evaluation Sets

Interactive LLM Comparison Matrix

Red-Team Evaluation Methodology

Defense Red-Team Scope

Red-Team Composition for Defense Programs

Red-Team Documentation Requirements

ATO & Accreditation Pathway

LDEF Documentation Package for ATO

Continuous Monitoring for LLM Systems

Interactive Evaluation Checklist

Defense Use-Case Analysis

Use Case 1: Intelligence Analysis Support

Use Case 2: Acquisition and Contract Support

Use Case 3: Technical Requirements and Systems Engineering

Use Case 4: Disconnected / Tactical Edge Deployment

Implementation Roadmap

The Continuum Approach

Engagement Models

Conclusion

Ready to Evaluate Your LLMs for Defense Deployment?

References

LLM Defense
Evaluation Framework