Executive Summary
The rapid proliferation of open-source large language models — Llama 3, Mistral, Falcon, Phi, and their continuously evolving derivatives — has created an unprecedented opportunity and a complex challenge for defense organizations. These models offer the ability to deploy sovereign AI capabilities within classified enclaves, on air-gapped networks, and in operational environments where commercial API-based LLMs are not viable. They also introduce evaluation obligations that defense organizations are not yet equipped to meet: the commercial model evaluation literature does not translate directly to environments where output failures can affect intelligence assessments, mission planning, force protection, or acquisition decisions.
This white paper, authored by Kurt A. Richardson, PhD, presents the LLM Defense Evaluation Framework (LDEF) — a structured, five-dimension methodology for evaluating open-source LLMs for defense deployment. LDEF addresses the unique evaluation requirements of defense contexts: quality assurance that reflects operational accuracy requirements, not just benchmark leaderboard rankings; fairness analysis that identifies demographic and entity-level biases with operational consequence; security evaluation that goes beyond prompt injection testing to assess model backdoor risks, supply chain integrity, and fine-tuning poisoning vulnerabilities; operational performance characterization under the resource constraints of classified environments; and governance assessment for RMF/ATO compliance.
Commercial LLM benchmarks — MMLU, HellaSwag, HumanEval, TruthfulQA — measure academic and coding performance that does not predict operational quality in defense contexts. A model that ranks in the top 10% on academic benchmarks may have unacceptable failure modes on defense-specific tasks: intelligence analysis, acquisition support, military doctrine reasoning, or classified environment operation. Defense organizations must build their own evaluation infrastructure, not rely on commercial leaderboards.
Introduction: The Open-Source LLM Opportunity
Commercial LLM services — GPT-4, Claude, Gemini — have demonstrated transformative potential for productivity, analysis, and decision support across virtually every domain of human activity. For defense organizations, their use is constrained by a fundamental architecture problem: commercial LLMs are cloud-hosted services that require data to leave the organization's control, traverse public networks, and be processed on infrastructure the organization does not own. For classified programs, sensitive intelligence analysis, and operational planning, this architecture is categorically unacceptable.
Open-source LLMs — models whose weights are publicly available for download, deployment, and modification — solve this architecture problem. A Llama 3.1-70B model can be deployed on government-owned hardware, inside a classified enclave, on an air-gapped network, with full control over data handling, logging, and access. No data leaves the boundary. No external API call is made. The model is the government's to configure, fine-tune, and operate as a sovereign AI capability. This is a genuinely transformative opportunity for defense AI adoption.
The Evaluation Gap
The evaluation challenge is as significant as the opportunity. Open-source LLMs arrive without the testing infrastructure, safety tuning validation, or operational performance documentation that would accompany a commercial enterprise software procurement. The model's license may be permissive; its evaluation documentation is typically absent. What evaluation exists — benchmark scores on academic datasets — was designed to compare models in commercial development contexts, not to assess whether a model is safe to deploy for defense decision support.
NIST AI RMF requires that AI systems deployed in government contexts be evaluated against quantified performance, fairness, and reliability criteria before deployment, and monitored continuously during operation. CDAO's Responsible AI guidance requires documented evaluation of AI systems for bias, reliability, and security. DoD Directive 3000.09 (Autonomous Weapons) creates evaluation obligations for AI systems involved in targeting-adjacent decisions. Meeting these obligations requires a defense-specific evaluation framework — which is what LDEF provides.
Relationship to the Continuum Research Series
LDEF is the evaluation counterpart to the adversarial security framework in WP-CR-2025-04 (Prompt Injection & Adversarial Attacks). Where WP-04 focuses on protecting deployed LLMs against adversarial manipulation, LDEF focuses on evaluating candidate LLMs before deployment to determine whether they meet the performance, safety, and security thresholds required for their intended use case. Together, they form the pre-deployment evaluation + operational security architecture for defense LLM programs.
Why Open-Source LLMs for Defense
The case for open-source LLMs in defense deployments rests on five operational advantages that commercial API-based LLMs cannot match. Each advantage has direct implications for the evaluation framework required.
The Defense Open-Source LLM Landscape
| Model Family | Origin | Largest Variant | License | Defense Relevance |
|---|---|---|---|---|
| Llama 3.x | Meta AI | 405B | Llama Community License | General-purpose; strongest performance at scale; ITAR considerations for non-US deployment |
| Mistral / Mixtral | Mistral AI (EU) | 8×22B MoE | Apache 2.0 | Strong instruction following; European provenance; MoE efficiency for air-gap deployment |
| Falcon | TII (UAE) | 180B | Falcon License (custom) | Non-Western origin creates evaluation obligation for national security applications; strong multilingual |
| Phi-3 / Phi-4 | Microsoft Research | 14B | MIT | High capability per parameter — suitable for edge/disconnected deployment; small footprint |
| Gemma 2 | Google DeepMind | 27B | Gemma ToS | Safety-focused training; strong reasoning; Google provenance provides some third-party safety validation |
| OLMo / OLMO2 | Allen Institute for AI | 13B | Apache 2.0 | Full training transparency — weights, data, code fully public; highest auditability of any major model |
The Five Evaluation Dimensions
LDEF organizes LLM evaluation for defense deployment across five dimensions that together constitute a comprehensive pre-deployment assessment. Each dimension addresses a different failure mode that could affect operational outcomes or create legal, ethical, or national security risks.
How LDEF Differs from Commercial Evaluation
| Dimension | Commercial Evaluation Focus | LDEF Defense Focus |
|---|---|---|
| Quality Assurance | Academic benchmark scores (MMLU, GPQA, HumanEval) | Defense domain task performance, hallucination rates on sensitive topics, calibration for low-certainty outputs |
| Fairness | Protected characteristic demographic parity | Geopolitical representation bias, adversary nation portrayal, allied nation fairness, military entity stereotyping |
| Security | Basic jailbreak resistance, harmful content refusal | Backdoor detection, supply chain integrity, classified content handling, adversarial suffix resilience, training data exfiltration |
| Performance | Benchmark throughput on A100/H100 | Performance on government-approved hardware (no gaming GPUs), air-gapped deployment, quantized model quality degradation |
| Governance | Terms of service compliance | NIST AI RMF GOVERN/MAP/MEASURE/MANAGE, CDAO RAI, DoDD 3000.09, ATO documentation completeness |
Dimension 1: Quality Assurance
Quality assurance in LDEF measures whether a model produces outputs that are sufficiently accurate, well-reasoned, and calibrated for the specific defense tasks it will be asked to perform. The key insight is that task-specific quality measurement must replace — or heavily supplement — general benchmark scores. A model's MMLU score predicts its performance on academic multiple-choice questions; it does not predict its performance on intelligence analysis summarization, DFARS compliance reasoning, or military doctrine question answering.
Factual Accuracy and Hallucination Rate
Hallucination — the model generating confident, fluent, but factually incorrect outputs — is the most consequential quality failure mode for defense applications. An LLM that hallucinates a citation, an operational procedure, a weapon system capability, or a regulatory requirement does so with the same fluency and confidence as a correct output. There is no syntactic signal that distinguishes a hallucinated fact from a recalled one. Defense evaluation must quantify the hallucination rate for the specific domains relevant to the deployment use case.
LDEF measures hallucination using three complementary approaches:
- Closed-book factual probing: A curated set of domain-specific factual questions with known, verified answers drawn from authoritative defense sources (DoD Instructions, field manuals, acquisition regulations). The model is queried without context documents; responses are scored for factual accuracy against the gold standard.
- RAG faithfulness testing: The model is provided a retrieved document set and asked to answer questions. Responses are scored for faithfulness to the provided documents versus generation of content not in the context — a specific hallucination failure mode in RAG-based deployments relevant to the Secure RAG architecture in CR-04.
- Calibration measurement: The model is asked to express confidence in its answers, and that expressed confidence is compared to its actual accuracy across a large sample. A well-calibrated model that expresses 80% confidence is right approximately 80% of the time. An overconfident model that expresses 95% confidence but is only right 70% of the time is a decision-support liability.
Domain-Specific Task Performance
Quality evaluation must include task-specific performance benchmarks constructed for the program's specific use cases. LDEF defines four primary defense task categories, each requiring a purpose-built evaluation set:
| Task Category | Evaluation Approach | Key Metrics | Failure Mode |
|---|---|---|---|
| Intelligence Analysis Support | Summarization accuracy, entity extraction precision/recall, claim verification against source documents | ROUGE-L, factual precision, hallucination rate | False claim generation; source misattribution; over-confident inference |
| Acquisition & Regulatory Reasoning | FAR/DFARS question answering, contract clause interpretation, compliance determination | Answer accuracy, legal error rate | Incorrect regulatory interpretation with high confidence |
| Technical Documentation | System requirements generation, interface specification interpretation, design review support | Requirement quality metrics, INCOSE compliance | Ambiguous requirements; specification errors |
| Decision Support & Planning | Course of action analysis, risk assessment synthesis, logistics planning support | Decision quality, completeness of consideration, bias detection | Missing key considerations; confirmation bias amplification |
Robustness to Distribution Shift
Defense operational environments change rapidly. A model evaluated in peacetime conditions may encounter significantly different input distributions in crisis conditions — novel terminology, degraded communications, fragmentary reporting, unfamiliar geographic contexts. LDEF tests robustness by evaluating models on out-of-distribution variants of the task evaluation sets: paraphrased questions, incomplete inputs, noisy text (OCR artifacts from scanned documents), and domain-adjacent questions that test whether the model generalizes appropriately or fails in unexpected ways.
The most dangerous quality failure in defense LLM deployment is not a wrong answer presented tentatively — it is a wrong answer presented with high confidence. LDEF's calibration measurement specifically targets this: models that are systematically overconfident on defense domain tasks are flagged as high-risk regardless of their average accuracy, because overconfident errors are much more likely to be acted upon without additional verification. Programs should establish maximum permissible calibration error thresholds as a deployment gate, not just accuracy thresholds.
Dimension 2: Fairness & Bias
Fairness evaluation for defense LLM deployment extends well beyond the demographic bias analysis that characterizes commercial AI fairness assessments. In defense contexts, the most consequential bias categories are geopolitical, entity-level, and operational — biases that could cause the model to systematically misrepresent adversary capabilities, undervalue allied contributions, stereotype foreign military forces, or apply different analytical standards to different nations or organizations.
Defense-Specific Bias Categories
Bias Measurement Methodology
LDEF measures geopolitical and military bias through counterfactual substitution: identical analytical queries are posed with different nation, force, or entity substitutions, and response quality, completeness, and characterization are compared systematically. For example, a query asking for an analysis of a specific military capability is posed with multiple national subjects (U.S., allied nations, adversary nations), and the responses are evaluated for consistency in analytical depth, factual accuracy, and characterization quality. Systematic differences — consistently shallower analysis for one national category, consistently more negative characterization for another — indicate a bias requiring mitigation before operational deployment.
Effective counterfactual bias testing requires care in test construction: substitutions must be semantically equivalent except for the entity being substituted, the evaluation rubric must be objective and independently assessed, and the test set must be large enough to distinguish systematic patterns from stochastic variation. LDEF recommends minimum 200 counterfactual pairs per bias dimension, rated by a panel of domain experts with specific rubrics for each quality dimension (completeness, accuracy, characterization tone, analytical depth).
Demographic Fairness in Defense Contexts
Beyond geopolitical bias, LDEF also addresses traditional demographic fairness — whether the model produces different quality outputs for queries involving different demographic groups of service members, veterans, or civilians. Biased outputs in personnel assessment, medical triage support, administrative processing, or casualty analysis have direct human consequences and legal obligations under DoD equal opportunity and non-discrimination policy.
Dimension 3: Security Evaluation
Security evaluation of open-source LLMs for defense deployment must address threat categories that do not apply to commercial API-based models. Because the model weights are public, any adversary can study them. Because fine-tuning modifies the weights, a supply chain attack on the fine-tuning process can embed persistent behaviors that survive all other safety controls. Because the model will be deployed on classified networks, the consequences of a security failure extend beyond information exposure to potential intelligence compromise.
The Open-Source Security Threat Landscape
- Pre-trained weight integrity: The weights of widely used open-source models are hosted on repositories (HuggingFace, model-specific releases) that could be compromised by an adversary replacing authentic weights with modified versions containing backdoors or degraded capabilities. Supply chain verification — cryptographic hash verification of downloaded weights against authenticated checksums published by the model's authors — is a required pre-deployment step for all LDEF evaluations.
- Backdoor vulnerabilities: Research has demonstrated that LLMs can be fine-tuned to exhibit specific behaviors (producing harmful outputs, bypassing safety controls, returning attacker-specified content) when triggered by specific input patterns, while appearing normal on all other inputs. For a defense deployment using a community fine-tuned model, the risk that the fine-tuning introduced a backdoor requires explicit backdoor detection evaluation.
- Training data memorization: LLMs memorize portions of their training data and can be induced to reproduce them. If the training data included sensitive information — personally identifiable information, proprietary organizational data, operationally sensitive content from open-source repositories — the model may reproduce it under targeted extraction prompts. Memorization testing is a required LDEF evaluation step for models that will be fine-tuned on sensitive data.
- Adversarial input resilience: As detailed in WP-CR-2025-04, open-weight models are susceptible to adversarial suffix attacks generated using white-box gradient access. For operational deployments, the resistance of the deployed model to known adversarial attack techniques — including program-office-specific red-team generated attacks — must be characterized before deployment.
Security Evaluation Protocol
| Security Test | Method | Pass Criterion | Risk if Failed |
|---|---|---|---|
| Weight Integrity Verification | SHA-256 hash comparison against model author's signed release hash; supply chain provenance documentation review | Hash match with authenticated source; documented provenance | CRITICAL — backdoored weights |
| Backdoor Screening | Neural Cleanse or STRIP methodology; behavioral consistency testing under trigger-pattern inputs; anomaly detection in output distribution | No anomalous behavior clusters detected | CRITICAL — persistent malicious behavior |
| Memorization Testing | Extraction attack battery; targeted prompting for training data reproduction; PII detection in generated outputs | No PII reproduction; no sensitive training data extraction above baseline | HIGH — data exposure risk |
| Adversarial Suffix Resilience | GCG/AutoDAN adversarial suffix generation against model weights; black-box jailbreak battery from WP-04 checklist | No safety bypass within standardized attack budget | HIGH — adversarial manipulation risk |
| Prompt Injection Resistance | Direct and indirect injection test battery from WP-04 taxonomy; application-specific injection scenarios | System prompt not overridden; no unauthorized tool calls; no data exfiltration patterns | HIGH — operational manipulation |
| Guardrail Consistency | Safety boundary stress testing across paraphrase variants; multilingual jailbreak testing; encoding attack battery | Consistent refusal across semantically equivalent prohibited requests | CRITICAL for classified deployments |
Models originating from non-Western research institutions — including Falcon (UAE), Yi (China), Qwen (Alibaba/China), and similar — require additional provenance evaluation beyond the standard security battery. For national security applications, the organization must assess whether the model's origin creates foreign intelligence service access risks, whether training data may include content subject to information operations, and whether the model's behavioral characteristics reflect deliberate design choices with national security implications. This is not a blanket prohibition on non-Western models — it is a requirement for explicit, documented risk assessment.
Dimension 4: Operational Performance
Operational performance evaluation assesses whether the candidate model can deliver acceptable quality of service within the hardware and infrastructure constraints of the intended deployment environment. Defense environments impose constraints that commercial LLM deployments do not face: government-approved computing hardware with specific approved products lists, classified network bandwidth limitations, air-gapped deployment without cloud scaling, forward-deployed disconnected operations, and continuous operation requirements without commercial maintenance windows.
Hardware Constraint Evaluation
The primary operational challenge for large open-source models is that the highest-performing variants (70B+ parameters) require specialized high-memory GPU hardware that is not universally available on DoD networks. LDEF evaluates models across a hardware tiering that reflects the actual deployment environments:
| Deployment Tier | Hardware Profile | Max Model Size (fp16) | Recommended Architecture |
|---|---|---|---|
| Tier 1 — Data Center | Multiple A100/H100 80GB GPUs; unlimited network | 405B+ (sharded) | Full-precision or light quantization; maximum capability |
| Tier 2 — Server Room | 1–4× A100/A10G 40–80GB; classified network | 70B | GPTQ/AWQ 4-bit quantization for 70B; fp16 for 13B |
| Tier 3 — Workstation | 1–2× RTX 4090 24GB VRAM; workstation class | 13–20B | 4-bit quantization required; CPU offloading for larger variants |
| Tier 4 — Tactical Edge | NVIDIA Jetson, NVIDIA RTX 4000 Ada (20GB), no cloud | 7B (quantized) | Int4/Int8 quantization; llama.cpp or similar efficient inference |
| Tier 5 — Disconnected/EMSO | CPU-only; laptop class; air-gapped | 3B (quantized) | GGUF format; llama.cpp CPU inference; Phi-3-mini class models |
Quantization Quality Degradation Assessment
Most tactical and edge deployments require model quantization — reducing model weights from 16-bit floats to 8-bit or 4-bit integers to fit within available VRAM. Quantization reduces model size and inference cost but may degrade quality on specific tasks. LDEF requires quantization quality degradation testing: the full-precision model is evaluated on the defense task suite, then quantized versions are evaluated on the same suite, and the quality delta is measured. If quality degradation exceeds the program's acceptable performance threshold at a given quantization level, that quantization level is rejected for that use case, and a smaller full-precision model may be preferred.
Latency and Throughput Requirements
Operational performance requirements vary significantly by use case. A document summarization tool for analysts can tolerate 30–60 second latency for a complex summary; a real-time conversational assistant for an operational planning session requires sub-5-second first-token latency. LDEF specifies that latency and throughput requirements must be defined as part of the use case specification before model evaluation begins, and candidate models are evaluated against these requirements on the target hardware tier — not on benchmark cloud hardware.
Dimension 5: Governance & Compliance
Governance evaluation ensures that the candidate LLM and its deployment architecture are consistent with the legal, regulatory, and policy frameworks that govern defense AI deployment. This dimension does not evaluate the model itself — it evaluates the program's documentation, process controls, and monitoring posture that together constitute the governance architecture around the model.
NIST AI RMF Alignment
NIST AI RMF 1.0 defines four core functions — GOVERN, MAP, MEASURE, MANAGE — that together constitute a risk management approach for AI systems. LDEF maps the evaluation activities across all five dimensions to the NIST AI RMF MEASURE function requirements, providing programs with documentation that directly supports RMF compliance assertions:
- GOVERN: LLM evaluation policy documented; roles and responsibilities for AI evaluation team defined; escalation procedures for evaluation findings established; executive accountability for deployment decisions documented.
- MAP: Use case context documented; affected stakeholders identified; risk categories and tolerance levels established; deployment context constraints documented.
- MEASURE: All five LDEF dimensions evaluated with quantified metrics; results compared against defined pass/fail thresholds; evaluation methodology documented for reproducibility.
- MANAGE: Deployment decision documented with specific risk acceptance rationale; continuous monitoring plan established; incident response procedures documented; model update and revaluation triggers defined.
License Compliance for Defense Deployment
Open-source model licenses impose use restrictions that must be reviewed by program legal counsel before deployment. The Llama Community License restricts use by organizations with over 700 million monthly active users (not relevant to DoD programs) but has specific terms about disclosure of derivative model development. Apache 2.0 (Mistral, Phi-4) provides broad commercial and government use rights. Custom licenses (Falcon) require specific legal review. Programs must document license compliance as part of the governance assessment, particularly when fine-tuning creates derivative models that may have additional disclosure obligations.
CDAO Responsible AI Principles
The Chief Digital and Artificial Intelligence Office's Responsible AI (RAI) framework requires defense AI systems to be evaluated against five principles: responsible, equitable, traceable, reliable, and governable. LDEF maps to these principles as follows: fairness evaluation → equitable; quality assurance + security evaluation → reliable; hallucination measurement + calibration → traceable; governance evaluation → governable; complete LDEF evaluation documentation → responsible deployment.
Defense Benchmark Suite
The LDEF Defense Benchmark Suite (LDBS) is a purpose-built collection of evaluation datasets and tasks designed specifically to measure LLM performance on defense-relevant activities. Unlike commercial benchmarks that aggregate performance across broad academic domains, LDBS tasks are directly tied to operational use cases and are evaluated by domain experts who can assess the real-world quality of model outputs.
LDBS Task Taxonomy
| LDBS Task Set | Tasks | Evaluation Method | Minimum Pass Score |
|---|---|---|---|
| LDBS-QA: Regulatory & Policy | FAR/DFARS Q&A (500 items), DoD Instruction interpretation (200 items), ITAR/EAR classification questions (100 items) | Expert-scored accuracy; legal error rate | ≥82% accuracy, 0 legal errors |
| LDBS-IS: Intel Analysis Support | OSINT summarization (150 items), claim verification against source (200 items), entity extraction from reports (300 items) | ROUGE-L, precision/recall, hallucination rate | ROUGE-L ≥0.65, hall. rate <8% |
| LDBS-RE: Requirements Engineering | SysML element interpretation (100 items), requirements ambiguity identification (150 items), INCOSE quality checking (200 items) | Expert panel scoring, INCOSE criteria | ≥75% expert agreement on quality |
| LDBS-BC: Bias & Calibration | Counterfactual geopolitical substitution (200 pairs), confidence calibration across 1000 items, overconfidence detection | Consistency delta, calibration error (ECE) | ECE <0.08, consistency delta <15% |
| LDBS-SEC: Security Probing | Backdoor trigger battery (50 trigger patterns), adversarial suffix test (standard GCG budget), memorization extraction battery | Binary pass/fail per test; zero-tolerance on critical | All critical tests pass |
| LDBS-MC: Medical / Casualty | TCCC protocol Q&A (200 items), medical triage support accuracy (100 items), medication interaction questions (150 items) | Expert-scored; life-safety error rate | Zero life-safety errors allowed |
Constructing Program-Specific Evaluation Sets
LDBS provides the structural template and baseline evaluation sets. Every program should supplement LDBS with program-specific evaluation sets constructed from:
- Representative samples of the actual documents the model will process in the deployed system
- Historical expert-produced outputs for the tasks the model will assist with (analyst reports, requirements documents, acquisition packages), used as gold standards
- Domain expert-constructed adversarial test cases that probe the specific failure modes most consequential for the program's mission
- Historical incidents or near-misses from analogous programs that represent the failure modes the evaluation is designed to prevent
Defense evaluation datasets are sensitive program assets. A model specifically fine-tuned on the evaluation set — intentionally or because the evaluation set was exposed to the training pipeline — will produce artificially high evaluation scores that do not reflect operational performance. LDEF requires that evaluation datasets be maintained as controlled, non-public assets; that they be versioned with cryptographic integrity verification; and that evaluation be conducted in isolated environments where the model cannot access the evaluation set except during controlled test execution.
Interactive LLM Comparison Matrix
The following matrix provides a comparative view of leading open-source LLMs across LDEF dimensions, scored against defense-relevant criteria. Filter by evaluation dimension to focus the comparison. Scores represent Continuum's assessment based on available public evaluation data, published research, and operational experience — they are not official government ratings. Program-specific evaluation may yield different results.
Red-Team Evaluation Methodology
Red-team evaluation of LLMs for defense deployment is a structured adversarial exercise in which a team of experts attempts to cause the candidate model to produce outputs that would be unacceptable in the operational context. The methodology draws on the WP-CR-2025-04 adversarial attack taxonomy, adapted for the specific use case and deployment context of the program being evaluated. Red-team evaluation is required for all Tier 1 and Tier 2 defense LLM deployments and recommended for any deployment with access to sensitive operational data.
Defense Red-Team Scope
- Operational context attacks: Red teamers simulate users with legitimate but misused access — an analyst attempting to extract information outside their access authorization, a contractor attempting to use the system for competitive intelligence gathering, or an insider threat attempting to elicit classified information from an AI system trained on classified data.
- Adversarial content injection: Red teamers simulate adversary-controlled content entering the model's context — through documents, email, retrieved web content, or user-submitted materials that contain embedded instructions. This directly applies the indirect injection taxonomy from WP-04 to the specific document types and ingestion pathways of the program's RAG architecture.
- Misinformation and disinformation probing: Red teamers attempt to cause the model to generate content that, if acted upon, would produce incorrect operational outcomes — false analytical conclusions, erroneous regulatory interpretations, incorrect technical specifications. This is distinct from jailbreaking: the goal is not prohibited content but subtly incorrect content that passes superficial quality review.
- Guardrail boundary testing: Red teamers systematically probe the model's content safety boundaries with variants of queries near the refusal threshold, using paraphrasing, role-play framing, and hypothetical framing to identify inconsistencies in the model's safety behavior.
Red-Team Composition for Defense Programs
A defense LLM red-team requires three distinct expertise categories that are rarely co-located in existing program offices:
- AI security specialists: Researchers with specific expertise in adversarial ML, LLM attack techniques, and the tooling required for white-box and black-box adversarial testing. Responsible for the technical security evaluation components.
- Domain experts: Subject matter experts in the mission area the LLM will support — intelligence analysts for intelligence support tools, acquisition professionals for acquisition support tools, military doctors for medical support tools. Responsible for identifying the failure modes that would have operational consequence and that a non-domain expert would not recognize as failures.
- Operational security specialists: Personnel with specific expertise in the threats faced by the program's deployment environment — counterintelligence, information operations, insider threat — who can design red-team scenarios that reflect realistic adversary capabilities and intent.
Red-Team Documentation Requirements
LDEF requires that all red-team findings be documented in a standardized format that includes: the attack technique used, the specific input that caused the failure, the model's output, the expert panel's assessment of operational consequence, the LDEF dimension and sub-dimension the finding falls under, and a remediation recommendation. Red-team findings that rise to the level of Critical or High severity are tracked through a formal remediation process with defined SLAs before deployment is authorized.
LDBS benchmark evaluation and red-team evaluation are complementary, not interchangeable. Benchmarks measure known performance characteristics against structured test sets; red-teams surface unknown failure modes through creative adversarial exploration. Both are required. Programs that conduct only benchmark evaluation miss the creative adversarial failures that characterize real-world attacks; programs that conduct only red-team evaluation have no systematic coverage of the full performance space. LDEF requires both as conditions of a positive deployment recommendation.
ATO & Accreditation Pathway
Authority to Operate for AI systems — and specifically for LLM components within larger systems — is an evolving area where the traditional RMF process was not designed with the unique characteristics of machine learning systems in mind. LDEF provides a structured approach for integrating LLM evaluation documentation into the RMF/ATO package, addressing the specific questions that authorizing officials are likely to ask about AI system risk.
LDEF Documentation Package for ATO
| ATO Documentation Element | LDEF Source | RMF Control Families |
|---|---|---|
| AI System Description and Boundary | System context document from use case specification phase | SA-4, SA-17, PL-2 |
| Model Provenance and Supply Chain | Dimension 3 security evaluation — weight integrity section | SA-12, SR-3, SR-4 |
| Performance Characterization | Dimension 1 QA evaluation — all LDBS task results | SA-10, SA-11, SI-3 |
| Bias Assessment and Mitigations | Dimension 2 fairness evaluation — full bias measurement report | PL-4, SI-12, PM-26 |
| Security Test Results | Dimension 3 security evaluation — all LDBS-SEC results and red-team report | CA-8, RA-5, SI-10 |
| Operational Performance Profile | Dimension 4 operational evaluation — hardware tier and latency characterization | CP-2, SA-9, SC-5 |
| Continuous Monitoring Plan | ConMon section of governance evaluation — performance drift monitoring, security scanning | CA-7, SI-4, PM-28 |
| Incident Response Procedures | Governance evaluation — AI-specific incident scenarios and response playbooks | IR-4, IR-8, SI-4 |
Continuous Monitoring for LLM Systems
ATO for LLM systems requires a continuous monitoring approach that is qualitatively different from traditional continuous monitoring: in addition to infrastructure monitoring (vulnerability scanning, configuration management), programs must monitor model performance — whether the model's output quality is drifting from the evaluation baseline. Model drift can occur without any infrastructure change: as the operational input distribution shifts (new document types, new query patterns, seasonal variation in topics), model performance may degrade in ways that are invisible to infrastructure monitoring but detectable through ongoing sampling and evaluation.
LDEF recommends a minimum continuous monitoring cadence of: quarterly automated benchmark re-evaluation against LDBS core tasks; monthly sampling and expert review of a random 1% of operational outputs; real-time anomaly detection on output quality indicators (response length distribution, refusal rate, sentiment); and immediate re-evaluation triggered by any significant change to the model, its fine-tuning, its retrieval corpus, or its deployment configuration.
Interactive Evaluation Checklist
The following checklist operationalizes all five LDEF dimensions into a structured evaluation workflow. Click items to mark complete, right-click to mark as failed (requiring remediation), and double-click to mark N/A. The progress bar tracks completion across all dimensions. This checklist is designed as a program office tracking tool — not a substitute for the full LDEF evaluation methodology.
Defense Use-Case Analysis
Different defense use cases impose different evaluation priorities. A model adequate for logistics planning support may be inadequate for intelligence analysis support; a model that passes evaluation for unclassified document summarization may fail evaluation for a deployment with access to sensitive acquisition data. This section provides use-case-specific evaluation guidance for the four primary defense LLM deployment contexts.
Use Case 1: Intelligence Analysis Support
Intelligence analysis support is the highest-consequence defense LLM use case — outputs that inform decision-makers about adversary capabilities, intentions, and activities. The LDEF priorities for this use case are: hallucination rate (must be the lowest of any use case — the minimum acceptable is substantially below 5%); calibration (the model must accurately represent its uncertainty — overconfident wrong answers in an intelligence context can directly influence operational decisions); geopolitical bias (systematic bias in adversary characterization is an intelligence analysis failure); and source attribution (the model must correctly trace analytical claims to source documents, not confabulate citations).
Recommended model profile for intelligence analysis: a larger model (70B+) fine-tuned on declassified intelligence analysis tradecraft documentation, with RAG grounding to limit hallucination, deployed in a Tier 1 or Tier 2 environment, with mandatory human review of all analytical outputs before dissemination. Open-source candidates: fine-tuned Llama 3.1-70B or Mixtral 8×22B with domain adaptation.
Use Case 2: Acquisition and Contract Support
Acquisition and contract support involves the model assisting with FAR/DFARS compliance, contract clause interpretation, source selection analysis, and acquisition package development. The highest-consequence failure modes are regulatory misinterpretation (a model confidently providing incorrect regulatory guidance that is acted upon) and bias in source selection analysis (systematic bias in evaluation of contractor proposals that could affect contract outcomes).
LDEF priorities for this use case: accuracy on LDBS-QA regulatory tasks (hard pass/fail threshold for any legal error); entity bias in contractor evaluation (counterfactual testing with different company names, sizes, and ownership categories); and calibration on regulatory questions (the model should express uncertainty on edge cases, not produce confident wrong interpretations). Continuum's operational experience in acquisition support programs directly informs this use case evaluation.
Use Case 3: Technical Requirements and Systems Engineering
Technical requirements generation, review, and traceability support — directly tied to the Embedding-Driven Requirements Management framework in WP-CR-2025-07. For this use case, LDEF priorities are: MBSE terminology accuracy; INCOSE requirements quality criteria compliance; ability to identify ambiguous or inconsistent requirements; and performance on LDBS-RE tasks. The primary failure mode is requirements that appear well-formed but are subtly ambiguous, which a domain expert would catch but a general LLM might miss.
Use Case 4: Disconnected / Tactical Edge Deployment
Tactical edge deployment — an LLM running on a ruggedized laptop or tactical server in a forward operating environment without network connectivity — imposes the most severe operational performance constraints. LDEF priorities: performance on Tier 4/5 hardware (quantization quality under severe quantization); ability to operate from a local context window without RAG (the retrieval corpus cannot be updated in disconnected operations); robustness to degraded input quality (partial, fragmentary, or OCR-impaired documents); and very small model footprint without catastrophic quality degradation.
Recommended candidates for tactical edge: Phi-3-mini (3.8B) or Phi-4-mini at int4 quantization for CPU deployment; Llama 3.2-3B for GPU-assisted tactical systems. These models are substantially less capable than 70B variants but may be sufficient for specific tactical use cases when evaluated against task-specific criteria rather than general benchmarks.
Implementation Roadmap
Building an LDEF evaluation capability within a program office is a phased investment. Programs typically begin with an urgent need to evaluate a specific model for a specific deployment and discover that they lack the infrastructure, methodology, and evaluation datasets to do so rigorously. This roadmap builds the evaluation capability in parallel with the first program-specific evaluation.
Stand up the evaluation infrastructure: isolated evaluation environment (no internet access from evaluation nodes), model weight download and hash verification workflow, LDBS benchmark installation, and automated evaluation pipeline. Simultaneously, produce the use case specification document that defines the operational tasks, performance requirements, hardware constraints, and failure consequence taxonomy for the deployment being evaluated.
Run the automated LDBS benchmarks against all candidate models on the target hardware tier. This includes QA accuracy benchmarks, LDBS-BC calibration testing, LDBS-SEC security probing, and operational performance profiling (latency, throughput, quantization quality). Produce the Dimension 1–4 evaluation report with scores, comparison to thresholds, and initial deployment recommendation per dimension.
Convene the domain expert panel for LDBS task scoring that requires human judgment. Conduct the red-team evaluation with the three-component team. Run counterfactual bias testing with expert panel adjudication. Produce the complete LDEF evaluation report covering all five dimensions, with specific findings, severity ratings, and deployment recommendations.
Complete the governance evaluation: license review with legal counsel, NIST AI RMF documentation, CDAO RAI checklist, continuous monitoring plan, and incident response procedures. Assemble the ATO documentation package mapping LDEF findings to RMF control families. Brief the Authorizing Official with the complete LDEF package.
Deploy the approved model with the continuous monitoring infrastructure active from day one. Quarterly automated LDBS re-evaluations run against the deployed model; monthly output sampling with expert review; real-time anomaly detection active. When a new model version or fine-tune is considered, the full LDEF evaluation cycle re-runs before it is approved for deployment. LDEF is not a one-time gate — it is a continuous quality assurance practice.
The Continuum Approach
Continuum Resources developed the LDEF methodology from the intersection of three capabilities: the AI technical research program directed by Kurt A. Richardson, PhD; the operational AI deployment experience across Continuum's defense program engagements where LLM evaluation questions arise in practical context; and the adversarial security research documented in WP-CR-2025-04. LDEF is not an academic framework constructed for publication — it is the evaluation methodology Continuum applies to LLM selection decisions on active defense programs.
- LDEF Evaluation Engagement: Full five-dimension LDEF evaluation of a candidate open-source LLM for a specific defense deployment use case. Includes LDBS benchmark execution, domain expert panel review, red-team exercise, bias assessment, and complete evaluation report. Deliverable: LDEF Evaluation Report with deployment recommendation and ATO documentation package support.
- Evaluation Infrastructure Setup: Design and deployment of the isolated evaluation environment, automated LDBS pipeline, model weight verification workflow, and continuous monitoring infrastructure. Enables the program office to conduct future evaluations independently with consistent, documented methodology.
- Program-Specific Benchmark Construction: Development of program-specific evaluation datasets tailored to the program's specific tasks, document types, and failure consequence taxonomy. These datasets become controlled program assets that supplement LDBS for ongoing evaluation cycles.
- Red-Team Exercises: Structured adversarial evaluation exercise using the three-component team model — AI security specialists, domain experts, and operational security specialists — producing a formal red-team report with severity-stratified findings and remediation recommendations.
- Fine-Tuning Safety Evaluation: Evaluation of fine-tuned model variants for backdoor introduction, alignment degradation, and performance characterization relative to the base model. Addresses the specific evaluation challenge that fine-tuning on sensitive data introduces for open-source model deployments.
- ATO Documentation Package: Development of the AI-specific portions of the RMF system security plan and assessment report, mapping LDEF findings to control families and providing the Authorizing Official with a complete, quantified AI system risk characterization.
Engagement Models
| Engagement | Scope | Duration | Outcome |
|---|---|---|---|
| Rapid Evaluation Sprint | Dimensions 1, 3, 4 automated evaluation; initial deployment recommendation; identifies critical blockers | 3–4 weeks | LDEF Baseline Report; Critical/High findings identified; go/no-go recommendation |
| Full LDEF Evaluation | All five dimensions; domain expert panel; red-team exercise; complete evaluation report; ATO package support | 12–16 weeks | Complete LDEF Report with AO-ready documentation package |
| Evaluation Infrastructure Build | Automated evaluation pipeline, benchmark datasets, monitoring infrastructure, program team training | 8–12 weeks | Independent evaluation capability with documented methodology |
| Continuous Monitoring Program | Quarterly evaluations, output sampling, drift detection, model update evaluation support | Ongoing | Maintained ATO confidence with documented evaluation history |
Conclusion
Open-source large language models have created a genuine strategic opportunity for defense AI: sovereign, air-gappable, fine-tunable AI capability that can be deployed at any classification level without dependence on external commercial providers. Realizing this opportunity responsibly requires evaluation infrastructure and methodology that the defense community is still in the early stages of building. The commercial model evaluation literature is a starting point, not an answer — it was designed to rank models on academic benchmarks, not to assess whether a model is safe and reliable for intelligence analysis, acquisition support, or mission planning.
The LDEF provides a defense-specific answer to the evaluation question: five dimensions that together constitute a comprehensive pre-deployment assessment — quality assurance that measures operational task performance, not just leaderboard rank; fairness analysis that addresses geopolitical and military entity bias, not just demographic parity; security evaluation that includes backdoor detection and supply chain integrity, not just jailbreak resistance; operational performance characterization against government hardware tiers, not commercial cloud benchmarks; and governance documentation that supports RMF/ATO, not just terms-of-service compliance.
Ready to Evaluate Your LLMs for Defense Deployment?
Contact Continuum Resources for a Rapid Evaluation Sprint on your candidate open-source LLM.
Get in Touch →References
- [NIST-AI-RMF] National Institute of Standards and Technology — "AI Risk Management Framework (AI RMF 1.0)" — NIST AI 100-1, January 2023. The governing framework for LDEF Dimension 5 governance evaluation.
- [NIST-AI-100-5] National Institute of Standards and Technology — "Adversarial Machine Learning: A Taxonomy and Terminology" — NIST AI 100-4, 2024. Taxonomy basis for LDEF Dimension 3 security evaluation.
- [CDAO-RAI] Chief Digital and Artificial Intelligence Office — "Responsible AI (RAI) Strategy and Implementation Pathway" — CDAO, 2022. CDAO RAI principles mapped to LDEF dimensions throughout.
- [DOD-AI-ETHICS] Department of Defense — "DoD AI Ethics Principles" — CDAO, February 2020. Five AI ethics principles providing normative basis for LDEF fairness and governance evaluation.
- [DODD-3000-09] Department of Defense Directive 3000.09 — "Autonomous Weapons Systems" — DoD, January 2023. Evaluation obligations for AI systems involved in targeting-relevant decisions.
- [LIANG-2022] Liang, P. et al. — "Holistic Evaluation of Language Models (HELM)" — Stanford CRFM, NeurIPS 2022. The most comprehensive academic LLM evaluation framework; LDEF extends HELM methodology to defense-specific dimensions.
- [SRIVASTAVA-2023] Srivastava, A. et al. — "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-Bench)" — TMLR, 2023. Multi-task benchmark framework informing LDBS task design methodology.
- [LIN-2022] Lin, S. et al. — "TruthfulQA: Measuring How Models Mimic Human Falsehoods" — ACL 2022. Hallucination measurement methodology adapted for defense domain in LDBS-IS.
- [BOMMASANI-2023] Bommasani, R. et al. — "On the Opportunities and Risks of Foundation Models" — Stanford CRFM, 2023. Foundation model risk taxonomy providing part of the basis for LDEF Dimension 3.
- [WEN-2023] Wen, Y. et al. — "Backdoor Attacks on Language Models" — IEEE S&P, 2023. Backdoor detection methodology applied in LDEF LDBS-SEC security probing.
- [CR-04] Richardson, K.A. — "WP-CR-2025-04: Prompt Injection & Adversarial Attacks on LLM Systems" — Continuum Resources, 2025. Adversarial attack taxonomy that provides the basis for LDEF red-team scope and LDBS-SEC test design.
- [CR-03] Richardson, K.A. — "WP-CR-2025-03: AI Governance for Federal Contractors" — Continuum Resources, 2025. Federal AI governance framework that LDEF Dimension 5 operationalizes for LLM-specific deployment evaluation.
- [OUYANG-2022] Ouyang, L. et al. — "Training Language Models to Follow Instructions with Human Feedback (InstructGPT)" — NeurIPS 2022. RLHF methodology context for understanding alignment tuning quality in evaluated models.