Executive Summary
The convergence of large language models (LLMs), tool-use capabilities, and autonomous reasoning has produced a new class of software system: the agentic AI. Unlike traditional AI models that respond to a single prompt, agentic systems decompose complex objectives into sub-tasks, coordinate specialized agents, invoke external tools, and persist state across multi-step reasoning chains — all with minimal human intervention.
For the Department of Defense, this transition is not theoretical. Space Force acquisition programs, Navy logistics networks, and Army intelligence workflows are already confronting the operational question: how do we deploy autonomous AI agents safely, securely, and effectively in environments where failure carries mission-critical consequences?
This white paper establishes a rigorous, practitioner-grade framework for deploying multi-agent AI systems in DoD environments. Drawing on Continuum Resources' direct experience with Space Force, Navy, and Army programs — as well as our published research on LLM evaluation, Secure RAG architectures, and MBSE — we address the full lifecycle: agent architecture, LLM selection, security hardening, governance, human-machine teaming, and a phased deployment roadmap.
Agentic AI does not replace warfighters or program managers. It removes the cognitive and administrative overhead that prevents them from operating at their highest level. The goal is not autonomy for autonomy's sake — it is decision superiority through informed, accelerated, human-commanded action.
Introduction: The Agentic Inflection Point
The history of computing in the DoD is a history of automation waves. Each wave — from mainframe batch processing to client-server ERP, from web-enabled logistics to cloud-native DevSecOps — followed the same arc: initial skepticism, isolated pilots, organizational friction, eventual transformation. We are now at the leading edge of the next wave.
What makes the agentic wave different is not raw computational power or improved accuracy metrics. It is the emergence of general-purpose reasoning that can be directed toward arbitrary tasks, combined with the ability to use tools — APIs, databases, file systems, communication channels — in service of those tasks. For the first time, software can be given an objective in natural language and pursue it through a series of self-directed actions.
The DoD Automation Gap
Despite decades of IT modernization investment, DoD workflows remain heavily manual. Program managers spend an estimated 40–60% of their time on administrative tasks: synthesizing reports, tracking compliance documentation, coordinating between stakeholders, and responding to status inquiries that could be answered by any sufficiently informed system.
This is not a workforce problem — it is a systems architecture problem. The tools available until recently were either too rigid (rule-based RPA) or too generic (standard LLM chatbots) to handle the heterogeneous, context-dependent nature of defense workflows. Agentic AI closes this gap.
Scope of This Paper
This document is written for program managers, CIOs, contracting officers, and technical leads who are evaluating or actively deploying AI agent systems in defense contexts. It is not a theoretical survey — it is an operational guide backed by Continuum's hands-on experience and peer-reviewed research. We cover:
- The technical definition of agentic AI and how it differs from conventional AI
- Multi-agent system architectures appropriate for classified and unclassified DoD environments
- Concrete use cases across Space Force, Navy, Army, and defense contractors
- LLM selection criteria specific to security, performance, and compliance requirements
- Risk taxonomy, security hardening, and adversarial threat models for agent systems
- A governance framework aligned with DoD AI Ethics Principles and NIST AI RMF
- A phased 18-month implementation roadmap
Defining Agentic AI
The term "agentic AI" is used loosely in industry. For this paper, we define it precisely: an agentic AI system is one in which a language model (or ensemble of models) autonomously plans and executes sequences of actions — including tool calls, memory retrieval, inter-agent delegation, and external API invocations — to accomplish a goal specified in natural language, with the ability to observe the results of its actions and adapt its plan accordingly.
The Autonomy Spectrum
Not all agentic systems are equally autonomous. The spectrum ranges from simple tool-augmented models to fully autonomous agents capable of extended, multi-day operation. DoD deployments should be mapped explicitly to the appropriate autonomy level — a decision that is fundamentally about risk tolerance, not capability.
all actions
human approves
guardrails
reviews exceptions
minimal oversight
Under DoD Directive 3000.09, autonomous weapons systems require explicit senior-official authorization. For non-kinetic AI systems, the DoD AI Ethics Principles require that all AI be "governable" — human oversight must be technically possible at all times. L4 and L5 autonomy are inappropriate for any DoD workflows involving consequence-bearing decisions without explicit waivers and governance structures.
Key Capabilities That Enable Agency
- Tool Use / Function Calling: The ability to invoke external APIs, query databases, write and execute code, read/write files, and trigger downstream systems.
- Planning & Task Decomposition: Breaking a high-level objective into ordered sub-tasks, tracking completion state, and replanning when sub-tasks fail.
- Memory Systems: Maintaining working context (short-term), retrieving relevant past interactions (episodic), and accessing persistent knowledge bases (semantic).
- Multi-Agent Coordination: Spawning specialized sub-agents, passing partial results, resolving conflicts, and aggregating outputs.
- Self-Reflection & Evaluation: Assessing output quality, detecting errors in reasoning, and triggering corrective actions without human prompting.
How Agentic AI Differs from Conventional AI
| Dimension | Conventional LLM | Agentic AI System |
|---|---|---|
| Interaction Model | Single-turn prompt → response | Multi-step plan → execute → observe → adapt |
| State | Stateless (each call independent) | Persistent state across actions |
| Tool Access | None (text only) | APIs, databases, code execution, files |
| Error Handling | Human must detect & retry | Agent detects, retries, or escalates |
| Duration | Seconds | Minutes to days |
| Risk Profile | Low (output only) | Higher (real-world actions) |
| Governance Complexity | Moderate | Substantially higher |
Multi-Agent System Architecture
Effective multi-agent systems for DoD environments require a layered architecture that separates orchestration, execution, memory, and security concerns. The design must accommodate both classified and unclassified operational contexts, integrate with legacy DoD systems (DCSA, JIRA-aligned program management tools, SAP-ERP variants), and maintain complete auditability of every agent action.
Orchestration Patterns
Three primary orchestration patterns are applicable to DoD deployments, each with distinct trade-offs in autonomy, auditability, and complexity:
| Pattern | Description | Best For | Oversight Level |
|---|---|---|---|
| Sequential Chain | Agents execute in a fixed order; each output is the next input | Document processing, compliance checks | L2–L3 |
| Hierarchical (Hub & Spoke) | Planner agent delegates to specialist agents dynamically | Complex multi-domain tasks, program management | L3 |
| Parallel Fan-Out | Multiple agents work on sub-tasks simultaneously, results merged | Intelligence aggregation, logistics optimization | L3 |
| Debate / Critic Model | Multiple agents propose solutions; a critic agent evaluates | Risk assessment, high-stakes recommendation generation | L2 |
| Reactive Event Loop | Agents respond to real-time events and sensor inputs | Monitoring, alert triage, anomaly detection | L2–L3 |
Memory Architecture for Classified Environments
Memory systems in agentic AI represent a novel attack surface that conventional security architectures do not address. For DoD contexts, we recommend a three-tier classified memory architecture:
- Ephemeral Working Memory: Encrypted in-flight, destroyed on session termination, never persisted to disk. Suitable for reasoning chains within a single task.
- Episodic Memory with Classification Labels: Prior interactions stored with mandatory classification markings. Cross-classification retrieval is blocked at the retrieval layer, not just the display layer.
- Semantic Knowledge Base: Vector stores partitioned by clearance level. All document ingestion goes through automated classification marking review before insertion. Continuum's published Secure RAG architecture provides the design patterns for this tier.
DoD Use Cases: Deployed & Emerging
The following use cases represent both active Continuum deployments and high-confidence near-term applications identified through our program engagements. Each is structured to show the agent workflow, autonomy level, and measurable outcome.
Continuum led the first SpOC Operational Acceptance under the Software Acquisition Pathway — establishing AI-augmented workflows that reduced manual coordination overhead by over 60%. The agentic layer now continuously monitors pipeline health, flags compliance deviations, and drafts exception reports for program manager review.
Multi-agent system combining time-series forecasting agents with inventory query agents and maintenance scheduling agents. The orchestrator ingests sensor data, NMCI-accessible supply records, and historical maintenance logs to produce prioritized work orders and predicted shortfall alerts — days ahead of conventional reporting cycles.
An agent ensemble that continuously monitors designated open-source feeds (unclassified), performs entity extraction and relationship mapping, clusters emerging themes, and produces structured intelligence summaries formatted to unit-specific reporting templates — dramatically reducing the analyst's background monitoring burden.
A compliance agent that ingests draft solicitations and contract documents, maps every clause against the current FAR/DFARS clause matrix, identifies missing or incorrectly applied provisions, and generates a structured redline summary for contracting officer review. Reduces pre-award review cycle from days to hours.
Based on Continuum's published SATCOM Innovation Framework, this agent system monitors long-duration satellite communication programs for technology refresh opportunities. It ingests emerging technology signals, maps them against current program architecture, and produces structured Technology Refresh Proposals for program office consideration.
Automation Viability by DoD Task Type
Not every DoD workflow is a candidate for automation. The table below maps task categories to recommended autonomy levels and required oversight structures. Filter by domain to focus your assessment.
LLM Selection for DoD Environments
Continuum's published LLM Defense Evaluation framework provides a structured methodology for assessing language models against the specific requirements of defense programs: accuracy, safety, alignment, tool-use reliability, context window adequacy, latency, and deployment model flexibility. The following assessment applies this framework to the current leading models.
Our framework evaluates models across: Instruction Following Accuracy, Tool-Use Reliability, Safety & Alignment, Hallucination Rate on Domain-Specific Content, Context Window (operational documents are large), Deployment Flexibility (on-prem vs. API), Security Certifications, and Latency/Cost at Scale.
On-Premises vs. API Deployment
For NIPR-level workloads, API-based deployment through approved cloud service providers (CSPs) with FedRAMP High authorization is acceptable. For SIPR and above, on-premises or dedicated GovCloud deployment with air-gap capability is required. Open-weight models (Llama 3.x series, Mistral variants) are the primary viable options for fully air-gapped classified environments.
| Scenario | Recommended Approach | Models | Certification Req. |
|---|---|---|---|
| Unclassified / NIPR | API via FedRAMP High CSP | GPT-4o, Claude 3.5+, Gemini | FedRAMP High, IL2–IL4 |
| NIPR Sensitive (CUI) | Dedicated GovCloud or CSP enclave | Claude Gov, Azure OpenAI Gov | FedRAMP High, CMMC L2 |
| SIPR / Classified | On-prem, air-gapped deployment | Llama 3.x, Mistral, fine-tuned open models | ATO required, IL5–IL6 |
| SAP / Above | Requires explicit NSA evaluation | No current commercial models approved | NSA/CSS evaluation required |
Embedding-Driven Requirement Traceability
Continuum's published research on Embedding-Driven Requirement Management provides directly applicable techniques for agentic systems. By representing requirements, test cases, and change requests as semantic embeddings, agent systems can automatically detect requirement drift, identify impacted test cases, and surface traceability gaps — critical capabilities for systems engineering programs under DoD 5000.87.
Security, Risk & Threat Modeling
Agentic AI systems introduce a fundamentally new attack surface. Unlike conventional software, an agent can be manipulated through its inputs — including data retrieved from memory, documents processed on behalf of users, and results returned by external tools. This section presents the threat taxonomy and corresponding mitigations that Continuum applies in all DoD-facing deployments.
Threat Taxonomy
| Threat Vector | Description | DoD Impact | Severity |
|---|---|---|---|
| Prompt Injection | Malicious instructions embedded in documents or tool outputs that redirect agent behavior | Agent executes unauthorized actions; exfiltrates data | Critical |
| Memory Poisoning | Attacker injects false information into long-term memory stores | Persistent misinformation affecting future decisions | Critical |
| Tool Abuse | Agent manipulated into misusing legitimate tool access (e.g., API calls, file writes) | Unauthorized system modifications or data leakage | High |
| Cross-Agent Privilege Escalation | Sub-agent inherits unintended permissions from orchestrator agent | Clearance boundary violations, unauthorized data access | High |
| Hallucination in High-Stakes Context | Model generates plausible-sounding but false information in reports or recommendations | Incorrect operational decisions; compliance violations | High |
| Confidentiality Bleed | Information from classified memory leaks into unclassified outputs | Classification violations; potential legal consequences | Critical |
| Supply Chain Attack on Model | Fine-tuned or open-weight model contains backdoor triggers | Unpredictable agent behavior under adversarial inputs | High |
| Denial of Service via Loops | Agent enters infinite reasoning loop consuming compute resources | System unavailability; mission interruption | Medium |
Security Control Architecture
Continuum implements a defense-in-depth approach to agentic system security, drawing on our DevSecOps expertise and Secure RAG research:
Governance Framework
Governance is not a constraint on agentic AI — it is the prerequisite for organizational trust that enables adoption. Without governance structures that make agent behavior transparent, auditable, and correctable, DoD programs will — correctly — decline to deploy. Continuum's governance framework aligns with the DoD AI Ethics Principles, NIST AI RMF, and Executive Order 13960 on AI in the Federal Government.
DoD AI Ethics Principles Alignment
The DoD AI Ethics Principles require AI to be Responsible, Equitable, Traceable, Reliable, and Governable. Each maps directly to agentic system design requirements:
| DoD AI Principle | Agentic System Requirement | Implementation |
|---|---|---|
| Responsible | Clear accountability for every agent action | Named owner for each agent; immutable audit trail |
| Equitable | No discriminatory bias in agent recommendations | Bias evaluation in LLM Defense Evaluation framework |
| Traceable | Complete reasoning chain available for review | Chain-of-thought logging; decision tree reconstruction |
| Reliable | Consistent behavior under adversarial conditions | Red-team testing; output validation; fallback modes |
| Governable | Humans can monitor, correct, and shut down at any time | Kill switches; human approval gates; override protocols |
Governance Structure
Continuum recommends a three-tier governance structure for DoD agentic AI programs:
Human-Machine Teaming Principles
Effective human-machine teaming in agentic AI requires deliberate design of the hand-off protocols between autonomous action and human decision authority. Continuum's HMT design principles:
- Agents must always be able to explain their reasoning in plain language when queried
- Every agent action must be reversible or at minimum stoppable unless explicitly designated otherwise
- Uncertainty must be surfaced — agents should express low confidence, not hallucinate high confidence
- Human approval gates are mandatory for all consequential, irreversible, or escalated actions
- Operators must be able to override any agent decision at any point in the workflow
- Regular human-in-the-loop sampling of routine agent decisions to catch drift or degradation
- Never design agents that present decisions as final without surfacing the reasoning
- Never allow agents to self-modify their own permission scope
- Never implement "auto-approve" modes for high-risk action categories
- Never deploy without an incident response playbook specific to agentic AI failures
Implementation Roadmap
Successful agentic AI deployment in DoD environments does not happen in a single sprint. It requires a phased approach that builds organizational trust, validates security posture, and expands autonomy incrementally based on demonstrated performance. The following 18-month roadmap reflects Continuum's deployment methodology refined across multiple DoD programs.
Define target workflows, assess data environments, select LLM stack for classification level, design agent architecture, and begin Authority to Operate (ATO) preparation with the program security officer. No agents deployed to production.
Deploy one high-value, low-risk workflow in a controlled environment with full human oversight. Focus on auditability, accuracy measurement, and user trust-building. All agent actions reviewed by humans — no autonomous execution.
Based on pilot performance, expand to additional workflows and move to L3 autonomy (agent acts within guardrails, human reviews exceptions). ATO should be achieved. Begin integrating with DoD backend systems. Continuous monitoring and weekly governance review.
Fine-tune agent behavior based on operational performance data. Introduce specialized agents for domain-specific tasks. Expand memory systems. Conduct full security audit against the threat taxonomy. Update governance documentation for SAF/AQ review.
Scale to enterprise-wide deployment with full governance maturity. Establish the AI oversight board as a standing program element. Develop internal capability for ongoing agent development. Document lessons learned for program-of-record transition.
The Continuum Approach
Continuum Resources is uniquely positioned at the intersection of the capabilities required for successful DoD agentic AI deployment: deep AI engineering expertise, MBSE and systems engineering rigor, proven DevSecOps practice, and a track record of mission-critical delivery across Space Force, Navy, and Army programs. Our approach is not theoretical — it is battle-tested.
- Published Research: Our LLM Defense Evaluation, Secure RAG, SATCOM Innovation Framework, and Embedding-Driven Requirements publications are directly translated into deployment practice — not just reference material.
- WOSB & SBA Certified: A trusted government partner with the certifications and track record that DoD programs require.
- PhD-Level Technical Leadership: Kurt Richardson, PhD (Head of R&D) and Sudip Giri, PhD (Head of Product) provide the academic rigor that separates production-grade AI from proof-of-concept demos.
- Full-Stack Capability: AI + Agile + DevSecOps + Systems Engineering + Automated Testing — we can own the full deployment lifecycle without integration risk from multi-vendor fragmentation.
- Proven DoD Delivery: First SpOC Operational Acceptance under the Software Acquisition Pathway — a validated benchmark for what Agile + AI delivery looks like in Space Force contexts.
How We Engage
Continuum typically engages DoD programs in one of three modes, depending on program maturity:
| Engagement Mode | Scope | Duration | Best For |
|---|---|---|---|
| Assessment Sprint | Workflow discovery, feasibility analysis, ATO gap assessment, roadmap development | 4–6 weeks | Programs exploring agentic AI for the first time |
| Pilot Deployment | Single-workflow agent deployment through P2 of roadmap; full governance structure established | 3–4 months | Programs ready for controlled operational testing |
| Full Program Support | End-to-end agentic AI program delivery, architecture through ATO through enterprise scale | 12–24 months | Programs seeking a long-term innovation partner |
Conclusion
The deployment of agentic AI in DoD environments is not a distant future capability — it is an active operational imperative. Programs that wait for "more mature" technology or "clearer policy" will find themselves operating at a decision-speed disadvantage relative to adversaries who are not waiting. The question is not whether to deploy agentic AI, but how to deploy it responsibly, securely, and in a manner that amplifies — rather than replaces — human judgment.
This paper has established that responsible agentic AI deployment in the DoD is tractable today, provided programs approach it with architectural discipline, governance seriousness, and a phased autonomy model that builds trust incrementally. The framework, architecture patterns, security controls, and roadmap presented here represent a proven path forward — not aspirational theory.
Continuum Resources invites program offices, contracting officers, and technology leads to engage with our team for a no-cost Assessment Sprint to evaluate the readiness of your workflows for agentic AI deployment. Our commitment is the same as it has always been: from ideation to impact.
Ready to Evaluate Agentic AI for Your Program?
Contact our team for a complimentary Assessment Sprint tailored to your program's workflow and classification requirements.
Get in Touch →References
This paper builds on the following Continuum Resources research publications and external sources. All Continuum publications are available via the Publications page of our website.
- [CR-01] Richardson, K. — "Embedding-Driven Requirement Management" — Continuum Resources Research Publication 01, 2024. A semantic, embedding-based approach to requirements engineering for complex enterprise systems.
- [CR-02] Richardson, K. — "SATCOM Innovation Framework" — Continuum Resources Research Publication 02, 2024. Strategic analysis for technology incorporation in long-duration satellite communication programs.
- [CR-03] Richardson, K. — "LLM Defense Evaluation" — Continuum Resources Research Publication 03, 2024. Defense-focused framework for assessing open-source large language models.
- [CR-04] Richardson, K. — "Secure RAG Architectures" — Continuum Resources Research Publication 04, 2024. Design patterns for secure Retrieval-Augmented Generation in regulated environments.
- [DoD-01] Department of Defense — "DoD AI Ethics Principles" — Office of the Chief Digital and Artificial Intelligence Officer (CDAO), February 2020.
- [DoD-02] Department of Defense — "DoD Directive 3000.09: Autonomous Weapons Systems" — November 2023 reissuance.
- [DoD-03] Department of Defense — "Software Acquisition Pathway" — DoD 5000.87, October 2020.
- [NIST-01] National Institute of Standards and Technology — "AI Risk Management Framework (AI RMF 1.0)" — January 2023.
- [EO-01] Executive Order 13960 — "Promoting the Use of Trustworthy Artificial Intelligence in the Federal Government" — December 2020.
- [INCOSE-01] International Council on Systems Engineering — "Systems Engineering Handbook v4" — 2015.