| Q | Topic | One-Line Answer |
|---|---|---|
| Q1 | 5-Layer Cake | Energy → Chips → Infra → Models → Apps; NVIDIA plays all 5 |
| Q2 | Model Building | Curate → Train → Customize → Deploy; RAG adds knowledge without retraining |
| Q3 | Agents | LLM + loop + tools; specialization + orchestration solves the reliability problem |
| Q4 | Tokenomics/Energy | Tokens/watt = new miles/gallon; B200 is 4x H100; on-prem wins >60% utilization |
| Q5 | Infrastructure | GPU = parallel math at scale; InfiniBand for training; NVMe local cache |
| Q6 | Skills & ISVs | The gap is the bridge between existing ISV systems and AI orchestration layers |
| Q7 | Influence to Buy | Pain → Proof → Path → Partner; never pitch features before finding the failure |
| Q8 | Synthetic Data | Rare events + privacy + bias — the flight simulator principle |
| Q9 | OpenClaw/OpenShift | VoIP displacement repeating; OpenShift + GPU Operator = NVIDIA's enterprise K8s play |
| Q10 | Trust & Governance | Explainability + Audit + Compliance + Safety + Data Gov; CJIS is my proof |
The AI 5-Layer Cake
| # | Layer | Bakery Analogy | AI Reality | NVIDIA |
|---|---|---|---|---|
| L1 | Energy | Electricity powering the ovens | Data center power + cooling. H100 = 700W. Sovereign AI = your own power plant. | DGX-Ready DC Program, liquid cooling |
| L2 | Chips | The industrial ovens — faster = more cakes | GPU (parallel) vs CPU (sequential). GPU bakes 10,000 at once. | Blackwell B200, H100, L40S, Grace |
| L3 | Infrastructure | Kitchen layout — ovens, fridges, conveyors, staff network | DGX systems, NVLink, InfiniBand, NVMe storage, K8s clusters | DGX BasePOD, SuperPOD, GB200 NVL72 |
| L4 | Models | Master recipe + trained head baker | Foundation Models (Llama, Nemotron). Weights = recipe. Training = teaching the baker. | NeMo, AI Foundry, Nemotron family |
| L5 | Applications | Finished cake on the customer's table | ChatGPT, Copilot, ViQi ASSIST, enterprise RAG copilot | NIM, AI Enterprise, Blueprints, Triton |
How a Model Like GPT Is Built
Data Curation → NeMo Data Curator Pre-training → NeMo Framework + DGX Cloud Fine-tuning → NeMo SFT, PEFT (LoRA), RLHF Inference → Triton + TensorRT-LLM Deployment → NIM (OpenAI-compatible API) RAG → NeMo Retriever + Milvus/pgvector Safety → NeMo Guardrails
| Term | One-Line Definition | NVIDIA Solution |
|---|---|---|
| Tokens | ~0.75 words; the unit everything is measured in | — |
| Context Length | How much text the model holds "in mind" (e.g., 128K tokens) | NeMo long-context training |
| Training | Model learns by adjusting billions of parameters to minimize prediction error | NeMo + DGX Cloud |
| Inference | Frozen model generates output — milliseconds | Triton + TensorRT + NIM |
| Reasoning | Model "thinks step by step" before answering (Chain-of-Thought) | Nemotron Reasoning models |
| RAG | Retrieve relevant context from knowledge base; inject before generating | NeMo Retriever + NIM |
| SFT | Supervised Fine-Tuning — labeled input/output pairs teach a task | NeMo SFT |
| LoRA / PEFT | Fine-tune <1% of parameters — same result, 10× less compute | NeMo PEFT |
| MoE | Mixture of Experts — only activate relevant "specialist" neurons per token | Mixtral, Nemotron MoE |
| Mamba | State Space Model alternative to Transformer — better for very long sequences | Research/emerging |
| KV Cache | Store previous token computations — avoid recomputing per query | TensorRT-LLM optimization |
| Vector DB | Store embeddings (meaning), not text — semantic similarity search | Milvus, pgvector, NeMo Retriever |
| Chunking | Break large documents into smaller pieces for RAG retrieval | Preprocessing pipeline |
| Hallucination | Model generates plausible but incorrect output | NeMo Guardrails, RAG grounding |
| Quantization | Reduce model precision (FP32→INT8/FP4) — smaller, faster | TensorRT, NIM optimizations |
How Agents Are Built
| Framework | Best For | Key Feature | NVIDIA Integration |
|---|---|---|---|
| LlamaIndex | RAG-heavy doc pipelines | Deep retrieval primitives | NeMo Retriever compatible |
| CrewAI | Role-based multi-agent teams | Agent personas, task delegation | NIM as LLM backend |
| LangGraph | Complex state machines | Graph-based, supports cycles | NIM as LLM backend |
| AutoGen | Human-in-the-loop conversational agents | Chat-based coordination | NIM as LLM backend |
| NVIDIA AIQ Toolkit | Production NVIDIA-native agents | Blueprints, NIM-native, observability | Native |
User Input
↓
Orchestrator Agent (LLM: "what tool do I call next?")
↓
Tool Call (JSON schema → API / code / DB / sub-agent)
↓
Tool Result → back to Orchestrator
↓
Next decision → repeat until task complete
↓
Final Response (+ NeMo Guardrails check)
Trigger: New email arrives (IMAP / Graph API webhook)
↓
[Classifier Agent] LLM classifies: urgent/routine/spam/action-required
Extracts: sender, intent, deadline, entities
↓
[Router] ──branches──▶ [Draft Reply Agent] RAG on email history → human review
├──▶ [Calendar Agent] parse meetings → check availability
├──▶ [Task Agent] extract actions → push to Jira/Asana
└──▶ [Escalation Agent] flag VIP/time-critical → notify human
↓
[Review Gate] human approval for any outbound action
↓
[Execution] send reply / create event / create task
| Failure Mode | Cause | Fix |
|---|---|---|
| Wrong tool called | Ambiguous tool description | Better docstrings, few-shot examples |
| Infinite loop | No termination condition | Max iterations + circuit breakers |
| Hallucinated tool args | Model invents parameters | Structured output (JSON schema enforcement) |
| Compounding errors | Each step amplifies upstream mistake | Checkpoint validation, human-in-loop gates |
| Context overflow | History exceeds window | Summarization, sliding window, memory compression |
Tokenomics & Energy
| Factor | Prefer Cloud | Prefer On-Prem |
|---|---|---|
| Utilization | <40% (bursty) | >60% (sustained production) |
| Data Sovereignty | No restrictions | CJIS, FedRAMP, GDPR, healthcare |
| Time to start | Hours | Months (hardware procurement) |
| Model size | Small to medium | Largest (trillion params) |
| 3-year TCO | Higher at scale | Lower at scale (>60% utilization) |
AI Infrastructure Layer
| Dimension | CPU | GPU |
|---|---|---|
| Cores | 8–128 large complex cores | Thousands of small parallel cores |
| Strength | Complex sequential logic, OS, branching | Parallel math (matrix multiply = AI) |
| Memory BW | ~100–500 GB/s | 3.35 TB/s (H100 SXM) |
| AI training | 1 epoch = hours | 1 epoch = minutes |
| NVIDIA product | Grace (ARM-based, HPC-optimized) | H100, B200, L40S, Jetson |
| Type | Speed | AI Use Case | When |
|---|---|---|---|
| Local NVMe | Highest | Data cache, checkpoints | Hot data, single-node training |
| Parallel FS (Lustre/Weka) | High | Training datasets (shared) | Multi-node training clusters |
| NFS | Medium | Home dirs, model repos | Dev environments |
| Object Storage (S3) | Low latency | Data lakes, archival | Ingestion, cold storage |
| GPU | Architecture | VRAM | Best For |
|---|---|---|---|
| B200 | Blackwell | 192 GB HBM3e | Trillion-param training, large model inference |
| H100 SXM | Hopper | 80 GB HBM3 | Enterprise training, mid-size LLMs |
| L40S | Ada Lovelace | 48 GB GDDR6 | Inference + visualization + video (I used this on Brev) |
| Jetson Orin | Ampere | 16–64 GB unified | Edge AI, robotics, real-time inference |
Agentic AI Skills & ISV Solutions
Examples you already know: Salesforce (CRM) · SAP (ERP) · ServiceNow (IT ticketing) · Splunk (log analytics) · Snowflake (data warehouse).
The key insight: Enterprises already have 300–400 ISV apps deployed. They are not replacing them. The AI opportunity is building the orchestration layer that sits on top — reading from and writing to these systems via their existing REST APIs, without touching the underlying data or replacing the UI.
Think of it like a hospital: the patient (user) speaks to the triage nurse (orchestrator), who routes them to the right specialist (agent), who uses the hospital's existing IT systems (ISVs) to look up records and order tests (tools).
"Summarize all open P1 incidents from the last 7 days and draft a Slack message to the on-call team."
LLM breaks the task: (a) query ServiceNow for P1 tickets, (b) summarize with RAG Agent, (c) Action Agent drafts + sends Slack message.
Tool call: GET /api/now/table/incident?priority=1&opened_at>7d → returns 14 tickets.
Queries Vector DB for incident patterns → retrieves 3 most relevant historical runbooks → passes to NIM for synthesis.
NIM generates the message → NeMo Guardrails checks for sensitive data → Action Agent calls Slack API → message delivered. Full audit log written.
"Done. 14 P1 incidents summarized. Message sent to #oncall. Top pattern: database connection timeouts (8 of 14 incidents)." Total time: ~8 seconds.
Influencing a Customer to Buy
| Step | What You Do | Your Example |
|---|---|---|
| Pain | Find the failure mode first. Never pitch features before understanding the problem. | VirtualPartner: 98% failure rate. I showed the data before saying a word about ASSIST AI. |
| Proof | Build something in their environment, not a canned demo. | AIQ GSI Research Assistant built before the NVIDIA interview — not after. |
| Path | Make the journey to production feel short. ROI calculators, reference architectures, customer references. | $1.2M–$1.6M annual savings per 100-officer agency. ai-infra-advisor TCO model. |
| Partner | Leave them more capable than before. Train their team, not just their exec. | 100+ architects trained. Partners close deals without me in the room. |
| Stakeholder | Primary Concern | Your Move |
|---|---|---|
| CTO/CIO | Architecture fit, security, scalability | Reference architecture + security posture (NeMo Guardrails, CJIS, FedRAMP) |
| CFO | ROI, TCO, risk | ROI calculator + 3-year TCO model (ai-infra-advisor) |
| VP Business Unit | Time to value, competitive edge | POC timeline, quick win from similar vertical |
| Lead Developer | Real integration complexity | Working code, GitHub repo, deploy in 1 hour |
| Compliance Officer | Regulatory risk | Audit trail, data residency, model explainability, governance framework |
Why Synthetic Data Is Critical
| Problem | Without Synthetic Data | With Synthetic Data |
|---|---|---|
| Data scarcity | Rare events have too few examples (fraud, rare diseases, edge cases) | Generate realistic variants at scale |
| Privacy constraints | Real patient records, criminal histories cannot be shared | Synthetic records preserve statistics, expose no real individuals |
| Bias amplification | Historical data encodes historical biases | Synthetically balanced datasets correct representation gaps |
| Labeling cost | Human annotation at scale costs millions | Synthetic data comes pre-labeled |
| Product | Role |
|---|---|
| NeMo Data Curator | Data quality pipeline — filter, deduplicate, augment real data |
| Nemotron 4 340B | Trained substantially on synthetic instruction data — NVIDIA's landmark disclosure |
| NeMo synthetic data gen | Generate instruction-following datasets using teacher models |
| Omniverse / Replicator | Photorealistic synthetic data for computer vision (robots, defect detection, autonomous vehicles) |
OpenClaw, Red Hat AI / OpenShift, Hermes MLOps
Why it matters: 176,000+ GitHub stars (fastest in history). One developer displaced a million-dollar enterprise category.
| Product | What It Is | NVIDIA Integration |
|---|---|---|
| OpenShift | Red Hat's enterprise Kubernetes — dominant in regulated industries (government, finance, healthcare) | NVIDIA GPU Operator runs natively |
| Red Hat OpenShift AI | ML model serving on OpenShift (formerly RHODS) | NIM deployable as OpenShift workloads |
| InstructLab | Red Hat's open-source fine-tuning using synthetic data (LAB = Large-scale Alignment for chatBots) | NeMo-compatible pipeline |
| Granite Models | IBM/Red Hat open-source LLMs for enterprise tasks | NIM-deployable |
Data Version Control → DVC, Git-LFS Training Jobs → Slurm + Enroot/Pyxis or Kubernetes Experiment Tracking → MLflow, Weights & Biases Model Registry → NGC Catalog, MLflow CI/CD for ML → GitHub Actions + Helm Deployment → NIM / Triton Inference Server Infrastructure Mon. → DCGM + Prometheus + Grafana Drift Detection → Arize, NeMo Guardrails monitoring Cluster Management → NVIDIA Base Command Manager
| MSI Project | NVIDIA Concept | Build/Extend |
|---|---|---|
| ViQi ASSIST (ASR) | NIM inference, real-time, TensorRT | ✅ RealtimeVoice repo (Nemotron vs Whisper benchmark done) |
| ASSIST AI (4 agents) | AIQ Toolkit, multi-agent, Guardrails | Rebuild one agent with AIQ v1.3 + NIM + Guardrails layer |
| DDaaS (sensor fusion) | Edge AI, Jetson, multi-modal | Architecture diagram mapping NVIDIA stack to each sensor |
| Policy RAG | NeMo Retriever, RAG, Vector DB | Add policy search mode to aiq-gsi-demo |
| ROI Calculator | DGX Cloud vs on-prem | Extend ai-infra-advisor with on-prem mode |
The Trust & Governance Layer
| Dimension | The Question | What Breaks Without It |
|---|---|---|
| Explainability | Why did the model produce this output? | Regulators reject black-box decisions; auditors can't investigate |
| Auditability | What was input, output, and reasoning for every decision? | No chain of evidence; unable to investigate failures; non-compliant |
| Compliance | Does the system meet regulatory requirements? | Legal liability, license revocation, agency defunding |
| Safety | Can harmful outputs be prevented? | Model says something illegal/dangerous; reputational damage |
| Data Governance | Who owns the data? How long retained? Who can access? | Privacy violations, GDPR/CJIS breach, data sovereignty failure |
| Layer | NVIDIA Product | What It Does |
|---|---|---|
| Safety / Guardrails | NeMo Guardrails | Topical, safety, security, factual rails — block harmful/off-topic outputs at inference |
| Confidential Compute | H100 CC / NVIDIA CC SDK | Encrypts model weights and data in-use — even cloud provider cannot see data |
| Secure supply chain | AI Enterprise Prod/LTS branches | CVE-monitored containers, 0 high vulnerabilities, monthly patches |
| Sovereign AI | DGX SuperPOD + NIM on-prem | Keep model, data, and inference within national/org boundaries |
| Data protection | FIPS 140-2 containers | Encryption at rest/transit in regulated deployments |
| Regulation | Industry | Key AI Requirement |
|---|---|---|
| CJIS v6.0 | Law enforcement | 1,414 security controls, audit trail, encryption, access logging |
| FedRAMP High | US Government | Continuous monitoring, incident response, ATO authorization |
| EU AI Act | EU deployments | Risk classification, human oversight, transparency for high-risk AI |
| HIPAA | Healthcare | Data minimization, access controls, audit logging |
| NIST AI RMF | US Federal + voluntary | Govern, Map, Measure, Manage — AI risk framework |
| GDPR | EU / Global | Right to explanation for automated decisions, data residency |
"I've operated in the highest AI governance environment outside of nuclear and defense: CJIS v6.0. Every query ViQi ASSIST processes — a real-time check of NCIC, state, county, and DMV databases — generates an immutable audit record: who queried, what was queried, what was returned, and when. 1,414 security controls govern the system. FedRAMP High ATO took 9 months. FIPS 140-2 encryption end-to-end.
The CJIS audit is not a compliance checkbox — it's the evidentiary chain that determines whether a criminal prosecution holds up in court. When an officer queries a suspect's record and the AI gives the wrong answer, someone's life is at risk. That's not a hallucination problem. That's a governance failure.
The GSI opportunity is massive: 95% of enterprise AI projects have no governance layer. They're demos pretending to be production systems. A GSI that walks in with a governance framework, audit toolkit, and NVIDIA Confidential Computing reference architecture will close every deal that point-solution AI vendors can't."
Story Bank
| Story | Questions | Key Metric | Opening Line |
|---|---|---|---|
| ViQi ASSIST — Nemotron ASR | Q1, Q2, Q4 | 43ms / 21× faster than Whisper | "I benchmarked Nemotron against Whisper on an L40S GPU..." |
| ASSIST AI — 98% → 90% | Q3, Q7 | 4 agents fixed what 1 LLM call couldn't | "The original system had a 98% failure rate. Here's why — and how I redesigned it." |
| DDaaS — sensor fusion | Q1, Q5, Q8 | 0–5s detection; $500M federal funding; FIFA 2026 | "We needed to detect a drone before it crossed a stadium boundary..." |
| CJIS v6.0 compliance | Q10 | 1,414 controls; FedRAMP High in 9 months | "Every query generates an immutable audit record. Here's what that architecture looks like." |
| 100+ architects trained | Q6, Q7 | 40% shorter sales cycles; $50M+ deals | "I designed an enablement program for architects who'd never touched AI..." |
| OpenClaw case study | Q9 | 176K stars; VoIP pattern repeating | "In 1999 I was inside the Lucent partner channel when Cisco started the VoIP disruption. OpenClaw is the same pattern." |
| ai-infra-advisor tool | Q4, Q7 | On-prem vs cloud TCO model (github.com/QbitLoop) | "I built the tool that answers this question — let me walk you through it." |
| aiq-gsi-demo on Brev L40S | Q1, Q2, Q3 | NIM + AIQ Toolkit + 96 verified NVIDIA sources; $1.49/hr | "I deployed an NVIDIA Blueprint on an L40S instance this week to prove it works in practice..." |
Agent Engineering
An agent is an LLM + a loop + tools. The loop: Observe → Think → Act → Observe. Tools are functions the LLM can call. Reliability comes from the harness around the loop, not from the model itself.
ReAct Pattern (Reason + Act)
| Step | What happens | Implementation |
|---|---|---|
| Thought | LLM reasons about what to do next | Structured output parsing (JSON mode) |
| Action | LLM selects a tool + arguments | Tool schema in system prompt; function calling API |
| Observation | Tool returns result back to context | Append tool result to conversation history |
| Repeat | LLM decides: done or next action? | Max iterations limit prevents infinite loops |
| Final | LLM synthesizes all observations | Structured final answer extraction |
Agent Harness — What I Build Around Every Agent
| Component | Why it matters | My implementation |
|---|---|---|
| Tool registry | Defines what the agent can do | Pydantic schemas; strict type validation on inputs/outputs |
| Retry logic | LLMs fail ~15% of tool calls on first attempt | 3 retries with exponential backoff + error injection into context |
| Max iterations | Prevents infinite reasoning loops | Hard limit of 10; soft warning at 7 with progress summary |
| Structured output | Parse-safe final answers | Instructor library + Pydantic; fallback to regex extraction |
| Trace logging | Debug which tool was called, with what args, and why | Append each thought/action/observation to trace file |
| Guardrails hook | Block unsafe outputs before they leave | NeMo Guardrails colang rules on final response |
Multi-Agent Architecture Patterns
| Pattern | When to use | NVIDIA tool |
|---|---|---|
| Orchestrator + specialists | Complex tasks needing multiple domain experts | AIQ Toolkit + specialized NIM agents |
| Parallel fan-out | Independent sub-tasks that can run simultaneously | async tool dispatch; aggregate results before synthesis |
| Sequential pipeline | Each step depends on previous output | LangGraph stateful graph; node = agent, edge = condition |
| Critic / reviewer | High-stakes output that needs self-checking | Second NIM call with critic prompt; reject if score < threshold |
| Human-in-the-loop | Irreversible actions (API writes, emails, payments) | Approval gate before any write tool; async webhook |
Built 4-agent system for Motorola Solutions: Orchestrator → [RAG Policy Agent | Case Lookup Agent | Form Gen Agent | Escalation Agent]. Key lesson: the hardest problem was not the LLM — it was tool reliability. NCIC APIs timeout; forms have 200+ fields. Built retry harness + confidence threshold routing (if RAG <70% confident → escalate to human). Q1 2026 revenue generating.
Context Engineering
Context engineering is the discipline of deciding WHAT knowledge to put in the context window, HOW to compress it, and in WHAT ORDER. A well-engineered context window eliminates the need for RAG on many tasks. Distillation > Retrieval when the knowledge set is bounded and known.
Context Budget Framework
| Layer | Token Budget | What goes here | Priority |
|---|---|---|---|
| System identity | ~2K | Role, persona, behavior rules, output format | P0 — always present |
| Core knowledge | ~20K | Distilled domain facts (not raw docs) | P0 — always present |
| Task examples | ~10K | Few-shot demonstrations, templates | P1 — include for complex tasks |
| Retrieved context | ~30K | RAG chunks for this specific query | P1 — query-time only |
| Conversation history | ~20K | Prior turns, compressed summaries | P2 — sliding window |
| Response buffer | ~8K | Space for model to generate output | P0 — always reserve |
Knowledge Distillation Techniques
| Technique | Compression ratio | When to use |
|---|---|---|
| Extractive summarization | 10:1 | Structured docs (manuals, specs) |
| Abstractive distillation | 20:1 | Unstructured prose, research papers |
| Fact tables | 50:1 | Numerical data, product specs, comparisons |
| Decision trees (text) | 30:1 | Troubleshooting guides, workflows |
| Q&A pairs | 15:1 | FAQ content, support docs |
Building a Claude Cowork connector that distills 4.8GB of NVIDIA knowledge into ~80K tokens. No RAG — pure context engineering. The 5 knowledge domains (AIIO training, interview team intel, technical manual, my experience, NVIDIA stack) are each distilled into optimized markdown with specific token budgets. The skill becomes an AI employee that knows everything I know about NVIDIA + GSI without any database.
System Prompt Design Patterns
| Pattern | Use case | Key principle |
|---|---|---|
| Role + Goal + Constraints | Any agent/assistant | Tell the model WHO it is, WHAT it does, what it must NOT do |
| Chain-of-thought seeding | Reasoning tasks | "Think step by step" or structured thought template in system |
| Output format schema | Structured extraction | JSON schema in system prompt enforces parseable output |
| Persona + tone calibration | Customer-facing apps | Explicit persona description with 3 example turns |
| Safety anchors | Production systems | Explicit refusal list beats guardrails as first line of defense |
Memory Systems in AI Agents
Human memory: short-term (working memory), long-term (episodic + semantic), procedural (muscle memory). AI agent memory maps exactly: context window, vector DB, document store, fine-tuned weights. The architecture question is: which tier for which information?
| Memory Type | Human Analogy | AI Implementation | NVIDIA Component | When to Use |
|---|---|---|---|---|
| In-context (working) | Active thought | Conversation history in prompt | KV Cache (TRT-LLM) | Current session; immediate context |
| Episodic (short-term) | Yesterday's events | Session summaries in vector DB | NeMo Retriever + Milvus | Multi-session continuity; user preferences |
| Semantic (long-term) | Domain knowledge | Knowledge base in vector DB | NeMo Retriever + RAG | Static facts; documentation; policies |
| Procedural (implicit) | Skill / habit | Fine-tuned model weights | NeMo Framework LoRA/SFT | Domain vocab, style, specialized reasoning |
Memory Architecture Decision Tree
| Question | If YES | If NO |
|---|---|---|
| Is the info needed only in this session? | In-context memory (append to prompt) | Needs persistence → vector DB |
| Does the info change frequently? | Vector DB (update embeddings) | Stable → consider fine-tuning |
| Is exact retrieval required? | Traditional DB + structured query | Semantic similarity → vector DB |
| Is it behavioral style/domain vocab? | Fine-tune (LoRA) the base model | Keep in external retrieval |
| Is it user-specific preferences? | User profile in episodic vector store | Global knowledge → semantic store |
Context Window Management (Sliding Window)
Long conversations overflow the context window. Strategies: (1) Sliding window — keep last N turns. (2) Summarization — compress older turns into a summary. (3) Hierarchical — compress into layers (turn → session → user history). Claude Code uses compaction summaries — you see this every session reset.
Built a file-based persistent memory system for this project: user.md (who Waseem is), feedback.md (what works), project.md (current goals), reference.md (where things are). Each file has structured frontmatter. MEMORY.md is an index loaded every session. This IS context engineering applied to agent memory — 200-line limit on index, semantic content in separate files. Zero vector DB needed.
Agentic Loops
Single-shot LLM calls are not agents. Agents loop: perceive the environment, reason, act, observe the result, reason again. The loop design — how many iterations, when to stop, how to handle failure — determines whether an agent is reliable in production or a demo that breaks on Monday morning.
The Major Loop Patterns
| Pattern | Flow | Best for | Risk |
|---|---|---|---|
| ReAct | Think → Act → Observe → Think... | Tool-using agents with uncertainty | Can spiral in reasoning loops |
| Plan-then-Execute | Plan all steps → Execute in order | Well-defined tasks with known steps | Plan goes stale if environment changes |
| Reflection / Critic | Generate → Critique → Revise | Quality-sensitive outputs (reports, code) | 2x token cost; critic can be wrong |
| RLHF Loop | Generate → Human rates → Reward model learns → Fine-tune | Model behavior alignment | Reward hacking; expensive |
| RLAIF (AI feedback) | Generate → Stronger LLM rates → Reward model → Fine-tune | Scalable alignment without humans | Model collapse if teacher is weak |
| Self-play (synthetic data) | Model generates problems → solves them → trains on correct ones | Math, coding, reasoning skill improvement | Distribution shift from real data |
Loop Failure Modes (and Fixes)
| Failure mode | Root cause | Production fix |
|---|---|---|
| Infinite reasoning loop | No termination condition | Max iteration limit + progress check |
| Wrong tool called | Ambiguous tool description | Better docstrings; few-shot examples in tool schema |
| Hallucinated tool args | Model invents parameter values | Strict Pydantic validation; error fed back to context |
| Lost context mid-loop | Context window fills up | Compress intermediate observations; keep trace summary |
| Irreversible side effect | Action agent called write API | Human-in-the-loop gate for all write operations |
| Cost explosion | Too many LLM calls in loop | Cache repeated tool calls; use smaller model for sub-steps |
Eval Loop — Closing the Quality Feedback Loop
The loop that makes agents better over time: (1) Log every agent trace. (2) Sample failures. (3) Root-cause which loop step failed. (4) Fix: better prompt, better tool, add retry, add critic. (5) Re-run eval set. This loop, not the initial build, is what makes agents production-grade.
ASSIST AI had a 40% failure rate in week 1 of production. Loop analysis: 32% was wrong chunking in RAG retrieval (fixed: semantic chunking), 5% was NCIC API timeout not handled (fixed: retry + fallback message), 3% was hallucinated case numbers (fixed: guardrail on numeric patterns). Loop debugging reduced failure rate to <8% in 3 weeks. The loop trace log was the only tool that made this diagnosable.
Evaluation & RAG Bench
You cannot improve what you cannot measure. For RAG systems: answer correctness is not enough. You need retrieval quality + answer faithfulness + answer relevance — separately. For agents: you need per-step accuracy, not just final answer quality. RAGAS is the framework; LLM-as-judge is the scalable evaluation method.
RAGAS Metrics Framework
| Metric | What it measures | Target | How to improve |
|---|---|---|---|
| Faithfulness | Is the answer grounded in retrieved context? (No hallucination) | >0.90 | Better guardrails; citation enforcement |
| Answer Relevancy | Does the answer actually address the question? | >0.85 | Better reranking; query expansion |
| Context Precision | Are retrieved chunks actually relevant? | >0.80 | Better chunking; reranker model |
| Context Recall | Did retrieval find all the relevant info? | >0.75 | Larger k; hybrid search |
| Answer Correctness | Is the answer factually correct? | >0.85 | Better source docs; fact verification |
RAG Eval Bench — My GitHub Repo
Benchmarks 4 RAG approaches head-to-head: naive chunking, semantic chunking, hybrid search, and reranking. Runs RAGAS on all 4. Shows that adding a reranker (NV-RerankQA) improves context precision by ~20% with minimal latency cost. Used to justify the ASSIST AI architecture decisions.
LLM-as-Judge Pattern
| Step | What happens | Tool |
|---|---|---|
| Generate golden set | Create 50-200 Q&A pairs from ground truth docs | Nemotron or Claude to generate; human spot-check |
| Run the system | Feed each Q through the RAG/agent pipeline | Automated test harness |
| Score with LLM judge | Stronger model rates: correct, partially correct, wrong | Claude claude-opus-4-6 or GPT-4 as judge; structured output |
| Compute RAGAS | Automated metric calculation | RAGAS library + Milvus + NIM embeddings |
| Root-cause failures | Cluster failing cases by failure mode | K-means on failure embeddings; manual review of top clusters |
RAG-Eval-Bench finding that changed ASSIST AI: naive 1000-char chunking scored 0.42 faithfulness. Semantic chunking (split on section boundaries) scored 0.87. The difference was that CJIS policy documents have cross-references — "as defined in Section 5.3.2" — which naive chunking split across chunk boundaries. Semantic chunking kept policy sections intact. One config change, 2x improvement.
Voice AI & Multimodal
Voice AI has a hard UX cliff: <3s total round-trip feels responsive; >5s feels broken. The pipeline: ASR (speech to text) + LLM inference + TTS (text to speech). Each step adds latency. Streaming is not optional — it is the only architecture that crosses the 3s threshold reliably.
Voice AI Pipeline — Components and Latencies
| Stage | Technology | Latency target | NVIDIA solution |
|---|---|---|---|
| Wake word / VAD | Voice Activity Detection | <50ms | NVIDIA Riva VAD (on-device) |
| ASR (speech to text) | Streaming whisper or Parakeet | <200ms TTFT | Nemotron ASR: 43ms (21x faster than Whisper) |
| LLM inference | NIM + TRT-LLM | <800ms TTFT | NIM streaming; first token in <500ms |
| TTS (text to speech) | Streaming synthesis | First audio <300ms | NVIDIA Riva TTS; ElevenLabs for quality |
| Total (streaming) | End-to-end streamed | <3.3s perceived | ViQi ASSIST: 3.3s on Azure GPU nodes |
Multimodal Architecture Patterns
| Modality combo | Architecture | Use case | My example |
|---|---|---|---|
| Speech + Text | ASR → LLM → TTS pipeline | Voice assistants, dispatch, call centers | ViQi ASSIST — law enforcement dispatch AI |
| Vision + Text | Vision encoder + LLM (VLM) | Document OCR, scene understanding | DDaaS drone detection — camera + radar fusion |
| Video + Text | Frame sampling + VLM | Surveillance, body cam review, training | SmartVideo + SmartEvidence pipeline |
| Sensor + Text | Multi-modal encoder + LLM | IoT, robotics, physical AI | DDaaS: acoustic + radar + optical fusion |
| Voice + RAG | ASR → RAG → TTS | Voice-enabled knowledge search | SmartSearch / ASSIST AI voice mode |
QbitLoop/RealtimeVoice GitHub repo: benchmarked NVIDIA Nemotron ASR at 43ms vs Whisper at 900ms on the same audio clips — 21x faster. Architecture: WebSocket streaming from browser → GPU inference → streaming response back. Key finding: Whisper's latency is not model inference time, it's the encoder's non-streaming architecture. Nemotron uses streaming CTC decoder — first token in 43ms regardless of utterance length.
ViQi ASSIST — Full Architecture Story
Platform: APX Next radio (hardened Android) → ViQi Voice Gateway (Azure) → NLU → Intent Router → [NCIC agent | policy RAG agent | translation agent] → TTS → speaker. Latency: 3.3s 95th percentile. Compliance: CJIS v6.0 (1,414 controls), H100 Confidential Computing TEE. Scale: NTECC, CBP deployments. SmartSearch GA April 16, 2026.
AIIO Associate Exam Prep
38% Essential AI Knowledge (Units 1–6) | 40% AI Infrastructure (Units 7–12) | 22% AI Operations (Units 13–14)
Infrastructure is the biggest section. Focus there first.
1 — The 4-Step AI Workflow (always tested)
| Step | What happens | NVIDIA tool | Key detail |
|---|---|---|---|
| 1. Data Prep | Clean, label, augment datasets | RAPIDS, NeMo Data Curator | GPU-accelerated Python; most time-consuming in practice |
| 2. Training | Adjust model weights via gradient descent | PyTorch / TensorFlow / NeMo | Most compute-intensive step; uses FP8/FP16 mixed precision |
| 3. Optimization | Compress + optimize for target hardware | TensorRT (TRT) | Quantization, pruning, kernel fusion → 5x faster inference |
| 4. Inference / Deploy | Run model on new inputs in production | Triton Inference Server + NIM | Multi-GPU, multi-node, load balancing, multi-framework |
2 — GPU Architectures (40% section, exam-critical)
| Architecture | GPU | Transistors | Key features | MIG? | Conf.Compute? |
|---|---|---|---|---|---|
| Blackwell | B200 | 208B | 2nd-gen Transformer Engine, FP8/FP4, 5th-gen NVLink (576 GPUs), RAS engine, decompression engine | Yes | Yes (built-in) |
| Hopper | H100 | 80B | 1st-gen Transformer Engine, FP8, 4th-gen NVLink (900GB/s), 4th-gen Tensor Cores | Yes (up to 7) | Yes (built-in) |
| Ada Lovelace | L40S | — | 4th-gen Tensor Cores, gaming+data center inference, advanced video accel, 2x power efficiency vs prior gen | No | No |
| Grace CPU | ARM CPU | — | NVIDIA's first data center CPU; ARM-based; large memory BW; HPC, genomics, quantum chemistry | N/A | N/A |
| Grace Hopper | GH200 | — | Grace CPU + H100 GPU via NVLink-C2C; 900GB/s bidirectional; unified memory; recommender systems | Yes | Yes |
| Grace Blackwell | GB200 | — | Grace CPU + 2x B200 via NVLink; 900GB/s; unified memory; trillion-parameter LLMs | Yes | Yes |
3 — DGX Systems Specs (know the numbers)
| System | GPUs | Key specs | Use case |
|---|---|---|---|
| DGX H100 | 8x H100 | 4x NVSwitch, 10x ConnectX-7 (400Gbps each = 1TB/s peak), 2TB RAM, 30TB NVMe, 32 petaFLOPS FP8 | Enterprise training + inference |
| DGX B200 | 8x B200 | 5th-gen NVLink, 72 petaFLOPS FP8 training, 144 petaFLOPS FP4 inference, 1.4TB GPU memory | Frontier LLM training + inference |
| GB200 NVL72 | 72x Blackwell | 36 Grace CPUs + 72 B200s; liquid-cooled; 130TB/s compute fabric; "exaflop AI supercomputer in a rack" | Trillion-parameter model training |
| DGX BasePOD | 2–16 nodes | DGX H100 or B200 nodes + QM9700 switch (compute) + SN5600 (GPU-to-GPU Ethernet) + SN4600 (storage/mgmt) + SN2201 (out-of-band) | On-prem integrated AI solution |
| DGX SuperPOD | Up to 127 | Scalable units (SU); recommended for GPT-4 scale LLM training; deployed at customer DCs + CSPs | Hyperscale AI training |
4 — Interconnects: NVLink vs NVSwitch vs PCIe (multiple exam questions)
| Technology | What it does | Bandwidth | Exam key fact |
|---|---|---|---|
| NVLink | High-speed chip-to-chip: GPU-to-GPU and GPU-to-CPU | H100: 900GB/s; B200: 5th gen → 576 GPUs | Overcomes PCIe bottleneck for scale-up multi-GPU |
| NVSwitch | Fabric switch enabling all-to-all GPU communication at full bandwidth | Full bisection bandwidth | Enables ALL GPUs to talk to each other simultaneously (no bottleneck) |
| PCIe | Traditional server bus, CPU-to-GPU | ~128 GB/s (PCIe 5.0) | The bottleneck NVLink replaces for GPU-to-GPU communication |
| InfiniBand | Multi-node compute fabric (node-to-node) | NDR: 400Gbps per port | Lossless, RDMA-native, ultra-low latency; QM9700 switch = BasePOD compute |
5 — Legacy Network vs AI Network (always on the exam)
| Characteristic | Legacy/Cloud Network | AI-Optimized Network |
|---|---|---|
| Protocol | TCP | RDMA (Remote Direct Memory Access) |
| Jitter tolerance | High jitter tolerated | No jitter — every GPU waits for slowest |
| Oversubscription | Oversubscribed (shared bandwidth) | Nonblocking (full bisection bandwidth) |
| Traffic pattern | North-south (client ↔ server) | East-west (GPU ↔ GPU, all-to-all) |
| NVIDIA product | Spectrum-4 (Ethernet) | QM9700 InfiniBand or Spectrum-4 AI |
RoCE (RDMA over Converged Ethernet) encapsulates InfiniBand transport packets inside Ethernet frames. Enables RDMA performance on Ethernet infrastructure. Key when a customer has Ethernet investment but needs RDMA for AI training.
6 — Management Tools Mapping (Exam question: match tool to function)
| Tool | Function | Layer | Key detail |
|---|---|---|---|
| Ansible | Provisioning automation | Infrastructure | Install OS, configure servers — automation layer |
| Terraform | Infrastructure as code | Infrastructure | Define data center resources declaratively |
| Foreman | Server lifecycle management | Infrastructure | Provisioning + OS deployment + reporting |
| DCGM | GPU health monitoring | Resource monitoring | Telemetry → DCGM Exporter → Prometheus → Grafana |
| Prometheus | Metrics collection/storage | Resource monitoring | Scrapes DCGM Exporter; time-series DB |
| Grafana | Visualization dashboards | Resource monitoring | Connects to Prometheus; GPU dashboards |
| Slurm | Job scheduling (HPC/bare metal) | Workload management | Open source; batch + interactive; priority/preemption; no kernel mods |
| Kubernetes | Container orchestration | Workload management | Microservices; scale up/down; advanced features built-in |
| Base Command Manager | On-prem cluster management | All three layers | Deploys K8s + Slurm + Jupyter; job monitoring + GPU metrics |
| Base Command Platform | DGX Cloud web interface | Workload management | Submit/monitor AI training jobs in browser; web GUI for DGX Cloud |
7 — Kubernetes vs Slurm (Key exam distinction)
| Dimension | Kubernetes | Slurm |
|---|---|---|
| Workload type | Containers / microservices | Bare metal HPC / batch jobs |
| Scaling | Scale up and down dynamically | No dynamic scale-down (HPC cluster) |
| Advanced features | Via MLOps plugins | Built-in (priority, preemption, QoS) |
| Inferencing support | Yes (scales to demand) | No (not designed for inference serving) |
| NVIDIA integration | GPU Operator + Network Operator | enroot + Pyxis for containers |
| When to use | Inference serving, MLOps, microservices | Model training, simulation, HPC workloads |
8 — MIG Deep Dive (22% operations section)
Multi-Instance GPU — partition one H100 or B200 into up to 7 fully isolated instances. Each instance gets its own HBM (memory), cache, and compute cores — hardware-enforced isolation, not virtualization. Does NOT increase power consumption. MIG Manager handles this in Kubernetes clusters.
| Exam question | Answer |
|---|---|
| Max MIG instances per GPU? | 7 (on H100 or B200) |
| Does MIG increase power consumption? | No — key exam trap |
| Which GPUs support MIG? | H100 (Hopper), B200 (Blackwell). NOT A100 Ada Lovelace/L40S |
| MIG vs NVSwitch — what's the difference? | MIG = GPU partitioning; NVSwitch = multi-GPU all-to-all communication fabric |
| What manages MIG in Kubernetes? | MIG Manager (separate from Network Operator) |
| Why use MIG? | Multi-tenant inference — 7 isolated workloads on one GPU; maximize utilization |
9 — Cloud Consumption Models (quick exam points)
| Model | What you manage | Example | Exam hook |
|---|---|---|---|
| On-prem | Everything | DGX BasePOD in your DC | Full control, full cost |
| IaaS | OS + software + data | EC2 with A100 instance | You manage the OS |
| PaaS | Data + applications | Managed Kubernetes | Managed Kubernetes = PaaS — exam answer |
| SaaS | Nothing (just use it) | DGX Cloud | Least management = SaaS — exam answer |
Data Locality | Data Sovereignty | Hybrid IT Strategies | Real-Time Performance
10 — High-Frequency Practice Questions
| # | Question | Answer |
|---|---|---|
| 1 | What are the 4 steps of the AI workflow? | Data Prep → Training → Optimization → Inference/Deployment |
| 2 | What tool maps to the Optimization step? | TensorRT (TRT) |
| 3 | What tool maps to the Deployment step? | Triton Inference Server |
| 4 | Key difference: Legacy vs AI network? | Legacy = TCP; AI = RDMA; Legacy = oversubscribed; AI = nonblocking |
| 5 | What does NVSwitch do (vs NVLink)? | NVLink = chip-to-chip GPU-to-GPU; NVSwitch = all-to-all fabric at full bandwidth for all GPUs simultaneously |
| 6 | What is RoCE? | RDMA over Converged Ethernet — InfiniBand packets in Ethernet frames for RDMA over Ethernet |
| 7 | Managed Kubernetes = which cloud model? | PaaS |
| 8 | Minimal management cloud model? | SaaS |
| 9 | DCGM function? | GPU health monitoring and management in clusters; simplifies admin; improves reliability |
| 10 | Ansible vs DCGM vs Slurm — what does each do? | Ansible = provisioning; DCGM = GPU monitoring; Slurm = workload scheduling |
| 11 | Max MIG instances per GPU? | 7 |
| 12 | Does MIG increase power? | No |
| 13 | Which GPUs support MIG? | H100 (Hopper) and B200 (Blackwell) |
| 14 | DGX H100 GPU count + key BW? | 8x H100; 10x ConnectX-7 = 1TB/s peak network bandwidth |
| 15 | GB200 NVL72: how many GPUs? | 72x Blackwell + 36x Grace CPUs; "exaflop supercomputer in a rack" |
| 16 | BasePOD compute switch model? | QM9700 (NDR InfiniBand 400Gbps) |
| 17 | Primary storage requirement for AI? | Large capacity with focus on READ IO performance (training = repeated reads of datasets) |
| 18 | DPU role (3 words)? | Offload + Accelerate + Isolate infrastructure tasks from CPU |
| 19 | Web interface for DGX Cloud? | Base Command Platform |
| 20 | What are NVIDIA AI Foundations? | NeMo (text/LLM) + Picasso (visual) + BioNeMo (biology) — managed services on DGX Cloud |
| 21 | vGPU: true statements (select 2)? | Multiple VMs share a single GPU AND a single VM can access multiple GPUs |
| 22 | Cooling options for GPUs (select 2)? | Water cooling + Air cooling |
| 23 | 3 data center resources to consider? | Power + Cooling + Space |
| 24 | True about accelerated computing (select 3)? | GPUs use less energy over time; MIG boosts utilization without more power; GPUs handle compute-intensive functions |
| 25 | Blackwell: what generation NVLink? How many GPUs? | 5th-gen NVLink; scales to 576 GPUs |
Infrastructure is 40% — master GPU architectures, DGX specs, and networking first. Study the management tools mapping (Ansible/DCGM/Slurm) as a unit — it appears as a matching question. MIG questions are tricky: the exam tries to confuse MIG (partitioning) with NVSwitch (fabric) and GPU Operator (K8s software). Know RoCE definition exactly. Kubernetes = PaaS, DGX Cloud = SaaS are guaranteed questions.
NVIDIA Product Cheat Sheet
| Product | Category | One-Line | When to Recommend |
|---|---|---|---|
| NIM | Inference | Any model as OpenAI-compatible API | Always — fastest time to inference |
| NeMo Framework | Training | Train/fine-tune/customize LLMs end-to-end | Domain-specific models, SFT, LoRA |
| NeMo Guardrails | Safety | Safety/compliance rails at inference | Any production deployment |
| NeMo Retriever | RAG | Semantic search + vector DB integration | Knowledge-grounded RAG pipelines |
| AIQ Toolkit | Agents | NVIDIA-native multi-agent orchestration (v1.3) | Agentic apps on NVIDIA stack |
| Triton Inference Server | Serving | Multi-framework model serving for production | High-throughput, multi-model environments |
| TensorRT / TRT-LLM | Optimization | GPU inference optimization (5× speedup) | Latency-sensitive production inference |
| DGX Cloud | Cloud | Multi-node AI training as a service | Training without capex; bursty workloads |
| AI Enterprise | Platform | Production-grade software platform with SLA | Any enterprise going to production |
| Base Command Manager | Ops | Cluster provisioning + Slurm/K8s orchestration | On-prem DGX cluster management |
| DCGM | Monitoring | GPU telemetry, health checks, diagnostics | Any managed GPU cluster |
| NGC Catalog | Registry | GPU-optimized containers, models, Helm charts | Starting point for any NVIDIA deployment |
| NVLink 5 | Networking | 1.8 TB/s GPU-to-GPU (within node) | Multi-GPU training, large model inference |
| InfiniBand NDR | Networking | 400 Gb/s multi-node training fabric | Multi-node DGX training clusters |
| BlueField DPU | Infrastructure | Offload networking/storage from CPU | Large-scale inference, zero-trust security |
| GPU Operator | K8s | Automates NVIDIA software on Kubernetes | Any K8s cluster with NVIDIA GPUs |
| MIG | Virtualization | Partition one GPU into isolated instances | Multi-tenant inference, shared infrastructure |
| Omniverse / Replicator | Synthetic Data | Photorealistic synthetic data for computer vision | CV training, robotics, defect detection |
| H100 Confidential Computing | Security | Encrypt data + model in-use (TEE) | Healthcare, finance, government regulated AI |