NVIDIA Interview Prep System — Haroon Khan

Q	Topic	One-Line Answer
Q1	5-Layer Cake	Energy → Chips → Infra → Models → Apps; NVIDIA plays all 5
Q2	Model Building	Curate → Train → Customize → Deploy; RAG adds knowledge without retraining
Q3	Agents	LLM + loop + tools; specialization + orchestration solves the reliability problem
Q4	Tokenomics/Energy	Tokens/watt = new miles/gallon; B200 is 4x H100; on-prem wins >60% utilization
Q5	Infrastructure	GPU = parallel math at scale; InfiniBand for training; NVMe local cache
Q6	Skills & ISVs	The gap is the bridge between existing ISV systems and AI orchestration layers
Q7	Influence to Buy	Pain → Proof → Path → Partner; never pitch features before finding the failure
Q8	Synthetic Data	Rare events + privacy + bias — the flight simulator principle
Q9	OpenClaw/OpenShift	VoIP displacement repeating; OpenShift + GPU Operator = NVIDIA's enterprise K8s play
Q10	Trust & Governance	Explainability + Audit + Compliance + Safety + Data Gov; HIPAA + SOC2 is the proof

Q01

The AI 5-Layer Cake

Energy, chips, infrastructure, models, applications — NVIDIA is the only company at all 5 layers simultaneously.

NIMDGXBlackwellNeMo

Q02

How a Model Is Built

Curate → Train → Customize → Deploy. RAG, SFT, LoRA, KV Cache, Vector DB — all explained with one-liners and NVIDIA solutions.

NeMoRAGTensorRTGuardrails

Q03

How Agents Are Built

LLM + loop + tools. Multi-agent frameworks, A2A communication, failure modes, and a complete email agent design.

AIQ ToolkitCrewAILangGraphComplianceBot

Q04

Tokenomics & Energy

Tokens per watt = the new miles per gallon. On-prem vs cloud TCO decision framework. The ai-infra-advisor tool.

DGX CloudTCOB200Sovereign AI

Q05

AI Infrastructure Layer

CPU vs GPU, storage types, NVLink/InfiniBand, GPU selection guide, NVIDIA reference architectures.

H100InfiniBandNVMeBasePOD

Q06

Agentic AI Skills & ISVs

Skills taxonomy, enterprise ISV landscape, vertical use cases. The bridge between existing systems and AI orchestration.

LangGraphServiceNowK8sObservability

Q07

Influencing a Customer

Pain → Proof → Path → Partner. Stakeholder map, ROI tools, 35% faster time to POC. Never pitch before finding the failure.

ROI CalculatorPOCEnablement$45M+

Q08

Synthetic Data

Rare events, privacy, bias correction. Flight simulator principle. Nemotron 4 340B trained on synthetic data. Omniverse Replicator.

NeMo CuratorOmniverseHIPAASiteGuard AI

Q09

OpenClaw / OpenShift / MLOps

176K GitHub stars, VoIP displacement pattern. OpenShift + NVIDIA GPU Operator. MLOps stack: Slurm + K8s + Triton + DCGM.

OpenClawRed Hat AISlurmDCGM

Q10 — CRAFTED

Trust & Governance Layer

Explainability, audit, compliance, safety. HIPAA + SOC2 Type II = 118 combined specs. SOC2 Type II certification in 6 months. NeMo Guardrails + Confidential Computing.

HIPAASOC2GuardrailsEU AI Act

Question 1

The AI 5-Layer Cake

How does an AI pipeline work? What are the different layers? Explain in simple analogy how these work to your elderly mother — Energy, Chips, AI Infrastructure, Models, Applications. What NVIDIA solutions exist at each layer?

The Stack — End-to-End Flow

L1 (bottom) = foundation. L5 (top) = what users see. Every query flows down the stack and the answer flows back up.

Core Concept

The AI stack is interdependent — you cannot skip layers. Every application running today (ChatGPT, Copilot, MediCall AI) rests on all five layers simultaneously. The value of understanding this is knowing where a bottleneck lives when something breaks or costs too much. NVIDIA is unique in playing at all 5 layers; hyperscalers only play L1–L3 well.

Simple Analogy — The Birthday Bakery

Imagine a bakery that needs to bake 10,000 cakes every hour.

#	Layer	Bakery Analogy	AI Reality	NVIDIA
L1	Energy	Electricity powering the ovens	Data center power + cooling. H100 = 700W. Sovereign AI = your own power plant.	DGX-Ready DC Program, liquid cooling
L2	Chips	The industrial ovens — faster = more cakes	GPU (parallel) vs CPU (sequential). GPU bakes 10,000 at once.	Blackwell B200, H100, L40S, Grace
L3	Infrastructure	Kitchen layout — ovens, fridges, conveyors, staff network	DGX systems, NVLink, InfiniBand, NVMe storage, K8s clusters	DGX BasePOD, SuperPOD, GB200 NVL72
L4	Models	Master recipe + trained head baker	Foundation Models (Llama, Nemotron). Weights = recipe. Training = teaching the baker.	NeMo, AI Foundry, Nemotron family
L5	Applications	Finished cake on the customer's table	ChatGPT, Copilot, MediCall AI, enterprise RAG copilot	NIM, AI Enterprise, Blueprints, Triton

Your Story

Example Answer

"In production deployments, MediCall AI runs on a HIPAA-compliant cloud environment (Layers 1–2 rented), uses NVIDIA Nemotron for ASR (Layer 4), served via NIM-compatible APIs through Triton (Layer 5), all within HIPAA + SOC2 Type II compliance. SiteGuard AI required thinking about GPU selection, network latency, and power constraints simultaneously. You can't optimize the application if you don't understand the oven."

Key Numbers

208B

B200 transistors — Blackwell flagship

1.8 TB/s

NVLink 5 bidirectional bandwidth

1 Exaflop

GB200 NVL72 in a single rack

54%

YoY performance gain on same H100 (software optimization)

5×

TensorRT inference speedup vs baseline

Likely Follow-ups

"Where do hyperscalers (AWS/Azure/Google) fit?"

They're chip-and-infrastructure resellers at layers 1–3. They add model APIs at layer 4. NVIDIA sits beneath all of them AND competes at layers 3–5 directly. No hyperscaler has a GPU architecture, a training framework (NeMo), AND inference microservices (NIM).

"Where is the margin for a GSI?"

Layers 4–5: customization, integration, industry-specific Blueprints. The generic infra is commoditizing; the vertical expertise is not. A GSI that deep-customizes an AIQ Blueprint for financial compliance is far harder to displace than one reselling raw NIM deployments.

Question 2

How a Model Like GPT Is Built

Data → Training → Inference → Reasoning. Explain RAG, Fine-tuning, SFT, MoE, Mamba, KV Cache, Vector DB, Tokens, Context lengths, Chunking, Hallucinations, Privacy, Governance. Challenges at each stage. NVIDIA/CSP/OSS solutions.

The 4-Stage Pipeline

Most enterprise AI projects don't train from scratch — they enter at Customize or Deploy. Knowing where to enter the pipeline is a core GSI advisory skill.

Data Curation → NeMo Data Curator
Pre-training   → NeMo Framework + DGX Cloud
Fine-tuning    → NeMo SFT, PEFT (LoRA), RLHF
Inference      → Triton + TensorRT-LLM
Deployment     → NIM (OpenAI-compatible API)
RAG            → NeMo Retriever + Milvus/pgvector
Safety         → NeMo Guardrails

Key Term Glossary — One Line Each

Term	One-Line Definition	NVIDIA Solution
Tokens	~0.75 words; the unit everything is measured in	—
Context Length	How much text the model holds "in mind" (e.g., 128K tokens)	NeMo long-context training
Training	Model learns by adjusting billions of parameters to minimize prediction error	NeMo + DGX Cloud
Inference	Frozen model generates output — milliseconds	Triton + TensorRT + NIM
Reasoning	Model "thinks step by step" before answering (Chain-of-Thought)	Nemotron Reasoning models
RAG	Retrieve relevant context from knowledge base; inject before generating	NeMo Retriever + NIM
SFT	Supervised Fine-Tuning — labeled input/output pairs teach a task	NeMo SFT
LoRA / PEFT	Fine-tune <1% of parameters — same result, 10× less compute	NeMo PEFT
MoE	Mixture of Experts — only activate relevant "specialist" neurons per token	Mixtral, Nemotron MoE
Mamba	State Space Model alternative to Transformer — better for very long sequences	Research/emerging
KV Cache	Store previous token computations — avoid recomputing per query	TensorRT-LLM optimization
Vector DB	Store embeddings (meaning), not text — semantic similarity search	Milvus, pgvector, NeMo Retriever
Chunking	Break large documents into smaller pieces for RAG retrieval	Preprocessing pipeline
Hallucination	Model generates plausible but incorrect output	NeMo Guardrails, RAG grounding
Quantization	Reduce model precision (FP32→INT8/FP4) — smaller, faster	TensorRT, NIM optimizations

Your Story

ComplianceBot — Entered at Deploy + RAG

"In ComplianceBot, the team didn't train a model — they entered at deployment and customization. The real engineering challenge was RAG: chunking 600+ compliance policy documents correctly so the Policy Search Agent retrieved the right section in under 10 seconds. The legacy chatbot had a 94% failure rate — not because the model was wrong, but because the retrieval was wrong. Fixed it with better chunking + reranking. NIM RAG on the NVIDIA stack uses the same principles."

Key Numbers

45ms

Nemotron ASR latency on L40S vs baseline ASR (RealtimeVoice)

21×

Faster than baseline ASR on same hardware

0.1%

Parameters trained with LoRA vs full fine-tune

4.7s

E2E RAG latency (NIM demo: embed 502ms + search 5.7ms + gen 4.2s)

Question 3

How Agents Are Built

What frameworks exist — LlamaIndex, CrewAI, custom agents? How do agents talk to each other and to LLMs? What happens when agents hallucinate? Design an email agent for your system.

Framework Comparison

Framework	Best For	Key Feature	NVIDIA Integration
LlamaIndex	RAG-heavy doc pipelines	Deep retrieval primitives	NeMo Retriever compatible
CrewAI	Role-based multi-agent teams	Agent personas, task delegation	NIM as LLM backend
LangGraph	Complex state machines	Graph-based, supports cycles	NIM as LLM backend
AutoGen	Human-in-the-loop conversational agents	Chat-based coordination	NIM as LLM backend
NVIDIA AIQ Toolkit	Production NVIDIA-native agents	Blueprints, NIM-native, observability	Native

Agent Communication Loop

User Input
    ↓
Orchestrator Agent (LLM: "what tool do I call next?")
    ↓
Tool Call (JSON schema → API / code / DB / sub-agent)
    ↓
Tool Result → back to Orchestrator
    ↓
Next decision → repeat until task complete
    ↓
Final Response (+ NeMo Guardrails check)

Email Agent Design (Real-World Example)

Trigger: New email arrives (IMAP / Graph API webhook)
    ↓
[Classifier Agent]  LLM classifies: urgent/routine/spam/action-required
                    Extracts: sender, intent, deadline, entities
    ↓
[Router] ──branches──▶  [Draft Reply Agent]   RAG on email history → human review
                    ├──▶ [Calendar Agent]      parse meetings → check availability
                    ├──▶ [Task Agent]          extract actions → push to Jira/Asana
                    └──▶ [Escalation Agent]    flag VIP/time-critical → notify human
    ↓
[Review Gate]  human approval for any outbound action
    ↓
[Execution]    send reply / create event / create task

NVIDIA Stack: NIM (Llama/Nemotron) as backbone · AIQ Toolkit for orchestration · NeMo Guardrails for output safety · Triton for low-latency inference · DCGM for monitoring

Reference Story — ComplianceBot (4 Agents in Production)

Multi-Agent in a Healthcare Compliance Environment

"ComplianceBot is 4 agents in production: Policy Search, Federated Query, Document Writing, Translation. The legacy single-LLM chatbot tried to do everything in one LLM call — 94% failure rate. When the team redesigned as a multi-agent system with specialized agents and explicit input/output contracts, success rate went to 89%+. Each agent has a single, well-defined responsibility. That's the principle that scales across GSI partners."

When Agents Fail — and How to Fix It

Failure Mode	Cause	Fix
Wrong tool called	Ambiguous tool description	Better docstrings, few-shot examples
Infinite loop	No termination condition	Max iterations + circuit breakers
Hallucinated tool args	Model invents parameters	Structured output (JSON schema enforcement)
Compounding errors	Each step amplifies upstream mistake	Checkpoint validation, human-in-loop gates
Context overflow	History exceeds window	Summarization, sliding window, memory compression

Question 4

Tokenomics & Energy

Can you explain tokenomics with respect to energy (Tokens/Watt)? Why is this important for investment decisions? How does this look for on-prem vs cloud deployments?

Simple Analogy — Miles Per Gallon

If your car gets 10 MPG at $5/gallon, a 100-mile trip costs $50. Upgrade to 40 MPG — same trip costs $12.50. Same destination. 4× cheaper. That's what each GPU generation does for AI inference. Tokens per watt = miles per gallon. Every new NVIDIA GPU generation roughly doubles this number.

On-Prem vs Cloud Decision Framework

Factor	Prefer Cloud	Prefer On-Prem
Utilization	<40% (bursty)	>60% (sustained production)
Data Sovereignty	No restrictions	HIPAA, SOC2, GDPR, healthcare
Time to start	Hours	Months (hardware procurement)
Model size	Small to medium	Largest (trillion params)
3-year TCO	Higher at scale	Lower at scale (>60% utilization)

Your Story

The Tool I Built for This Decision

"The team built the GPU Economics Calculator (github.com/QbitLoop/ai-infra-advisor) specifically to answer this question for customers. For MediCall AI, the deployment went 100% HIPAA-compliant cloud environment because data sovereignty (HIPAA + SOC2 Type II) required a pre-certified environment — building on-prem would have taken 2+ years. That's the governance layer driving the infrastructure decision, not pure economics."

Key Numbers

4×

B200 improvement in tokens/watt vs H100

700W

H100 peak TDP

~$8–10

Per H100 GPU-hour on DGX Cloud

60%

Utilization threshold where on-prem TCO wins

54%

YoY performance gain on same H100 (software only)

Question 5

AI Infrastructure Layer

Difference between CPU/GPU (ARM/x86). Storage, DB, filesystems, RAM. Network and compute. GPU classes and when to use which. Energy dependence. Build vs rent decisions.

CPU vs GPU

Dimension	CPU	GPU
Cores	8–128 large complex cores	Thousands of small parallel cores
Strength	Complex sequential logic, OS, branching	Parallel math (matrix multiply = AI)
Memory BW	~100–500 GB/s	3.35 TB/s (H100 SXM)
AI training	1 epoch = hours	1 epoch = minutes
NVIDIA product	Grace (ARM-based, HPC-optimized)	H100, B200, L40S, Jetson

Storage for AI

Type	Speed	AI Use Case	When
Local NVMe	Highest	Data cache, checkpoints	Hot data, single-node training
Parallel FS (Lustre/Weka)	High	Training datasets (shared)	Multi-node training clusters
NFS	Medium	Home dirs, model repos	Dev environments
Object Storage (S3)	Low latency	Data lakes, archival	Ingestion, cold storage

GPU Selection Guide

GPU	Architecture	VRAM	Best For
B200	Blackwell	192 GB HBM3e	Trillion-param training, large model inference
H100 SXM	Hopper	80 GB HBM3	Enterprise training, mid-size LLMs
L40S	Ada Lovelace	48 GB GDDR6	Inference + visualization + video (I used this on Brev)
Jetson Orin	Ampere	16–64 GB unified	Edge AI, robotics, real-time inference

Reference Story — SiteGuard AI

Infrastructure Design from the Edge Inward

"SiteGuard AI taught the team to think about GPU selection from the edge inward. At the multi-sensor fusion layer — camera + LiDAR + thermal — real-time inference under 5 seconds was required. That constraint drove Jetson Orin at the factory edge, with heavier analytics offloaded to cloud. Same architectural reasoning applies when a GSI asks 'what hardware should our client buy for a factory floor AI deployment?' — start with the latency constraint, work backwards to GPU class, then design the network around data movement."

Question 6

Agentic AI Skills & ISV Solutions

What skills are required for agentic AI solutions? What existing ISV solutions does an enterprise IT team typically use? Where are the opportunities to build, and what are the typical use cases?

What Is an ISV?

ISV = Independent Software Vendor. A company that builds and sells software but doesn't make the hardware it runs on. They are independent because they write their own product and sell it to many customers.

Examples you already know: Salesforce (CRM) · SAP (ERP) · ServiceNow (IT ticketing) · Splunk (log analytics) · Snowflake (data warehouse).

The key insight: Enterprises already have 300–400 ISV apps deployed. They are not replacing them. The AI opportunity is building the orchestration layer that sits on top — reading from and writing to these systems via their existing REST APIs, without touching the underlying data or replacing the UI.

What Is the Orchestration Layer?

The orchestration layer is the "traffic controller" between a user's intent and the ISV systems that hold the data and take the actions. It is made of agents — each specializing in one task — coordinated by an Orchestrator that decides what to do next, who to call, and what to do with the result.

Think of it like a hospital: the patient (user) speaks to the triage nurse (orchestrator), who routes them to the right specialist (agent), who uses the hospital's existing IT systems (ISVs) to look up records and order tests (tools).

The orchestration layer is the bridge — it sits between users and ISV systems, adding AI reasoning without touching what's already deployed.

The Agentic Loop — Step by Step

User sends intent

"Summarize all open P1 incidents from the last 7 days and draft a Slack message to the on-call team."

Orchestrator calls NIM to plan

LLM breaks the task: (a) query ServiceNow for P1 tickets, (b) summarize with RAG Agent, (c) Action Agent drafts + sends Slack message.

Search Agent queries ServiceNow API (ISV)

Tool call: GET /api/now/table/incident?priority=1&opened_at>7d → returns 14 tickets.

RAG Agent finds relevant runbooks

Queries Vector DB for incident patterns → retrieves 3 most relevant historical runbooks → passes to NIM for synthesis.

Action Agent drafts & sends Slack message

NIM generates the message → NeMo Guardrails checks for sensitive data → Action Agent calls Slack API → message delivered. Full audit log written.

Orchestrator returns result to user

"Done. 14 P1 incidents summarized. Message sent to #oncall. Top pattern: database connection timeouts (8 of 14 incidents)." Total time: ~8 seconds.

The ISV Landscape — Where the Data Lives

CRM

Salesforce · HubSpot

AI copilot for reps, deal intelligence, auto-follow-up agents

ITSM

ServiceNow · Jira

Incident triage agent, auto-resolution, change risk scoring

ERP

SAP · Oracle · Workday

Procurement automation, anomaly detection, report generation

Productivity

Microsoft 365 · Google

Document agents, meeting summarization, email agents

SIEM / Security

Splunk · CrowdStrike

Threat intel agents, alert triage, SOC automation

Data Platform

Snowflake · Databricks

Natural language to SQL, data quality agents

Communication

Slack · Teams · Zoom

Meeting agents, action item extraction, context persistence

Healthcare ISV

EHR systems · Insurance DBs · Formulary APIs

MediCall AI: 4 federated queries in <3.3s (production reference)

Skills Required for Agentic AI

FoundationLLM APIs & Prompt Engineering expand_more

How to call NIM / OpenAI / Anthropic APIs. How to write effective prompts. How to parse structured output (JSON schema). How to handle errors, timeouts, retries. This is the entry point — everything else builds on this.

CoreRAG Pipelines & Vector Databases expand_more

Vector DB (Milvus, pgvector, Pinecone) · Embeddings · Chunking strategy · Reranking · Hybrid search (vector + BM25). NVIDIA stack: NeMo Retriever + Milvus. Why it matters: RAG is the #1 technique for reducing hallucinations in production.

AdvancedOrchestration Frameworks expand_more

AIQ Toolkit (NVIDIA-native) · LangGraph (graph-based state machines) · CrewAI (role-based teams) · AutoGen (conversational multi-agent). Know when to use each. Know how to design agent boundaries: single responsibility, typed inputs/outputs, error handling.

ProductionDevOps for AI expand_more

Docker · Kubernetes · Helm charts · CI/CD (GitHub Actions) · Triton Inference Server · DCGM monitoring · Prometheus/Grafana dashboards. Why it matters: 80% of AI projects fail not because the model is wrong, but because the deployment is fragile. This is where enterprise-trained GSI architects often have gaps.

CriticalSecurity, Compliance & Observability expand_more

RBAC/ABAC access control · Audit logging (immutable) · Data classification · NeMo Guardrails · Prompt injection defense · LLM observability (LangSmith, Arize). Reference proof: HIPAA (54 specs) + SOC2 Type II (64 criteria) in production. Few practitioners have deployed AI under these constraints.

Your Story

200+ Architects → Confident Partners (No Hand-Holding)

"At a global enterprise software company, the team built the enablement program for 200+ solution architects who had zero AI experience. The curriculum mirrored this skills stack exactly: start with APIs (call an LLM), add RAG (ground it in their data), introduce multi-agent patterns (automate a workflow), then production: Docker, K8s, monitoring. By the end, partners could position and demo solutions without hand-holding. 35% faster time to POC. $45M+ in enabled pipeline. That's the train-the-trainer model that scales at NVIDIA across GSI partners globally."

Question 7

Influencing a Customer to Buy

If you are an employee, how do you influence a customer to buy/consider the solution?

The Framework: Pain → Proof → Path → Partner

Step	What You Do	Your Example
Pain	Find the failure mode first. Never pitch features before understanding the problem.	Legacy chatbot: 94% failure rate. Show the data before saying a word about ComplianceBot.
Proof	Build something in their environment, not a canned demo.	AIQ GSI Research Assistant built before the NVIDIA interview — not after.
Path	Make the journey to production feel short. ROI calculators, reference architectures, customer references.	$800K–$1.2M annual savings per 500-bed hospital. GPU Economics Calculator TCO model.
Partner	Leave them more capable than before. Train their team, not just their exec.	200+ architects trained. Partners close deals without hand-holding.

Stakeholder Map

Stakeholder	Primary Concern	Your Move
CTO/CIO	Architecture fit, security, scalability	Reference architecture + security posture (NeMo Guardrails, HIPAA, SOC2)
CFO	ROI, TCO, risk	ROI calculator + 3-year TCO model (GPU Economics Calculator)
VP Business Unit	Time to value, competitive edge	POC timeline, quick win from similar vertical
Lead Developer	Real integration complexity	Working code, GitHub repo, deploy in 1 hour
Compliance Officer	Regulatory risk	Audit trail, data residency, model explainability, governance framework

Key Numbers

35%

Faster time to POC with enablement tools (measured)

$45M+

Partner-driven ecosystem pipeline enabled

3×

POC-to-close conversion with working demo vs slides

200+

Solution architects trained → each becomes a deal multiplier

Question 8

Why Synthetic Data Is Critical

Why is synthetic data generation so important in the world of AI?

Simple Analogy — Flight Simulators

The FAA doesn't let pilots learn emergency procedures on real planes with real passengers. They use simulators — realistic enough to train on, safe to fail in. Synthetic data is the flight simulator for AI. You wouldn't train a rare disease detector on 12 real examples. Synthesize 10,000 realistic variants and suddenly you have enough.

Four Problems Solved

Problem	Without Synthetic Data	With Synthetic Data
Data scarcity	Rare events have too few examples (fraud, rare diseases, edge cases)	Generate realistic variants at scale
Privacy constraints	Real patient records, criminal histories cannot be shared	Synthetic records preserve statistics, expose no real individuals
Bias amplification	Historical data encodes historical biases	Synthetically balanced datasets correct representation gaps
Labeling cost	Human annotation at scale costs millions	Synthetic data comes pre-labeled

NVIDIA's Play

Product	Role
NeMo Data Curator	Data quality pipeline — filter, deduplicate, augment real data
Nemotron 4 340B	Trained substantially on synthetic instruction data — NVIDIA's landmark disclosure
NeMo synthetic data gen	Generate instruction-following datasets using teacher models
Omniverse / Replicator	Photorealistic synthetic data for computer vision (robots, defect detection, autonomous vehicles)

Your Story

SiteGuard AI + HIPAA — Synthetic Data as Operational Necessity

"In MediCall AI and SiteGuard AI, the teams hit the synthetic data wall directly. HIPAA prohibits using real patient records for AI training without strict controls. Real manufacturing defect footage is proprietary and scarce. SiteGuard AI used Omniverse-style simulation to generate synthetic camera + LiDAR + thermal signatures for defect types not yet seen in production — training the classifier before enough real examples existed. This is not a theoretical technique. It's how you build AI in regulated and constrained environments."

Question 9

OpenClaw, Red Hat AI / OpenShift, Hermes MLOps

Talk about OpenClaw, AI Red Hat Factory, OpenShift. Go through all industry examples and see what we can build to prepare for these questions. What are ISV solutions?

OpenClaw — The Displacement Signal

What it is: A complete personal AI assistant product (NOT a framework). Connects Claude/GPT/Llama to 16+ messaging platforms (WhatsApp, Telegram, Slack, Discord, Signal, iMessage) via a local daemon. Built by Austrian developer Peter Steinberger in a weekend.

Why it matters: 176,000+ GitHub stars (fastest in history). One developer displaced a million-dollar enterprise category.

Industry Pattern — The 1999 VoIP Displacement Parallel

"An industry analysis of OpenClaw shows a familiar pattern. In 1999, the Cisco partner channel was thriving while Lucent tried to buy Cisco's partners to slow VoIP displacement. Cisco won anyway — not because of better voice, but because it was general-purpose infrastructure that made dedicated voice infrastructure unnecessary. Lucent: $258B to pennies. NVIDIA is in the Cisco seat right now. OpenClaw is the signal that the application layer is commoditizing. The value will flow to whoever controls the infrastructure layer and the specialized vertical expertise."

Red Hat AI / OpenShift

Product	What It Is	NVIDIA Integration
OpenShift	Red Hat's enterprise Kubernetes — dominant in regulated industries (government, finance, healthcare)	NVIDIA GPU Operator runs natively
Red Hat OpenShift AI	ML model serving on OpenShift (formerly RHODS)	NIM deployable as OpenShift workloads
InstructLab	Red Hat's open-source fine-tuning using synthetic data (LAB = Large-scale Alignment for chatBots)	NeMo-compatible pipeline
Granite Models	IBM/Red Hat open-source LLMs for enterprise tasks	NIM-deployable

GSI Play: Accenture, Deloitte, IBM GBS all have OpenShift practices. Bridge: existing OpenShift → add NVIDIA GPU Operator → deploy NIM → build RAG pipeline. GSI provides integration + domain expertise.

Hermes / MLOps Stack

Data Version Control  → DVC, Git-LFS
Training Jobs         → Slurm + Enroot/Pyxis or Kubernetes
Experiment Tracking   → MLflow, Weights & Biases
Model Registry        → NGC Catalog, MLflow
CI/CD for ML          → GitHub Actions + Helm
Deployment            → NIM / Triton Inference Server
Infrastructure Mon.   → DCGM + Prometheus + Grafana
Drift Detection       → Arize, NeMo Guardrails monitoring
Cluster Management    → NVIDIA Base Command Manager

What to Build — Industry Examples → NVIDIA Demos

Industry Example	NVIDIA Concept	Build/Extend
MediCall AI (ASR)	NIM inference, real-time, TensorRT	✅ RealtimeVoice repo (Nemotron vs baseline ASR benchmark done)
ComplianceBot (4 agents)	AIQ Toolkit, multi-agent, Guardrails	Rebuild one agent with AIQ v1.3 + NIM + Guardrails layer
SiteGuard AI (sensor fusion)	Edge AI, Jetson Orin, multi-modal	Architecture diagram mapping NVIDIA stack to each sensor
Policy RAG	NeMo Retriever, RAG, Vector DB	Add policy search mode to aiq-gsi-demo
GPU Economics Calculator	DGX Cloud vs on-prem	Extend GPU Economics Calculator with on-prem mode

Question 10 — Crafted

The Trust & Governance Layer

As AI becomes critical infrastructure for enterprises and regulated industries, what does the Trust & Governance layer look like? How do you ensure AI decisions are explainable, auditable, and compliant? What does NVIDIA provide, and what should GSIs be building for their enterprise clients?

Simple Analogy — Nuclear Plant vs Generator

A home generator can fail — you replace the batteries. A nuclear plant cannot fail catastrophically, so it has redundant cooling, independent monitoring, regulatory inspectors, and public accountability. AI in a hospital emergency triage system is the nuclear plant. AI writing a marketing email is the home generator. The governance layer scales to the stakes.

The 5 Dimensions of AI Trust

Dimension	The Question	What Breaks Without It
Explainability	Why did the model produce this output?	Regulators reject black-box decisions; auditors can't investigate
Auditability	What was input, output, and reasoning for every decision?	No chain of evidence; unable to investigate failures; non-compliant
Compliance	Does the system meet regulatory requirements?	Legal liability, license revocation, agency defunding
Safety	Can harmful outputs be prevented?	Model says something illegal/dangerous; reputational damage
Data Governance	Who owns the data? How long retained? Who can access?	Privacy violations, GDPR/HIPAA breach, data sovereignty failure

NVIDIA's Trust Stack

Layer	NVIDIA Product	What It Does
Safety / Guardrails	NeMo Guardrails	Topical, safety, security, factual rails — block harmful/off-topic outputs at inference
Confidential Compute	H100 CC / NVIDIA CC SDK	Encrypts model weights and data in-use — even cloud provider cannot see data
Secure supply chain	AI Enterprise Prod/LTS branches	CVE-monitored containers, 0 high vulnerabilities, monthly patches
Sovereign AI	DGX SuperPOD + NIM on-prem	Keep model, data, and inference within national/org boundaries
Data protection	AES-256 + TLS 1.3 containers	Encryption at rest/transit in regulated deployments

The GSI Governance Architecture

┌─────────────────────────────────────────────────────────┐ │ GOVERNANCE CONTROL PLANE │ │ Policy Engine │ Audit Logger │ Risk Classifier │ │ (NeMo Guardrails)│ (immutable log) │ (EU AI Act tiers│ ├─────────────────────────────────────────────────────────┤ │ AI INFERENCE LAYER │ │ NIM microservice │ Triton server │ Model Registry │ ├─────────────────────────────────────────────────────────┤ │ DATA GOVERNANCE LAYER │ │ Access Control │ Data Lineage │ Retention Policy │ │ (RBAC/ABAC) │ (provenance) │ (TTL, disposal) │ ├─────────────────────────────────────────────────────────┤ │ CONFIDENTIAL COMPUTE LAYER │ │ H100 Confidential Computing (data + model encrypted │ │ in-use — not visible even to cloud provider) │ └─────────────────────────────────────────────────────────┘

Regulatory Landscape

Regulation	Industry	Key AI Requirement
HIPAA	Healthcare	54 security specs, audit trail, encryption, access logging, PHI protection
SOC2 Type II	Enterprise / SaaS	64 criteria: continuous monitoring, incident response, trust service principles
EU AI Act	EU deployments	Risk classification, human oversight, transparency for high-risk AI
HIPAA	Healthcare	Data minimization, access controls, audit logging
NIST AI RMF	US Federal + voluntary	Govern, Map, Measure, Manage — AI risk framework
GDPR	EU / Global	Right to explanation for automated decisions, data residency

Reference Story — Production AI Governance in Healthcare

HIPAA + SOC2 Type II — Real Production AI Under Strict Regulated Constraints

"In production deployments, teams operate in demanding AI governance environments: HIPAA + SOC2 Type II. Every query MediCall AI processes — a real-time check of EHR systems, insurance databases, and formulary APIs — generates an immutable audit record: who queried, what was queried, what was returned, and when. 54 HIPAA specs + 64 SOC2 criteria govern the system. SOC2 Type II certification took 6 months. AES-256 + TLS 1.3 encryption end-to-end.

The compliance audit is not a checkbox — it's the evidentiary chain that protects patients and validates clinical decisions. When a triage AI gives the wrong answer, someone's life is at risk. That's not a hallucination problem. That's a governance failure.

The GSI opportunity is massive: 95% of enterprise AI projects have no governance layer. They're demos pretending to be production systems. A GSI that walks in with a governance framework, audit toolkit, and NVIDIA Confidential Computing reference architecture will close every deal that point-solution AI vendors can't."

Key Numbers

118

HIPAA + SOC2 combined specs/criteria (54 HIPAA + 64 SOC2 Type II)

6 months

SOC2 Type II certification (production reference)

€35M

EU AI Act max fine for prohibited AI

Global turnover penalty — EU AI Act

40%

Enterprise AI deployments expected to face governance failures by 2026 (Gartner)

The Business Case for GSIs

Key Interview Framing

Every enterprise that deployed AI in 2023–2024 without a governance layer is now retroactively building one. Governance retrofits are 3–5× more expensive than governance-first builds. A GSI that walks in with a framework, audit toolkit, and NVIDIA Confidential Computing reference architecture will close deals that point-solution AI vendors cannot.

Reference

Story Bank

Reference stories mapped to interview questions. For each story: the metric, the Q it answers, and how to deliver it.

Story	Questions	Key Metric	Opening Line
MediCall AI — Nemotron ASR	Q1, Q2, Q4	45ms / 21× faster than baseline ASR	"Benchmarks showed Nemotron against baseline ASR on an L40S GPU..."
ComplianceBot — 94% → 89%	Q3, Q7	4 agents fixed what 1 LLM call couldn't	"The original system had a 94% failure rate. Here's why — and how it was redesigned."
SiteGuard AI — sensor fusion	Q1, Q5, Q8	0–5s detection; manufacturing floor defect detection	"The team needed to detect a defect before it left the production line..."
HIPAA + SOC2 Type II compliance	Q10	118 combined specs; SOC2 Type II in 6 months	"Every query generates an immutable audit record. Here's what that architecture looks like."
200+ architects trained	Q6, Q7	35% faster time to POC; $45M+ pipeline	"The team designed an enablement program for architects who'd never touched AI..."
OpenClaw industry analysis	Q9	176K stars; VoIP pattern repeating	"In 1999, the Cisco partner channel saw VoIP disruption. OpenClaw is the same pattern."
GPU Economics Calculator tool	Q4, Q7	On-prem vs cloud TCO model (github.com/QbitLoop)	"The team built the tool that answers this question — let me walk you through it."
aiq-gsi-demo on Brev L40S	Q1, Q2, Q3	NIM + AIQ Toolkit + 96 verified NVIDIA sources; $1.49/hr	"I deployed an NVIDIA Blueprint on an L40S instance this week to prove it works in practice..."

Deep Dive

Agent Engineering

How to build reliable, production-grade AI agents from scratch

The Core Pattern

An agent is an LLM + a loop + tools. The loop: Observe → Think → Act → Observe. Tools are functions the LLM can call. Reliability comes from the harness around the loop, not from the model itself.

ReAct Pattern (Reason + Act)

Step	What happens	Implementation
Thought	LLM reasons about what to do next	Structured output parsing (JSON mode)
Action	LLM selects a tool + arguments	Tool schema in system prompt; function calling API
Observation	Tool returns result back to context	Append tool result to conversation history
Repeat	LLM decides: done or next action?	Max iterations limit prevents infinite loops
Final	LLM synthesizes all observations	Structured final answer extraction

Agent Harness — What I Build Around Every Agent

Component	Why it matters	My implementation
Tool registry	Defines what the agent can do	Pydantic schemas; strict type validation on inputs/outputs
Retry logic	LLMs fail ~15% of tool calls on first attempt	3 retries with exponential backoff + error injection into context
Max iterations	Prevents infinite reasoning loops	Hard limit of 10; soft warning at 7 with progress summary
Structured output	Parse-safe final answers	Instructor library + Pydantic; fallback to regex extraction
Trace logging	Debug which tool was called, with what args, and why	Append each thought/action/observation to trace file
Guardrails hook	Block unsafe outputs before they leave	NeMo Guardrails colang rules on final response

Multi-Agent Architecture Patterns

Pattern	When to use	NVIDIA tool
Orchestrator + specialists	Complex tasks needing multiple domain experts	AIQ Toolkit + specialized NIM agents
Parallel fan-out	Independent sub-tasks that can run simultaneously	async tool dispatch; aggregate results before synthesis
Sequential pipeline	Each step depends on previous output	LangGraph stateful graph; node = agent, edge = condition
Critic / reviewer	High-stakes output that needs self-checking	Second NIM call with critic prompt; reject if score < threshold
Human-in-the-loop	Irreversible actions (API writes, emails, payments)	Approval gate before any write tool; async webhook

Reference Story — ComplianceBot (4 Agents)

Built 4-agent system for a global enterprise software company: Orchestrator → [RAG Policy Agent | Case Lookup Agent | Form Gen Agent | Escalation Agent]. Key lesson: the hardest problem was not the LLM — it was tool reliability. EHR APIs timeout; forms have 200+ fields. Built retry harness + confidence threshold routing (if RAG <70% confident → escalate to human). Production revenue generating.

Deep Dive

Context Engineering

The art of fitting the right knowledge into 200K tokens — without RAG

The Core Insight

Context engineering is the discipline of deciding WHAT knowledge to put in the context window, HOW to compress it, and in WHAT ORDER. A well-engineered context window eliminates the need for RAG on many tasks. Distillation > Retrieval when the knowledge set is bounded and known.

Context Budget Framework

Layer	Token Budget	What goes here	Priority
System identity	~2K	Role, persona, behavior rules, output format	P0 — always present
Core knowledge	~20K	Distilled domain facts (not raw docs)	P0 — always present
Task examples	~10K	Few-shot demonstrations, templates	P1 — include for complex tasks
Retrieved context	~30K	RAG chunks for this specific query	P1 — query-time only
Conversation history	~20K	Prior turns, compressed summaries	P2 — sliding window
Response buffer	~8K	Space for model to generate output	P0 — always reserve

Knowledge Distillation Techniques

Technique	Compression ratio	When to use
Extractive summarization	10:1	Structured docs (manuals, specs)
Abstractive distillation	20:1	Unstructured prose, research papers
Fact tables	50:1	Numerical data, product specs, comparisons
Decision trees (text)	30:1	Troubleshooting guides, workflows
Q&A pairs	15:1	FAQ content, support docs

My Project — GSI Expert Skill

Building a Claude Cowork connector that distills 4.8GB of NVIDIA knowledge into ~80K tokens. No RAG — pure context engineering. The 5 knowledge domains (AIIO training, interview team intel, technical manual, my experience, NVIDIA stack) are each distilled into optimized markdown with specific token budgets. The skill becomes an AI employee that knows everything I know about NVIDIA + GSI without any database.

System Prompt Design Patterns

Pattern	Use case	Key principle
Role + Goal + Constraints	Any agent/assistant	Tell the model WHO it is, WHAT it does, what it must NOT do
Chain-of-thought seeding	Reasoning tasks	"Think step by step" or structured thought template in system
Output format schema	Structured extraction	JSON schema in system prompt enforces parseable output
Persona + tone calibration	Customer-facing apps	Explicit persona description with 3 example turns
Safety anchors	Production systems	Explicit refusal list beats guardrails as first line of defense

Deep Dive

Memory Systems in AI Agents

How agents remember — the 4 memory types and when to use each

The Analogy

Human memory: short-term (working memory), long-term (episodic + semantic), procedural (muscle memory). AI agent memory maps exactly: context window, vector DB, document store, fine-tuned weights. The architecture question is: which tier for which information?

Memory Type	Human Analogy	AI Implementation	NVIDIA Component	When to Use
In-context (working)	Active thought	Conversation history in prompt	KV Cache (TRT-LLM)	Current session; immediate context
Episodic (short-term)	Yesterday's events	Session summaries in vector DB	NeMo Retriever + Milvus	Multi-session continuity; user preferences
Semantic (long-term)	Domain knowledge	Knowledge base in vector DB	NeMo Retriever + RAG	Static facts; documentation; policies
Procedural (implicit)	Skill / habit	Fine-tuned model weights	NeMo Framework LoRA/SFT	Domain vocab, style, specialized reasoning

Memory Architecture Decision Tree

Question	If YES	If NO
Is the info needed only in this session?	In-context memory (append to prompt)	Needs persistence → vector DB
Does the info change frequently?	Vector DB (update embeddings)	Stable → consider fine-tuning
Is exact retrieval required?	Traditional DB + structured query	Semantic similarity → vector DB
Is it behavioral style/domain vocab?	Fine-tune (LoRA) the base model	Keep in external retrieval
Is it user-specific preferences?	User profile in episodic vector store	Global knowledge → semantic store

Context Window Management (Sliding Window)

The Compaction Problem

Long conversations overflow the context window. Strategies: (1) Sliding window — keep last N turns. (2) Summarization — compress older turns into a summary. (3) Hierarchical — compress into layers (turn → session → user history). Claude Code uses compaction summaries — you see this every session reset.

My Project — Claude Code Memory System

Built a file-based persistent memory system for this project: user.md (who Waseem is), feedback.md (what works), project.md (current goals), reference.md (where things are). Each file has structured frontmatter. MEMORY.md is an index loaded every session. This IS context engineering applied to agent memory — 200-line limit on index, semantic content in separate files. Zero vector DB needed.

Deep Dive

Agentic Loops

The patterns that make agents actually work in production

Why Loops Matter

Single-shot LLM calls are not agents. Agents loop: perceive the environment, reason, act, observe the result, reason again. The loop design — how many iterations, when to stop, how to handle failure — determines whether an agent is reliable in production or a demo that breaks on Monday morning.

The Major Loop Patterns

Pattern	Flow	Best for	Risk
ReAct	Think → Act → Observe → Think...	Tool-using agents with uncertainty	Can spiral in reasoning loops
Plan-then-Execute	Plan all steps → Execute in order	Well-defined tasks with known steps	Plan goes stale if environment changes
Reflection / Critic	Generate → Critique → Revise	Quality-sensitive outputs (reports, code)	2x token cost; critic can be wrong
RLHF Loop	Generate → Human rates → Reward model learns → Fine-tune	Model behavior alignment	Reward hacking; expensive
RLAIF (AI feedback)	Generate → Stronger LLM rates → Reward model → Fine-tune	Scalable alignment without humans	Model collapse if teacher is weak
Self-play (synthetic data)	Model generates problems → solves them → trains on correct ones	Math, coding, reasoning skill improvement	Distribution shift from real data

Loop Failure Modes (and Fixes)

Failure mode	Root cause	Production fix
Infinite reasoning loop	No termination condition	Max iteration limit + progress check
Wrong tool called	Ambiguous tool description	Better docstrings; few-shot examples in tool schema
Hallucinated tool args	Model invents parameter values	Strict Pydantic validation; error fed back to context
Lost context mid-loop	Context window fills up	Compress intermediate observations; keep trace summary
Irreversible side effect	Action agent called write API	Human-in-the-loop gate for all write operations
Cost explosion	Too many LLM calls in loop	Cache repeated tool calls; use smaller model for sub-steps

Eval Loop — Closing the Quality Feedback Loop

The Production Secret

The loop that makes agents better over time: (1) Log every agent trace. (2) Sample failures. (3) Root-cause which loop step failed. (4) Fix: better prompt, better tool, add retry, add critic. (5) Re-run eval set. This loop, not the initial build, is what makes agents production-grade.

Reference Story — ComplianceBot Loop Debugging

ComplianceBot had a 40% failure rate in week 1 of production. Loop analysis: 32% was wrong chunking in RAG retrieval (fixed: semantic chunking), 5% was EHR API timeout not handled (fixed: retry + fallback message), 3% was hallucinated case numbers (fixed: guardrail on numeric patterns). Loop debugging reduced failure rate to <8% in 3 weeks. The loop trace log was the only tool that made this diagnosable.

Deep Dive

Evaluation & RAG Bench

How to prove your AI system works — before it hits production

The Eval Mindset

You cannot improve what you cannot measure. For RAG systems: answer correctness is not enough. You need retrieval quality + answer faithfulness + answer relevance — separately. For agents: you need per-step accuracy, not just final answer quality. RAGAS is the framework; LLM-as-judge is the scalable evaluation method.

RAGAS Metrics Framework

Metric	What it measures	Target	How to improve
Faithfulness	Is the answer grounded in retrieved context? (No hallucination)	>0.90	Better guardrails; citation enforcement
Answer Relevancy	Does the answer actually address the question?	>0.85	Better reranking; query expansion
Context Precision	Are retrieved chunks actually relevant?	>0.80	Better chunking; reranker model
Context Recall	Did retrieval find all the relevant info?	>0.75	Larger k; hybrid search
Answer Correctness	Is the answer factually correct?	>0.85	Better source docs; fact verification

RAG Eval Bench — My GitHub Repo

QbitLoop/RAG-Eval-Bench

Benchmarks 4 RAG approaches head-to-head: naive chunking, semantic chunking, hybrid search, and reranking. Runs RAGAS on all 4. Shows that adding a reranker (NV-RerankQA) improves context precision by ~20% with minimal latency cost. Used to justify the ComplianceBot architecture decisions.

LLM-as-Judge Pattern

Step	What happens	Tool
Generate golden set	Create 50-200 Q&A pairs from ground truth docs	Nemotron or Claude to generate; human spot-check
Run the system	Feed each Q through the RAG/agent pipeline	Automated test harness
Score with LLM judge	Stronger model rates: correct, partially correct, wrong	Claude claude-opus-4-6 or GPT-4 as judge; structured output
Compute RAGAS	Automated metric calculation	RAGAS library + Milvus + NIM embeddings
Root-cause failures	Cluster failing cases by failure mode	K-means on failure embeddings; manual review of top clusters

Reference Story — Chunking Was the Problem

RAG-Eval-Bench finding that changed ComplianceBot: naive 1000-char chunking scored 0.42 faithfulness. Semantic chunking (split on section boundaries) scored 0.87. The difference was that compliance policy documents have cross-references — "as defined in Section 5.3.2" — which naive chunking split across chunk boundaries. Semantic chunking kept policy sections intact. One config change, 2x improvement.

Deep Dive

Voice AI & Multimodal

Real-time speech + vision — the latency and architecture challenges

The Latency Problem

Voice AI has a hard UX cliff: <3s total round-trip feels responsive; >5s feels broken. The pipeline: ASR (speech to text) + LLM inference + TTS (text to speech). Each step adds latency. Streaming is not optional — it is the only architecture that crosses the 3s threshold reliably.

Voice AI Pipeline — Components and Latencies

Stage	Technology	Latency target	NVIDIA solution
Wake word / VAD	Voice Activity Detection	<50ms	NVIDIA Riva VAD (on-device)
ASR (speech to text)	Streaming whisper or Parakeet	<200ms TTFT	Nemotron ASR: 45ms TTFT / 21x faster than baseline ASR
LLM inference	NIM + TRT-LLM	<800ms TTFT	NIM streaming; first token in <500ms
TTS (text to speech)	Streaming synthesis	First audio <300ms	NVIDIA Riva TTS; ElevenLabs for quality
Total (streaming)	End-to-end streamed	<3.3s perceived	MediCall AI: 3.3s on HIPAA-compliant GPU nodes

Multimodal Architecture Patterns

Modality combo	Architecture	Use case	My example
Speech + Text	ASR → LLM → TTS pipeline	Voice assistants, triage, call centers	MediCall AI — hospital emergency triage voice AI
Vision + Text	Vision encoder + LLM (VLM)	Document OCR, scene understanding	SiteGuard AI — camera + LiDAR fusion for defect detection
Video + Text	Frame sampling + VLM	Surveillance, factory floor review, training	Smart manufacturing QA pipeline
Sensor + Text	Multi-modal encoder + LLM	IoT, robotics, physical AI	SiteGuard AI: camera + LiDAR + thermal fusion
Voice + RAG	ASR → RAG → TTS	Voice-enabled knowledge search	ComplianceBot voice mode for policy lookups

Reference Story — RealtimeVoice Benchmark

QbitLoop/RealtimeVoice GitHub repo: benchmarks showed NVIDIA Nemotron ASR at 45ms vs baseline ASR at 900ms on the same audio clips — 21x faster. Architecture: WebSocket streaming from browser → GPU inference → streaming response back. Key finding: baseline ASR latency is not model inference time, it's the encoder's non-streaming architecture. Nemotron uses streaming CTC decoder — first token in 45ms regardless of utterance length.

MediCall AI — Full Architecture Story

Production Stats

Platform: Mobile triage device → MediCall Voice Gateway (HIPAA-compliant cloud) → NLU → Intent Router → [EHR agent | policy RAG agent | formulary agent] → TTS → speaker. Latency: 3.3s 95th percentile. Compliance: HIPAA + SOC2 Type II (118 combined specs), H100 Confidential Computing TEE. Scale: multi-hospital network deployments. Production revenue generating.

Certification

AIIO Associate Exam Prep

50 questions · 60 min · 70% pass · $135 — exam by end of March 2026

Section Weights — Know These First

38% Essential AI Knowledge (Units 1–6) | 40% AI Infrastructure (Units 7–12) | 22% AI Operations (Units 13–14)

Infrastructure is the biggest section. Focus there first.

1 — The 4-Step AI Workflow (always tested)

Step	What happens	NVIDIA tool	Key detail
1. Data Prep	Clean, label, augment datasets	RAPIDS, NeMo Data Curator	GPU-accelerated Python; most time-consuming in practice
2. Training	Adjust model weights via gradient descent	PyTorch / TensorFlow / NeMo	Most compute-intensive step; uses FP8/FP16 mixed precision
3. Optimization	Compress + optimize for target hardware	TensorRT (TRT)	Quantization, pruning, kernel fusion → 5x faster inference
4. Inference / Deploy	Run model on new inputs in production	Triton Inference Server + NIM	Multi-GPU, multi-node, load balancing, multi-framework

2 — GPU Architectures (40% section, exam-critical)

Architecture	GPU	Transistors	Key features	MIG?	Conf.Compute?
Blackwell	B200	208B	2nd-gen Transformer Engine, FP8/FP4, 5th-gen NVLink (576 GPUs), RAS engine, decompression engine	Yes	Yes (built-in)
Hopper	H100	80B	1st-gen Transformer Engine, FP8, 4th-gen NVLink (900GB/s), 4th-gen Tensor Cores	Yes (up to 7)	Yes (built-in)
Ada Lovelace	L40S	—	4th-gen Tensor Cores, gaming+data center inference, advanced video accel, 2x power efficiency vs prior gen	No	No
Grace CPU	ARM CPU	—	NVIDIA's first data center CPU; ARM-based; large memory BW; HPC, genomics, quantum chemistry	N/A	N/A
Grace Hopper	GH200	—	Grace CPU + H100 GPU via NVLink-C2C; 900GB/s bidirectional; unified memory; recommender systems	Yes	Yes
Grace Blackwell	GB200	—	Grace CPU + 2x B200 via NVLink; 900GB/s; unified memory; trillion-parameter LLMs	Yes	Yes

3 — DGX Systems Specs (know the numbers)

System	GPUs	Key specs	Use case
DGX H100	8x H100	4x NVSwitch, 10x ConnectX-7 (400Gbps each = 1TB/s peak), 2TB RAM, 30TB NVMe, 32 petaFLOPS FP8	Enterprise training + inference
DGX B200	8x B200	5th-gen NVLink, 72 petaFLOPS FP8 training, 144 petaFLOPS FP4 inference, 1.4TB GPU memory	Frontier LLM training + inference
GB200 NVL72	72x Blackwell	36 Grace CPUs + 72 B200s; liquid-cooled; 130TB/s compute fabric; "exaflop AI supercomputer in a rack"	Trillion-parameter model training
DGX BasePOD	2–16 nodes	DGX H100 or B200 nodes + QM9700 switch (compute) + SN5600 (GPU-to-GPU Ethernet) + SN4600 (storage/mgmt) + SN2201 (out-of-band)	On-prem integrated AI solution
DGX SuperPOD	Up to 127	Scalable units (SU); recommended for GPT-4 scale LLM training; deployed at customer DCs + CSPs	Hyperscale AI training

4 — Interconnects: NVLink vs NVSwitch vs PCIe (multiple exam questions)

Technology	What it does	Bandwidth	Exam key fact
NVLink	High-speed chip-to-chip: GPU-to-GPU and GPU-to-CPU	H100: 900GB/s; B200: 5th gen → 576 GPUs	Overcomes PCIe bottleneck for scale-up multi-GPU
NVSwitch	Fabric switch enabling all-to-all GPU communication at full bandwidth	Full bisection bandwidth	Enables ALL GPUs to talk to each other simultaneously (no bottleneck)
PCIe	Traditional server bus, CPU-to-GPU	~128 GB/s (PCIe 5.0)	The bottleneck NVLink replaces for GPU-to-GPU communication
InfiniBand	Multi-node compute fabric (node-to-node)	NDR: 400Gbps per port	Lossless, RDMA-native, ultra-low latency; QM9700 switch = BasePOD compute

5 — Legacy Network vs AI Network (always on the exam)

Characteristic	Legacy/Cloud Network	AI-Optimized Network
Protocol	TCP	RDMA (Remote Direct Memory Access)
Jitter tolerance	High jitter tolerated	No jitter — every GPU waits for slowest
Oversubscription	Oversubscribed (shared bandwidth)	Nonblocking (full bisection bandwidth)
Traffic pattern	North-south (client ↔ server)	East-west (GPU ↔ GPU, all-to-all)
NVIDIA product	Spectrum-4 (Ethernet)	QM9700 InfiniBand or Spectrum-4 AI

RoCE = RDMA over Ethernet

RoCE (RDMA over Converged Ethernet) encapsulates InfiniBand transport packets inside Ethernet frames. Enables RDMA performance on Ethernet infrastructure. Key when a customer has Ethernet investment but needs RDMA for AI training.

6 — Management Tools Mapping (Exam question: match tool to function)

Tool	Function	Layer	Key detail
Ansible	Provisioning automation	Infrastructure	Install OS, configure servers — automation layer
Terraform	Infrastructure as code	Infrastructure	Define data center resources declaratively
Foreman	Server lifecycle management	Infrastructure	Provisioning + OS deployment + reporting
DCGM	GPU health monitoring	Resource monitoring	Telemetry → DCGM Exporter → Prometheus → Grafana
Prometheus	Metrics collection/storage	Resource monitoring	Scrapes DCGM Exporter; time-series DB
Grafana	Visualization dashboards	Resource monitoring	Connects to Prometheus; GPU dashboards
Slurm	Job scheduling (HPC/bare metal)	Workload management	Open source; batch + interactive; priority/preemption; no kernel mods
Kubernetes	Container orchestration	Workload management	Microservices; scale up/down; advanced features built-in
Base Command Manager	On-prem cluster management	All three layers	Deploys K8s + Slurm + Jupyter; job monitoring + GPU metrics
Base Command Platform	DGX Cloud web interface	Workload management	Submit/monitor AI training jobs in browser; web GUI for DGX Cloud

7 — Kubernetes vs Slurm (Key exam distinction)

Dimension	Kubernetes	Slurm
Workload type	Containers / microservices	Bare metal HPC / batch jobs
Scaling	Scale up and down dynamically	No dynamic scale-down (HPC cluster)
Advanced features	Via MLOps plugins	Built-in (priority, preemption, QoS)
Inferencing support	Yes (scales to demand)	No (not designed for inference serving)
NVIDIA integration	GPU Operator + Network Operator	enroot + Pyxis for containers
When to use	Inference serving, MLOps, microservices	Model training, simulation, HPC workloads

8 — MIG Deep Dive (22% operations section)

What is MIG?

Multi-Instance GPU — partition one H100 or B200 into up to 7 fully isolated instances. Each instance gets its own HBM (memory), cache, and compute cores — hardware-enforced isolation, not virtualization. Does NOT increase power consumption. MIG Manager handles this in Kubernetes clusters.

Exam question	Answer
Max MIG instances per GPU?	7 (on H100 or B200)
Does MIG increase power consumption?	No — key exam trap
Which GPUs support MIG?	H100 (Hopper), B200 (Blackwell). NOT A100 Ada Lovelace/L40S
MIG vs NVSwitch — what's the difference?	MIG = GPU partitioning; NVSwitch = multi-GPU all-to-all communication fabric
What manages MIG in Kubernetes?	MIG Manager (separate from Network Operator)
Why use MIG?	Multi-tenant inference — 7 isolated workloads on one GPU; maximize utilization

9 — Cloud Consumption Models (quick exam points)

Model	What you manage	Example	Exam hook
On-prem	Everything	DGX BasePOD in your DC	Full control, full cost
IaaS	OS + software + data	EC2 with A100 instance	You manage the OS
PaaS	Data + applications	Managed Kubernetes	Managed Kubernetes = PaaS — exam answer
SaaS	Nothing (just use it)	DGX Cloud	Least management = SaaS — exam answer

4 Key Cloud AI Considerations (exam: select 4)

Data Locality | Data Sovereignty | Hybrid IT Strategies | Real-Time Performance

10 — High-Frequency Practice Questions

#	Question	Answer
1	What are the 4 steps of the AI workflow?	Data Prep → Training → Optimization → Inference/Deployment
2	What tool maps to the Optimization step?	TensorRT (TRT)
3	What tool maps to the Deployment step?	Triton Inference Server
4	Key difference: Legacy vs AI network?	Legacy = TCP; AI = RDMA; Legacy = oversubscribed; AI = nonblocking
5	What does NVSwitch do (vs NVLink)?	NVLink = chip-to-chip GPU-to-GPU; NVSwitch = all-to-all fabric at full bandwidth for all GPUs simultaneously
6	What is RoCE?	RDMA over Converged Ethernet — InfiniBand packets in Ethernet frames for RDMA over Ethernet
7	Managed Kubernetes = which cloud model?	PaaS
8	Minimal management cloud model?	SaaS
9	DCGM function?	GPU health monitoring and management in clusters; simplifies admin; improves reliability
10	Ansible vs DCGM vs Slurm — what does each do?	Ansible = provisioning; DCGM = GPU monitoring; Slurm = workload scheduling
11	Max MIG instances per GPU?	7
12	Does MIG increase power?	No
13	Which GPUs support MIG?	H100 (Hopper) and B200 (Blackwell)
14	DGX H100 GPU count + key BW?	8x H100; 10x ConnectX-7 = 1TB/s peak network bandwidth
15	GB200 NVL72: how many GPUs?	72x Blackwell + 36x Grace CPUs; "exaflop supercomputer in a rack"
16	BasePOD compute switch model?	QM9700 (NDR InfiniBand 400Gbps)
17	Primary storage requirement for AI?	Large capacity with focus on READ IO performance (training = repeated reads of datasets)
18	DPU role (3 words)?	Offload + Accelerate + Isolate infrastructure tasks from CPU
19	Web interface for DGX Cloud?	Base Command Platform
20	What are NVIDIA AI Foundations?	NeMo (text/LLM) + Picasso (visual) + BioNeMo (biology) — managed services on DGX Cloud
21	vGPU: true statements (select 2)?	Multiple VMs share a single GPU AND a single VM can access multiple GPUs
22	Cooling options for GPUs (select 2)?	Water cooling + Air cooling
23	3 data center resources to consider?	Power + Cooling + Space
24	True about accelerated computing (select 3)?	GPUs use less energy over time; MIG boosts utilization without more power; GPUs handle compute-intensive functions
25	Blackwell: what generation NVLink? How many GPUs?	5th-gen NVLink; scales to 576 GPUs

Exam Strategy

Infrastructure is 40% — master GPU architectures, DGX specs, and networking first. Study the management tools mapping (Ansible/DCGM/Slurm) as a unit — it appears as a matching question. MIG questions are tricky: the exam tries to confuse MIG (partitioning) with NVSwitch (fabric) and GPU Operator (K8s software). Know RoCE definition exactly. Kubernetes = PaaS, DGX Cloud = SaaS are guaranteed questions.

Reference

NVIDIA Product Cheat Sheet

Every product you might name — one line, when to use it.

Product	Category	One-Line	When to Recommend
NIM	Inference	Any model as OpenAI-compatible API	Always — fastest time to inference
NeMo Framework	Training	Train/fine-tune/customize LLMs end-to-end	Domain-specific models, SFT, LoRA
NeMo Guardrails	Safety	Safety/compliance rails at inference	Any production deployment
NeMo Retriever	RAG	Semantic search + vector DB integration	Knowledge-grounded RAG pipelines
AIQ Toolkit	Agents	NVIDIA-native multi-agent orchestration (v1.3)	Agentic apps on NVIDIA stack
Triton Inference Server	Serving	Multi-framework model serving for production	High-throughput, multi-model environments
TensorRT / TRT-LLM	Optimization	GPU inference optimization (5× speedup)	Latency-sensitive production inference
DGX Cloud	Cloud	Multi-node AI training as a service	Training without capex; bursty workloads
AI Enterprise	Platform	Production-grade software platform with SLA	Any enterprise going to production
Base Command Manager	Ops	Cluster provisioning + Slurm/K8s orchestration	On-prem DGX cluster management
DCGM	Monitoring	GPU telemetry, health checks, diagnostics	Any managed GPU cluster
NGC Catalog	Registry	GPU-optimized containers, models, Helm charts	Starting point for any NVIDIA deployment
NVLink 5	Networking	1.8 TB/s GPU-to-GPU (within node)	Multi-GPU training, large model inference
InfiniBand NDR	Networking	400 Gb/s multi-node training fabric	Multi-node DGX training clusters
BlueField DPU	Infrastructure	Offload networking/storage from CPU	Large-scale inference, zero-trust security
GPU Operator	K8s	Automates NVIDIA software on Kubernetes	Any K8s cluster with NVIDIA GPUs
MIG	Virtualization	Partition one GPU into isolated instances	Multi-tenant inference, shared infrastructure
Omniverse / Replicator	Synthetic Data	Photorealistic synthetic data for computer vision	CV training, robotics, defect detection
H100 Confidential Computing	Security	Encrypt data + model in-use (TEE)	Healthcare, finance, government regulated AI

Portfolio

What I Am Consuming

The tools and projects I actively use and build on.

settingsLive

ops.qbitloop.com

AI Operations dashboard — MLOps tooling, model monitoring, infrastructure ops views.

ReactVercelMLOps

codeLive

crow.qbitloop.com

The Crow Code — creative coding experiments, generative art, interactive demos.

ReactVercelCreative