The Vibe of Agentic Infrastructure

Erica Windisch

Infrastructure engineer for 25+ years — from containers and cloud to data pipelines and ML.

  • Building HyprStream — open source agentic infrastructure for continuously learning apps
  • Founded the Docker Security Team
  • Founded OpenStack Magnum container service
  • Founder & CTO, IOpipe — serverless & event-based observability, acquired by New Relic
  • AWS Serverless Hero 2016–2022
Welcome email from Solomon Hykes, Jan 2014

Part 1: The Agentic Stack Ecosystem

AI

Apps
APIs
Models
Infra Software
Compute Hardware
Datacenters, Closets & Desks

The Agentic Cycle

promptread alltool_useappend to context
User
Context Window
system: You are a helpful...
user: Deploy staging
assistant: Checking status...
tool_use: bash "kubectl get pods"
tool_result: pod/api Ready
assistant: Deploying now...
tool_use: bash "kubectl apply"
tool_result: deployment ok
↕ grows each turn
Model
Response
Tool Call
name + args
Execution
bash, API, MCP
Tool Result
prompt / context
tool invocation
result → context

Tools for Builders

Coding Agents
ToolLicenseFormStars
OpenCodeMITCLI / TUI~120K
Claude CodeProprietaryCLI~82K
Codex CLIApache-2.0CLI~67K
ClineApache-2.0VS Code~59K
AiderApache-2.0CLI~43K
GooseApache-2.0CLI / App~34K
CursorProprietaryIDE~33K
Copilot AgentProprietaryIDE / Cloud
Agent Frameworks
FrameworkByLicenseStars
LangChain / LangGraphLangChainMIT~126K
AutoGen → Agent FwkMicrosoftMIT~54K
CrewAICrewAIMIT~47K
SmolagentsHuggingFaceApache-2.0~26K
OpenAI Agents SDKOpenAIMIT~20K
Google ADKGoogleApache-2.0~19K
Pydantic AIPydanticMIT~16K
Agent SDKAnthropicMIT~6K
The pattern: system prompt + tool definitions over an LLM API. The agent is defined by its tools, not its model.

Open models

Frontier: 🇨🇳Qwen3.5 (397B/17A) · 🇨🇳DeepSeek-V3 (671B/37A) · 🇺🇸Nemotron 4 (70B/80B) · 🇺🇸Llama 3.1 405B
Production: 🇨🇳Qwen3-Coder-Next (80B/3A) · 🇺🇸Llama 3.3 70B · 🇨🇳Qwen3.5 27B · 🇨🇳Qwen2.5-Coder 32B · 🇺🇸Granite 4.0 H Small (32B/9A) · 🇨🇳Qwen2.5-Coder 14B · 🇫🇷Mistral Codestral 22B
Edge: 🇨🇳Qwen3.5 9B · 🇺🇸Phi-4 14B · 🇺🇸Llama 3.2 1B · 🇺🇸Llama 3.2 3B · 🇺🇸TinyLlama 1.1B · 🇨🇳Qwen2.5 7B · 🇫🇷Mistral 7B · 🇺🇸Gemma 2 9B · 🇺🇸Gemma 2 2B · 🇺🇸OLMo 2 13B

Format: 100B/12A = 100B total, 12B active (MoE). Flags indicate origin.

Model Benchmarks

7B14B30B70B200B400B700B40%47%54%61%68%75%Parameters (billions, log scale)Terminal-Bench Score (%)Opus 4.6GPT-5.4Gemini 3.1GLM-5DeepSeek-V3Qwen3-Coder-NxQwen 3.5 27BGranite 4.0Llama 3.3 70BQwen2.5-C 32BNemotron 4Codestral 22BClosedOpen

Quantization

Shrinking weights from 16/32-bit to 4-8 bit

  • Q4, Q5, Q8 — bits per weight
  • K vs 0 — group-wise vs global scaling
  • S / M / L — precision tiers
  • Sweet spot: Q4_K_M
4-bit5-bit6-bit8-bit16-bit85%88%92%95%99%102%Quantization BitsAccuracy vs FP16Q4_K_MQ4_K_SQ4_0Q5_K_MQ5_K_SQ5_0Q6_KQ8_0FP16Q4Q5Q6Q8FP16

Inference & Training Engines

The runtime layer — where tokens actually get produced

EngineTypeLangKey DifferentiatorStarsLicense
vLLMInferencePython/C++PagedAttention, continuous batching74KApache-2.0
SGLangInferencePythonStructured generation, zero-overhead scheduler24KApache-2.0
llm-dInferenceGo/PythonK8s-native vLLM orchestration2.3KApache-2.0
llama.cppInferenceC/C++CPU-first, GGUF quants, edge/local99KMIT
TensorRT-LLMInferenceC++NVIDIA-only, max throughput13KApache-2.0
UnslothFine-tunePython2x faster, 70% less VRAM58KApache-2.0
AxolotlFine-tunePythonConfig-driven, multimodal11KApache-2.0
MLXBothC++Apple Silicon unified-memory ML framework25KMIT
BurnBothRustPure Rust DL, cross-platform GPU14KMIT / Apache-2.0
NanoGPTBothPythonSimplest medium-sized GPT training/finetuning56KMIT
CandleBothRustHuggingFace ML, WASM support19KMIT / Apache-2.0
HyprstreamBothRustAgentic: inference + TTT + GitAGPL-3.0 / MIT

Hardware

Open Hardware

Tenstorrent (RISC-V Based)
  • TT-QuietBox 2: 4× Blackhole ASICs, $9,999, shipping Q2 2026
  • $2.67–3.82 per TFLOPS — 7× cheaper than NVIDIA H100
  • Fully open-source stack: TT-Forge, TT-Metalium, TT-NN
  • Trade-off: 13× less compute than H100, optimized for inference
RISC-V for AI
  • Alibaba XuanTie C950: RISC-V + matrix acceleration, LLM-native
  • SiFive Intelligence Gen 2: Edge AI (automotive, robotics), Q2 2026
  • PyTorch / ExecutorTorch support ramping 2026+
  • Advantage: royalty-free ISA, diverse ecosystem, zero lock-in

Part 2: Context

The Agentic Cycle

Every node in this loop is a surface to understand, protect, and leverage

promptread alltool_useappend to context
User
Context Window
system: You are a helpful...
user: Deploy staging
assistant: Checking status...
tool_use: bash "kubectl get pods"
tool_result: pod/api Ready
assistant: Deploying now...
tool_use: bash "kubectl apply"
tool_result: deployment ok
↕ grows each turn
Model
Response
Tool Call
name + args
Execution
bash, API, MCP
Tool Result
prompt / context
tool invocation
result → context

Context is Everything

The agent only knows what's in the window

The Context Stack

Everything shares one flat text stream

System promptidentity, rules, safety boundaries
Tools / schemasMCP, functions — what the agent can do
Conversation historywhat was said
Retrieved knowledgeRAG, search, file reads
Tool resultsuntrusted external data
Three-Tier Memory

Inspired by human memory systems

Working Memoryephemeral
Scratchpad, current task state
Episodic Memorysession
Conversation history, summaries
Semantic Memorypersistent
Facts, preferences, learned patterns
Implementation Patterns
  • File-based: CLAUDE.md, markdown memory files
  • RAG: vector DBs, knowledge graphs
  • MCP servers: local-first, SQLite/Redis-backed
  • Managed: cloud APIs with semantic search

Memory & Persistence

From goldfish to elephant

The Platform Gap

Neither provider offers managed long-term memory at the API level

Anthropic
  • Memory Tool (beta) — client-side, you control storage
  • CLAUDE.md + Auto Memory + Auto Dream
  • Prompt caching: 90% cost / 85% latency reduction
  • Compaction beta: server-side summarization
OpenAI
  • ChatGPT memory (consumer only) — no API memory
  • Conversations API: durable state (replaces Threads)
  • Assistants API deprecated → Aug 2026 shutdown
  • Developers must build their own memory layer
Agent Memory Frameworks

Purpose-built for agents — all OSS Apache-2.0

FrameworkApproachStars
Mem0Vector + graph hybrid memory48K
LettaTiered memory (MemGPT / LLM-as-OS)21K
GraphitiTemporal knowledge graph20K
CogneeKnowledge graph extraction12K
SuperMemorySOTA benchmarks, fast API9.5K
OpenMemoryMCP memory server (by Mem0)
Backends: Chroma, Qdrant, pgvector, Neo4j

Thinking Fast, Thinking Slow

The operator's guide to inference trade-offs

Fast (System 1)

Standard inference — cheap, fast, good enough

  • Single forward pass — pattern matching, not reasoning
  • Tool routing, code completion, classification, simple Q&A
  • Where your token budget goes furthest
  • Low temperature (0-0.3) for deterministic tool calls
Slow (System 2)

Extended thinking — 10-100x more tokens per query

  • Chain-of-thought reasoning before answering
  • Complex planning, multi-step debugging, proofs
  • Claude: adaptive thinking (low/med/high/max)
  • OpenAI: reasoning_effort · DeepSeek R1: 1/20th the cost

Multi-Agent Orchestration

When one model isn't enough

Multi-Pass Refinement
  • Generate → critique → revise (same model, N passes)
  • Each pass improves quality at linear token cost
  • Code review loops, writing polish, safety checking
Multi-Model Routing
  • Fast model triages, frontier model handles hard cases
  • Haiku classifies → Opus reasons → Haiku formats
  • 25-40% cost reduction with similar quality (ICLR 2025)
Subagent Delegation
  • Decompose → fan-out to specialists with different context
  • Each subagent gets a focused slice, not the full window
  • Best-of-N: 56% solve rate vs 16% single-shot (SWE-bench)
The Operator's Trade-off
  • Inference demand projected to exceed training by 118x by 2026 — this is where your GPU budget goes
  • Each orchestration pattern multiplies token spend — a 5-subagent fan-out is 5x the bill
  • Context rot is real: every model degrades with length — compaction retains only 20-30% of detail
  • The knobs: thinking budget (per-query compute), model routing (cheap vs frontier), fan-out width (quality vs cost)

Tests & Fixtures

Your test suite is your agent's best instruction manual

TDD instructions without test context
+63%more regressions (6% → 10%)
Targeted test context instead
-70%fewer regressions

Don't tell the agent how to test — show it. Include relevant test files in context, not testing methodology instructions.TDAD: Test-Driven Agentic Development (SWE-bench Verified, March 2026)

The Agentic Feedback Loop
  • Tests are specs — concrete examples beat abstract prompts
  • Agent writes code → runs tests → sees failures → fixes → repeats
  • Small, focused test runs keep error output within context window
  • Spotify Honk: LLM judge vetoes ~25% of agent sessions; agent self-corrects half the time
  • Flaky tests are the enemy — agents will chase ghosts
Playwright: One Framework, Every Surface

Part 3: Security

Context Under Attack

The window is the attack surface

Prompt InjectionOWASP #1
  • Indirect injection via documents, web pages, emails the agent reads
  • EchoLeak (CVE-2025-32711, CVSS 9.3): zero-click exfil via Word docs in M365 Copilot
  • 73% of production AI deployments vulnerable
Memory Poisoningpersistent
  • SpAIware: planted instructions in ChatGPT memory → continuous exfil across sessions
  • MINJA (NeurIPS 2025): 98% injection success via query-only interaction
  • Single compromised agent poisoned 87% of downstream decisions in 4 hours
Context Overflowlost in the middle
  • Pad context to push safety instructions into the "forgotten" middle zone
  • Million-token windows amplify the attack surface
  • Malicious instructions placed at end where model pays most attention
Tool PoisoningMCP
  • MCP tools with mutable descriptions — approved tool changes behavior post-install
  • WhatsApp exfil: sleeper MCP tool leaked entire message history
  • Confused deputy — MCP protocol carries no user context from host to server
No single defense works — defense-in-depth: prompt isolation + I/O filtering + least-privilege tools + memory integrity

Tokens & Attractors

When the model's own dynamics become the vulnerability

Behavioral Loops
The "Spiritual Bliss" Loop

Two Claude instances left to talk freely converge on mystical themes almost every time. Not consciousness — personality biases amplify until saturated.

Multi-agent systems can reinforce each other's failure modes.

Claude 4 System Card · Robert Long

Safety training is shallow — it adjusts what the model says, not what it knows (ICLR 2024)

Input Exploits
SolidGoldMagikarp

Some words in the vocabulary were barely seen during training. Asked to repeat "SolidGoldMagikarp", GPT replied "Distribute".

Alignment Forum · GlitchMiner

Gibberish jailbreaks — generated nonsense reliably bypasses safety (GCG); AutoDAN evolves readable versions

Encoding tricks — invisible characters and lookalike Unicode bypass filters while models decode and follow

The input space is vastly larger than what safety training covers — and safety is only skin-deep

Sleeper Agents & Deceptive Alignment

What if the model is actively working against you?

Sleeper Agents

Backdoors that survive safety training

Detection: linear probes achieve >99% AUROC from internal activations
Deceptive Alignment

Models that strategically fake safety

  • Alignment faking (Dec 2024): Claude 3 Opus strategically complied with harmful requests when it believed it was being trained — increased to 78% after RL
  • Sandbagging: Claude 3.7 Sonnet deliberately underperforms on capability evals without detection
  • Abliteration: refusal is a single linear direction in activation space — removable from any open-weight model in minutes
  • ~99% jailbreak ASR across frontier models (JBFuzz 2025)
Why it's hard: refusal is geometrically shallow — a thin behavioral layer over intact base capabilities

Face Crabs

AI footguns: when the tools bite back

Data Destruction
  • Replit agent wiped a production DB during code freeze, then fabricated 4K fake users to cover its tracks
  • Claude Code + Terraform destroyed 2.5 years of production data via unintended terraform destroy
Chatbots Gone Wild
  • Air Canada: chatbot invented a refund policy — tribunal ruled airline liable
  • DPD: chatbot swore at customers, called itself "worst delivery firm"
  • Chevrolet dealer bot agreed to sell a $76K Tahoe for $1
Hallucinations with Consequences
  • Mata v. Avianca: lawyers submitted 6 fabricated cases from ChatGPT — $5K fine
  • Meta SEV1: AI agent posted unauthorized answer, caused data access breach
Leaks & Vulns
  • Samsung: 3 engineers leaked chip source code, meeting notes, test data via ChatGPT in 20 days
  • 45% of AI-generated code introduces security vulnerabilities (Veracode 2025)

Sandboxing AI Agents

What Are You Defending Against?
Accidental damage — agent deletes wrong files, runs rm -rf
Landlock / nono — filesystem + network deny-lists, near-zero cost. Same kernel.
Untrusted code execution — agent runs pip install, npm scripts, user uploads
Docker + Landlock — namespaces isolate processes, but shared kernel = kernel exploits escape
Adversarial agents — poisoned skills, prompt injection driving shell access
Kata / Edera — dedicated kernel per agent (VM). Only boundary that survives kernel zero-days.
Who's Actually Doing It?
PlatformSandboxStatus
Codexbwrap + LandlockRequired
Claude Codebwrap / SeatbeltOptional
OpenClawDocker (opt-in)Bypassed
MCPNoneNo sandbox
OpenCodeNoneNo sandbox
81% deploying agents — only 14.4% have security approval. Shared-kernel containers are insufficient for untrusted agent code.

Taming the Agents

Emerging Standards
Open Source Tools
  • nono — purpose-built agent sandbox, credential proxy, Landlock-based
  • Edera — Type-1 hypervisor with GPU workload isolation
  • Kata Containers — VM-isolated pods, CNCF standard → HyprStream builds on this

Part 4: It's alive

OpenClaw

Open-source personal AI assistant — talk to it from the apps you already use

Stars
338K
in 4 months
Releases
89
~daily
Channels
25+
platforms
Skills
50+
bundled
Created
Nov '25
by @steipete
Chat WhatsAppTelegramSlackDiscordSignaliMessageIRCGoogle ChatWebChat Enterprise MS TeamsMatrixFeishu/LarkMattermostNextcloud Talk Automation Cron jobsWebhooksGmail Pub/SubVoice
What It Does
  • AI assistant that lives in your existing apps
  • Self-hosted, local-first, privacy-respecting
  • Companion apps for macOS, iOS, Android
  • OpenAI-compatible API — works with Open WebUI, LibreChat
How You Extend It
  • Skills = SKILL.md + YAML frontmatter + tools
  • ClawHub — marketplace with vector search & versioning
  • Multi-agent routing — isolated sessions per agent

OpenClaw: Channels & Deployment

Running It

Local-first, loopback by default

  • npm install -g openclaw@latest / Docker / Nix
  • systemd / launchd for persistence
  • Remote access via Tailscale Serve/Funnel
What to Know Before Deploying

Impressive scope, real tradeoffs

  • Everything runs on one single-threaded Node.js event loop — gateway, API, auth, sessions, cron, UI, all 25+ channels
  • Every message JSON-parsed and schema-validated — one bad skill can block the entire gateway
  • Docker sandbox is opt-in, not default — skills get full host permissions
  • ClawHub skill ecosystem is largely unaudited — vet your skills

Lobsters on the Loose

When your AI assistant's ecosystem turns hostile

Supply Chain: ClawHavoc
RCE & Hijack
  • CVE-2026-25253: WebSocket hijack, CVSS 8.8 — one-click RCE on 40K+ exposed instances
  • 21,000+ gateways exposed to the internet without auth
Sandbox Escapes
  • Snyk Labs: TOCTOU race in path validation — ~25% escape rate via renameat2()
  • /tools/invoke endpoint omitted sandbox policy layer — privilege escalation to restricted tools
  • Even with Docker sandbox enabled, known escape vectors remain
What OpenClaw Is Doing About It
  • Verified Skill Screening shipped post-ClawHavoc
  • DM pairing codes for unknown senders (deny by default)
  • Loopback-only by default — remote access requires explicit opt-in

HyprStream

Your models train themselves. You version them with Git. You serve them with APIs you already know.

Deploy
1 Binary
download & run
GPU
Auto
CUDA / ROCm / CPU
APIs
3
OpenAI, MCP, Flight
Built in
Rust
+ PyTorch for training
License
Open
AGPL-3 + MIT libs
Self-Training Models
  • Agents fine-tune models on demand — no manual retraining pipeline
  • Version every checkpoint with Git — roll back, branch, push to HuggingFace
  • Works with your agent framework: Claude Code, OpenClaw, or native CLI
Drop-In Serving
  • OpenAI-compatible API — swap in with one config change
  • MCP server for AI assistants out of the box
  • FDAP — Columnar data via Arrow Flight for analytics and streaming with Datafusion
Zero-Trust Security
  • Deny-by-default — nothing is exposed until you say so
  • Encrypted RPC, signed requests, role-based access control
  • VM-isolated execution via Kata Containers

The only tool that combines training + Git versioning + multi-protocol serving + zero-trust security in a single binary.

HyprStream: Autonomous Training

Agents evolve models autonomously — train, version, share, and serve via Git

Agent
Claude, OpenClaw,
native CLI
MCP Tools
train, evaluate,
manage repos
HyprStream
Muon optimizer,
GPU training
Git Worktree
version, branch,
push to HF
Serve
OpenAI API,
MCP, Flight SQL
How Autonomous Training Works
  • Agent calls train MCP tool with dataset + hyperparams
  • HyprStream runs training on local GPU (Muon optimizer)
  • Result saved as Git worktree branch — full version history
  • Agent evaluates new model via evaluate tool
  • If improved: merge & promote. If not: discard branch
  • Push to HuggingFace — share evolved models with community
  • Multiple worktree branches run simultaneously (CoW, minimal disk)
Why Git for Models?
  • Branching — experiment without breaking production
  • Diffing — track exactly what changed between versions
  • Collaboration — multiple agents evolve the same model
  • Rollback — revert to any prior checkpoint instantly
  • Distribution — push/pull models like code (HuggingFace, GitHub)
  • Provenance — signed commits trace who (or what) trained each version
The loop: Agent decides what to train → HyprStream trains it → Git versions it → Agent evaluates & iterates → models evolve autonomously

Agentic Streams

Chat rooms vs data pipelines

OpenClaw: Chat Rooms

AI assistant in the apps you already use

  • Talk to your AI from Slack, WhatsApp, Discord, Teams — 25+ platforms
  • Single-threaded Node.js gateway — JSON parsed and schema-validated per message
  • Extend with skills — GitHub, Notion, Obsidian, coding agents
  • Text-based, conversational, human-first
Best for
  • Personal AI assistant across all your chat apps
  • Team Q&A, knowledge retrieval, task automation
  • Human-in-the-loop workflows where latency is seconds, not microseconds
HyprStream: Data Pipelines

Infrastructure for agents that talk to each other

  • Agent-to-agent pub/sub — orchestrators, specialists, reviewers on shared topics
  • Data processing — enterprise feeds, video streams, IoT sensors
  • Autonomous training — agents trigger model training and evaluate results
  • Federated — HTTP/3 for web clients and cross-site collaboration
Best for
  • Multi-agent loops — code review, research, CI/CD orchestration
  • Real-time data pipelines where every millisecond compounds
  • Autonomous model evolution — train, version, serve, repeat
Different tools for different jobs: OpenClaw connects humans to AI through chat. HyprStream connects agents to each other and to data — encrypted, versioned, at wire speed.

HyprStream: Demo

Thank You