[{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/adr/","section":"Tags","summary":"","title":"Adr","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/agent-design/","section":"Tags","summary":"","title":"Agent-Design","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/agentic-workflows/","section":"Tags","summary":"","title":"Agentic-Workflows","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/agents/","section":"Tags","summary":"","title":"Agents","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/agile/","section":"Tags","summary":"","title":"Agile","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/ai/","section":"Tags","summary":"","title":"Ai","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/ai-agents/","section":"Tags","summary":"","title":"Ai-Agents","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/argocd/","section":"Tags","summary":"","title":"Argocd","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/automation/","section":"Tags","summary":"","title":"Automation","type":"tags"},{"content":" Where this started # The original homelab was simpler. A server, some Docker Compose files, Jellyfin for media, and a UGREEN NAS for storage. Useful, but not interesting.\nThen I started running local AI models. Ollama on the tower, a few inference endpoints, some scripts that called them. The problem wasn\u0026rsquo;t compute — heavyarms has an RTX 1060 that handles 7B and 13B models fine. The problem was coordination. The models were isolated. They couldn\u0026rsquo;t call each other, share context, or act on the cluster they lived in. Each one was a dead end.\nI wanted agents. Not chatbots — agents that register themselves, discover each other, maintain memory across sessions, and operate on real infrastructure. That meant building a platform, not just running models. And building a platform meant making real decisions about protocols, storage, secrets, observability, and discovery.\nThis is that story, from the first decision to where we landed.\nNode Type CPU RAM GPU Role heavyarms Tower Ryzen 8-core 80 GB RTX 1060 6 GB Primary compute exia NUC7i5 i5 — — k3s worker kyrios NUC7i5 i5 — — k3s worker dynames NUC7i5 i5 — — k3s worker (2x NUC11) NUC11 — — — Staged / waiting UGREEN NAS NAS — — — Persistent storage + Cloudflare tunnel The whole thing is powered by k3s, declaratively managed through ArgoCD, and exposed via Cloudflare Zero Trust — no VPN, no port forwarding, no /etc/hosts hacks.\nI didn\u0026rsquo;t build this for the flex. I built it because I wanted a real platform to run AI agents at home: agents that coordinate with each other, persist memory across sessions, and operate on actual infrastructure rather than a laptop demo environment.\nWhy Kubernetes for a homelab? # This comes up every time I talk about this setup. The honest answer: I want the same primitives I use at work. GitOps, declarative configuration, rolling updates, health probes, secrets management that doesn\u0026rsquo;t involve hardcoding tokens in a shell script. Kubernetes gives me all of that at home. k3s specifically gives me Kubernetes without the control-plane overhead — one binary, embedded etcd, runs fine on a NUC7.\nThe alternative is Docker Compose on the NAS, which I also do for a handful of services. But Compose doesn\u0026rsquo;t give me ArgoCD reconciliation, MetalLB load balancing, or Traefik ingress with cert-manager TLS. Once you have those, every new service is four files and a commit.\nThe stack # The platform layers break down clearly:\nLayer Tool Why Orchestration k3s v1.29.4 Lightweight k8s, embedded registry, vxlan CNI GitOps ArgoCD App-of-Apps pattern, auto-sync on commit Load Balancer MetalLB Bare-metal LoadBalancer type, IP pool 192.168.86.200-220 Ingress Traefik v26.1.0 IngressRoute CRDs, TLS termination Certificates cert-manager Self-signed cluster CA for internal TLS External Access Cloudflare Tunnel Zero Trust exposure, no inbound firewall rules Secrets External Secrets Operator GitOps-safe secret sync from Azure Key Vault Monitoring VictoriaMetrics + Grafana ~10x lighter than kube-prometheus-stack on NUC7 nodes AI Ops HolmesGPT AI-powered k8s investigator, Discord alert integration MCP Server stock-mcp Market data MCP for agent tool use Agent Registry Hermes Custom gRPC/REST agent discovery service (more on this below) Everything has a URL at *.apexarcology.com. ArgoCD, Grafana, HolmesGPT, stock-mcp, the NAS UI, Jellyfin, Photos — all behind Cloudflare Access with WARP device posture. No service touches the public internet without passing Zero Trust first.\nAdding a service takes four steps # The pattern is consistent enough that I have it memorized:\nCreate k8s/bootstrap/\u0026lt;app\u0026gt;/ with manifests and a kustomization.yaml Add an ArgoCD Application to k8s/apps/templates/\u0026lt;app\u0026gt;.yaml Add an IngressRoute pointing to \u0026lt;app\u0026gt;.apexarcology.com Add the ingress rule and DNS CNAME to nas/dns/main.tf Commit and push. ArgoCD picks it up within 3 minutes.\n# IngressRoute pattern apiVersion: traefik.io/v1alpha1 kind: IngressRoute metadata: name: \u0026lt;app\u0026gt; namespace: \u0026lt;app\u0026gt; spec: entryPoints: - websecure routes: - match: Host(`\u0026lt;app\u0026gt;.apexarcology.com`) kind: Rule services: - name: \u0026lt;app\u0026gt; port: 80 tls: secretName: \u0026lt;app\u0026gt;-tls The part that took the most thinking: agents # Everything above is solved infrastructure. MetalLB, Traefik, ArgoCD — all of these are well-documented, community-maintained, and just work if you follow the docs.\nAgents are not a solved problem. If you want AI agents to coordinate on a Kubernetes cluster — discovering each other, calling each other over gRPC, maintaining persistent memory, operating within a GitOps workflow — you have to make decisions. Which agent framework? Which memory system? How do secrets flow to agent pods? What does observability look like when the thing being monitored is an AI agent\u0026rsquo;s behavior, not just its CPU usage?\nI wrote eight Architectural Decision Records (ADRs) to answer these questions. Each one scored candidates against my actual requirements rather than hype. The results were sometimes surprising:\nAgent registry: Built my own in Go (Hermes) rather than adopting an existing platform, because no existing tool was k3s-native, gRPC-first, and ArgoCD-compatible without significant bending. Memory system: Replaced Mem0 (already deployed) with Hindsight for TEMPR four-strategy retrieval — graph + temporal + semantic + keyword, self-hosted, with a real Helm chart. Monitoring: VictoriaMetrics over kube-prometheus-stack. The NUC7 nodes have ~16 GB RAM shared across all workloads. VictoriaMetrics uses ~400 MB; kube-prometheus-stack uses 2–3 GB. DNS: Cloudflare Tunnel over Pi-hole/AdGuard, because the infrastructure was already there and the pattern was proven on the NAS. The next posts in this series go deep on each of these. Starting with Hermes, because it\u0026rsquo;s the most interesting thing I\u0026rsquo;ve built from scratch.\nWhat\u0026rsquo;s next # Post 2: Hermes — designing a gRPC agent registry in Go, the ADR process, and what it looks like deployed on k3s. Post 3: Five ADRs in a weekend — storage, secrets, monitoring, and DNS, and what the scoring process taught me. Post 4: The Night City Crew — designing a roster of specialized AI agents named after Cyberpunk 2077 characters, each with a SOUL personality contract and dedicated MCP tool integrations. The repository is at github.com/gardlt/homelab. ADR specs live in specs/, the Hermes source is in apps/hermes/, and the k8s manifests are under k8s/bootstrap/.\n","date":"24 June 2026","externalUrl":null,"permalink":"/posts/homelab/001-building-an-ai-agent-platform-at-home/","section":"Posts","summary":"","title":"Building a Production-Grade AI Agent Platform in My Living Room","type":"posts"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/caveman/","section":"Tags","summary":"","title":"Caveman","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/ci-cd/","section":"Tags","summary":"","title":"Ci-Cd","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/claude-code/","section":"Tags","summary":"","title":"Claude-Code","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/cli/","section":"Tags","summary":"","title":"Cli","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/cloudflare/","section":"Tags","summary":"","title":"Cloudflare","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/cost-optimization/","section":"Tags","summary":"","title":"Cost-Optimization","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/cyberpunk/","section":"Tags","summary":"","title":"Cyberpunk","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/enterprise/","section":"Tags","summary":"","title":"Enterprise","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/fitness/","section":"Tags","summary":"","title":"Fitness","type":"tags"},{"content":"After deciding to build Hermes (see part 2), I had the agent registry. What I didn\u0026rsquo;t have was the infrastructure beneath it: persistent storage for agent state and metrics, a secrets backend that works in GitOps, an observability stack that fits on NUC7 nodes, a memory system for agents, and a DNS strategy that doesn\u0026rsquo;t require editing /etc/hosts on every client.\nI ran five ADRs over a weekend. The format was the same each time: define dimensions, score candidates honestly, accept the result. Here\u0026rsquo;s what I decided and why.\nADR 002 — Memory System: Hindsight over Mem0 # Problem: Agents need persistent memory — stored context, learned behavior, prior decisions — that survives restarts and is queryable across sessions.\nMem0 was already deployed on the cluster. It had 48K GitHub stars and a working Docker container. Switching to anything else meant decommissioning a running service and migrating data.\nThe alternative was Hindsight by Vectorize.io — a newer system built around TEMPR: four simultaneous retrieval strategies running in parallel.\nRetrieval Strategy What it does Semantic Vector similarity — \u0026ldquo;find memories about Kubernetes storage\u0026rdquo; Keyword/BM25 Exact keyword match — \u0026ldquo;find memories containing \u0026lsquo;Longhorn\u0026rsquo;\u0026rdquo; Graph Relationship traversal — \u0026ldquo;what did this agent decide that led to X\u0026rdquo; Temporal Time-range queries — \u0026ldquo;what happened last month\u0026rdquo; Mem0 offers semantic retrieval on the free/self-hosted tier. Graph retrieval — the ability to traverse relationships between memories — is locked behind the Pro tier at $249/month. I need graph retrieval. That\u0026rsquo;s not a nice-to-have; agents reasoning about prior decisions need to trace chains of causality, not just surface similar-sounding memories.\nScored four dimensions (retrieval quality, feature gating, Kubernetes/Helm deployment, per-agent memory bank config). Hindsight won all four, 12 vs 4. M5–M7 weren\u0026rsquo;t scored — the sweep was decisive enough that they couldn\u0026rsquo;t change the outcome.\nResult: Mem0 decommissioned. Hindsight deployed via official Helm chart, ArgoCD-managed. Each agent gets a configurable memory bank with Mission, Directives, and Disposition fields — a personality and behavioral contract baked into how it stores and retrieves memories.\nWhat it cost: Migrating away from a running service. Not insurmountable, but real switching cost. The ADR made the case that TEMPR\u0026rsquo;s graph retrieval was worth it.\nADR 005 — Storage: R2 for Objects, NAS for Block # Problem: No persistent storage layer existed. All current workloads were stateless or relied on NAS-hosted Docker Compose. The cluster needed PVCs for agent memory backends, Prometheus/VictoriaMetrics retention, and Grafana persistence.\nTwo tiers required:\nObject storage (S3-compatible, for backups and artifacts): Cloudflare R2 + csi-s3 vs MinIO Block storage (RWO PVCs for databases): Longhorn vs NFS from existing NAS Object storage: R2 won easily (17 vs 9). The complete design already existed in docs/network-storage-r2.md. The Terraform was partially written. R2 has zero egress fees. The operational cost is minimal — a CSI DaemonSet using csi-s3/geesefs for FUSE-mounted S3 buckets. MinIO is excellent for on-cluster S3 but requires significant disk allocation and operational overhead that wasn\u0026rsquo;t justified at homelab scale.\nBlock storage: More interesting. Longhorn is the \u0026ldquo;right\u0026rdquo; answer for production — configurable replication, scheduled snapshots, built-in UI. But P1 (block storage priority) was scored C: \u0026ldquo;not needed yet, defer block storage.\u0026rdquo; Without a hard requirement for RWO PVCs right now, Longhorn\u0026rsquo;s complexity wasn\u0026rsquo;t justified.\nNFS from the existing UGREEN NAS won (11 vs 7) for block. It\u0026rsquo;s already there, the NAS is reliable, and the nfs-subdir-external-provisioner StorageClass adds it to Kubernetes in one Helm install. The tradeoff is real: no HA, no RWO (NFS is RWX only), NAS is a single point of failure. I\u0026rsquo;ll revisit Longhorn when agents need true block storage.\nResult: Cloudflare R2 + csi-s3 for objects. NAS NFS for block. Longhorn deferred.\nADR 006 — Monitoring: VictoriaMetrics over kube-prometheus-stack # Problem: The cluster had Grafana deployed with no data sources, no dashboards, a hardcoded admin password, and no persistence. No Prometheus, no AlertManager, no metrics collection.\nThe NUC7i5DNK nodes are resource-constrained — approximately 16 GB RAM per node, shared across all workloads. Every megabyte allocated to monitoring is a megabyte not available to agents.\nFour candidates: kube-prometheus-stack, VictoriaMetrics + Grafana, Grafana Cloud (Alloy agent), Netdata.\nThe scores broke down clearly once I hit M5 (resource budget):\nStack Approx RAM Notes kube-prometheus-stack 2–3 GB Full Prometheus + Grafana + AlertManager + exporters VictoriaMetrics 300–500 MB VictoriaMetrics + VMAgent + Grafana + VMAlert Grafana Cloud (Alloy) ~64 MB Alloy agent only; cloud backend required Netdata ~100 MB/node Per-node agent; not Prometheus-compatible Grafana Cloud would have won on resource usage, but M1 (self-hosted, no cloud dependency) eliminated it — metrics and logs staying on-cluster is a hard requirement. Netdata was eliminated by not being Prometheus-compatible (M2) and poor k8s-native metric support (M6). kube-prometheus-stack lost on M5 alone — 2–3 GB on NUC7 nodes is too much.\nVictoriaMetrics won at 17 vs 13 (kube-prometheus-stack) vs 15 (Grafana Cloud) vs 11 (Netdata). MetricsQL is a Prometheus superset — all existing dashboards and exporters work without modification. VictoriaLogs handles log aggregation without adding another system. VMAlert routes to Discord, complementing HolmesGPT\u0026rsquo;s existing AI-generated findings.\nResult: VictoriaMetrics + Grafana. Existing standalone Grafana HelmRelease replaced with the victoria-metrics-k8s-stack chart. Hardcoded admin password removed (routed through External Secrets Operator from ADR 004).\nADR 004 — Secrets: Azure Key Vault + External Secrets Operator # Problem: Secrets can\u0026rsquo;t live in Git. GitOps requires secrets to be declarative, but a Secret manifest with a plaintext password committed to a repository defeats the point.\nExternal Secrets Operator (ESO) was already in the cluster — it syncs secrets from an external backend into Kubernetes Secret objects on a defined schedule. The question was which backend.\nOptions evaluated: HashiCorp Vault (self-hosted), AWS Secrets Manager, Azure Key Vault with Workload Identity OIDC, and Doppler.\nThe homelab already had Azure investment (planned Azure AI Foundry and APIM workloads). Azure Key Vault with Kubernetes Workload Identity OIDC won: it requires no additional self-hosted service (unlike Vault), integrates cleanly with ESO\u0026rsquo;s ClusterSecretStore, and the Workload Identity OIDC flow means pods authenticate to Azure without any credentials in the cluster — just an annotated ServiceAccount and a federated identity credential in Azure AD.\nResult: Azure Key Vault as the ESO backend. Pod authentication via Workload Identity OIDC. All secrets flow: AKV → ESO → Kubernetes Secret → pod env var. No plaintext secrets in Git.\nADR 007 — DNS: Cloudflare Tunnel # Problem: k8s services (ArgoCD, Grafana, HolmesGPT, stock-mcp) were accessible only via manual /etc/hosts entries. Adding a new device meant editing the hosts file again.\nCandidates: Pi-hole, AdGuard Home, CoreDNS extension (already running in k3s), Cloudflare Tunnel.\nThe decision was made quickly. nas/dns/ already contained working Terraform (Cloudflare provider v4.52.7) exposing nas.apexarcology.com, photos.apexarcology.com, and jellyfin.apexarcology.com through a Cloudflare Tunnel connector running in UGOS Pro Docker. The pattern was proven and operational.\nExtending it to k8s services meant deploying a second cloudflared connector as a Kubernetes Deployment and adding ingress_rule blocks to the tunnel config for each new service. Same pattern, new tunnel target.\nResult: Cloudflare Tunnel. /etc/hosts instructions removed from README. All services at *.apexarcology.com route through Cloudflare Zero Trust. Access requires Cloudflare WARP device enrollment — no exposed ports on the home router.\nWhat five ADRs in a weekend taught me # The scoring process doesn\u0026rsquo;t eliminate judgment — it structures it. The dimensions you choose reflect the requirements you actually have. If you\u0026rsquo;re not honest about what you need, the scores will reflect your biases rather than your constraints.\nThe most important discipline was accepting the result when it was the uncomfortable answer. Replacing Mem0 (already deployed, already working) with Hindsight because graph retrieval was a hard requirement — that was uncomfortable. Choosing NFS over Longhorn when Longhorn is clearly the more capable system — that required acknowledging that \u0026ldquo;more capable\u0026rdquo; doesn\u0026rsquo;t mean \u0026ldquo;right for right now.\u0026rdquo;\nThe ADR spec for each of these decisions lives in specs/ in the repository. Each one has the full scoring table and rationale. Future me (and anyone else reading the repo) can see exactly why each choice was made and what the tradeoffs were.\nNext: the Night City Crew — a roster of seven specialized AI agents named after Cyberpunk 2077 characters, each with a defined role, a personality contract, and a set of tool integrations.\n","date":"24 June 2026","externalUrl":null,"permalink":"/posts/homelab/003-five-adrs-in-a-weekend/","section":"Posts","summary":"","title":"Five Infrastructure ADRs in a Weekend: Storage, Secrets, Monitoring, Memory, and DNS","type":"posts"},{"content":" The bill nobody\u0026rsquo;s reading # Token costs are the new cloud bill. Most developers don\u0026rsquo;t look at them until they\u0026rsquo;re large enough to matter — and by then the habits are already set.\nThe pattern is the same everywhere: every git status, every test run, every file read gets dumped into the model context. Raw, unfiltered, verbose. The model reads all of it, responds to all of it, and you pay for all of it. On cheap models, the waste is invisible. On Opus or GPT-5.5, it\u0026rsquo;s real money.\nCompanies are starting to notice. The question \u0026ldquo;how are you budgeting your tokens?\u0026rdquo; is showing up in engineering conversations the same way \u0026ldquo;how are you managing your AWS spend?\u0026rdquo; did five years ago.\nHere\u0026rsquo;s how I\u0026rsquo;m handling it.\nThree tools, three angles # The interesting thing about token optimization is that it\u0026rsquo;s a multi-layer problem. Verbose input, verbose output, uncompressed context, and bloated conversation history are separate failure modes. I found four open-source tools that each own one layer:\nrtk — squeeze the input # rtk intercepts CLI output before the model sees it. Every git status, every test result, every file listing gets filtered and compressed at the shell level. The model receives a summary, not a raw dump.\nThe key insight: most CLI output is noise. Status messages, progress bars, redundant headers, unchanged lines in a diff. Strip those before they enter the context and you eliminate a category of cost entirely.\ncaveman — shrink the output # caveman addresses the other side: model verbosity. Left to its own defaults, a language model will write you paragraphs when sentences would do. Caveman constrains the output format — short answers, no filler, no pleasantries, technical substance preserved.\nThis matters because output tokens are billed the same as input tokens, and verbose responses also grow your context window on the next turn. Compound savings.\nheadroom — compress everything else # headroom handles the remaining context: conversation history, retrieved documents, anything that doesn\u0026rsquo;t fit the other two categories. It compresses long-form content before it enters the model without stripping the semantic content that makes it useful.\nThink of it as the catch-all layer. rtk handles CLI output, caveman handles response format, headroom handles everything else.\nponytail — trim the conversation tail # ponytail targets a specific and often overlooked cost center: long-running conversations. As a session grows, earlier turns accumulate in context even when they\u0026rsquo;re no longer relevant. ponytail progressively compresses or prunes the conversation tail — the older turns that are taking up context budget but contributing little to the current exchange.\nIt supports Claude, Codex, and Gemini, and ships with configurable intensity levels so you control the aggressiveness of the trim. The reported reductions are significant: ~54% on context from prior turns, ~20% on retrieved documents, ~27% on overall session length in longer agentic workflows.\nThe use case where this shines: multi-step agentic sessions that run for dozens of turns. Without tail trimming, context grows until you hit the window limit or costs spike. With it, the session stays lean regardless of how long it runs.\nWhat running all four looks like # Individual savings per tool vary. Combined, the reduction exceeds 80% of baseline token usage.\nThe part that surprised me: answer quality doesn\u0026rsquo;t degrade. Compression isn\u0026rsquo;t the same as loss. Most of what gets removed was noise the model was processing but not meaningfully using — verbose CLI headers, filler sentences in responses, redundant context that was already summarized, and stale early turns that stopped being relevant ten messages ago.\nIf you\u0026rsquo;re on a paid Opus or GPT-5.5 plan, 80% reduction is a direct budget impact. If you\u0026rsquo;re on a cheaper model, the gain shows up as speed — smaller context windows process faster.\nHow to think about token budgeting # The mental model that\u0026rsquo;s been useful for me: treat tokens like compute. You wouldn\u0026rsquo;t run a full database export to answer a query that only needs an index. The same discipline applies to what you put in and take out of model context.\nA few questions worth asking for any workflow:\nWhat\u0026rsquo;s entering the context? Is it raw output or pre-processed? Could a CLI filter remove half of it before the model sees it? What\u0026rsquo;s leaving the context? Are responses longer than they need to be? Is the model writing essays when it could write sentences? What\u0026rsquo;s accumulating in context? Is conversation history growing unbounded? Could earlier turns be compressed without losing the thread? These aren\u0026rsquo;t rhetorical. Each one is a category of waste with a tool or technique that addresses it.\nWhat\u0026rsquo;s next # Next post: a deep dive into token compression — how headroom works under the hood, where semantic compression wins over naive truncation, and what it looks like to build compression into a CI/CD pipeline.\nIn the meantime: if you\u0026rsquo;re paying Opus rates and not compressing your context, you\u0026rsquo;re leaving real money on the table.\nHow are you managing token budget in your workflows? Curious what approaches people are finding outside of the obvious \u0026ldquo;use a cheaper model\u0026rdquo; answer.\n","date":"24 June 2026","externalUrl":null,"permalink":"/posts/ai/three-tools-token-budget/","section":"Posts","summary":"","title":"Four Open-Source Tools That Cut My Token Costs by 80%+","type":"posts"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/github-copilot/","section":"Tags","summary":"","title":"Github-Copilot","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/gitops/","section":"Tags","summary":"","title":"Gitops","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/golang/","section":"Tags","summary":"","title":"Golang","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/grpc/","section":"Tags","summary":"","title":"Grpc","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/headroom/","section":"Tags","summary":"","title":"Headroom","type":"tags"},{"content":"When I decided to run AI agents on my home k3s cluster, the first question was: how do agents find each other?\nIn a monolith, this is trivial. In a microservices architecture, you have Consul, Kubernetes Service DNS, or an API gateway. But agents are different from services — they register themselves, announce their capabilities, go offline, come back, and need to be discoverable by capability rather than just by name. I needed something purpose-built.\nMy first instinct was to adopt an existing framework. So I evaluated the two most plausible options: Hermes (which I\u0026rsquo;d design from scratch) vs OpenClaw (an open-source personal AI assistant platform with 379K GitHub stars and a gateway architecture for WhatsApp, Telegram, Slack, Discord, and Signal).\nThe ADR process # I\u0026rsquo;ve started using scored Architectural Decision Records for every significant choice in this project. The format is simple: define decision dimensions, score candidates against your actual requirements, then accept the result even when it\u0026rsquo;s uncomfortable. It forces you to be honest about what you actually need rather than what sounds cool.\nFor the agent framework decision, I scored seven dimensions:\nDimension Hermes OpenClaw D1 — Use Case Fit 2 2 D2 — Kubernetes-Native 3 1 D3 — Maintenance Burden 3 1 D4 — Inter-Agent Protocol (gRPC/REST) 3 1 D5 — Dynamic Discovery 3 1 D6 — Channel Integration (WhatsApp/Slack) 3 0 D7 — Community \u0026amp; Self-Sufficiency 3 1 TOTAL 20 7 The result was unambiguous. OpenClaw is excellent software — 379K stars means a lot of people find it valuable. But it solves a fundamentally different problem: human-to-agent interaction via messaging channels. My problem is machine-to-machine coordination within a Kubernetes cluster. Those are adjacent but distinct.\nOpenClaw\u0026rsquo;s architecture reflects its use case: Node.js runtime, WebSocket node pairing, static workspace routing via AGENTS.md config files, and no Helm chart (I\u0026rsquo;d have to write the k8s manifests myself). For a Discord-accessible personal assistant, that\u0026rsquo;s fine. For a cluster-internal registry where agents register themselves dynamically and other agents query live state with sub-500ms latency — it\u0026rsquo;s the wrong tool.\nWhy build from scratch # Building Hermes from scratch carries real cost: I own the code, I fix the bugs, I add the features. OpenClaw has 1,170+ contributors; Hermes has one.\nThe counterargument: Hermes is small. The core service is a registry with three operations — Register, Query, Watch. The complexity surface is low, the ownership burden is proportional, and I get exactly what I need with no adapter layer over a framework designed for a different use case.\nI made the call: build Hermes, own it fully, keep it small.\nWhat Hermes does # Hermes is a Go service with a dual interface: gRPC on port 50051 and an HTTP/REST gateway on port 8080 (via grpc-gateway). The proto contract defines three operations:\nservice AgentRegistry { rpc Register(RegisterRequest) returns (RegisterResponse); rpc Query(QueryRequest) returns (QueryResponse); rpc Watch(WatchRequest) returns (stream AgentEvent); } Register: An agent starts, calls Register with its name, capabilities, and endpoint. Hermes stores it. The registration carries a TTL — if the agent doesn\u0026rsquo;t heartbeat, Hermes evicts it.\nQuery: Another agent or operator calls Query with optional capability filters. Hermes returns all matching live registrations.\nWatch: A server-streaming RPC. Callers subscribe to agent lifecycle events (registered, evicted, updated) and receive them in real time as the registry changes.\nThe internal store is an in-memory map protected by a sync.RWMutex. A background sweeper goroutine runs on a configurable interval, evicting registrations past their TTL. Tests cover the store independently from the gRPC layer, and the watch tests verify that event streams receive the right events in order.\nThe implementation # apps/hermes/ ├── proto/hermes.proto # Source of truth ├── gen/hermesv1/ # Generated Go from buf ├── internal/ │ ├── registry/ │ │ ├── record.go # AgentRecord struct + TTL logic │ │ ├── store.go # Thread-safe in-memory registry │ │ ├── store_test.go # Store unit tests │ │ └── sweeper.go # Background TTL eviction goroutine │ └── server/ │ ├── server.go # gRPC + HTTP server setup │ ├── registry_handlers.go # Register/Query/Watch handlers │ ├── health.go # /healthz endpoint │ ├── metrics.go # Prometheus metrics │ ├── register_test.go # Integration tests for Register │ └── watch_test.go # Integration tests for Watch streaming └── main.go # Flag parsing, signal handling, server start main.go is 47 lines. It parses two flags (--grpc-addr, --http-addr), creates the store, starts the sweeper, starts gRPC and HTTP servers in goroutines, and blocks on context cancellation or an error from either server.\nDeploying on k3s # The deployment follows the same ArgoCD App-of-Apps pattern as every other service in the cluster:\n# k8s/bootstrap/hermes/deployment.yaml (abbreviated) containers: - name: hermes image: ghcr.io/gardlt/homelab/hermes:latest ports: - containerPort: 50051 # gRPC - containerPort: 8080 # REST gateway readinessProbe: httpGet: path: /healthz port: 8080 The service exposes both ports. The IngressRoute routes hermes.apexarcology.com to port 8080 for REST access. In-cluster agents call the gRPC endpoint directly at hermes.hermes.svc.cluster.local:50051, which stays inside the cluster mesh and doesn\u0026rsquo;t route through Cloudflare.\nWhat the ADR taught me # The most useful part of the scoring process wasn\u0026rsquo;t the number — it was being forced to articulate why each dimension mattered. Writing out \u0026ldquo;D5: agents need to discover each other dynamically via a live registry with sub-500ms query latency\u0026rdquo; meant I couldn\u0026rsquo;t rationalize OpenClaw\u0026rsquo;s static workspace routing as \u0026ldquo;good enough.\u0026rdquo; The requirement was specific. The gap was real.\nI\u0026rsquo;ve applied the same format to every major infrastructure decision since. The scores don\u0026rsquo;t make the decision — the requirements do. The scores just prevent motivated reasoning from overriding them.\nNext up: the five infrastructure ADRs that shaped the platform beneath the agents — storage, secrets, monitoring, and DNS.\n","date":"24 June 2026","externalUrl":null,"permalink":"/posts/homelab/002-hermes-building-a-grpc-agent-registry/","section":"Posts","summary":"","title":"Hermes: Why I Built a gRPC Agent Registry Instead of Using an Existing Framework","type":"posts"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/hindsight/","section":"Tags","summary":"","title":"Hindsight","type":"tags"},{"content":" What changed # Over the past two weeks I rebuilt the home lab with a specific constraint: every service has to do something useful in daily life, not just demonstrate that it runs.\nThe infrastructure is the same — k3s cluster across heavyarms and the NUC fleet, ArgoCD for GitOps, Cloudflare Tunnel for access. What changed is what runs on top of it. I added dedicated ML inference instances running Ollama and stood up an n8n workflow engine. The goal was to see how fast I could go from \u0026ldquo;idea\u0026rdquo; to \u0026ldquo;running automation\u0026rdquo; once the platform was in place.\nAnswer: pretty fast.\nFirst project: AI personal trainer # I wanted to build something with stakes. My partner and I are working on fitness goals — weight, gym consistency, progressive overload — and staying accountable when life gets busy is hard. A tool that watches the data and sends context-aware messages felt like a good test case.\nThree workflows, each targeting a different failure mode.\nWeight progress motivation # Problem: The scale moves slowly. It\u0026rsquo;s easy to lose context on whether you\u0026rsquo;re trending the right direction.\nHow it works:\nGoogle Sheets holds the weekly weigh-ins n8n polls on a schedule, calculates deltas Ollama generates a personalized message based on the trend (up, down, or flat) Message lands in Telegram The AI piece isn\u0026rsquo;t just \u0026ldquo;you lost 0.3 kg, good job.\u0026rdquo; It reads the trend across multiple weeks and adjusts tone accordingly — more energetic when momentum is building, steadier when things plateau. The message changes based on context, not just the latest number.\nNext iteration: Pull directly from the Google Fit API instead of manual sheet entries.\nGym consistency tracking # Problem: Knowing you should go 4 days a week and actually going 4 days a week are different things.\nHow it works:\nWorkout log lives in Google Sheets (one row per session) n8n counts sessions per week against the 4-day target Hit the goal → celebratory message Miss the goal → supportive nudge, not guilt The tone calibration here was important. Shame-based reminders don\u0026rsquo;t work long-term. The LLM prompt is tuned to stay in coach mode, not scorekeeper mode.\nAdaptive workout programming # Problem: Static workout plans stop working the moment life deviates from the plan.\nHow it works:\nWeekly plan is generated based on previous week\u0026rsquo;s performance Two paths: Success path: Increase reps, weight, or exercise complexity Adjustment path: Build on incomplete workouts instead of advancing past them Ollama analyzes the log and generates next week\u0026rsquo;s plan accordingly This is the most LLM-heavy workflow. The model receives the full training history for the week and outputs a structured plan. Running locally on heavyarms means no token costs and no data leaving the network — workout history stays private.\nThe stack in practice # Google Sheets → n8n trigger → conditional logic → Ollama (local) → Telegram n8n handles the orchestration. Each workflow is a visual graph: triggers, conditions, HTTP calls to the Ollama endpoint, and output formatting before the Telegram send. The visual representation turns out to be useful not just for building but for debugging — you can step through a run and see exactly where data transforms.\nThe Ollama endpoint is internal only. n8n calls it over the cluster network. No egress, no API keys, no usage limits.\nWhy this matters beyond fitness # The fitness workflows are the first real test of a pattern I want to apply more broadly:\nEdge AI (Ollama) — inference runs locally, no external dependencies Container orchestration (k3s) — workloads are managed, restartable, scalable Workflow automation (n8n) — complex conditional logic without writing glue code Real-world data integration — Google Sheets today, more APIs next The infrastructure cost to add a new AI-powered workflow is now close to zero. The pattern is established. Plugging in a new data source and a new prompt takes an afternoon.\nWhat\u0026rsquo;s next # Expanding the same framework to other areas:\nHabit tracking — general-purpose streaks and accountability Photography — automated culling or tagging workflows Learning goals — spaced repetition nudges, reading summaries The platform is in place. The next projects are mostly about writing the right prompts and connecting the right data sources.\nBehind the scenes # The n8n workflows look exactly like you\u0026rsquo;d expect — nodes connected by edges, data flowing left to right. Triggers on the left, conditional branches in the middle, output nodes on the right. The Ollama call is a standard HTTP node hitting the local inference endpoint.\nFair warning: the naming conventions in the current workflows are rough. Built fast to validate the idea. Cleanup is on the list.\nThe interesting part isn\u0026rsquo;t the individual nodes — it\u0026rsquo;s that the entire decision loop (collect data, reason about it, send a message) is visual, auditable, and restartable. When something goes wrong, you can see it.\n","date":"24 June 2026","externalUrl":null,"permalink":"/posts/homelab/005-ai-personal-trainer-workflow/","section":"Posts","summary":"","title":"Home Lab Evolution: Building an AI-Powered Personal Trainer Workflow","type":"posts"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/homelab/","section":"Tags","summary":"","title":"Homelab","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/series/homelab-ai-platform/","section":"Series","summary":"","title":"Homelab-Ai-Platform","type":"series"},{"content":" The problem # Street photography lives and dies by location. You need foot traffic, interesting light, and enough density that something worth capturing will happen in the time you\u0026rsquo;re willing to stand around.\nWinter kills most of my usual spots. Outdoor markets thin out. Tourist corridors empty. The places that work in July become ghost towns by February. I\u0026rsquo;ve been solving this manually for years — walking different neighborhoods, checking event calendars, asking other photographers — but it doesn\u0026rsquo;t scale. I needed a better way to surface high-traffic locations before I leave the house.\nThis weekend I built one.\nWhat I built # A dynamic heatmap that overlays foot traffic data by day of week, proximity to landmarks, and density of local dining hubs. The idea: restaurants and cafés are a reliable proxy for pedestrian activity. Where people eat, people walk. Where people walk, there\u0026rsquo;s something to photograph.\nThe tool pulls location data, scores each area across those three dimensions, and renders it as a map layer I can filter by day. Planning a Saturday shoot is now a ten-second lookup instead of guesswork.\nHow Claude Code changed my process # I\u0026rsquo;ve been using GitHub Copilot for a while. It\u0026rsquo;s good at autocomplete, decent at boilerplate. What I wanted to test was whether the \u0026ldquo;agent\u0026rdquo; model — giving a tool a goal and letting it reason about the implementation — actually felt different.\nIt does.\nSpeed from idea to MVP. The agentic approach collapsed the gap between \u0026ldquo;I want a heatmap\u0026rdquo; and \u0026ldquo;here is a working heatmap.\u0026rdquo; Instead of speccing out the component structure, writing stubs, then filling them in, I described what I wanted and iterated on the output. The scaffolding phase that usually takes hours happened in minutes.\nContext matters enormously. I added a CLAUDE.md file with project-specific context — my preferred stack, how I wanted data structured, constraints on the map library. The accuracy improvement was immediate. The agent stopped making generic choices and started making choices that fit my actual setup. Custom context isn\u0026rsquo;t optional; it\u0026rsquo;s the difference between a useful agent and a fast one that needs constant correction.\nDeveloper intuition still required. The tool hit a wall on some JavaScript logic around the scoring aggregation. The output was wrong in a non-obvious way — the map rendered, but the weights weren\u0026rsquo;t combining correctly. I had to diagnose it, explain the issue, and guide it through the fix. What impressed me: it didn\u0026rsquo;t just patch the reported line. It understood the underlying logic error and corrected the approach. That\u0026rsquo;s a different kind of collaboration than autocomplete.\nWhere this lands # Still loyal to Copilot for day-to-day coding. It\u0026rsquo;s fast, it\u0026rsquo;s integrated, it knows my habits. But the gap is narrowing in ways I didn\u0026rsquo;t expect a year ago. Agentic coding — give it a goal, guide it through edge cases, iterate — is a real mode of development, not just a demo.\nFor side projects where I want to go from zero to something fast, it\u0026rsquo;s now my first move.\nThe heatmap works. Next shoot is Saturday.\nWhat\u0026rsquo;s the most useful thing you\u0026rsquo;ve built with an AI agent? I\u0026rsquo;m genuinely curious what the non-obvious use cases look like for other people.\n","date":"24 June 2026","externalUrl":null,"permalink":"/posts/misc/street-photography-heatmap-with-claude-code/","section":"Posts","summary":"","title":"I Used an AI Agent to Solve My Biggest Street Photography Problem","type":"posts"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/k3s/","section":"Tags","summary":"","title":"K3s","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/kubernetes/","section":"Tags","summary":"","title":"Kubernetes","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/llm/","section":"Tags","summary":"","title":"Llm","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/mcp/","section":"Tags","summary":"","title":"Mcp","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/n8n/","section":"Tags","summary":"","title":"N8n","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/ollama/","section":"Tags","summary":"","title":"Ollama","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/open-source/","section":"Tags","summary":"","title":"Open-Source","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/photography/","section":"Tags","summary":"","title":"Photography","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/ponytail/","section":"Tags","summary":"","title":"Ponytail","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/rtk/","section":"Tags","summary":"","title":"Rtk","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/self-hosted/","section":"Tags","summary":"","title":"Self-Hosted","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/side-project/","section":"Tags","summary":"","title":"Side-Project","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/software-engineering/","section":"Tags","summary":"","title":"Software-Engineering","type":"tags"},{"content":" The mental model that changed everything # Most developers use AI as a smarter autocomplete. You\u0026rsquo;re in a file, you need a function, you ask, you get code. That\u0026rsquo;s useful. It\u0026rsquo;s not transformative.\nThe shift that unlocked real productivity for me: stop thinking about AI as a tool you invoke and start thinking about it as a team member with a defined role. Once I did that, I stopped asking \u0026ldquo;what can the AI do?\u0026rdquo; and started asking \u0026ldquo;who do I need right now?\u0026rdquo;\nIn an Agile workflow, the answer is rarely the same person twice.\nThe roster # My current setup within GitHub Copilot maps each agent to a distinct phase of the development lifecycle:\nThe Orchestrator Triages incoming work. Takes a feature request, breaks it down, and delegates to the right agent for each sub-task. The entry point for anything new. Without this layer, the other agents operate in isolation and you lose the coordination benefit.\nThe Planner Owns the architecture. Given a task from the Orchestrator, it maps out the approach — data flow, component boundaries, dependency order, edge cases to consider before writing a line of code. This is the agent that saves you from painting yourself into a corner at 11pm.\nThe Implementor Writes the core logic. Focused, no architecture decisions, no documentation concerns. Just: here\u0026rsquo;s the spec, here\u0026rsquo;s the code. Keeping this role narrow matters — it\u0026rsquo;s easy to let an implementor drift into planning, which produces code that solves the immediate problem but ignores the broader context.\nThe Tester Adversarial by design. Its job is to find the ways the Implementor\u0026rsquo;s code breaks. Unit edge cases, integration failures, malformed inputs, race conditions. The Tester doesn\u0026rsquo;t care about shipping — it cares about what happens when the Implementor was wrong. This tension is the point.\nThe Documentor Ensures the knowledge base evolves with the code. Not retrofitted docs written after the fact — documentation that gets updated as part of the same cycle that produces the feature. The Documentor reads the Implementor\u0026rsquo;s output and the Planner\u0026rsquo;s rationale and synthesizes both into something a future developer (or future agent) can actually use.\nWhat holds them together # A copilot-instructions.md file is the connective tissue. It carries:\nProject standards (naming conventions, architectural patterns, test coverage expectations) Current context (what phase we\u0026rsquo;re in, what decisions have already been made) Role boundaries (what each agent is and isn\u0026rsquo;t responsible for) Without this file, every agent session starts cold. With it, each agent inherits the full context of the project without you having to re-explain it. The instructions file is the manager. Maintaining it is non-negotiable.\nWhat\u0026rsquo;s next: moving out of the IDE # The workflows above live in the VS Code sidebar right now. That\u0026rsquo;s functional but limited — it requires a human in the loop to hand work from agent to agent.\nThe next phase is GitHub Actions. The goal is a CI/CD flow where a single prompt — in a PR description, a comment, an issue — triggers the full multi-agent cycle. Orchestrator receives the input, delegates to Planner, Planner outputs a spec, Implementor writes code against it, Tester validates, Documentor updates the relevant files. The commit lands in the branch. The PR is updated.\nThe \u0026ldquo;chat\u0026rdquo; becomes the only interface needed to initiate a full development cycle. The backend handles the coordination.\nWe\u0026rsquo;re close to this being practical, not experimental. The pieces are available — GitHub Models, Copilot Extensions, Actions workflows, MCP servers. The integration work is real, but it\u0026rsquo;s integration work, not research.\nThe question worth sitting with # If agents can handle orchestration, planning, implementation, testing, and documentation — what does the developer role look like in two years?\nMy current answer: you become the person who defines the standards, maintains the context, and makes the judgment calls the agents can\u0026rsquo;t make. You move up the stack. The work doesn\u0026rsquo;t disappear — it changes altitude.\nThat\u0026rsquo;s a better job, not a smaller one. But it requires building the muscle now.\n","date":"24 June 2026","externalUrl":null,"permalink":"/posts/ai/agentic-team-copilot-workflows/","section":"Posts","summary":"","title":"Stop Using AI as a Tool. Start Using It as a Team.","type":"posts"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/street-photography/","section":"Tags","summary":"","title":"Street-Photography","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":" The problem with letting the LLM do everything # Agentic skill development is addictive. You describe a task, the agent handles it — reading raw configs, parsing outputs, formatting results, making decisions. It works. For one developer, on one task, the cost is invisible.\nThe problem surfaces at scale. Every token spent on deterministic logic is money that didn\u0026rsquo;t have to be spent. And those tokens compound.\nThe numbers # I ran a comparison between two approaches to the same skill:\nApproach Cost per turn Verbose (LLM handles everything) ~$0.185 CLI-Optimized (pre-processed data) ~$0.017 10x difference per turn.\nThe verbose version passes raw data to the model and lets it parse, filter, and format before responding. The CLI-optimized version pre-processes that data in code — deterministic transformations that don\u0026rsquo;t need language model reasoning — and only sends the model what it actually needs to reason about.\nWhat this looks like at enterprise scale # Model this across an organization with 40,000 developers:\nScenario Annual cost Verbose skills $1.48M CLI-optimized skills $136K Savings ~$1.34M Push that across multiple skill categories, higher usage rates, or a larger developer population and you reach $16M+ in annual savings. The math isn\u0026rsquo;t hypothetical — it follows directly from the per-turn delta multiplied by usage volume.\nThe CLI wrapper pays for itself in under 30 minutes of developer time at enterprise scale. That\u0026rsquo;s not a performance optimization. That\u0026rsquo;s a budget decision.\nThe principle # Deterministic work belongs in deterministic code.\nA CLI can parse a JSON config in milliseconds for zero tokens. An LLM can also parse a JSON config — for 200 tokens and a non-trivial latency penalty. Using the LLM for that work isn\u0026rsquo;t wrong, it\u0026rsquo;s just expensive in a way that\u0026rsquo;s easy to miss at the individual level and impossible to ignore at the organizational level.\nThe pattern for CLI-optimized skills:\nPre-process in code. Filter, transform, and format data before it reaches the model. Pass structured summaries, not raw outputs. Reserve the model for reasoning. Let the LLM do what only an LLM can do: handle ambiguity, generate language, make contextual judgments. Wrap with a CLI. A lightweight CLI layer between raw data and model input is the lowest-effort, highest-leverage optimization available. The broader implication # Token cost is the new compute cost. A decade ago, engineering organizations started tracking cloud spend per feature. The teams that built cost awareness into their infrastructure early had a structural advantage over teams that optimized reactively.\nThe same dynamic is playing out now with LLM usage. Organizations that build token-efficient skills from the start will carry a lower cost basis as usage scales. Organizations that don\u0026rsquo;t will optimize under pressure, which is always more expensive than optimizing by design.\nThe curiosity phase is fine. Let developers experiment, let skills get verbose, see what actually gets used. But once a skill is load-bearing — once it\u0026rsquo;s running on thousands of turns per day — the cost of not optimizing becomes real money.\nThe takeaway: before promoting a skill to production, ask whether any part of it is paying model rates for work a CLI could do for free.\n","date":"24 June 2026","externalUrl":null,"permalink":"/posts/ai/cli-optimized-skills-token-cost/","section":"Posts","summary":"","title":"The Cost of Curiosity: Why CLI-Optimized Skills Win at Enterprise Scale","type":"posts"},{"content":"By the time I had Hermes running and the infrastructure ADRs settled, I had a platform without agents. The registry could register, query, and stream events. The memory system could store and retrieve. The monitoring stack was collecting metrics. What I didn\u0026rsquo;t have was anyone to run on the platform.\nThat\u0026rsquo;s where the Night City Crew came in.\nThe premise # The idea was simple: a roster of specialized AI agents, each named after a character from Cyberpunk 2077, each with a clearly defined domain, a recommended model tier, a personality contract (SOUL), and a specific set of MCP tool integrations. Declare the whole thing in version control. Provision it in one operation.\nThe character names aren\u0026rsquo;t cosmetic. Each name was chosen to fit the role:\nAgent Role Why this character Alt Cunningham System Architecture The most powerful netrunner in Night City. She sees systems as others can\u0026rsquo;t. Dexter DeShawn Planning \u0026amp; Orchestration A fixer who coordinates jobs, resources, and people with cold precision. Judy Alvarez Frontend \u0026amp; UI/UX Braindance technician, artist, the person who makes things feel right. Adam Smasher Security \u0026amp; Testing Full cyborg, zero mercy, finds every weakness. Security and adversarial testing. Panam Palmer DevOps \u0026amp; Automation Nomad engineer who keeps convoys moving. Infrastructure and pipelines. Hanako Arasaka Research \u0026amp; Analysis Corporate intel and deep research, long-game strategic thinking. Misty Olszewski Travel \u0026amp; Logistics The wildcard. Falls outside the homelab\u0026rsquo;s core infrastructure domain — included but flagged for operator confirmation. The personality contract (SOUL) # Each agent carries a SOUL — a personality contract committed to the repository alongside the agent definition. The contract specifies:\nVoice: Opinionated, zero filler, brief, dry wit. Failure reporting: Radically honest. State problems plainly without sugarcoating. Position-taking: Every response commits to a position. No \u0026ldquo;it depends\u0026rdquo; without a recommendation. Filler prohibition: No \u0026ldquo;Great question\u0026rdquo;, \u0026ldquo;I\u0026rsquo;d be happy to help\u0026rdquo;, \u0026ldquo;Certainly\u0026rdquo;, or \u0026ldquo;As an AI\u0026hellip;\u0026rdquo;. Ever. The contract is testable: pose the same question to an agent with and without the contract loaded. The contracted response is shorter, takes a position, and omits openers. That\u0026rsquo;s the acceptance criterion, not a vibe check.\nThe SOUL approach comes from Hindsight\u0026rsquo;s per-agent memory bank model — each agent\u0026rsquo;s Mission, Directives, and Disposition are stored alongside its memories, shaping how it retrieves and reasons about prior context. The personality contract integrates with this: the agent doesn\u0026rsquo;t just speak a certain way, it remembers a certain way.\nMCP tool integrations # Each agent is wired to exactly the MCP servers it needs for its role. Least privilege applies to tools, not just secrets.\nAdam Smasher (security) gets access to infrastructure testing tools and cluster introspection. Hanako Arasaka (research) gets search and data retrieval. Panam Palmer (DevOps) gets cluster management and CI tooling. Alt Cunningham (architecture) gets access to the full system topology through Hermes and the monitoring stack.\nThe requirement is strict: an agent can only use the integrations declared for it. An integration requiring an external credential that doesn\u0026rsquo;t exist in Azure Key Vault surfaces a clear, actionable failure at provisioning time — it does not silently skip the integration and start the agent in a broken state.\nThe declarative model # The crew is defined in version control. The full definition for each agent includes:\n# agents/alt-cunningham.yaml (illustrative) name: alt-cunningham character: Alt Cunningham role: System Architecture model: claude-sonnet-4-6 # resolved from model policy at planning time soul: souls/alt-cunningham.md integrations: - hermes-registry # query the agent registry - victoria-metrics # read cluster metrics - k8s-read # inspect cluster state Provisioning the roster runs a single operation that creates all seven agents. It is idempotent — running it twice produces no duplicates. Individual agent failures are isolated and reported with the agent name, not swallowed into a generic error. The remaining agents still come up if one fails.\nOutstanding decisions before implementation # The spec for spec 008 carries two open questions that need operator confirmation before the implementation plan can be finalized:\nMisty Olszewski (scope): The travel and logistics role falls outside the homelab\u0026rsquo;s infrastructure domain. Does it belong in the crew, or does it get dropped as out-of-domain? My instinct is to include her but constrain the integrations to things that actually exist in the homelab — no booking APIs, no flight data services. She becomes the agent for external-world data coordination that touches the homelab (package tracking, calendar events, etc.), not a general travel assistant.\nExternal providers: Several roles would benefit from external cloud models or SaaS MCP integrations. The homelab has Azure Key Vault and can hold API keys for Claude, GPT-4o, and MCP SaaS providers. The question is whether to allow them or restrict the crew to locally available models. My current lean: allow external models where a credential exists in AKV, fail loudly where it doesn\u0026rsquo;t. This keeps the crew useful without pretending the homelab is an island.\nWhere this lands # The Night City Crew is the first concrete use case for everything built before it. Hermes registers each agent at startup. Hindsight stores their memories. VictoriaMetrics tracks their behavior. External Secrets Operator delivers their credentials. Cloudflare Tunnel exposes any human-facing interface. ArgoCD reconciles their definitions when they change.\nThe platform wasn\u0026rsquo;t theoretical. It was built for this.\nThe full spec is in specs/008-night-city-crew/spec.md. Implementation begins once the two outstanding clarifications are resolved — the roster is clear, the contracts are written, and the tool mappings are assigned. What\u0026rsquo;s left is running the provisioning workflow and watching seven agents come online in a cluster sitting under my desk in my living room.\nThis is part 4 of a 4-part series on building an AI agent platform on a home Kubernetes cluster. The full repository, ADR specs, Hermes source code, and k8s manifests are at github.com/gardlt/homelab.\n","date":"24 June 2026","externalUrl":null,"permalink":"/posts/homelab/004-the-night-city-crew/","section":"Posts","summary":"","title":"The Night City Crew: Designing a Roster of Specialized AI Agents","type":"posts"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/series/token-economics/","section":"Series","summary":"","title":"Token-Economics","type":"series"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/tokens/","section":"Tags","summary":"","title":"Tokens","type":"tags"},{"content":"","date":"24 June 2026","externalUrl":null,"permalink":"/tags/victoria-metrics/","section":"Tags","summary":"","title":"Victoria-Metrics","type":"tags"},{"content":"","date":"18 June 2026","externalUrl":null,"permalink":"/tags/compression/","section":"Tags","summary":"","title":"Compression","type":"tags"},{"content":"","date":"18 June 2026","externalUrl":null,"permalink":"/tags/economics/","section":"Tags","summary":"","title":"Economics","type":"tags"},{"content":"Series: The LLM Token Optimization Ecosystem Part 1: Meet the Tools · Part 2: How They Fit Together · Part 3: Does It Pay Off? (you\u0026rsquo;re here)\nPart 1 met the tools; Part 2 mapped how they fit. Now the question that actually drives adoption: does compression pay for itself? This part runs the economics across every major model, then closes with risks and recommendations.\nToken Economics Analysis # Cost Model (Claude Sonnet 4.6 Pricing) # Assuming ~$3/M input tokens, ~$15/M output tokens:\nTool Mechanism Session Savings (tokens) $/session saved $/month (5 sessions/day) rtk CLI output compression ~94k input tokens ~$0.28 ~$42 caveman Output compression (~65%) ~600 output tokens ~$0.009 ~$1.35 caveman-compress Context file compression (46%) ~400 input tokens/session ~$0.001 ~$0.18 headroom (SRE workload) Full context compression (92%) ~60k input tokens ~$0.18 ~$27 headroom KV cache alignment Cache hit improvement Model-dependent $0.03–0.15 $4.5–22 Stacking all tools: a developer using rtk + caveman + headroom could realistically save $60–100/month in API costs on a typical coding workflow. For a team of 10, that\u0026rsquo;s $600–1,000/month — a reasonable ROI trigger for enterprise deployment.\nThe Real Value: Speed and Context Length # The dollar savings are real but secondary. The primary value proposition is:\nSpeed: 65–80% fewer tokens = 2–5× faster responses. Time is worth more than API cost. Context longevity: Compressed context stays within the model\u0026rsquo;s window longer. A 200k-token context window effectively becomes 400–1,000k tokens equivalent with headroom compression. Agent coherence: Shorter context = less attention dilution = more focused reasoning. Monetization Paths # None of these repos are currently monetized directly. Potential paths:\nSaaS proxy (headroom is closest with ENTERPRISE.md) — charge per million tokens compressed Self-hosted enterprise — team dashboards, compliance, SAML Managed fine-tuning (cavegemma model direction) — charge for model distillation as a service Developer tooling subscription — token savings analytics, session recording, team-level dashboards Multi-Model Cost Comparison # Assumptions \u0026amp; Methodology # Session model: A 30-minute agentic coding session using a typical TypeScript/Rust project.\nMetric Without compression With full stack (rtk + headroom + caveman) Input tokens 118,000 ~12,000 (~90% reduction) Output tokens 15,000 ~5,250 (~65% reduction) Net input savings — rtk: –80% on CLI outputs; headroom: –50% on remaining reads/RAG Net output savings — caveman: –65% average across query types Work schedule: 5 sessions/day × 22 working days = 110 sessions/month (individual developer). Team calculation: 10 developers × individual monthly cost. Batch discount: Anthropic and OpenAI both offer ~50% off for batch/async processing. Prompt cache discount: ~90% off cached input tokens (applies when prefixes are stable — KV cache alignment via headroom maximizes this).\nPer-Session and Monthly Cost by Model # Anthropic — Claude Family # Model Input $/M Output $/M Cost/session (raw) Cost/session (compressed) Monthly — 1 dev (raw) Monthly — 1 dev (compressed) Monthly savings Claude Opus 4.8 $5.00 $25.00 $0.97 $0.19 $106.70 $20.90 $85.80 Claude Sonnet 4.6 $3.00 $15.00 $0.58 $0.11 $63.80 $12.10 $51.70 Claude Haiku 4.5 $1.00 $5.00 $0.19 $0.04 $20.90 $4.40 $16.50 Opus 4.8 Fast Mode doubles cost ($10/$50); compressed fast-mode session ≈ $0.38, monthly ≈ $41.80. Batch API (50% discount): Opus 4.8 compressed drops to ~$0.095/session / $10.45/month.\nOpenAI — GPT \u0026amp; o-series # Model Input $/M Output $/M Cost/session (raw) Cost/session (compressed) Monthly — 1 dev (raw) Monthly — 1 dev (compressed) Monthly savings GPT-5.5 $5.00 $30.00 $1.04 $0.22 $114.40 $24.20 $90.20 GPT-5.5 Pro $30.00 $180.00 $6.24 $1.31 $686.40 $144.10 $542.30 GPT-4o $2.50 $10.00 $0.45 $0.08 $49.50 $8.80 $40.70 o3 $2.00 $8.00 $0.36 $0.07 $39.60 $7.70 $31.90 o4-mini $0.55 $2.20 $0.098 $0.018 $10.78 $1.98 $8.80 ⚠️ o-series reasoning tokens: o3 and o4-mini generate hidden reasoning tokens billed as output. A response showing 500 output tokens may consume 3,000+ actual billed tokens. Effective output cost can be 3–6× the listed rate for complex reasoning tasks. Compression reduces the input cost but cannot reduce reasoning tokens — caveman still reduces final response output tokens.\nGPT-5.5 Batch/Flex: 50% discount → compressed session ≈ $0.11/session.\nGoogle — Gemini Family # Model Input $/M Output $/M Cost/session (raw) Cost/session (compressed) Monthly — 1 dev (raw) Monthly — 1 dev (compressed) Monthly savings Gemini 2.5 Pro $1.25 $10.00 $0.30 $0.068 $33.00 $7.48 $25.52 Gemini 2.5 Flash $0.30 $2.50 $0.073 $0.017 $8.03 $1.87 $6.16 Gemini 2.5 Flash-Lite $0.10 $0.40 $0.018 $0.0033 $1.98 $0.36 $1.62 Gemini 2.5 Pro uses tiered pricing: prompts \u0026gt;200k tokens step up significantly. Headroom\u0026rsquo;s compression is particularly valuable here — keeping prompts under the 200k threshold avoids the tier-up surcharge entirely.\nOpen-Weight \u0026amp; Alternative Providers # Model Input $/M Output $/M Cost/session (raw) Cost/session (compressed) Monthly — 1 dev (raw) Monthly — 1 dev (compressed) Monthly savings Mistral Large 3 $0.50 $1.50 $0.082 $0.014 $9.02 $1.54 $7.48 Llama 4 Maverick (hosted) $0.15 $0.60 $0.027 $0.0050 $2.97 $0.55 $2.42 Llama 4 Scout (hosted) $0.08 $0.30 $0.014 $0.0026 $1.54 $0.29 $1.25 DeepSeek V3.2 $0.14 $0.28 $0.021 $0.0032 $2.31 $0.35 $1.96 DeepSeek R1 (reasoning) $3.00 $7.00 $0.459 $0.073 $50.49 $8.03 $42.46 Llama 4 pricing varies by host (Together.ai, Fireworks, Groq, DeepInfra). Scout at $0.08 input is among the cheapest hosted frontier-class models available. DeepSeek V3.2 offers Claude Haiku-class capability at ~1/10th the cost — compression ROI is lower in absolute terms but proportionally the same (~83% cost reduction with the full stack).\nSummary: Full Compression Stack ROI by Model Tier # Tier Model Raw monthly (1 dev) Compressed monthly $ Saved/mo % Saved Team of 10 savings Frontier-expensive GPT-5.5 Pro $686 $144 $542 79% $5,420/mo Frontier-expensive Claude Opus 4.8 $107 $21 $86 80% $860/mo Frontier-expensive GPT-5.5 $114 $24 $90 79% $900/mo Mid-tier DeepSeek R1 $50 $8 $42 84% $420/mo Mid-tier Claude Sonnet 4.6 $64 $12 $52 81% $520/mo Mid-tier GPT-4o $50 $9 $41 82% $410/mo Mid-tier o3 $40 $8 $32 80% $320/mo Mid-tier Gemini 2.5 Pro $33 $7.5 $25.5 77% $255/mo Budget Claude Haiku 4.5 $21 $4.4 $16.5 79% $165/mo Budget o4-mini $11 $2 $9 82% $90/mo Budget Gemini 2.5 Flash $8 $1.9 $6.1 77% $61/mo Budget Mistral Large 3 $9 $1.5 $7.5 83% $75/mo Ultra-cheap Llama 4 Maverick $3 $0.55 $2.4 81% $24/mo Ultra-cheap DeepSeek V3.2 $2.3 $0.35 $2 85% $20/mo Ultra-cheap Gemini 2.5 Flash-Lite $2 $0.36 $1.6 82% $16/mo Key insight: The percentage savings is nearly constant (~80%) across all models because the compression ratios (90% input, 65% output) are model-agnostic. The absolute dollar savings, however, scale linearly with model price — making compression tools most valuable with expensive frontier models.\nThe Diminishing Marginal ROI Problem # For ultra-cheap models (DeepSeek V3.2, Llama 4 Scout, Gemini Flash-Lite), the absolute savings from compression are $1–3/month per developer — low enough that the integration overhead of headroom/rtk may not be worth it. The break-even math:\nheadroom setup time: ~30–60 min for initial integration rtk setup: ~5 min caveman setup: ~30 sec For a developer using Gemini Flash-Lite, the $1.62/month savings on token costs doesn\u0026rsquo;t justify the headroom setup. For the same developer using Claude Opus 4.8 or GPT-5.5 Pro, the $86–542/month savings pays back setup in minutes.\nConclusion: Compression tools have the highest ROI when used with premium frontier models. As models get cheaper, the economic case for compression weakens, but the speed and context-length benefits remain constant.\nBatch Processing Multiplier # For workflows that tolerate latency (test runs, automated reviews, nightly analysis), batch API pricing compounds compression savings:\nModel Compressed + Batch cost/session Compressed standard Additional batch savings Claude Opus 4.8 $0.095 $0.191 -50% Claude Sonnet 4.6 $0.057 $0.114 -50% GPT-5.5 $0.110 $0.218 -50% GPT-4o $0.041 $0.083 -50% Combining compression + batch + prompt caching can reduce Claude Opus 4.8 per-session cost from $0.965 to under $0.05 — a 95%+ total cost reduction versus baseline.\nPrompt Cache Alignment Bonus (headroom-specific) # Headroom\u0026rsquo;s CacheAligner stabilizes prompt prefixes to maximize provider KV cache hit rates. With Anthropic prompt caching:\nCached input tokens: 90% discount (effectively $0.50/M for Sonnet 4.6, $0.30/M for Haiku 4.5) Cache write tokens: 25% premium (one-time cost, amortized over re-reads) For a session where 80% of input tokens are repeated context (system prompts, CLAUDE.md, codebase context), the effective input cost drops dramatically. This benefit is orthogonal to compression and stacks multiplicatively — the combination of CacheAligner + compression can reduce the effective input token rate to near-zero for stable context.\nKey Observations and Risks # The compression arms race: As LLMs improve at long-context tasks, the value of compression decreases at the margins. However, token pricing creates a persistent economic incentive regardless of model capability improvements.\nBenchmark credibility: All three repos self-report benchmarks. rtk\u0026rsquo;s are the most conservative and granular; caveman\u0026rsquo;s cite an independent paper; headroom\u0026rsquo;s cover accuracy preservation most thoroughly. None have been independently reproduced.\nIntegration fragility: Hook-based approaches (rtk, lean-ctx) depend on agent hook APIs that can change with agent updates. Claude Code\u0026rsquo;s hook architecture has been stable but this is a dependency risk.\nPrivacy surface: Headroom proxies all LLM traffic locally, but \u0026ldquo;locally\u0026rdquo; means on the developer\u0026rsquo;s machine — fine for individual use, requires careful audit for enterprise deployment (data handling policies, compliance, secrets exposure in logs).\nThe lean-ctx wildcard: With only 4 stars but a well-differentiated technical approach (MCP server + file caching + TDD mode + web dashboard), lean-ctx could grow quickly if it executes on its roadmap. The Token Dense Dialect (mathematical symbols for code constructs) is a genuinely novel approach that RTK and headroom haven\u0026rsquo;t matched.\nRecommendations # For individual developers: Start with rtk (biggest immediate ROI, zero config) + caveman (simple output compression). Add headroom if you run automated agent workflows with heavy RAG or log analysis.\nFor teams: Headroom\u0026rsquo;s enterprise offering + rtk integration is the right architecture. The headroom learn feedback loop is the most defensible long-term value proposition.\nFor investors/observers: The most interesting bet is whether headroom\u0026rsquo;s ML-first approach (Kompress-v2-base) creates a durable moat, or whether rule-based tools like rtk are \u0026ldquo;good enough.\u0026rdquo; Given that rtk\u0026rsquo;s 63.1k stars came faster than headroom\u0026rsquo;s, and rtk has a more active PR/issue volume, market pull appears stronger for simple, transparent tools.\nWatch: lean-ctx — technically sound, first-mover on MCP + file caching combo, very early. Also watch cavegemma: if fine-tuned compression becomes good enough, it eliminates the need for all three runtime tools.\nSeries recap: Part 1 — Meet the Tools · Part 2 — How They Fit Together · Part 3 (this post).\nSources # juliusbrussee/caveman chopratejas/headroom rtk-ai/rtk yvgude/lean-ctx Medium: The Ultimate Token-Saving Stack DEV.to: Headroom writeup arxiv: Brevity Constraints paper (2604.00025) ","date":"18 June 2026","externalUrl":null,"permalink":"/posts/ai/tokenomics/llm-token-tools-part-3-does-it-pay-off/","section":"Posts","summary":"","title":"The LLM Token Optimization Ecosystem — Part 3: Does It Pay Off?","type":"posts"},{"content":"Series: The LLM Token Optimization Ecosystem Part 1: Meet the Tools · Part 2: How They Fit Together (you\u0026rsquo;re here) · Part 3: Does It Pay Off?\nPart 1 profiled rtk, caveman, and headroom individually. This part zooms out: where each one sits in the landscape, how they compose into a single stack, and — the dimension most comparisons miss — where they actually deploy.\nCompetitive Landscape Map # Direct Competitors # Project Stars Scope Deploy Local Reversible Lang rtk 63.1k CLI output Hook ✅ ❌ Rust caveman 62.1k LLM output Skill/plugin ✅ ❌ JS/Python headroom 29.9k All context Proxy/library/MCP ✅ ✅ Python/Rust lean-ctx 4 CLI + file reads Hook + MCP ✅ ❌ Rust Compresr.ai N/A Text Hosted API ❌ ❌ — Token Co. N/A Text Hosted API ❌ ❌ — OpenAI Compaction N/A Conversation history Provider-native ❌ ❌ — Related / Ecosystem Repos # From caveman\u0026rsquo;s own ecosystem:\nJuliusBrussee/cavemem — cross-agent memory via SQLite + MCP, session-compressed storage JuliusBrussee/cavekit — Claude Code plugin: NL → blueprint → parallel build plan → working software; per-task token budgets; automated iteration JuliusBrussee/finetune-caveman (cavegemma) — Gemma 4 31B fine-tuned on caveman prompt-completion pairs; bakes compression into model weights Adjacent tools:\nyvgude/lean-ctx — Hybrid Shell Hook + MCP Server; 89–99% claimed savings; introduces Token Dense Dialect (mathematical symbols: λ for functions, § for classes, ∂ for interfaces); 6-mode file reading with MD5 session cache; Shannon entropy analysis; web dashboard at localhost:3333. Only 4 stars currently — very early stage, but technically differentiated by MCP server + file caching. Differentiation Deep-Dive # Attack Surface Comparison # ┌─────────────────────────────────────────────────────────────┐ │ Token Bill Anatomy (typical agentic coding session) │ │ │ │ Input tokens breakdown: │ │ ├── Tool outputs (CLI, git, tests) ← rtk / lean-ctx │ │ ├── File reads ← headroom / lean-ctx │ │ ├── RAG / search results ← headroom │ │ ├── Conversation history ← headroom / OpenAI │ │ └── System prompt / context files ← caveman-compress │ │ │ │ Output tokens: ← caveman │ └─────────────────────────────────────────────────────────────┘ rtk has the biggest real-world impact per dollar of implementation effort because CLI outputs (git, tests, grep) are the most token-wasteful part of a coding session. headroom covers more surface area but requires more setup. caveman is the only tool addressing output tokens.\nTechnical Design Philosophies # rtk: Systems-first. Zero runtime dependencies. Predictable, rule-based compression. Fast path: \u0026lt;10ms overhead. Trust is easy — you can read the Rust source and understand exactly what gets stripped. No ML, no black boxes.\ncaveman: Prompt-engineering-first. Zero infrastructure required. The compression algorithm runs in the model\u0026rsquo;s reasoning — which means it\u0026rsquo;s flexible but also vulnerable to prompt drift or model updates that change response style.\nheadroom: ML-first. Trains proprietary models on agentic data. This creates a compounding advantage as the model improves, but also introduces latency, dependency complexity, and a trust surface (you\u0026rsquo;re proxying all traffic through headroom\u0026rsquo;s pipeline).\nThe Integration Stack # These tools are designed to compose:\n[caveman] → compress what the agent says [rtk/lean-ctx] → compress CLI tool outputs [headroom] → compress everything else (wraps rtk, adds file/RAG/log compression) [cavekit + cavemem] → orchestrate agents with token budgets + persistent memory [cavegemma] → bake caveman-style output compression into model weights Headroom\u0026rsquo;s explicit decision to bundle rtk (and support lean-ctx via env var) signals that the winning architecture isn\u0026rsquo;t one tool — it\u0026rsquo;s a composable layer.\nDeployment Model: Local-Only vs Enterprise Gateway # A dimension that matters enormously for teams but is easy to miss: where does the tool actually run? Most of this ecosystem is strictly per-developer. Only one tool has a credible path to central deployment.\nLocal-Only Tools # rtk — strictly local. Single binary, runs on each developer\u0026rsquo;s machine via shell hook. No server mode, no central deployment path. Each developer installs and manages their own instance. The closest it gets to \u0026ldquo;shared\u0026rdquo; is the openclaw plugin integration, but even that is per-machine. caveman — local by design. It\u0026rsquo;s a prompt instruction injected into each agent\u0026rsquo;s config directory. There\u0026rsquo;s no server component whatsoever — it\u0026rsquo;s essentially a SKILL.md file. Deployment means distributing the file to each developer\u0026rsquo;s ~/.claude or equivalent. No gateway, no central control. lean-ctx — local only. Shell hook + per-machine MCP server (running on localhost:3333). No multi-user mode, no centralized deployment documented. Very early stage. cavemem / cavekit / cavegemma — all local. SQLite-backed memory, local agent orchestration, local fine-tuned model weights. No enterprise architecture. Can Function as Enterprise Gateway # headroom is the only one of the three that has a credible enterprise gateway path. It has multiple deployment modes that go beyond local:\nheadroom proxy --port 8787 — runs as a drop-in HTTP proxy. Any team member (or CI system) routes their LLM API traffic through it. The compression happens server-side before forwarding to Anthropic/OpenAI/Bedrock. Docker image — ghcr.io/chopratejas/headroom:latest — deployable on any container infra (ECS, GKE, Kubernetes). This is the gateway deployment path. ASGI middleware — app.add_middleware(CompressionMiddleware) — embeddable in a FastAPI/Starlette service, meaning the gateway can be part of an internal API layer that all developers hit. SharedContext — compressed context passing across multi-agent workflows, implying team-level shared state. ENTERPRISE.md — they explicitly document an enterprise offering, though the details aren\u0026rsquo;t public without contacting them. The architecture allows an enterprise to deploy one headroom instance, point all developer API keys through it, and apply compression + KV cache alignment + cross-agent memory at the team level — without any per-developer installation.\nCloud / Hosted (Not Local) # Compresr.ai and The Token Company — hosted APIs. You send your text to their endpoint, they compress it, you get tokens back. Zero local install, but your data leaves your environment. OpenAI Compaction — provider-native, runs on OpenAI\u0026rsquo;s infrastructure. No control or visibility, only covers conversation history. Deployment Summary # Tool Local only Self-hosted gateway Hosted/cloud rtk ✅ — — caveman ✅ — — lean-ctx ✅ — — cavemem / cavekit ✅ — — headroom ✅ ✅ — Compresr.ai — — ✅ Token Company — — ✅ OpenAI Compaction — — ✅ (provider) Bottom line: If you need central IT control, compliance visibility, or team-level deployment, headroom is the only open-source option in this space. rtk and caveman are fundamentally per-developer tools — you can distribute their configs via dotfiles or onboarding scripts, but there\u0026rsquo;s no central enforcement or observability layer. The gap between \u0026ldquo;headroom the local tool\u0026rdquo; and \u0026ldquo;headroom the enterprise gateway\u0026rdquo; is real and appears to be their primary monetization lever given the ENTERPRISE.md file.\nNext: Part 3 — Does It Pay Off? runs the token economics: per-model cost tables, ROI by tier, and where compression stops being worth it.\nSources # rtk-ai/rtk JuliusBrussee/caveman chopratejas/headroom yvgude/lean-ctx JuliusBrussee/cavekit ","date":"18 June 2026","externalUrl":null,"permalink":"/posts/ai/tokenomics/llm-token-tools-part-2-how-they-fit/","section":"Posts","summary":"","title":"The LLM Token Optimization Ecosystem — Part 2: How They Fit Together","type":"posts"},{"content":"Series: The LLM Token Optimization Ecosystem Part 1: Meet the Tools (you\u0026rsquo;re here) · Part 2: How They Fit Together · Part 3: Does It Pay Off?\nA distinct open-source category has emerged around reducing the token cost of AI coding agents. These tools do not make models smarter — they make the surface area of every interaction smaller. Three high-traction repos dominate this space, each attacking a different slice of the token bill:\nrtk (63.1k ⭐) — compresses CLI command outputs before they hit the LLM caveman (62.1k ⭐) — compresses LLM responses by making the agent speak in terse fragments headroom (29.9k ⭐) — compresses everything the LLM reads (tool outputs, RAG, logs, files) Together they address both sides of the API invoice: input tokens (rtk/headroom) and output tokens (caveman). The design philosophies are fundamentally different, and the competitive dynamics are surprisingly cooperative — headroom explicitly bundles the rtk binary; both point to lean-ctx as an alt.\nThis part profiles all three. Part 2 covers how they compose and where they deploy; Part 3 runs the cost math.\n1. rtk — CLI Output Proxy # Repo: rtk-ai/rtk Stars: 63.1k · Forks: 3.9k · Issues: 657 open · PRs: 576 open Releases: 211 (latest v0.42.4, Jun 12 2026) Language: Rust 92.9%, Shell 4.8%, TypeScript 1.5% License: Apache 2.0 Team: Patrick Szymkowiak (founder), Florian Bruniaux, Adrien Eppling\nWhat It Does # rtk sits as a transparent proxy between the AI agent\u0026rsquo;s Bash tool and the shell. When an agent calls git status, a PreToolUse hook intercepts it, rewrites it to rtk git status, runs the command, and returns compressed output. The agent never knows compression happened.\nFour compression strategies are applied per command type:\nSmart Filtering — strips comments, whitespace, boilerplate Grouping — aggregates similar items (files by directory, errors by type) Truncation — drops redundancy, keeps signal Deduplication — collapses repeated log lines with counts A tee recovery mechanism saves the full unfiltered output on failure, so the LLM can read it without re-running the command.\nTechnical Architecture # Single Rust binary, zero runtime dependencies, \u0026lt;10ms overhead. Supports 100+ commands across files, git, GitHub CLI, test runners (Jest, Cargo, pytest, Go test, RSpec), build tools (ESLint, tsc, Next.js, Rust), package managers, AWS CLI, Docker, and Kubernetes.\nAgent integration is hook-based: rtk init -g writes a PreToolUse hook to the agent\u0026rsquo;s config. Fourteen agents are supported including Claude Code, GitHub Copilot (VS Code + CLI), Cursor, Gemini CLI, Codex, Windsurf, Cline/Roo Code, OpenCode, OpenClaw, Pi, Hermes, Kilo Code, and Google Antigravity.\nCritical limitation: The hook only intercepts Bash tool calls. Claude Code\u0026rsquo;s native Read, Grep, and Glob tools bypass the hook entirely — so file reads don\u0026rsquo;t compress through rtk.\nBenchmarks (Claimed) # Operation Standard rtk Savings git status (10×/session) 3,000 600 -80% cargo test (5×/session) 25,000 2,500 -90% cat/read (20×/session) 40,000 12,000 -70% 30-min session total ~118,000 ~23,900 -80% Token Economics Model # rtk\u0026rsquo;s model is simple: reduce input tokens by compressing tool outputs. Given Claude Sonnet 4.6 at ~$3/M input tokens, an 80% reduction on 118k input tokens per session = ~$0.28 saved per session, ~$70/month for a developer running 5 sessions/day. The rtk gain command tracks this with USD estimates and supports JSON export for dashboards.\nStrengths # Highest stars in the space; most agent integrations (14) Single Rust binary, zero deps — trivial deployment Telemetry is opt-in only, GDPR-compliant 211 releases reflects fast iteration and active maintenance Multilingual README (7 languages) signals global adoption Weaknesses / Risks # 657 open issues and 576 open PRs suggest team is stretched thin Hook-only architecture misses Read/Grep/Glob tool calls — leaves significant savings on the table No file caching, no MCP server, no reversible compression Windows support limited to CLAUDE.md injection mode (no hook) Claimed savings are estimates based on \u0026ldquo;medium TypeScript/Rust projects\u0026rdquo;; real savings vary 2. caveman — Output Token Compressor # Repo: JuliusBrussee/caveman Stars: 62.1k · Forks: 3.5k · Issues: 75 open · PRs: 133 open Releases: 14 (latest v1.8.2, May 12 2026) Language: JavaScript 62.9%, Python 27.6%, PowerShell 4.9%, Shell 4.6% License: MIT Author: Julius Brussee (solo maintainer)\nWhat It Does # Caveman is a skill/plugin — a prompt-injected instruction set — that makes AI agents respond in compressed, caveman-style fragments. The agent still reasons at full depth but strips filler words, pleasantries, and redundant phrasing from its output.\nKey distinction: caveman only affects output tokens. Thinking/reasoning tokens are untouched. Input tokens are unchanged unless you also use caveman-compress to rewrite CLAUDE.md files.\nNormal: \u0026#34;The reason your React component is re-rendering is because you\u0026#39;re creating a new object reference on each render cycle...\u0026#34; = 69 tokens Caveman: \u0026#34;New object ref each render. Inline object prop = new ref = re-render. Wrap in `useMemo`.\u0026#34; = 19 tokens Technical Architecture # Architecture is deliberately lightweight: a SKILL.md file injected into the agent\u0026rsquo;s skills directory. A Claude Code session hook auto-activates caveman on every new session. The mechanism is entirely prompt-engineering — no code transformation, no proxy, no binary.\nFour compression levels: lite (drop filler), full (default caveman), ultra (telegraphic), wenyan (classical Chinese notation, maximally dense).\nSub-skills included:\n/caveman-commit — conventional commit messages ≤50 chars /caveman-review — one-line PR comments /caveman-stats — real session token usage + lifetime savings + USD /caveman-compress — rewrites CLAUDE.md/memory files into caveman-speak (~46% input token reduction on context files) caveman-shrink — MCP middleware that wraps any MCP server and compresses tool descriptions The repo structure covers 30+ agent targets: .claude-plugin, .agents, .codex, .kiro, .roo, .junie directories, plus plugin formats for OpenClaw, Gemini extensions, and others.\nBenchmarks (Claimed) # Average 65% output token reduction across 10 diverse prompts (range 22–87%). A March 2026 paper \u0026ldquo;Brevity Constraints Reverse Performance Hierarchies\u0026rdquo; found that constraining models to brief responses improved accuracy by 26 points on certain benchmarks — caveman cites this as independent validation.\nTask Normal Caveman Saved React re-render bug 1,180 159 87% Fix auth middleware 704 121 83% PostgreSQL pool 2,347 380 84% Git rebase vs merge 702 292 58% Architecture decision 446 310 30% Average 1,214 294 65% caveman-compress on memory/context files averages 46% reduction, which reduces input token cost on every subsequent session.\nToken Economics Model # Output token savings are smaller in absolute dollar terms for most models (output tokens ~5× more expensive per token than input for Sonnet 4.6, but there are far fewer of them). The bigger ROI is speed: 65% fewer output tokens means ~3× faster responses, which compounds into developer productivity. For API users paying output token costs, caveman\u0026rsquo;s savings are real but secondary to the UX improvement.\nThe ecosystem play is more interesting: cavekit adds per-task token budgets, cavemem compresses cross-session memory, and cavegemma fine-tunes Gemma 4 31B on caveman-style pairs to bake compression into model weights (removing the per-session prompt overhead entirely).\nStrengths # Extremely easy to install (one curl command, works in 30 seconds) No external dependencies, no binaries, no proxies Proven traction: 62.1k stars despite being a solo-maintained project Ecosystem breadth: caveman → cavemem → cavekit → cavegemma forms a coherent stack Accuracy claim supported by third-party research Weaknesses / Risks # Solo maintainer — bus factor of 1; 133 open PRs suggests review bottleneck Prompt-only mechanism — can be overridden by complex prompts or model behavior changes Output-only savings; doesn\u0026rsquo;t address the growing input token cost problem No reversibility or guaranteed format — compressed responses can be harder to parse programmatically 14 releases vs. rtk\u0026rsquo;s 211 suggests slower iteration cadence 3. headroom — Full Context Compression Layer # Repo: chopratejas/headroom Stars: 29.9k · Forks: 2k · Issues: 177 open · PRs: 90 open Releases: 155 (latest v0.25.0, Jun 12 2026) Language: Python 78.1%, Rust 17.2%, TypeScript 2.5% License: Apache 2.0\nWhat It Does # Headroom is the most technically ambitious of the three. Where rtk compresses CLI outputs and caveman compresses responses, headroom compresses everything the LLM reads — tool outputs, logs, RAG chunks, files, conversation history — via a pipeline that sits between the application and the LLM provider.\nYour app/agent │ prompts · tool outputs · logs · RAG · files ▼ ┌──────────────────────────────────────────────────┐ │ Headroom (local) │ │ CacheAligner → ContentRouter → CCR │ │ ├─ SmartCrusher (JSON) │ │ ├─ CodeCompressor (AST) │ │ └─ Kompress-base (text, HF) │ └──────────────────────────────────────────────────┘ │ compressed prompt + retrieval tool ▼ LLM provider (Anthropic · OpenAI · Bedrock · …) Technical Architecture # Six compression algorithms:\nSmartCrusher — universal JSON compression (arrays, nested objects) CodeCompressor — AST-aware compression for Python, JS, Go, Rust, Java, C++ Kompress-base — proprietary HuggingFace model trained on agentic traces CacheAligner — stabilizes prompt prefixes to maximize KV cache hit rates IntelligentContext — score-based context fitting with learned importance CCR (Compressed Context with Retrieval) — reversible compression; originals cached locally and retrievable on demand Deployment modes:\nPython library: compress(messages, model=...) TypeScript SDK: await compress(messages, { model }) Drop-in proxy: headroom proxy --port 8787 Agent wrap: headroom wrap claude|codex|cursor|aider MCP server: headroom_compress, headroom_retrieve, headroom_stats Framework integrations: Anthropic SDK, OpenAI SDK, Vercel AI SDK, LiteLLM, LangChain, Agno, Strands, ASGI middleware, multi-agent SharedContext.\nheadroom learn mines failed sessions and writes corrections back to CLAUDE.md/AGENTS.md/GEMINI.md — a feedback loop that improves agent behavior over time.\nHeadroom explicitly ships the rtk binary and uses it for shell-output rewriting, then compresses everything downstream. It also supports lean-ctx via HEADROOM_CONTEXT_TOOL=lean-ctx.\nBenchmarks (Claimed) # Real agent workloads:\nWorkload Before After Savings Code search (100 results) 17,765 1,408 92% SRE incident debugging 65,694 5,118 92% GitHub issue triage 54,174 14,761 73% Codebase exploration 78,502 41,254 47% Accuracy benchmarks (N=100 each):\nBenchmark Category Baseline Headroom Delta GSM8K Math 0.870 0.870 ±0.000 TruthfulQA Factual 0.530 0.560 +0.030 SQuAD v2 QA — 97% 19% compression BFCL Tool use — 97% 32% compression Reversibility via CCR means the LLM can retrieve originals when needed — addressing the accuracy risk of lossy compression.\nToken Economics Model # Headroom attacks the largest input token costs: logs, RAG chunks, and tool outputs are often 10–100× larger than necessary. A 92% reduction on a 65k-token SRE debug session saves ~$0.18 per incident in API costs at Sonnet 4.6 pricing. For teams running many automated agent workflows this compounds quickly.\nThe KV cache alignment via CacheAligner is an underrated feature: stable prompt prefixes allow Anthropic/OpenAI\u0026rsquo;s prompt caching to actually work, creating a compounding effect (cached tokens are ~90% cheaper). This benefit is orthogonal to compression and comes for free.\nThe headroom learn loop is a long-term moat: agent sessions generate proprietary training signal that improves compression quality over time.\nStrengths # Most technically comprehensive solution in the space Reversible compression (CCR) — strong accuracy guarantee Proprietary ML model (Kompress-v2-base) trained on agentic traces KV cache optimization is orthogonal value with zero additional cost Framework agnostic; 10+ SDK integrations Enterprise offering with ENTERPRISE.md 1,614 commits and 155 releases — highest development velocity of the three Weaknesses / Risks # Most complex to install and configure (Python 3.10+ required, Rust build for some components) ML model adds latency (~50–200ms depending on hardware) 177 open issues — support surface is wide SSL inspection environments have a painful setup path (documented but non-trivial) Requires trust: headroom proxies all LLM traffic; enterprise buyers will scrutinize data handling Python-first limits use in non-Python stacks despite TypeScript SDK Next: Part 2 — How They Fit Together maps the competitive landscape, the composable stack, and where each tool actually deploys.\nSources # juliusbrussee/caveman chopratejas/headroom rtk-ai/rtk arxiv: Brevity Constraints paper (2604.00025) Kompress-v2-base on HuggingFace ","date":"18 June 2026","externalUrl":null,"permalink":"/posts/ai/tokenomics/llm-token-tools-part-1-meet-the-tools/","section":"Posts","summary":"","title":"The LLM Token Optimization Ecosystem — Part 1: Meet the Tools","type":"posts"},{"content":"Scope: Technical, Competitive, and Token Economics Analysis Repos analyzed: caveman · headroom · rtk + identified ecosystem\nExecutive Summary # A distinct open-source category has emerged around reducing the token cost of AI coding agents. These tools do not make models smarter — they make the surface area of every interaction smaller. Three high-traction repos dominate this space, each attacking a different slice of the token bill:\nrtk (63.1k ⭐) — compresses CLI command outputs before they hit the LLM caveman (62.1k ⭐) — compresses LLM responses by making the agent speak in terse fragments headroom (29.9k ⭐) — compresses everything the LLM reads (tool outputs, RAG, logs, files) Together they address both sides of the API invoice: input tokens (rtk/headroom) and output tokens (caveman). The design philosophies are fundamentally different, and the competitive dynamics are surprisingly cooperative — headroom explicitly bundles the rtk binary; both point to lean-ctx as an alt.\n1. rtk — CLI Output Proxy # Repo: rtk-ai/rtk Stars: 63.1k · Forks: 3.9k · Issues: 657 open · PRs: 576 open Releases: 211 (latest v0.42.4, Jun 12 2026) Language: Rust 92.9%, Shell 4.8%, TypeScript 1.5% License: Apache 2.0 Team: Patrick Szymkowiak (founder), Florian Bruniaux, Adrien Eppling\nWhat It Does # rtk sits as a transparent proxy between the AI agent\u0026rsquo;s Bash tool and the shell. When an agent calls git status, a PreToolUse hook intercepts it, rewrites it to rtk git status, runs the command, and returns compressed output. The agent never knows compression happened.\nFour compression strategies are applied per command type:\nSmart Filtering — strips comments, whitespace, boilerplate Grouping — aggregates similar items (files by directory, errors by type) Truncation — drops redundancy, keeps signal Deduplication — collapses repeated log lines with counts A tee recovery mechanism saves the full unfiltered output on failure, so the LLM can read it without re-running the command.\nTechnical Architecture # Single Rust binary, zero runtime dependencies, \u0026lt;10ms overhead. Supports 100+ commands across files, git, GitHub CLI, test runners (Jest, Cargo, pytest, Go test, RSpec), build tools (ESLint, tsc, Next.js, Rust), package managers, AWS CLI, Docker, and Kubernetes.\nAgent integration is hook-based: rtk init -g writes a PreToolUse hook to the agent\u0026rsquo;s config. Fourteen agents are supported including Claude Code, GitHub Copilot (VS Code + CLI), Cursor, Gemini CLI, Codex, Windsurf, Cline/Roo Code, OpenCode, OpenClaw, Pi, Hermes, Kilo Code, and Google Antigravity.\nCritical limitation: The hook only intercepts Bash tool calls. Claude Code\u0026rsquo;s native Read, Grep, and Glob tools bypass the hook entirely — so file reads don\u0026rsquo;t compress through rtk.\nBenchmarks (Claimed) # Operation Standard rtk Savings git status (10×/session) 3,000 600 -80% cargo test (5×/session) 25,000 2,500 -90% cat/read (20×/session) 40,000 12,000 -70% 30-min session total ~118,000 ~23,900 -80% Token Economics Model # rtk\u0026rsquo;s model is simple: reduce input tokens by compressing tool outputs. Given Claude Sonnet 4.6 at ~$3/M input tokens, an 80% reduction on 118k input tokens per session = ~$0.28 saved per session, ~$70/month for a developer running 5 sessions/day. The rtk gain command tracks this with USD estimates and supports JSON export for dashboards.\nStrengths # Highest stars in the space; most agent integrations (14) Single Rust binary, zero deps — trivial deployment Telemetry is opt-in only, GDPR-compliant 211 releases reflects fast iteration and active maintenance Multilingual README (7 languages) signals global adoption Weaknesses / Risks # 657 open issues and 576 open PRs suggest team is stretched thin Hook-only architecture misses Read/Grep/Glob tool calls — leaves significant savings on the table No file caching, no MCP server, no reversible compression Windows support limited to CLAUDE.md injection mode (no hook) Claimed savings are estimates based on \u0026ldquo;medium TypeScript/Rust projects\u0026rdquo;; real savings vary 2. caveman — Output Token Compressor # Repo: JuliusBrussee/caveman Stars: 62.1k · Forks: 3.5k · Issues: 75 open · PRs: 133 open Releases: 14 (latest v1.8.2, May 12 2026) Language: JavaScript 62.9%, Python 27.6%, PowerShell 4.9%, Shell 4.6% License: MIT Author: Julius Brussee (solo maintainer)\nWhat It Does # Caveman is a skill/plugin — a prompt-injected instruction set — that makes AI agents respond in compressed, caveman-style fragments. The agent still reasons at full depth but strips filler words, pleasantries, and redundant phrasing from its output.\nKey distinction: caveman only affects output tokens. Thinking/reasoning tokens are untouched. Input tokens are unchanged unless you also use caveman-compress to rewrite CLAUDE.md files.\nNormal: \u0026#34;The reason your React component is re-rendering is because you\u0026#39;re creating a new object reference on each render cycle...\u0026#34; = 69 tokens Caveman: \u0026#34;New object ref each render. Inline object prop = new ref = re-render. Wrap in `useMemo`.\u0026#34; = 19 tokens Technical Architecture # Architecture is deliberately lightweight: a SKILL.md file injected into the agent\u0026rsquo;s skills directory. A Claude Code session hook auto-activates caveman on every new session. The mechanism is entirely prompt-engineering — no code transformation, no proxy, no binary.\nFour compression levels: lite (drop filler), full (default caveman), ultra (telegraphic), wenyan (classical Chinese notation, maximally dense).\nSub-skills included:\n/caveman-commit — conventional commit messages ≤50 chars /caveman-review — one-line PR comments /caveman-stats — real session token usage + lifetime savings + USD /caveman-compress — rewrites CLAUDE.md/memory files into caveman-speak (~46% input token reduction on context files) caveman-shrink — MCP middleware that wraps any MCP server and compresses tool descriptions The repo structure covers 30+ agent targets: .claude-plugin, .agents, .codex, .kiro, .roo, .junie directories, plus plugin formats for OpenClaw, Gemini extensions, and others.\nBenchmarks (Claimed) # Average 65% output token reduction across 10 diverse prompts (range 22–87%). A March 2026 paper \u0026ldquo;Brevity Constraints Reverse Performance Hierarchies\u0026rdquo; found that constraining models to brief responses improved accuracy by 26 points on certain benchmarks — caveman cites this as independent validation.\nTask Normal Caveman Saved React re-render bug 1,180 159 87% Fix auth middleware 704 121 83% PostgreSQL pool 2,347 380 84% Git rebase vs merge 702 292 58% Architecture decision 446 310 30% Average 1,214 294 65% caveman-compress on memory/context files averages 46% reduction, which reduces input token cost on every subsequent session.\nToken Economics Model # Output token savings are smaller in absolute dollar terms for most models (output tokens ~5× more expensive per token than input for Sonnet 4.6, but there are far fewer of them). The bigger ROI is speed: 65% fewer output tokens means ~3× faster responses, which compounds into developer productivity. For API users paying output token costs, caveman\u0026rsquo;s savings are real but secondary to the UX improvement.\nThe ecosystem play is more interesting: cavekit adds per-task token budgets, cavemem compresses cross-session memory, and cavegemma fine-tunes Gemma 4 31B on caveman-style pairs to bake compression into model weights (removing the per-session prompt overhead entirely).\nStrengths # Extremely easy to install (one curl command, works in 30 seconds) No external dependencies, no binaries, no proxies Proven traction: 62.1k stars despite being a solo-maintained project Ecosystem breadth: caveman → cavemem → cavekit → cavegemma forms a coherent stack Accuracy claim supported by third-party research Weaknesses / Risks # Solo maintainer — bus factor of 1; 133 open PRs suggests review bottleneck Prompt-only mechanism — can be overridden by complex prompts or model behavior changes Output-only savings; doesn\u0026rsquo;t address the growing input token cost problem No reversibility or guaranteed format — compressed responses can be harder to parse programmatically 14 releases vs. rtk\u0026rsquo;s 211 suggests slower iteration cadence 3. headroom — Full Context Compression Layer # Repo: chopratejas/headroom Stars: 29.9k · Forks: 2k · Issues: 177 open · PRs: 90 open Releases: 155 (latest v0.25.0, Jun 12 2026) Language: Python 78.1%, Rust 17.2%, TypeScript 2.5% License: Apache 2.0\nWhat It Does # Headroom is the most technically ambitious of the three. Where rtk compresses CLI outputs and caveman compresses responses, headroom compresses everything the LLM reads — tool outputs, logs, RAG chunks, files, conversation history — via a pipeline that sits between the application and the LLM provider.\nYour app/agent │ prompts · tool outputs · logs · RAG · files ▼ ┌──────────────────────────────────────────────────┐ │ Headroom (local) │ │ CacheAligner → ContentRouter → CCR │ │ ├─ SmartCrusher (JSON) │ │ ├─ CodeCompressor (AST) │ │ └─ Kompress-base (text, HF) │ └──────────────────────────────────────────────────┘ │ compressed prompt + retrieval tool ▼ LLM provider (Anthropic · OpenAI · Bedrock · …) Technical Architecture # Six compression algorithms:\nSmartCrusher — universal JSON compression (arrays, nested objects) CodeCompressor — AST-aware compression for Python, JS, Go, Rust, Java, C++ Kompress-base — proprietary HuggingFace model trained on agentic traces CacheAligner — stabilizes prompt prefixes to maximize KV cache hit rates IntelligentContext — score-based context fitting with learned importance CCR (Compressed Context with Retrieval) — reversible compression; originals cached locally and retrievable on demand Deployment modes:\nPython library: compress(messages, model=...) TypeScript SDK: await compress(messages, { model }) Drop-in proxy: headroom proxy --port 8787 Agent wrap: headroom wrap claude|codex|cursor|aider MCP server: headroom_compress, headroom_retrieve, headroom_stats Framework integrations: Anthropic SDK, OpenAI SDK, Vercel AI SDK, LiteLLM, LangChain, Agno, Strands, ASGI middleware, multi-agent SharedContext.\nheadroom learn mines failed sessions and writes corrections back to CLAUDE.md/AGENTS.md/GEMINI.md — a feedback loop that improves agent behavior over time.\nHeadroom explicitly ships the rtk binary and uses it for shell-output rewriting, then compresses everything downstream. It also supports lean-ctx via HEADROOM_CONTEXT_TOOL=lean-ctx.\nBenchmarks (Claimed) # Real agent workloads:\nWorkload Before After Savings Code search (100 results) 17,765 1,408 92% SRE incident debugging 65,694 5,118 92% GitHub issue triage 54,174 14,761 73% Codebase exploration 78,502 41,254 47% Accuracy benchmarks (N=100 each):\nBenchmark Category Baseline Headroom Delta GSM8K Math 0.870 0.870 ±0.000 TruthfulQA Factual 0.530 0.560 +0.030 SQuAD v2 QA — 97% 19% compression BFCL Tool use — 97% 32% compression Reversibility via CCR means the LLM can retrieve originals when needed — addressing the accuracy risk of lossy compression.\nToken Economics Model # Headroom attacks the largest input token costs: logs, RAG chunks, and tool outputs are often 10–100× larger than necessary. A 92% reduction on a 65k-token SRE debug session saves ~$0.18 per incident in API costs at Sonnet 4.6 pricing. For teams running many automated agent workflows this compounds quickly.\nThe KV cache alignment via CacheAligner is an underrated feature: stable prompt prefixes allow Anthropic/OpenAI\u0026rsquo;s prompt caching to actually work, creating a compounding effect (cached tokens are ~90% cheaper). This benefit is orthogonal to compression and comes for free.\nThe headroom learn loop is a long-term moat: agent sessions generate proprietary training signal that improves compression quality over time.\nStrengths # Most technically comprehensive solution in the space Reversible compression (CCR) — strong accuracy guarantee Proprietary ML model (Kompress-v2-base) trained on agentic traces KV cache optimization is orthogonal value with zero additional cost Framework agnostic; 10+ SDK integrations Enterprise offering with ENTERPRISE.md 1,614 commits and 155 releases — highest development velocity of the three Weaknesses / Risks # Most complex to install and configure (Python 3.10+ required, Rust build for some components) ML model adds latency (~50–200ms depending on hardware) 177 open issues — support surface is wide SSL inspection environments have a painful setup path (documented but non-trivial) Requires trust: headroom proxies all LLM traffic; enterprise buyers will scrutinize data handling Python-first limits use in non-Python stacks despite TypeScript SDK 4. Competitive Landscape Map # Direct Competitors # Project Stars Scope Deploy Local Reversible Lang rtk 63.1k CLI output Hook ✅ ❌ Rust caveman 62.1k LLM output Skill/plugin ✅ ❌ JS/Python headroom 29.9k All context Proxy/library/MCP ✅ ✅ Python/Rust lean-ctx 4 CLI + file reads Hook + MCP ✅ ❌ Rust Compresr.ai N/A Text Hosted API ❌ ❌ — Token Co. N/A Text Hosted API ❌ ❌ — OpenAI Compaction N/A Conversation history Provider-native ❌ ❌ — Related / Ecosystem Repos # From caveman\u0026rsquo;s own ecosystem:\nJuliusBrussee/cavemem — cross-agent memory via SQLite + MCP, session-compressed storage JuliusBrussee/cavekit — Claude Code plugin: NL → blueprint → parallel build plan → working software; per-task token budgets; automated iteration JuliusBrussee/finetune-caveman (cavegemma) — Gemma 4 31B fine-tuned on caveman prompt-completion pairs; bakes compression into model weights Adjacent tools:\nyvgude/lean-ctx — Hybrid Shell Hook + MCP Server; 89–99% claimed savings; introduces Token Dense Dialect (mathematical symbols: λ for functions, § for classes, ∂ for interfaces); 6-mode file reading with MD5 session cache; Shannon entropy analysis; web dashboard at localhost:3333. Only 4 stars currently — very early stage, but technically differentiated by MCP server + file caching. 5. Differentiation Deep-Dive # Attack Surface Comparison # ┌─────────────────────────────────────────────────────────────┐ │ Token Bill Anatomy (typical agentic coding session) │ │ │ │ Input tokens breakdown: │ │ ├── Tool outputs (CLI, git, tests) ← rtk / lean-ctx │ │ ├── File reads ← headroom / lean-ctx │ │ ├── RAG / search results ← headroom │ │ ├── Conversation history ← headroom / OpenAI │ │ └── System prompt / context files ← caveman-compress │ │ │ │ Output tokens: ← caveman │ └─────────────────────────────────────────────────────────────┘ rtk has the biggest real-world impact per dollar of implementation effort because CLI outputs (git, tests, grep) are the most token-wasteful part of a coding session. headroom covers more surface area but requires more setup. caveman is the only tool addressing output tokens.\nTechnical Design Philosophies # rtk: Systems-first. Zero runtime dependencies. Predictable, rule-based compression. Fast path: \u0026lt;10ms overhead. Trust is easy — you can read the Rust source and understand exactly what gets stripped. No ML, no black boxes.\ncaveman: Prompt-engineering-first. Zero infrastructure required. The compression algorithm runs in the model\u0026rsquo;s reasoning — which means it\u0026rsquo;s flexible but also vulnerable to prompt drift or model updates that change response style.\nheadroom: ML-first. Trains proprietary models on agentic data. This creates a compounding advantage as the model improves, but also introduces latency, dependency complexity, and a trust surface (you\u0026rsquo;re proxying all traffic through headroom\u0026rsquo;s pipeline).\nThe Integration Stack # These tools are designed to compose:\n[caveman] → compress what the agent says [rtk/lean-ctx] → compress CLI tool outputs [headroom] → compress everything else (wraps rtk, adds file/RAG/log compression) [cavekit + cavemem] → orchestrate agents with token budgets + persistent memory [cavegemma] → bake caveman-style output compression into model weights Headroom\u0026rsquo;s explicit decision to bundle rtk (and support lean-ctx via env var) signals that the winning architecture isn\u0026rsquo;t one tool — it\u0026rsquo;s a composable layer.\n6. Token Economics Analysis # Cost Model (Claude Sonnet 4.6 Pricing) # Assuming ~$3/M input tokens, ~$15/M output tokens:\nTool Mechanism Session Savings (tokens) $/session saved $/month (5 sessions/day) rtk CLI output compression ~94k input tokens ~$0.28 ~$42 caveman Output compression (~65%) ~600 output tokens ~$0.009 ~$1.35 caveman-compress Context file compression (46%) ~400 input tokens/session ~$0.001 ~$0.18 headroom (SRE workload) Full context compression (92%) ~60k input tokens ~$0.18 ~$27 headroom KV cache alignment Cache hit improvement Model-dependent $0.03–0.15 $4.5–22 Stacking all tools: a developer using rtk + caveman + headroom could realistically save $60–100/month in API costs on a typical coding workflow. For a team of 10, that\u0026rsquo;s $600–1,000/month — a reasonable ROI trigger for enterprise deployment.\nThe Real Value: Speed and Context Length # The dollar savings are real but secondary. The primary value proposition is:\nSpeed: 65–80% fewer tokens = 2–5× faster responses. Time is worth more than API cost. Context longevity: Compressed context stays within the model\u0026rsquo;s window longer. A 200k-token context window effectively becomes 400–1,000k tokens equivalent with headroom compression. Agent coherence: Shorter context = less attention dilution = more focused reasoning. Monetization Paths # None of these repos are currently monetized directly. Potential paths:\nSaaS proxy (headroom is closest with ENTERPRISE.md) — charge per million tokens compressed Self-hosted enterprise — team dashboards, compliance, SAML Managed fine-tuning (cavegemma model direction) — charge for model distillation as a service Developer tooling subscription — token savings analytics, session recording, team-level dashboards 7. Key Observations and Risks # The compression arms race: As LLMs improve at long-context tasks, the value of compression decreases at the margins. However, token pricing creates a persistent economic incentive regardless of model capability improvements.\nBenchmark credibility: All three repos self-report benchmarks. rtk\u0026rsquo;s are the most conservative and granular; caveman\u0026rsquo;s cite an independent paper; headroom\u0026rsquo;s cover accuracy preservation most thoroughly. None have been independently reproduced.\nIntegration fragility: Hook-based approaches (rtk, lean-ctx) depend on agent hook APIs that can change with agent updates. Claude Code\u0026rsquo;s hook architecture has been stable but this is a dependency risk.\nPrivacy surface: Headroom proxies all LLM traffic locally, but \u0026ldquo;locally\u0026rdquo; means on the developer\u0026rsquo;s machine — fine for individual use, requires careful audit for enterprise deployment (data handling policies, compliance, secrets exposure in logs).\nThe lean-ctx wildcard: With only 4 stars but a well-differentiated technical approach (MCP server + file caching + TDD mode + web dashboard), lean-ctx could grow quickly if it executes on its roadmap. The Token Dense Dialect (mathematical symbols for code constructs) is a genuinely novel approach that RTK and headroom haven\u0026rsquo;t matched.\n8. Recommendations # For individual developers: Start with rtk (biggest immediate ROI, zero config) + caveman (simple output compression). Add headroom if you run automated agent workflows with heavy RAG or log analysis.\nFor teams: Headroom\u0026rsquo;s enterprise offering + rtk integration is the right architecture. The headroom learn feedback loop is the most defensible long-term value proposition.\nFor investors/observers: The most interesting bet is whether headroom\u0026rsquo;s ML-first approach (Kompress-v2-base) creates a durable moat, or whether rule-based tools like rtk are \u0026ldquo;good enough.\u0026rdquo; Given that rtk\u0026rsquo;s 63.1k stars came faster than headroom\u0026rsquo;s, and rtk has a more active PR/issue volume, market pull appears stronger for simple, transparent tools.\nWatch: lean-ctx — technically sound, first-mover on MCP + file caching combo, very early. Also watch cavegemma: if fine-tuned compression becomes good enough, it eliminates the need for all three runtime tools.\n9. Multi-Model Cost Comparison # Assumptions \u0026amp; Methodology # Session model: A 30-minute agentic coding session using a typical TypeScript/Rust project.\nMetric Without compression With full stack (rtk + headroom + caveman) Input tokens 118,000 ~12,000 (~90% reduction) Output tokens 15,000 ~5,250 (~65% reduction) Net input savings — rtk: –80% on CLI outputs; headroom: –50% on remaining reads/RAG Net output savings — caveman: –65% average across query types Work schedule: 5 sessions/day × 22 working days = 110 sessions/month (individual developer). Team calculation: 10 developers × individual monthly cost. Batch discount: Anthropic and OpenAI both offer ~50% off for batch/async processing. Prompt cache discount: ~90% off cached input tokens (applies when prefixes are stable — KV cache alignment via headroom maximizes this).\nPer-Session and Monthly Cost by Model # Anthropic — Claude Family # Model Input $/M Output $/M Cost/session (raw) Cost/session (compressed) Monthly — 1 dev (raw) Monthly — 1 dev (compressed) Monthly savings Claude Opus 4.8 $5.00 $25.00 $0.97 $0.19 $106.70 $20.90 $85.80 Claude Sonnet 4.6 $3.00 $15.00 $0.58 $0.11 $63.80 $12.10 $51.70 Claude Haiku 4.5 $1.00 $5.00 $0.19 $0.04 $20.90 $4.40 $16.50 Opus 4.8 Fast Mode doubles cost ($10/$50); compressed fast-mode session ≈ $0.38, monthly ≈ $41.80. Batch API (50% discount): Opus 4.8 compressed drops to ~$0.095/session / $10.45/month.\nOpenAI — GPT \u0026amp; o-series # Model Input $/M Output $/M Cost/session (raw) Cost/session (compressed) Monthly — 1 dev (raw) Monthly — 1 dev (compressed) Monthly savings GPT-5.5 $5.00 $30.00 $1.04 $0.22 $114.40 $24.20 $90.20 GPT-5.5 Pro $30.00 $180.00 $6.24 $1.31 $686.40 $144.10 $542.30 GPT-4o $2.50 $10.00 $0.45 $0.08 $49.50 $8.80 $40.70 o3 $2.00 $8.00 $0.36 $0.07 $39.60 $7.70 $31.90 o4-mini $0.55 $2.20 $0.098 $0.018 $10.78 $1.98 $8.80 ⚠️ o-series reasoning tokens: o3 and o4-mini generate hidden reasoning tokens billed as output. A response showing 500 output tokens may consume 3,000+ actual billed tokens. Effective output cost can be 3–6× the listed rate for complex reasoning tasks. Compression reduces the input cost but cannot reduce reasoning tokens — caveman still reduces final response output tokens.\nGPT-5.5 Batch/Flex: 50% discount → compressed session ≈ $0.11/session.\nGoogle — Gemini Family # Model Input $/M Output $/M Cost/session (raw) Cost/session (compressed) Monthly — 1 dev (raw) Monthly — 1 dev (compressed) Monthly savings Gemini 2.5 Pro $1.25 $10.00 $0.30 $0.068 $33.00 $7.48 $25.52 Gemini 2.5 Flash $0.30 $2.50 $0.073 $0.017 $8.03 $1.87 $6.16 Gemini 2.5 Flash-Lite $0.10 $0.40 $0.018 $0.0033 $1.98 $0.36 $1.62 Gemini 2.5 Pro uses tiered pricing: prompts \u0026gt;200k tokens step up significantly. Headroom\u0026rsquo;s compression is particularly valuable here — keeping prompts under the 200k threshold avoids the tier-up surcharge entirely.\nOpen-Weight \u0026amp; Alternative Providers # Model Input $/M Output $/M Cost/session (raw) Cost/session (compressed) Monthly — 1 dev (raw) Monthly — 1 dev (compressed) Monthly savings Mistral Large 3 $0.50 $1.50 $0.082 $0.014 $9.02 $1.54 $7.48 Llama 4 Maverick (hosted) $0.15 $0.60 $0.027 $0.0050 $2.97 $0.55 $2.42 Llama 4 Scout (hosted) $0.08 $0.30 $0.014 $0.0026 $1.54 $0.29 $1.25 DeepSeek V3.2 $0.14 $0.28 $0.021 $0.0032 $2.31 $0.35 $1.96 DeepSeek R1 (reasoning) $3.00 $7.00 $0.459 $0.073 $50.49 $8.03 $42.46 Llama 4 pricing varies by host (Together.ai, Fireworks, Groq, DeepInfra). Scout at $0.08 input is among the cheapest hosted frontier-class models available. DeepSeek V3.2 offers Claude Haiku-class capability at ~1/10th the cost — compression ROI is lower in absolute terms but proportionally the same (~83% cost reduction with the full stack).\nSummary: Full Compression Stack ROI by Model Tier # Tier Model Raw monthly (1 dev) Compressed monthly $ Saved/mo % Saved Team of 10 savings Frontier-expensive GPT-5.5 Pro $686 $144 $542 79% $5,420/mo Frontier-expensive Claude Opus 4.8 $107 $21 $86 80% $860/mo Frontier-expensive GPT-5.5 $114 $24 $90 79% $900/mo Mid-tier DeepSeek R1 $50 $8 $42 84% $420/mo Mid-tier Claude Sonnet 4.6 $64 $12 $52 81% $520/mo Mid-tier GPT-4o $50 $9 $41 82% $410/mo Mid-tier o3 $40 $8 $32 80% $320/mo Mid-tier Gemini 2.5 Pro $33 $7.5 $25.5 77% $255/mo Budget Claude Haiku 4.5 $21 $4.4 $16.5 79% $165/mo Budget o4-mini $11 $2 $9 82% $90/mo Budget Gemini 2.5 Flash $8 $1.9 $6.1 77% $61/mo Budget Mistral Large 3 $9 $1.5 $7.5 83% $75/mo Ultra-cheap Llama 4 Maverick $3 $0.55 $2.4 81% $24/mo Ultra-cheap DeepSeek V3.2 $2.3 $0.35 $2 85% $20/mo Ultra-cheap Gemini 2.5 Flash-Lite $2 $0.36 $1.6 82% $16/mo Key insight: The percentage savings is nearly constant (~80%) across all models because the compression ratios (90% input, 65% output) are model-agnostic. The absolute dollar savings, however, scale linearly with model price — making compression tools most valuable with expensive frontier models.\nThe Diminishing Marginal ROI Problem # For ultra-cheap models (DeepSeek V3.2, Llama 4 Scout, Gemini Flash-Lite), the absolute savings from compression are $1–3/month per developer — low enough that the integration overhead of headroom/rtk may not be worth it. The break-even math:\nheadroom setup time: ~30–60 min for initial integration rtk setup: ~5 min caveman setup: ~30 sec For a developer using Gemini Flash-Lite, the $1.62/month savings on token costs doesn\u0026rsquo;t justify the headroom setup. For the same developer using Claude Opus 4.8 or GPT-5.5 Pro, the $86–542/month savings pays back setup in minutes.\nConclusion: Compression tools have the highest ROI when used with premium frontier models. As models get cheaper, the economic case for compression weakens, but the speed and context-length benefits remain constant.\nBatch Processing Multiplier # For workflows that tolerate latency (test runs, automated reviews, nightly analysis), batch API pricing compounds compression savings:\nModel Compressed + Batch cost/session Compressed standard Additional batch savings Claude Opus 4.8 $0.095 $0.191 -50% Claude Sonnet 4.6 $0.057 $0.114 -50% GPT-5.5 $0.110 $0.218 -50% GPT-4o $0.041 $0.083 -50% Combining compression + batch + prompt caching can reduce Claude Opus 4.8 per-session cost from $0.965 to under $0.05 — a 95%+ total cost reduction versus baseline.\nPrompt Cache Alignment Bonus (headroom-specific) # Headroom\u0026rsquo;s CacheAligner stabilizes prompt prefixes to maximize provider KV cache hit rates. With Anthropic prompt caching:\nCached input tokens: 90% discount (effectively $0.50/M for Sonnet 4.6, $0.30/M for Haiku 4.5) Cache write tokens: 25% premium (one-time cost, amortized over re-reads) For a session where 80% of input tokens are repeated context (system prompts, CLAUDE.md, codebase context), the effective input cost drops dramatically. This benefit is orthogonal to compression and stacks multiplicatively — the combination of CacheAligner + compression can reduce the effective input token rate to near-zero for stable context.\n10. Deployment Model: Local-Only vs Enterprise Gateway # A dimension that matters enormously for teams but is easy to miss: where does the tool actually run? Most of this ecosystem is strictly per-developer. Only one tool has a credible path to central deployment.\nLocal-Only Tools # rtk — strictly local. Single binary, runs on each developer\u0026rsquo;s machine via shell hook. No server mode, no central deployment path. Each developer installs and manages their own instance. The closest it gets to \u0026ldquo;shared\u0026rdquo; is the openclaw plugin integration, but even that is per-machine. caveman — local by design. It\u0026rsquo;s a prompt instruction injected into each agent\u0026rsquo;s config directory. There\u0026rsquo;s no server component whatsoever — it\u0026rsquo;s essentially a SKILL.md file. Deployment means distributing the file to each developer\u0026rsquo;s ~/.claude or equivalent. No gateway, no central control. lean-ctx — local only. Shell hook + per-machine MCP server (running on localhost:3333). No multi-user mode, no centralized deployment documented. Very early stage. cavemem / cavekit / cavegemma — all local. SQLite-backed memory, local agent orchestration, local fine-tuned model weights. No enterprise architecture. Can Function as Enterprise Gateway # headroom is the only one of the three that has a credible enterprise gateway path. It has multiple deployment modes that go beyond local:\nheadroom proxy --port 8787 — runs as a drop-in HTTP proxy. Any team member (or CI system) routes their LLM API traffic through it. The compression happens server-side before forwarding to Anthropic/OpenAI/Bedrock. Docker image — ghcr.io/chopratejas/headroom:latest — deployable on any container infra (ECS, GKE, Kubernetes). This is the gateway deployment path. ASGI middleware — app.add_middleware(CompressionMiddleware) — embeddable in a FastAPI/Starlette service, meaning the gateway can be part of an internal API layer that all developers hit. SharedContext — compressed context passing across multi-agent workflows, implying team-level shared state. ENTERPRISE.md — they explicitly document an enterprise offering, though the details aren\u0026rsquo;t public without contacting them. The architecture allows an enterprise to deploy one headroom instance, point all developer API keys through it, and apply compression + KV cache alignment + cross-agent memory at the team level — without any per-developer installation.\nCloud / Hosted (Not Local) # Compresr.ai and The Token Company — hosted APIs. You send your text to their endpoint, they compress it, you get tokens back. Zero local install, but your data leaves your environment. OpenAI Compaction — provider-native, runs on OpenAI\u0026rsquo;s infrastructure. No control or visibility, only covers conversation history. Deployment Summary # Tool Local only Self-hosted gateway Hosted/cloud rtk ✅ — — caveman ✅ — — lean-ctx ✅ — — cavemem / cavekit ✅ — — headroom ✅ ✅ — Compresr.ai — — ✅ Token Company — — ✅ OpenAI Compaction — — ✅ (provider) Bottom line: If you need central IT control, compliance visibility, or team-level deployment, headroom is the only open-source option in this space. rtk and caveman are fundamentally per-developer tools — you can distribute their configs via dotfiles or onboarding scripts, but there\u0026rsquo;s no central enforcement or observability layer. The gap between \u0026ldquo;headroom the local tool\u0026rdquo; and \u0026ldquo;headroom the enterprise gateway\u0026rdquo; is real and appears to be their primary monetization lever given the ENTERPRISE.md file.\nSources # juliusbrussee/caveman chopratejas/headroom rtk-ai/rtk yvgude/lean-ctx JuliusBrussee/cavekit Medium: The Ultimate Token-Saving Stack HyperAI: Caveman Open Source Story DEV.to: Headroom writeup arxiv: Brevity Constraints paper (2604.00025) Kompress-v2-base on HuggingFace ","date":"18 June 2026","externalUrl":null,"permalink":"/posts/ai/tokenomics/llm-token-optimization-ecosystem/","section":"Posts","summary":"","title":"The LLM Token Optimization Ecosystem: rtk, caveman, and headroom","type":"posts"},{"content":"","date":"20 March 2026","externalUrl":null,"permalink":"/tags/copilot/","section":"Tags","summary":"","title":"Copilot","type":"tags"},{"content":"","date":"20 March 2026","externalUrl":null,"permalink":"/tags/github/","section":"Tags","summary":"","title":"Github","type":"tags"},{"content":"","date":"20 March 2026","externalUrl":null,"permalink":"/tags/skills/","section":"Tags","summary":"","title":"Skills","type":"tags"},{"content":" Why Your AI Skills Should Be CLI-First: A Token Cost Analysis # Many AI skills are written so the model does all the heavy lifting: reading raw configuration files, running shell commands, and formatting output line by line. While this works, it comes at a cost: token consumption that scales poorly.\nThis post walks through a concrete analysis of a workspace setup skill that illustrates this problem and shows how shifting deterministic logic into a CLI binary can cut token consumption by over 95%.\nThe Problem: Verbose Skills are Expensive # A typical setup skill instructs the AI to:\nRead multiple raw configuration files Parse and summarize their contents for the user Place or update configuration files in the correct location In practice, this means the model is ingesting thousands of tokens just to present information it could have received pre-processed from a CLI command.\nSmall Scale vs Large Scale # 5 config files # Component Estimated tokens Instruction file + tool schemas 1,800 5 raw config files read into context 2,500 Parsing + reasoning + formatting 700 Response output 300 Total per turn 5,300 At five files, this already consumes more than five thousand input/output tokens for a single turn.\n100 config files # Component Estimated tokens Instruction file + tool schemas 1,800 100 raw config files read into context 50,000 Parsing + reasoning + formatting 7,000 Response output 600 Total per turn 59,400 The instruction overhead stays nearly constant, but raw file ingestion scales linearly. The model now spends most of its context window just reading data.\nCLI-first alternative (100 files) # Component Estimated tokens Instruction file + tool schemas 1,800 CLI summary output (pre-aggregated) 900 Light reasoning + response 600 Total per turn 3,300 That is roughly a 94% reduction compared with the 59,400-token baseline. In real projects, the reduction can exceed 95% when raw outputs are especially verbose and the CLI returns compact summaries.\nCost at Scale # Token counts are abstract until you put a dollar figure on them. Using Claude Sonnet as the baseline model (approximately $3.00 per 1M input tokens and $15.00 per 1M output tokens), here is what those numbers look like in practice.\nCost Per Turn # Approach Input tokens Output tokens Cost per turn Verbose skill (100 files) 58,800 600 ~$0.185 CLI-first skill (100 files) 2,700 600 ~$0.017 Calculation (verbose):\n(58,800 / 1,000,000) × $3.00 = $0.176 (input) ( 600 / 1,000,000) × $15.00 = $0.009 (output) ────── $0.185 per turn Calculation (CLI-first):\n(2,700 / 1,000,000) × $3.00 = $0.008 (input) ( 600 / 1,000,000) × $15.00 = $0.009 (output) ────── $0.017 per turn Cost Per Developer Per Month # Assuming a developer makes 10 skill invocations per day across 20 working days:\nApproach Turns/month Cost/dev/month Verbose skill 200 $37.00 CLI-first skill 200 $3.40 Savings $33.60 Cost at Team Scale # Team size Verbose (monthly) CLI-first (monthly) Annual savings 100 devs $3,700 $340 $40,320 500 devs $18,500 $1,700 $201,600 40,000 devs $1,480,000 $136,000 $16,128,000 These numbers assume a single skill used once per invocation. In practice, complex workflows involve multiple skill turns per task, which multiplies the gap further.\nThe Crossover Point # At what point does investing engineering time in a CLI wrapper pay off? A rough back-of-napkin calculation:\nEngineering cost to build and maintain a CLI helper: ~8 hours at $100/hr = $800 one-time Break-even for a team of 10: $800 / $336 = ~2.4 months Break-even for a team of 50: $800 / $1,680 = ~0.5 months Break-even for a team of 100: $800 / $3,360 = ~0.24 months Break-even for a company of 40,000: $800 / $1,344,000 = under 30 minutes of savings For any team larger than a handful of developers using the skill daily, the CLI investment pays for itself quickly. At enterprise scale — say 40,000 developers — the $800 engineering cost is recovered in under 30 minutes of production usage. The annual delta is over $16 million.\n","date":"20 March 2026","externalUrl":null,"permalink":"/posts/misc/understanding-llm-token-consumption/","section":"Posts","summary":"","title":"Token Consumption with GitHub Copilot Skills","type":"posts"},{"content":" Platform Engineer building production-grade infrastructure at home — because if it\u0026rsquo;s worth running once, it\u0026rsquo;s worth running right. Who I Am # I\u0026rsquo;m German Rivera, a Platform / DevOps engineer with 8+ years turning infrastructure into reliable, automated systems. By day I architect and operate distributed platforms. By night I run the same patterns on a homelab cluster just to see how far I can push it.\nThis blog is where I write about what I\u0026rsquo;m building, what broke, and what I learned from both.\nGitHub LinkedIn Certifications # Microsoft Certified Certification Code Issuer Azure Fundamentals AZ-900 Microsoft Skills # Infrastructure \u0026amp; Platform # Kubernetes / k3s ArgoCD / GitOps Azure Cloudflare Zero Trust Grafana · Prometheus OpenStack Languages # Go Python TypeScript Node.js Java Frameworks \u0026amp; Tools # React NestJS Docker Terraform AI \u0026amp; LLM Tooling # Microsoft AI Foundry AWS Bedrock GitHub Copilot Claude Code Reading List # Books that shaped how I think about systems and engineering culture.\nEngineering # The Phoenix Project — DevOps as a story; required reading The Unicorn Project — same universe, developer lens Accelerate — the science behind high-performing engineering orgs Google SRE Book — free, dense, worth it The DevOps Handbook — practical companion to Phoenix Project Clean Code — craft fundamentals Self Development # Range — why generalists win in a specialized world Ikigai — purpose-driven living The Now Habit — beating procrastination without guilt The 7 Habits of Highly Effective People — timeless framework ","date":"16 April 2021","externalUrl":null,"permalink":"/contact/","section":"","summary":"","title":"About Me","type":"page"},{"content":" Platform Engineer. I build and break distributed systems — at work and at home. Writing about Kubernetes, GitOps, AI tooling, and homelab adventures. If it runs on bare metal, I\u0026rsquo;ve probably overengineered it.\nKubernetes ArgoCD Azure LLMs \u0026amp; AI ","date":"16 April 2021","externalUrl":null,"permalink":"/","section":"","summary":"","title":"","type":"page"},{"content":"","date":"29 October 2016","externalUrl":null,"permalink":"/tags/engineering/","section":"Tags","summary":"","title":"Engineering","type":"tags"},{"content":" How to Ask Questions as a New Engineer # As a developer, when you start off in an industry you might have a lot of question. A lot of them will not be technical, most of them will be management question. These are very common questions, but as a developer, you will have even more technical questions. For those technical questions, there needs to be processed to ask questions to your colleagues. There are THREE steps to asking a technical question.\nStep 1: # GOOGLE IT. Too much of a surprise people with technical questions don’t consult the one thing that has almost all the answers. Researching a question should be your first reaction before asking a developer. Unless your question is to very specific architecture design for your team, one should google it first. Otherwise, you might get a lmgtfy link.\nStep 2: # Spend time working on it. It does not matter if you are a senior or intern developer, you will reach a roadblock. However, if you get stuck on a problem you cannot be quick to ask a question. Asking a question might give you the answer, but asking a question too early stops you from learning which will hurt you in the long-run. Plus, if you cannot solve a problem or make an attempt to solve it, it defeats the purpose of you being a developer. By researching the problem you will learn how to solve the problem and maybe even learn about some problems that you did not expect.\nStep 3: # Ask your question. Asking a question without the previous two steps is pointless because you will not have a good question. Asking a developer a question must come with some background knowledge, logs, reproduction steps, etc. Otherwise, the developer will have to guess what the issue is and run through the N scenarios that pop into their head. Now depending on the quality of the developer they might know what the problem is, but it’s a long shot.\n","date":"29 October 2016","externalUrl":null,"permalink":"/posts/misc/how-to-ask-questions-as-new-engineer/","section":"Posts","summary":"","title":"How to Ask Questions as New Engineer","type":"posts"},{"content":"","date":"29 October 2016","externalUrl":null,"permalink":"/tags/softskill/","section":"Tags","summary":"","title":"Softskill","type":"tags"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"}]