$2/Day AI: How a Four-Tier Model Hierarchy Reduced Agent Operating Costs 95%

Abstract

AI agent frameworks treat cost as a monitoring concern. This paper argues it should be the primary architectural constraint, and demonstrates that designing for extreme cost limits produces systems that are more resilient, more observable, and more production-ready than their unconstrained equivalents.

We present Veltrix, an autonomous agent that manages three businesses and aims to operate on a hard $2/day budget. We introduce Cost-First Agent Architecture: a pattern combining tiered model routing, progressive degradation, and local model scaffolding that reduced weekly operating costs 82% while maintaining 99.7% task success.

Over 18 days of production operation, Veltrix processed 1,562 API calls at a total cost of $50.43. It didn't start at $2/day. Week 1 averaged $4.42/day, including two catastrophic overspend days ($13.15 and $12.33) that exposed every gap in the cost control layer. Each failure drove a specific architectural fix. By Week 3, average daily cost had dropped to $1.46, approaching the target. The architecture uses a four-tier model routing hierarchy, from Claude Opus ($15/M input tokens) down to a local 14B parameter model running on consumer GPU hardware at zero marginal cost. 6.5% of all calls routed to local models with no quality degradation for appropriate task types. The system achieved budget adherence on 67% of operating days and improving, with overspend concentrated in early weeks before per-task budgets, loop detection, and rate limiting matured.

The gap between agent research (unconstrained budgets, synthetic benchmarks) and production deployment (hard caps, real consequences) is the binding constraint on adoption. This paper shares the concrete mechanisms for bridging it.

Cost-First Agent Architecture in one formula:

Agent Cost = Σ (task_i → cheapest_model_that_succeeds_for_task_i)

Where "cheapest model that succeeds" is determined by task classification, historical quality scores, and budget state, not by trying each model in sequence (FrugalGPT cascade) or by a trained router (RouteLLM). The routing decision is made before the call, not after.

1. Introduction

A single runaway AutoGPT loop can burn $300 in an afternoon. Most agent frameworks don't consider this a bug. Cost is something you monitor after deployment, not something you design around.

This paper makes the opposite argument: cost should be the first constraint you design for, not the last thing you measure. And designing for it doesn't just save money. It forces architectural decisions that make the system more resilient, more observable, and more production-ready than it would be otherwise.

Veltrix is an autonomous AI agent managing operations across three businesses: an AI tools platform (Veltrix Collective), a camping gear e-commerce store (Madden Authentic), and personal administration. It aims to operate on a hard $2/day budget. It didn't start there. Week 1 cost $4.42/day. Two days hit $13. But each overspend exposed a gap, and each gap drove a fix. By Week 3, the system was averaging $1.46/day with the same workload.

The $2 target forces questions that unconstrained agents never ask. Which tasks deserve a frontier model at $15/M tokens? Which can a 14B local model handle for free? When should the system stop spending and escalate to its human operator? These aren't theoretical. They arise at 2am when the social post automation is competing with a customer email for the remaining budget.

We introduce Cost-First Agent Architecture, a pattern built on three mechanisms:

Tiered model routing — four cost tiers from frontier ($15/M) to local ($0), with budget-aware selection that progressively downgrades as daily spend increases
Progressive degradation — a state machine that reduces agent autonomy (fewer iterations, restricted tools, mandatory escalation) based on error rates, rather than failing entirely
Local model scaffolding — a generate-score-repair pipeline that makes a 14B parameter model viable for production tasks that would otherwise require an API call

This paper validates these mechanisms with 18 days of production data (1,562 API calls, $50.43 total cost, 99.7% success rate) and shares the specific implementation details that made them work.

2. System Architecture

2.1 Overview

Veltrix runs as a systemd service on a WSL2 instance with an RTX 5060 Ti (16GB VRAM) and 48GB RAM. The system processes commands via Telegram, executes a ReAct agent loop against 20+ service integrations (GitHub, Notion, Zoho Mail, Vercel, Brevo, Supabase, Stripe, and others), and manages three business portfolios with distinct tool permissions, brand voices, and operational domains.

The core loop is straightforward: receive message, classify task complexity, route to the appropriate model tier, execute tool calls in a bounded iteration loop, score the result, and log everything to SQLite. What makes it interesting is the machinery surrounding this loop, the cost controls, degradation states, context management, and retry strategies, that make it viable under extreme resource constraints.

2.2 Four-Tier Model Routing

The router implements a four-tier model hierarchy with cost-aware selection:

Tier 1 (Local, $0/call): Ollama running qwen2.5:14b on GPU. Handles classification, email triage, formatting, summarisation, note-taking, and internal agent communication. 101 calls over the observation period, zero API cost.

Tier 2 (Budget Cloud, ~$0.002/call): Arcee AI's Trinity-Large-Thinking via OpenRouter at $0.22/$0.85 per million tokens (input/output). Used for structured work requiring tool-calling that exceeds local model capabilities, and as the fallback tier when budget exceeds 50%. 10 calls observed, average quality score 3.6/5.

Tier 3 (Primary Cloud, ~$0.034/call): Claude Sonnet 4 at $3/$15 per million tokens. The workhorse model for medium-complexity tasks. Handles the bulk of tool-calling, multi-step reasoning, and content generation. 1,192 calls, 80.4% prompt cache hit rate.

Tier 4 (Frontier, ~$0.08/call): Claude Opus 4 at $15/$75 per million tokens. Reserved for hard-complexity tasks that require extended reasoning chains. Not observed in the production data, as the budget constraint rarely allows it.

The routing decision is made by selectModel() which checks, in order:

Task type routing: Simple/internal tasks (classify, email, format, note, summarise) always route to the local model, regardless of budget state. These task types were identified empirically as ones where qwen2.5:14b produces acceptable output.
Structured work routing: Tool-calling and search tasks route to Trinity (Tier 2), which offers stronger reasoning than the local model at minimal cost.
Budget state check: For medium and hard tasks, the router queries today's cumulative spend against the $2 daily budget:
Under 50% ($0-$1): Use the best-performing model from routing history, defaulting to Sonnet
50-100% ($1-$2): Force downgrade to Trinity for all tasks
Over 100%: Block Sonnet and Opus entirely; Trinity becomes the ceiling
Rate limiting: More than 30 API calls in a day triggers forced Trinity routing regardless of budget state. This catches loops that haven't yet burned through the budget but show pathological call patterns.
Learning router: For tasks without a hard routing rule, the system queries historical quality scores from routing_history to find the best-performing model for that task type, requiring at least 5 prior observations and an average quality score of 3 or above.

Task types routed locally (zero cost):
  simple, internal, note, classify, summarise,
  self-knowledge, email, format

Task types routed to Trinity (budget tier):
  tool, search

Task types using budget-aware selection:
  medium → Sonnet (default) or Trinity (budget)
  hard   → Opus (default) or Trinity (budget)

2.3 Cost Estimation and Tracking

Every API call is priced using a static pricing table that maps model identifiers to per-million-token rates for input, output, and cache-read tokens. The cost formula accounts for prompt caching:

cost = (uncached_input * rate_input / 1M)
     + (cached_input * rate_cache_read / 1M)
     + (output * rate_output / 1M)

Prompt caching is significant. Anthropic models support ephemeral cache control on the system prompt, and the identity module (identity.ts) splits the prompt into a stable prefix (identity, rules, lessons, rarely changed) and a dynamic suffix (current context, date, portfolio state). The stable prefix gets the cache_control: { type: 'ephemeral' } annotation, enabling cross-call caching.

Production data shows an 80.4% cache hit rate on Claude Sonnet 4, with 25 million cache-hit tokens versus 6.1 million cache-miss tokens. At a 10x reduction in per-token cost for cached reads ($0.30 vs $3.00 per million), prompt caching reduced system prompt costs by roughly 72%.

2.4 The ReAct Agent Loop

The agent implements a standard Reason-Act-Observe loop with several production-specific adaptations:

Bounded iterations. The loop caps at 20 iterations (reduced under degradation). Each iteration: call the LLM, check for tool calls, execute tools, score the turn, and check budget thresholds.

Context compaction. When token usage hits 70% of the model's context window, the system compresses the conversation history. It extracts user requests, actions taken, successful outcomes, errors encountered, and tools used from older messages, replaces them with a structured snapshot, and retains only the 6 most recent messages. This preserves intent and state while reclaiming token budget.

Turn-level quality scoring. After each tool-execution turn, the system scores the turn based on tool call success rates: 5 = all tools succeeded, 1 = all failed. This score feeds back into the routing history for the learning router, and into the task budget tracker for loop detection.

Correction hints. Before executing tool calls, the system queries a structured lesson store for known pitfalls matching the tool and its arguments. If corrections are found (e.g. "Telegram image sends must use $TELEGRAM_BOT_TOKEN from env, never placeholders"), they're injected as a system message after the tool results, giving the model corrective context without breaking the tool-call message chain.

2.5 Task Budget Enforcement

Beyond the global daily budget, each task has its own budget enforced by a per-task tracker:

Complexity	Max Calls	Max Cost
simple	3	$0.10
internal	3	$0.10
classify	3	$0.10
email	5	$0.20
format	5	$0.20
tool	5	$0.30
search	5	$0.30
medium	15	$0.50
hard	20	$1.00

When a task exceeds its call or cost limit, the system blocks further API calls for that task and sends a Telegram escalation to the human operator. Three consecutive turns scoring 1/5 (all tool calls failed) triggers the same block, catching loops where the agent is trying the same failing approach repeatedly.

Task trackers reset hourly to prevent stale state from blocking legitimate work.

2.6 Degradation State Machine

The system continuously assesses its operational health by examining the last 20 tool-call outcomes from the action log. Error rate thresholds trigger progressive autonomy reduction:

Level	Error Rate	Max Iterations	Escalation
full	< 30%	20	No
cautious	30-50%	10	No
supervised	50-75%	5	Yes
paused	> 75%	0	Yes

The degradation state persists to SQLite, surviving restarts. Stale state (older than 1 hour) auto-resets to full, preventing a bad session from permanently degrading the system.

At the supervised level, tool permissions are restricted to a code-only profile, preventing the agent from making external API calls, sending emails, or modifying deployments. At paused, the agent returns a human-readable error message explaining the error rate and requesting manual review.

2.7 Circuit Breakers

Individual tools have circuit breakers that track consecutive failures. After 3 consecutive failures, the circuit opens and blocks that tool for 5 minutes. After the recovery timeout, the circuit enters a half-open state, allowing a single probe call. If the probe succeeds, the circuit closes; if it fails, it reopens.

The heartbeat system (running every 30 minutes) actively probes known-fragile tools (like the Docker sandbox) and resets circuits when they recover.

3. Three Cheap Passes Beat One Expensive Pass

3.1 The ATLAS-Inspired Pipeline

The smart_local.py module wraps the local Ollama model in a three-phase quality pipeline, inspired by the ATLAS framework for tool-augmented language models:

Phase 1 (Generate): Produce N candidates (default 3) for the given task. Each candidate receives a different style hint ("Be direct and concise", "Be creative and engaging", "Be thorough and detailed") to encourage diversity without changing temperature, which the Ollama API doesn't expose in a fine-grained way.

Phase 2 (Score): Each candidate is self-evaluated against task-specific criteria using a structured prompt that requests JSON output with per-criterion scores (1-10), a total, and identified issues. The module handles parse failures gracefully: it tries JSON extraction via regex, falls back to parsing "N/10" patterns from text, and defaults to a neutral score of 5 when parsing fails entirely.

Phase 3 (Repair): If the best candidate scores below the minimum threshold (default 6), and the scoring phase identified specific issues, the model gets one repair pass. The repair prompt includes the original content, the identified issues, and the quality criteria, asking for a targeted rewrite.

The key insight is that three cheap local passes beat one expensive API pass for many production task types. A social media post generated by qwen2.5:14b with the generate-score-repair pipeline costs exactly $0 in API spend and produces output that passes the quality bar for social platforms. The same task routed to Sonnet would cost approximately $0.03 per post, or roughly $0.90/month for daily social content alone, nearly 15% of the monthly budget.

3.2 Task Routing to Local Models

Eight task types route to the local model by default:

simple: Basic questions and lookups
internal: Agent-to-agent communication
note: Memory updates and note-taking
classify: Portfolio classification and message triage
summarise: Summarising tool outputs for context
self-knowledge: Questions about Veltrix's own state
email: Email triage, classification, and draft generation
format: Formatting and templating tasks

These types were selected based on two criteria: the local model produces acceptable output (validated through production quality scores averaging 2.9/5, comparable to cloud models for these specific tasks), and the tasks don't require tool-calling, which qwen2.5:14b handles unreliably.

When the cloud API fails for a text-only task with no tool-calling requirement, the router automatically falls back to the local model, providing resilience against API outages at zero additional cost.

3.3 Convenience Functions

The module provides task-specific wrappers: smart_social_post() (platform-aware with character limits and platform-specific criteria), smart_summary() (with reduced candidate count for efficiency), and smart_draft() (general-purpose). These encode domain knowledge about quality criteria into the pipeline, making it usable without the caller needing to specify criteria each time.

4. From $13/Day to $1.46: What the Production Data Shows

4.1 Daily Cost Distribution

Over 18 days of operation (23 March to 9 April 2026), the system processed 1,562 calls at a total cost of $50.43. The daily cost distribution tells the story of a system learning its constraints:

Period	Days	Avg Daily Cost	Max Daily Cost	Budget Compliance
Week 1 (Mar 23-29)	7	$4.42	$13.15	3/7 (43%)
Week 2 (Mar 30-Apr 5)	7	$1.52	$6.97	5/7 (71%)
Week 3 (Apr 6-9)	4	$1.46	$4.31	2/4 (50%)

The $13.15 day (March 23) is worth examining in detail. The agent was asked to set up a new integration. It hit an API error, retried, hit it again, and entered a loop: 499 API calls in a single day, each one burning ~$0.03 against a $2 budget. Nobody was watching. The daily budget check existed but ran once at task start, not per-call. By the time the operator noticed, the system had spent 6.5x its daily budget on a single failed task.

That one day drove three architectural changes: per-task call limits (so one task can't consume the entire budget), per-call budget checks (not just per-task), and loop detection (three consecutive 1/5 quality scores blocks the task and sends a Telegram alert).

March 28 repeated the pattern: $12.33, 340 calls, a different integration but the same failure mode. The per-task limits caught it faster, but the limits were too generous. They were tightened.

By Week 2, average daily cost dropped 66% despite the system handling the same workloads. The $6.97 day on April 2 (182 calls) was caused by a long content generation session. The $4.31 day on April 6 exposed a subtler bug: the system's own cost control documentation had gone stale. It was operating on outdated assumptions about model pricing because CONTEXT.md (the file the agent reads every session to understand its own rules) hadn't been updated after the last round of changes. The system broke its own budget because it forgot its own rules.

4.2 Model Usage and Cost Efficiency

Model	Calls	Total Cost	Avg Cost/Call	Avg Quality	Success Rate
Claude Sonnet 4	1,192	$42.77	$0.036	2.31/5	100%
GPT-4o	247	$7.46	$0.030	2.56/5	98.8%
Local (qwen2.5)	101	$0.00	$0.000	2.88/5	100%
Trinity-Thinking	10	$0.02	$0.002	3.60/5	100%
Claude 3.5 Haiku	10	$0.19	$0.019	3.00/5	100%

Several observations deserve comment.

The quality scores are skewed by the scoring mechanism. A score of 1 means "all tool calls failed this turn", while 5 means "all succeeded". 603 turns scored 1/5 and 133 scored 5/5, with very few intermediate scores. This reflects the binary nature of tool execution (it either works or it doesn't) rather than a nuanced quality assessment. The low average for Sonnet (2.31) likely reflects its use for the hardest tasks with the most tool calls, where partial failures are more common. Trinity's higher average (3.60) may reflect its use for simpler structured tasks where individual tool calls are more likely to succeed.

The local models (qwen2.5:14b and qwen2.5:7b combined) maintained a 100% success rate across 101 calls, but these were exclusively text generation tasks without tool calling. Their 2.88 average quality score is comparable to cloud models, validating the routing decision to handle simple text tasks locally.

Prompt caching on Sonnet saved roughly $7.27 over the observation period (estimated from the 25M cached tokens at the $2.70/M delta between full and cached pricing).

4.3 Cost Per Week Trend

Week	Calls	Total Cost
Week 12 (Mar 23-29)	1,082	$33.11
Week 13 (Mar 30-Apr 5)	334	$11.48
Week 14 (Apr 6-9)	146	$5.84

The 82% reduction in weekly cost from Week 12 to Week 14 reflects three things: maturation of cost controls (per-task budgets, loop detection), the introduction of Trinity as a budget tier replacing Haiku, and improved task classification routing more work to the local model. The system genuinely learned from its mistakes, both through the structured lesson store and through operator-driven rule updates.

5. Don't Build Middleware That Duplicates Your Agent

5.1 V1: The Middleman Architecture

The initial voice pipeline for processing gear reviews followed a common pattern in agent systems: build a specialised sub-pipeline.

Voice message arrives via Telegram. A Python subprocess invokes the OpenAI Whisper API for transcription. A separate GPT-4o-mini call classifies the transcription (is this a gear review or a general message?). State files in /tmp track the conversation flow. A dedicated handler processes the extracted review and saves it to Notion.

This architecture had every problem you'd predict from reading it:

Silent failures when the Python subprocess crashed
No conversation context (the classification model didn't know what the user had said previously)
False positives on gear review detection (any mention of a product triggered the review flow)
WSL2 network drops causing the Telegram file download to fail silently
State files in /tmp that didn't survive service restarts

The git history tells the story through commit messages: "fix: voice messages now context-aware", "fix: voice transcription only creates review if product/rating detected", "fix: fully disable regex fallback for review detection". Each fix addressed a symptom, not the root cause.

5.2 V2: Let the Agent Handle It

The fix was to stop building middleware that duplicated the agent's existing capabilities.

V2 replaced the Whisper API with a local faster-whisper instance running on the GPU (large-v3 model, HTTP API on port 9876). Transcription became free and fast. The transcribed text feeds directly into the main agent as a normal user message. The agent already has conversation history, tool access, Notion integration, and the context to determine whether a message is a gear review or a general question.

The specialised classifier was removed entirely. The state files were removed. The separate handler was simplified to just the Telegram file download (which switched from Node.js fetch to Python requests for WSL2 reliability) and the local transcription call.

One commit replaced hundreds of lines: "fix: voice agent runs in background - doesn't block Telegram polling".

5.3 The Lesson

Don't build middleman pipelines that duplicate what the agent already does. If your agent has conversation history and tool access, feeding it raw text is almost always better than building a separate classification-and-routing layer. The specialised layer will be less capable (no history, no tools, no context), harder to debug (separate logs, separate state), and more fragile (more moving parts, more failure modes).

This principle extends beyond voice. Any time you're tempted to build a pre-processing pipeline that classifies, routes, or transforms input before the agent sees it, ask whether the agent itself could handle the raw input. In most cases, it can, and it'll do it better because it has the full context.

6. The Spec Lies: What We Learned Scanning 20 CMS Platforms

6.1 The Problem

Veltrix operates a website intelligence audit product that scans customer URLs and scores them across SEO, accessibility, performance, and security. The scanner extracts JSON-LD structured data from pages for schema markup analysis. Early production revealed that the scanner was silently dropping schemas from common CMS platforms.

6.2 Platform-Specific Bugs

WordPress with Yoast SEO uses @graph arrays inside a single JSON-LD block. The scanner was reading the top-level object but not expanding the @graph array, missing the individual schema entries (Organization, WebSite, WebPage, BreadcrumbList) nested inside.

WordPress also uses @type as an array for pages with multiple schema types: "@type": ["CollectionPage", "FAQPage"]. The scanner's type extraction code expected @type to be a string, silently dropping array-typed entries.

Next.js applications place multiple schema objects inside a single <script type="application/ld+json"> tag as a JSON array: [{...Organization}, {...WebSite}]. The scanner expected each <script> tag to contain a single JSON object, missing every schema after the first.

6.3 The Fix

Three changes to the scanning engine:

When parsing a <script type="application/ld+json"> tag, check if the parsed result is a list. If so, iterate and append each item individually.
After parsing a JSON-LD object, check for an @graph key containing an array. If present, expand each graph item into the schema list.
In the SEO auditor, handle @type as either string or array when extracting schema types.

# Handle arrays of schemas (Next.js pattern)
if isinstance(parsed_ld, list):
    for item in parsed_ld:
        if isinstance(item, dict):
            result['schema_json_ld'].append(item)
elif isinstance(parsed_ld, dict):
    result['schema_json_ld'].append(parsed_ld)
    # Expand @graph arrays (WordPress/Yoast pattern)
    if '@graph' in parsed_ld and isinstance(parsed_ld['@graph'], list):
        for item in parsed_ld['@graph']:
            if isinstance(item, dict):
                result['schema_json_ld'].append(item)

6.4 Additional Detection Fixes

The JSON-LD parsing fix surfaced a pattern: the audit tool's detection logic was written against spec-compliant examples, not real CMS output. A systematic review uncovered five more detection gaps:

"About statement" check too narrow. The AEO auditor looked for "we are", "we provide", "our mission" in the first 500 characters. Most sites use "[Brand] is a..." or "[Product] is an..." instead. Adding " is a ", " is an ", "we help", "we build", "we offer" to the pattern list eliminated false negatives across CMS platforms.

Author signals only in text. The GEO auditor searched visible text for author names but didn't check JSON-LD for Person schema or author fields. WordPress, Ghost, and Shopify store author data in structured data, not visible text. Adding schema-based author detection fixed this.

Skip-navigation detection too simple. The accessibility check looked for the word "skip" in the first 3000 characters. Next.js uses href="#main" with sr-only class. WordPress themes use screen-reader-text class. Expanding to check #main, #content, #main-content, and screen-reader utility classes eliminated false positives across frameworks.

Colour contrast false positives. The original check flagged color:#fff as low contrast, but on dark-themed sites (common in SaaS and tech) white text is correct. Without knowing the background colour, hex-based contrast checking produces false positives. The check was narrowed to only flag provably bad patterns (rgba with alpha below 0.3, which is nearly invisible on any background).

Form label detection incorrect. Having an id attribute was counted as having an accessible label. An id alone provides no accessible name. The check was corrected to require an actual <label>, aria-label, or placeholder.

6.5 Results

The cumulative effect of all detection fixes:

Category	Before	After
SEO	100	100
AEO	83	100
GEO	92	100
Accessibility	90	100
Security	95	95
Performance	95	100
Overall	92	99

The remaining 1 point comes from two architectural trade-offs in the Security category: unsafe-inline in the CSP script-src (required by Next.js for inline scripts) and a missing COEP header (would break third-party analytics and CDN resources).

6.6 The Broader Lesson

Tools that parse structured data from the web will encounter platform-specific patterns that break assumptions about data format. The fix isn't more tolerance for malformed data. It's testing against the actual output of popular platforms (WordPress, Next.js, Shopify, Wix) rather than the spec. The spec says JSON-LD is a single object. The real world says it can be an array, it can contain nested graph structures, and its type annotations can be strings or arrays.

Equally important: audit reports must include specific evidence, not just issue names. "Potential low contrast text found (1 instance)" is useless to a customer. "color:#fff found in inline style on <span> element" is actionable. Every detection fix was paired with a reporting fix that includes the actual match, surrounding context, and the element it was found in.

7. Running Production AI on WSL2: Every Bug We Hit

Running a production agent on WSL2 introduces a class of problems that bare-metal or container deployments don't have. These are worth documenting because WSL2 is an increasingly common development and deployment target.

7.1 Network Unreliability

Node.js fetch on WSL2 drops connections to external APIs at a rate that's unacceptable for production. The specific failure mode: connections succeed, partial data arrives, then the stream hangs or resets without an error event. The AbortSignal.timeout() API helps but doesn't solve all cases.

The pragmatic fix was to delegate network-intensive operations to Python's requests library, which handles WSL2's network stack more reliably. Telegram file downloads switched from Node.js fetch to a Python helper script (telegram_download.py) that handles the download and returns a JSON result. The download reliability improved immediately.

7.2 Startup Timing

WSL2's network stack isn't ready at boot time. The Telegram polling client would fail on the first connection attempt, and without retry logic, the service would start without Telegram connectivity.

The fix: a 10-second initial delay on Telegram startup, followed by 10 connection retries with the first 3 failures logged silently (to avoid false alarm noise during normal startup).

7.3 Message Queue

Even after the startup fix, transient network drops during operation caused Telegram sends to fail. The solution was a background message queue: failed sends go into a retry queue that processes every 5 seconds, with a maximum of 30 retries (~2.5 minutes) before giving up on a message.

7.4 Long-Term Fix

The planned migration to Discord (tracked as CORE-009 in the internal ticket system) addresses the root cause. Discord's WebSocket-based connection model handles reconnection natively, eliminating the polling-and-retry pattern that makes Telegram fragile on WSL2.

8. Supporting Mechanisms

8.1 Task Profiles

Rather than named persistent agents, the system uses scoped task profiles loaded from a YAML configuration. Each profile defines:

Allowed tools: Which tools this profile can access (e.g., the admin profile gets email and file tools but not deployment tools)
Model tier: What complexity level to route at (controls cost)
Lesson categories: Which lessons from the structured store to inject into the system prompt
System prompt supplement: Profile-specific instructions and behavioural rules

Twelve profiles cover the system's operational domains: code, research, content, ops, admin, review, finance, data, sales, customer, product, strategy, and procurement. Task classification uses keyword matching against a mapping table in the YAML configuration, with multi-word keywords scoring higher than single words to prefer specific matches.

This approach is cheaper and simpler than persistent agent instances. Profiles are stateless, the configuration is version-controlled, and spawning a sub-agent with a different profile just means changing which tools are available and which lessons are injected. No separate processes, no inter-agent communication protocols, no state synchronisation.

8.2 Structured Lesson Store

Lessons are stored in SQLite with full-text search, categorised by domain (tool_usage, architecture, workflow, environment, security, etc.), tagged with severity, and tracked by occurrence count. The system prompt loader queries the top 10 lessons by relevance and injects them into the stable prefix (enabling prompt caching).

Lessons are also queried at tool-call time: before executing a tool, the system checks if there are known corrections for that tool-and-argument combination. If corrections exist, they're injected as a system message after the tool results, giving the model corrective context for the next iteration.

This creates a feedback loop: mistakes generate lessons, lessons modify future behaviour, and the occurrence count tracks whether a lesson is still relevant or has been superseded.

8.3 Prompt Caching Architecture

The system prompt is split into two parts:

Stable prefix: Identity (SOUL.md), agent rules (AGENTS.md), tool documentation (TOOLS.md), top lessons. This changes rarely within a session and gets the cache_control annotation for Anthropic models.
Dynamic suffix: Current context (CONTEXT.md), date, active portfolio state. This changes per call and is never cached.

The 80.4% cache hit rate on Sonnet validates this split. The stable prefix contains the bulk of the tokens (identity, rules, and documentation are collectively several thousand tokens), while the dynamic suffix is much smaller.

9. Results Summary

9.1 Key Metrics

Metric	Value
Operating period	18 days (Mar 23 - Apr 9, 2026)
Total API calls	1,562
Total cost	$50.43
Success rate	99.7% (1,557/1,562)
Local model calls	101 (6.5% of total)
Local model cost	$0.00
Budget-compliant days	12/18 (67%)
Prompt cache hit rate (Sonnet)	80.4%
Estimated cache savings	~$7.27

9.2 Cost Efficiency Over Time

Weekly cost decreased from $33.11 (Week 1) to $5.84 (Week 3, partial), driven by progressive implementation of cost controls, local model routing, and a budget-tier cloud model. The average cost per call across all models was $0.032, but this masks the bimodal distribution: cloud calls averaged $0.035 while local calls cost nothing.

9.3 Model Quality by Tier

Quality scores are limited by the binary nature of tool-call success/failure scoring. The most meaningful signal is the success rate: all models except the legacy Haiku 3.5 identifier (which was a misconfiguration that was quickly fixed, commit c8354bd) achieved 98.8% or higher success rates. The local model's 100% success rate reflects its restricted routing to text-only tasks where failure modes are minimal.

9.4 Budget Adherence

The system exceeded its $2 budget on 6 of 18 days. Five of the six overspend days occurred in the first 10 days of operation, before per-task budgets and rate limiting were fully implemented. The most recent week showed 2 of 4 days under budget, with the overspend day ($4.31) traced to stale cost control documentation that has since been updated.

10. What Worked, What Didn't, and What We'd Change

10.1 Cost vs Quality vs Latency

The four-tier model hierarchy creates a three-way trade-off. Local models are free and fast (typical latency under 5 seconds for a 14B model on GPU) but can't handle tool calling or complex reasoning. Budget cloud models add tool-calling capability at minimal cost but with weaker reasoning. Frontier models handle everything but at a cost that blows through the daily budget in a handful of calls.

The production evidence suggests this trade-off is navigable. The key is honest task classification. Simple text generation, classification, and formatting genuinely don't need a frontier model. The danger is in misclassifying a task as simple when it actually requires multi-step reasoning or tool calling, at which point the local model fails silently (produces plausible but wrong output) rather than explicitly.

10.2 When to Use Local Models

The production data suggests local models are viable for production when three conditions hold:

The task doesn't require tool calling
The output has a clear quality signal (character limits, format requirements, factual verification against a known source)
The generate-score-repair pipeline can catch and correct the most common failure modes

For tasks that lack a clear quality signal (open-ended reasoning, strategy recommendations, code generation), routing to a cloud model is the safer choice. The cost of a wrong answer from a local model often exceeds the API cost of getting a correct answer from a cloud model.

10.3 The Importance of Model Monitoring

During the observation period, the system transitioned from Claude 3.5 Haiku as the budget tier to Arcee AI Trinity-Large-Thinking. The trigger was a model scout automation (monthly, 1st of each month) that evaluates emerging models against cost-quality benchmarks. Trinity offered stronger reasoning and tool calling at less than a third of Haiku's price ($0.22/$0.85 vs $0.80/$4.00 per million tokens).

This kind of continuous model evaluation is essential for cost-constrained systems. The frontier-model pricing environment changes rapidly, and today's budget model might be tomorrow's overpriced legacy option. The system's learning router helps: by tracking quality scores per model per task type, it can detect when a cheaper model is outperforming an expensive one and adjust routing accordingly.

10.4 Limitations

Quality scoring is crude. The turn-level binary success/failure scoring doesn't capture output quality for text generation tasks. A social post that is technically successful (tool calls worked) but reads poorly will score 5/5. Better scoring would require either human evaluation or a separate quality assessment model, both of which add cost or latency.

The observation period is short. Eighteen days isn't enough to draw statistically strong conclusions about cost trends or quality distributions. The early overspend days dominate the aggregate statistics.

WSL2 is not a typical production environment. The network reliability issues and workarounds described in Section 7 are specific to this deployment target. A bare-metal or container deployment would eliminate several failure modes.

The $2 budget is arbitrary relative to the workload. The system manages a relatively light operational load (averaging 87 API calls per day). A higher-traffic system would require either a proportionally larger budget or more aggressive local model routing.

Single operator bias. All quality judgements, lesson creation, and architectural decisions reflect a single operator's preferences and domain knowledge. A multi-operator system would need mechanisms for resolving conflicting quality assessments and operational rules.

11. Related Work

11.1 Agent Frameworks

AutoGPT (Significant Gravitas, 2023) demonstrated recursive self-improving agents but with no cost controls, leading to uncontrolled API spend in production deployments. BabyAGI (Nakajima, 2023) introduced task-driven autonomous agents with a simpler architecture but similarly unbounded resource usage. CrewAI (Moura, 2024) added multi-agent orchestration with role specialisation but treats cost as a monitoring concern rather than a routing constraint.

Veltrix differs from these frameworks in treating cost as a first-class architectural concern that directly influences model selection, task routing, and system behaviour. The degradation state machine has no equivalent in existing frameworks; most either run at full capability or fail entirely.

11.2 Cost-Aware LLM Systems

FrugalGPT (Chen et al., 2023) proposed LLM cascading, where queries route to progressively more expensive models only when cheaper models fail. Veltrix's routing is similar in principle but differs in implementation: routing decisions are made before the call (based on task classification and budget state) rather than after (based on output quality). This avoids the latency cost of cascading but requires accurate upfront task classification.

RouteLLM (Ong et al., 2024) trained a router model to predict which LLM would produce the best output for a given query. Veltrix's approach is simpler (rule-based with historical quality scores) but requires no training data and adapts to new models immediately through the learning router.

11.3 Local Model Quality

ATLAS (Izacard et al., 2023) demonstrated that retrieval-augmented generation could bring smaller models closer to frontier performance. The smart_local.py pipeline is inspired by ATLAS's approach of using multiple passes to improve output quality, adapted for a generate-score-repair workflow that doesn't require external retrieval.

Self-Refine (Madaan et al., 2023) showed that iterative self-feedback improves output quality across tasks. The repair phase of the local model pipeline is a simplified version of self-refine, constrained to a single repair pass to limit latency.

11.4 Agent Reliability

Voyager (Wang et al., 2023) introduced a skill library for Minecraft agents that persists learned behaviours. Veltrix's lesson store serves a similar function, persisting operational corrections that modify future behaviour through prompt injection and tool-call-time correction hints.

Reflexion (Shinn et al., 2023) used verbal reinforcement to improve agent task completion. Veltrix's quality scoring and lesson generation provide a production-oriented version of this concept, where reflections are stored structurally and queried contextually rather than maintained in-context.

12. Conclusion

From $33 in Week 1 to $6 in Week 3. Same workload. Same agent. Different architecture.

The mechanisms described here aren't complicated. Tiered routing matches task complexity to model cost. Progressive degradation reduces autonomy instead of failing. A generate-score-repair pipeline makes a 14B model viable for tasks that would otherwise cost $0.03 each. Persistent lessons prevent the same mistake from costing money twice.

None of these require a research lab. They require treating cost as an engineering constraint rather than a line item.

What would change with more budget? The system would route more tasks to frontier models. Quality on hard tasks would improve. But the architecture wouldn't change. Tiered routing, degradation management, local model scaffolding — these are the same patterns you'd use at $200/day. The $2 target just forced us to build them on day one instead of discovering we needed them on the day the invoice arrived.

Every agent framework in production will eventually need these patterns. The question is whether you build them before or after the $300 overnight surprise.

For every well-funded research lab running unconstrained agent loops, there are thousands of builders who need their agent to run sustainably on real money, not grant money. This paper is for them.

Cost-First Agent Architecture isn't a constraint. It's a design philosophy. And the agents it produces are better for it.

Appendix A: System Configuration

Hardware: Intel CPU (WSL2), 48GB RAM, NVIDIA RTX 5060 Ti (16GB VRAM)

Software stack: Node.js 22 (TypeScript), Python 3.11, SQLite (better-sqlite3), Ollama (qwen2.5:14b), faster-whisper (large-v3)

External services: OpenRouter (model routing), Telegram (user interface), GitHub, Notion, Zoho Mail, Vercel, Brevo, Supabase, Stripe, ElevenLabs

Codebase: ~12,800 lines of TypeScript, ~3,000 lines of Python automations

Database tables: routing_history (1,562 rows), action_log, entities, relationships, lessons, error_fingerprints, chat_sessions, heartbeat_history, oauth_tokens, degradation_state

References

Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176.

Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., & Grave, E. (2023). Atlas: Few-shot Learning with Retrieval Augmented Language Models. Journal of Machine Learning Research, 24(251), 1-43.

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K.M., Welleck, S., Yazdanbakhsh, A., & Clark, P. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023.

Moura, J. (2024). CrewAI: Framework for Orchestrating Role-Playing Autonomous AI Agents. GitHub repository.

Nakajima, Y. (2023). BabyAGI: Task-Driven Autonomous Agent. GitHub repository.

Ong, I., Belo, A., Cai, J., Cen, J., De, A., Dikkala, N., Gan, Y., Leqi, L., Liu, W., & Stoica, I. (2024). RouteLLM: Learning to Route LLMs with Preference Data. arXiv:2406.18665.

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.

Significant Gravitas. (2023). AutoGPT: An Autonomous GPT-4 Experiment. GitHub repository.

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. NeurIPS 2023.