Credit Efficiency Engine

ClaudeJr

The layer that sits in front of every Claude & GPT call and cuts your bill up to 90% — using prompt caching, cross-provider model routing, a self-verifying speculative cascade, the Batch API, response & semantic caching and prompt compression. All stacked automatically. Built to protect your output quality, not trade it for savings.

Download for VS Code & CLI ↓ Calculate my savings Source on GitHub ↗
● LIVE · running inside APxAI today Claude + GPT, one router 0 runtime dependencies
~90%
off repeated context (prompt cache)
50%
off async work (Batch API)
100%
off exact repeats (response cache)
0
quality traded away — same model, same output

Measured, not asserted.

Every figure here is reproducible. ClaudeJr ships with a deterministic benchmark that drives the real engine over a fixed 1,500-call workload, prices each call at published provider rates, and decomposes the saving per technique — gated by sixteen honesty invariants that fail the build if a single dollar can't be traced to provider pricing.

66%
measured cost cut on the standard path
49%
measured cut on the speculative cascade
±$0.01
per-technique attribution telescopes to the total
16/16
honesty invariants enforced in CI

$ npm run claudejr:bench  # reproduce every number above

the real, verbatim output — not a screenshot, not a claim
──────────────────────────────────────────────────────────────────────────
  ClaudeJr savings benchmark — MEASURED against real catalog pricing
──────────────────────────────────────────────────────────────────────────
  workload      : 1500 calls · seed 0x5eed1009 · reproducible
  engine sees   : 45.4% simple · 39.3% standard · 15.3% complex  (classifyComplexity)
  baseline      : claude-opus for EVERY call (flagship-for-everything)
  system prefix : 74 tokens, shared by all calls
──────────────────────────────────────────────────────────────────────────
  ENGINE BEHAVIOR (what the real engine actually did)
    cache hit rate     : 28.9%  (310 exact · 124 semantic)
    live model calls   : 1066 of 1500
    routed below opus  : 902 of 1066 live  (84.6%)
    input tokens saved : 4,168 via compression
    volatile bypasses  : 25 paraphrase probes, 0 served stale
──────────────────────────────────────────────────────────────────────────
  MEASURED COST (priced at published per-token rates)
    baseline spend     : $62.55
    ClaudeJr spend     : $21.16
    saved              : $41.39   →   66.2% reduction
──────────────────────────────────────────────────────────────────────────
  WHERE THE SAVING COMES FROM (telescopes exactly to total)
    response+sem cache         $17.77     28.4%
    smart routing              $21.69     34.7%
    prompt compression          $0.06      0.1%
    output right-sizing         $1.87      3.0%

    sum of buckets             $41.39     66.2%
──────────────────────────────────────────────────────────────────────────
  SPECULATIVE CASCADE  (runVerified over 400 live calls)
    draft accept rate  : 67.5%  (143/212 cheap drafts shipped)
    cascade spend      : $9.02 vs $17.88 baseline  →  49.5% reduction
──────────────────────────────────────────────────────────────────────────
  SINGLE-FLIGHT COALESCING  (burst of 50 identical concurrent calls)
    upstream calls     : 1 of 50  (49 duplicates coalesced for $0)
    saved on the burst : $1.99 vs $2.03 had every duplicate hit the provider
──────────────────────────────────────────────────────────────────────────
  VERIFIED GRAY-ZONE CACHE  (correctness-gated extra hits, opt-in verifier)
    no verifier        : near-match NOT served — the gray zone is never gambled
    rejecting verifier : wrong-region answer NOT served — a real call is made
    accepting verifier : paraphrase served for ~$0, netted of the verify cost
    → safely LOWER the match threshold for more hits with zero wrong answers
──────────────────────────────────────────────────────────────────────────
  MODELED (website ROI calculator on the equivalent workload)
    projected reduction: 89%   (measured run() path: 66.2%)
    note: measured path excludes Batch API + prompt-cache breakpoints +
          the cascade, so the measured number is deliberately CONSERVATIVE.
──────────────────────────────────────────────────────────────────────────
  HONESTY INVARIANTS
    [PASS] exact repeats cost $0 (cache served them)
    [PASS] compression NEVER inflated input tokens
    [PASS] routing NEVER dropped below the task quality floor
    [PASS] volatile asks NEVER served from the semantic cache (freshness)
    [PASS] semantic cache caught paraphrased repeats
    [PASS] a big shared system prompt never serves a wrong cached answer
    [PASS] per-technique buckets telescope to total saved (±$0.01)
    [PASS] measured reduction is a real fraction (0 < r < 1, no free lunch)
    [PASS] workload is realistic, not cache-saturated (live ≥ 55%)
    [PASS] the ROI model never projects past its 92% honesty cap
    [PASS] right-sizing never lowers an explicit caller cap
    [PASS] single-flight collapsed a concurrent identical burst to ONE upstream call
    [PASS] single-flight coalesced exactly N−1 duplicates, ledgered honestly
    [PASS] verified gray-zone: with NO verifier a near-match is never served (correctness-first)
    [PASS] verified gray-zone: a rejecting verifier never serves a wrong-region answer (LO≪HI)
    [PASS] verified gray-zone: an accepting verifier serves a paraphrase, NET of verify cost
──────────────────────────────────────────────────────────────────────────
  RESULT: ALL INVARIANTS HOLD — numbers are real and bounded.
──────────────────────────────────────────────────────────────────────────

Baseline = the flagship model (Claude Opus) on every call — a deliberately unfavorable yardstick, so savings against a sane setup run higher, not lower. The harness fails CI if routing ever drops below a task's quality floor, if a cached answer is ever wrong or stale, or if the modeled projection exceeds its 92% honesty cap.

Twelve real levers. Stacked on every call.

These aren't tricks — they're the actual pricing mechanics of Claude and GPT, applied automatically so you don't have to think about them. Every dollar traces back to published provider pricing.

01 · PROMPT CACHINGPrefix KV-cache reuse

Stop paying for the same context twice

Your system prompt, instructions and pinned files are identical on every call. ClaudeJr marks that stable prefix with a cache breakpoint, so the provider reads it at ~10% of input price instead of re-billing it in full each turn.

biggest single lever · up to 90% off input
02 · SMART MODEL ROUTINGLeast-cost routing · quality-floored

The cheapest engine that still clears the bar

A fast classifier scores each task and routes it — across Claude and GPT — to the cheapest model whose reasoning tier meets the task's quality floor. Trivial calls land on Haiku / GPT-4o-mini; hard ones escalate to Sonnet or Opus. Never below the floor.

routes down, never below quality
03 · RESPONSE CACHEContent-addressed memoization · SHA-256

Identical requests cost zero

A SHA-256 fingerprint of the normalized request keys a TTL + LRU cache. Ask the same thing twice and the second answer returns instantly for 0 credits — no network, no model.

100% off exact repeats
04 · BATCH APIAsynchronous batch dispatch

Half price for anything async

Work that isn't blocking a user — evals, backfills, agent fan-out — is dispatched through the Batch API at a flat 50% discount, then reconciled back transparently.

50% off non-urgent work
05 · PROMPT COMPRESSIONLossless token deduplication

Fewer input tokens, same meaning

Collapses dead whitespace, blank-line runs and verbatim-duplicate context blocks before billing. The model sees identical meaning — you pay for fewer tokens on the variable part of every prompt.

trims the part caching can't
06 · SEMANTIC CACHEApproximate-match cache · Jaccard n-gram · verifier-gated gray zone

Catch the paraphrased repeats too

"List the steps" and "what are the steps" are the same question. ClaudeJr matches on meaning (Jaccard similarity over text shingles), not just exact bytes — so near-duplicate prompts hit cache and cost $0. Only for deterministic calls, so variety is never lost. And a borderline match is never guessed: below the safe-serve threshold sits an optional verifier-gated gray zone — a near-match there is served only if a cheap check you supply confirms it genuinely answers the new prompt, otherwise it falls through to a real call. That lets you lower the threshold for more hits with provably zero wrong answers, and the verify cost is netted out of the reported saving.

more hits · never a wrong answer
07 · BUDGET GUARDRAILSSpend governance · hard ceiling

A hard cap so you can't get burned

Set a monthly ceiling. ClaudeJr tracks real spend and blocks calls before they blow the budget — the safety net every broke founder and every Fortune 500 finance team actually wants.

no surprise bills
08 · SAVINGS LEDGERCost-attribution telemetry

Proof, not promises

Every call records tokens and dollars saved, exact vs semantic hit rate, spend-to-date and routed model. You get a running total you can put in front of finance — or an investor.

auditable, per call
09 · SPECULATIVE CASCADEDraft-verify-escalate cascade

Draft cheap, verify, escalate only if needed

A fast model writes a draft; a verifier checks it for refusals, truncation and invalid output; only when the draft misses the bar does ClaudeJr escalate to a flagship. You get flagship quality on the hard minority, cheap-model price on the easy majority — and a draft-accept rate in the ledger that proves it.

flagship quality, cheap-model price
10 · STALENESS GUARDFreshness-aware invalidation

Never serve a stale answer to save a buck

Caching repeats is free money — except for time-sensitive asks ("today", "latest", "price of…"). ClaudeJr flags those and bypasses the paraphrase cache for them, so you save on safe repeats without ever returning a stale reply where freshness matters.

savings that can't go stale
11 · OUTPUT RIGHT-SIZINGAdaptive completion budgeting

A trivial ask can't run up a giant bill

Output tokens are the most expensive tokens. When a caller sets no completion limit, ClaudeJr fills in a sane per-task ceiling — small for a one-liner, generous for hard work. It never lowers a limit you set, so it can't truncate intended work.

caps the priciest tokens
12 · REQUEST COALESCINGSingle-flight in-flight de-duplication

A concurrent burst becomes one upstream call

The newest lever. When identical requests arrive at the same moment, the response cache can't help — none has finished yet, so each would hit the provider. ClaudeJr collapses the burst with single-flight: one call goes upstream and the duplicates ride along on its exact result for $0. On any failure each caller falls back to its own attempt, so it only ever removes duplicate work — never changes an answer.

one call for the whole burst

One router. Claude and GPT.

ClaudeJr is provider-agnostic. It knows each model's price and reasoning tier, then sends every task to the cheapest option that still meets the quality floor — so you're never locked in and never overpaying.

ModelProviderInput /1MOutput /1MCache readTier

Published list prices, USD per million tokens. Cache-read price is what you pay for the stable prefix after the first call — the heart of the savings.

See your savings

Match the sliders to your workload. The math below is exactly what ships in the engine (projectSavings()) — same pricing, same formulas.

$0
estimated saved / month
$0 / year
Baseline spend$0
With ClaudeJr$0
Reduction0%
By technique
Prompt caching$0
Model routing$0
Batch API$0
Response cache$0
Compression$0
Illustrative, capped at 92% (techniques overlap). Actual savings depend on your traffic and prompt shape. Quality is unaffected — routing enforces a per-task quality floor.

🛡️ The "never alter quality" design principle

Every lever removes waste, not intelligence. Caching returns the same bytes. Compression preserves meaning. Routing only ever moves a task down to a model that still clears its quality floor — and escalates the moment a task looks hard. The speculative cascade goes further: it verifies a cheap draft and falls back to a flagship the instant the draft misses the bar, so it can only ever improve an answer, never ship a worse one to save money. When in doubt, ClaudeJr spends more, not less. You can verify it: the ledger logs the model, tier and draft-accept rate for every single call.

Drop-in. One wrap.

ClaudeJr already runs inside APxAI's LLM path. Download the two files below, drop them in, and wrap the SDK client you already have — Claude or GPT — in one line. The adapter mirrors the exact translation APxAI runs in production.

// two files you download below — 0 runtime deps (node:crypto only)
import { claudeJr } from "./claudejr";
import { makeAnthropicExecutor } from "./claudejr-adapters";

// Wrap the SDK client you already have — that is the whole integration.
const exec = makeAnthropicExecutor(new Anthropic());   // or makeOpenAIExecutor(new OpenAI())

// Cheapest path: cache, compression, cross-provider routing, coalescing & the ledger — free
const result = await claudeJr.run(req, exec);

// Or cheap-first-then-verify: drafts on a fast model, auto-verifies, and
// only escalates to a flagship the instant the draft misses the bar
const best = await claudeJr.runVerified(req, exec);

console.log(result.content, result.savings.estimatedUsdSaved);  // real answer + per-call saving
console.log(claudeJr.savingsReport());                  // cumulative ledger + draftAcceptRate

Download & install

Free. Self-contained. Runs anywhere Node runs — your laptop, your CI, your VS Code. Drop it in once and every Claude / GPT call gets cheaper automatically.

📦

Engine + adapters + shield · the real thing

The actual engine that runs inside APxAI — zero runtime dependencies (only Node's built-in crypto) — plus the one-line SDK adapters. Drop both into any TypeScript or Node project.

import { claudeJr } from "./claudejr";
import { makeAnthropicExecutor } from "./claudejr-adapters";
Download claudejr.ts ↓ claudejr.js ↓
claudejr-adapters.ts ↓ claudejr-adapters.js ↓
🛡 claudejr-shield.ts ↓ claudejr-shield.js ↓

Privacy Shield — redacts your API keys, secrets & PII before any prompt reaches OpenAI / Anthropic / xAI, then restores them in the answer. Your keys never leave your box.

Verify integrity: SHA256SUMS.txtshasum -a 256 -c SHA256SUMS.txt

🧩

VS Code extension

Mission Control + ClaudeJr savings, right in your editor.

Bundled with the APxAI extension

Get the extension →

Proxy gateway · works with anything

The universal path. Start the proxy, then point any SDK's base URL at it — Python, Go, Ruby, anything — and every call gets cached, routed and compressed. Zero code changes.

# 1. build & start it from source (Node 20+)
git clone https://github.com/caiocarvalho93/APxAI && cd APxAI
npm install && npm run build
OPENAI_API_KEY=… ANTHROPIC_API_KEY=… npm run claudejr -- --port 8787
# 2. point your client at it
base_url="http://localhost:8787/v1"
Self-host the proxy ↗

Fully OpenAI-compatible: streaming (stream: true → SSE), /v1/models, and optional bearer-token auth — so existing clients work unchanged. Dependency-free at runtime (node:crypto only) — nothing phones home, your prompts never leave your machine. The proxy is the same engine APxAI runs in production: drop it in front of OpenAI, Anthropic, or both, in any language.