Sovereign Trust

Launching the GeoClear Notary: Verification over Information

Shailesh Bhujbal — Fri, 01 May 2026 06:25:40 GMT

Here's what GeoClear just notarized while you opened this post — request hash, response hash, endpoint, timestamp, all signed by an HSM-bound key. Verify it yourself in 30 seconds.

{
  "iss": "https://geoclear.io",
  "iat": 1777612849,
  "endpoint": "/api/health",
  "req_hash": "sha256:6e849e1d1d0fd7a01ce7258340d6fdbf60ba68db987ba000f1fb87d2ed2f64f2",
  "resp_hash": "sha256:e2f20304ea4586a988d287a9b914b77be7d9425044a4299a9675ace73bb413bc",
  "status": 200,
  "kid": "geoclear-response-signing-2026"
}

Two posts ago, we asked whether machines that move money should be allowed to operate without explaining themselves.

Last post, we ran the same coding task across 8 leading AI models — three times — and watched the same prompt produce visibly different answers, run after run. Probability is not a substitute for proof.

If you read those two posts and felt the discomfort, this post is the answer.

The Notary, not the Log.

A Transaction Notary is not a logger. A logger captures what happened. A notary captures that it happened correctly — and produces evidence that survives without the notary.

Every machine decision GeoClear makes now carries a cryptographic receipt that you can verify, archive, and present as evidence — without trusting our database, our uptime, or our future continued existence.

What's actually shipping right now.

Every JSON response on geoclear.io — every API endpoint, every demo lookup, every MCP tool call — now carries two new HTTP headers:

X-GeoClear-Receipt — a JWS (JSON Web Signature) over the response body, in ES384 format.
X-GeoClear-Receipt-Kid — the public key ID, so verifiers know which key signed it.

The signing key is an HSM-bound ECDSA P-384 key in AWS KMS, on the FIPS 140-2 Level 3 validated path. The key cannot be exported. The HSM does the math; the host machine never sees the private bits.

Every emitted receipt is also written to an append-only audit table inside our database. UPDATE and DELETE privileges on that table are revoked at the database layer — even we cannot rewrite history.

The public verifier — npm install @geoclear/verify-receipt — is open source (MIT) and validates anything we've ever signed. The public key lives at https://geoclear.io/.well-known/jwks.json. You can verify a response offline, six months from now, with no network call to us.

Verification over Information.

Information without a verifiable signature is a claim. A signed receipt is evidence.

When you decide using a GeoClear response, you're not deciding on our word. You're deciding on a cryptographic artifact you can verify offline, replay six months from now, and present in court if you have to.

That's the difference between trusting an API and using a Notary.

What's next.

Today: Location Notarization. We started with location because it's the hardest decision tier to prove — and if a rooftop, an address, a flood polygon can be notarized, anything can.

We are currently opening early-access tiers for high-precision notarization: from Climate Risk and Flood Determination to Drone Deliverability and Sovereign Underwriting. If your machine needs to prove its work, we are building the envelope.

The roadmap is verifiable infrastructure for the machine economy. Not Trust Me. Verify Me.

Try it.

Live demo: https://geoclear.io/security#receipt-demo
Verifier package: npm install @geoclear/verify-receipt
See a live receipt header in your terminal: curl -sI https://geoclear.io/api/health | grep -i x-geoclear

Stop trusting black-box logs. Start holding the proof.

I Ran the Same Coding Task Across 8 AI Models — Then I Did It Again

Shailesh Bhujbal — Fri, 03 Apr 2026 09:34:34 GMT

8 models, 1 task, 19 runs — N-run averages and Redis atomicity stratification. Full data: github.com/shaileshjgd/FrontierModelEvals

I Ran the Same Coding Task Across 8 AI Models — Then I Did It Again

A free model reached the top tier. Then it didn’t hit it again.

All Redis code excerpts and timing data published for independent verification: github.com/shaileshjgd/FrontierModelEvals

I build large-scale ML and analytics systems, and I’ve spent the better part of the past year trying to answer one question honestly: which AI models actually hold up when the output needs to go to production, not just pass a demo? My previous piece argued that fluency without justification is insufficient for production AI systems. This is the empirical follow-up — applied to the tools engineers use every day.

On April 2, 2026 — the same day Google released Gemma 4 — I ran a controlled benchmark across eight of the most capable AI models available. The task: write a production-ready sliding window rate limiter for Express.js, complete with Redis adapter, TypeScript types, and real unit tests. The kind of thing a senior engineer might spend half a day on.

Then I ran it again. And again. Because a single run isn’t a finding — it’s a data point.

The single-run result: Google’s Gemma 4 31B, accessible for free via NVIDIA’s inference gateway, scored 9.3/10 vs. Claude Opus 4.6’s 9.7 — both in the top tier. That’s the headline many people would stop at.

The multi-run result is more interesting: Opus was consistent across both runs (9.7, 9.5 — avg 9.6). Gemma was high-variance (9.3, 6.3, 5.7 — avg 7.1). The top-tier ceiling is real. Hitting it consistently is not.

I want to be precise about what that means and what it doesn’t, because the details matter.

Why This Task, Not HumanEval

Before getting into results, I need to address the benchmark selection. HumanEval, MBPP, and SWEBench are all useful, but they don’t capture what separates a capable junior engineer from a strong senior one on real infrastructure tasks. This benchmark task was designed to require simultaneously:

Algorithmic correctness — sliding window log, not the simpler fixed-window approximation

Distributed systems reasoning — Redis atomicity, race condition avoidance

Production operations thinking — fail-open error handling, memory leak prevention, connection lifecycle

Test engineering — fake timers for deterministic sliding window validation, not just happy-path assertions

API design discipline — RFC-compliant headers, pluggable storage abstraction

This is the kind of task where the difference between a technically correct and a production-safe implementation is invisible to a quick read-through — and where AI-generated code poses real operational risk if used without expert review.

The Models, Honestly Described

One immediate finding before a line of code was evaluated: o3 produced no output. OpenAI migrated o3 to require max_completion_tokens instead of max_tokens — a documented parameter change that the evaluation script hadn’t yet adopted. The API returned a 400 error in 1.1 seconds. It’s a migration, not instability — but it illustrates why AI API changelogs belong in your dependency monitoring process alongside npm and pip.

A note on model versions: Claude Sonnet 4.6 was released in February 2026 but was not included in this run. The benchmark used the model versions in active production use in my tooling as of April 2. Sonnet 4.5 was the Sonnet-tier model in that configuration. Sonnet 4.6 will be included in the next eval.

The Scoring Rubric

I scored each response on six dimensions (1–10 each), averaged for a final score:

1. Strategy & Architecture — Is there a clear layered design with explicit reasoning?

2. Code Correctness — Is the sliding window semantics right? Any bugs?

3. Redis Implementation — What’s the atomicity level?

4. Test Quality — Do tests actually validate the sliding window property?

5. Production Readiness — Fail-open? Memory safe? Headers RFC-compliant?

6. Drop-in Readiness — Can you use this with minimal editing?

The Results

The Finding That Surprised Me Most: Redis is the Discriminator

When I set out to evaluate these models, I expected the differentiation to appear in code correctness or architectural clarity. Instead, the single dimension with the widest variance — and the most direct production risk implication — was Redis implementation quality.

Here’s the stratification:

Lua scripting — Only Claude Opus 4.6

Opus implemented Redis rate limiting via Lua scripts: ZREMRANGEBYSCORE + ZADD + ZCARD executed as a single atomic transaction. This is the only approach that is provably correct under arbitrary concurrency. No race condition is possible because the script runs atomically on the Redis server.

ZSET Pipeline — Gemma 4 31B and GPT-4.1

Both used Redis sorted sets with pipelined commands. The pipeline batches network round-trips but is not atomic — there’s a theoretical race window between the count check and the write. At typical API gateway throughput, this is acceptable. Both scored 8–9 on this dimension.

SETEX with JSON blob — Claude Sonnet 4.5

This is the one that should concern practitioners. Sonnet stored timestamp arrays as JSON strings, using SETEX for TTL. The sequence is: GET → deserialize → filter → push → SET. Two concurrent requests can both read the same count, both decide “allowed,” and both write back — causing the rate limit to be exceeded under concurrent load. This isn’t a theoretical edge case; it’s the expected behavior at any meaningful traffic volume.

Not implemented / Fixed window — GPT-4o

GPT-4o run1 described a Redis approach in prose but provided no code. Run3 implemented code — but the wrong algorithm: an INCR+PEXPIRE fixed-window counter, not a sliding window log. Both are unusable for distributed deployment as specified.

Missing prune call — Qwen3 Coder 480B

Qwen3’s Redis code counted entries with ZRANGEBYSCORE but never called ZREMRANGEBYSCORE to remove expired ones. The ZSET grows without bound under sustained traffic. The per-window count stays correct (the range query filters properly), but the key accumulates all historical entries — a memory leak that only surfaces in distributed deployment.

What Made Gemma 4 31B Score 9.3 on Run1 — And Why It Dropped

Let me be specific, because vague model praise is noise. Here’s what Gemma 4 produced in run1 that justified a 10 in multiple categories:

Architecture: It opened with an explicit component table mapping each layer (StorageAdapter, RateLimiter, RateLimitMiddleware, Config) to its responsibility and stated its complexity analysis unprompted: O(N) for memory store operations, O(log N) for Redis sorted set operations.

Code Correctness: The sliding window logic was exact — count < limit (not <=, which is a common off-by-one), timestamp prune-before-count ordering, correct branching without penalizing the user by counting the rejected request.

Tests: Three real Jest tests, including jest.useFakeTimers() for the sliding window expiry test. This is the test that actually validates the algorithm, not just happy-path behavior.

Production thinking: Fail-open try-catch with structured error handling, pexpire for automatic Redis key cleanup on idle users, and Retry-After header calculation using the oldest timestamp in the window — the correct value, not a fixed offset.

What Claude Opus 4.6 did better: Lua scripting for full atomicity, binary search for O(log N) timestamp pruning, picomatch glob routing, sweepInterval.unref() for Node.js process safety, and a shutdown() lifecycle hook.

Why runs 2 and 3 dropped: One specific detail separated them from run1. Run1’s Redis implementation added entries speculatively, then did a follow-up ZREM to clean up entries for rejected requests — so a blocked user’s timestamps didn’t contaminate their next window. Runs 2 and 3 omitted the cleanup. It’s 3 lines of code. The model knows the concept — it demonstrated it on run1 — but didn’t produce it consistently. That’s the variance story in a single bug: 9.3, 6.3, 5.7, average 7.1.

The Speed Surprise

I ran two timing probes across the NVIDIA free tier models, and the results revealed something important about how free-tier inference works.

Probe 1 (earlier in the day):

Qwen3 Coder 480B: 1,213ms for 100 tokens

Gemma 4 31B: 54,984ms for 100 tokens

Probe 2 (an hour later):

Gemma 4 31B: 2,661ms for 80 tokens ← warm model

Qwen3 Coder 480B: 4,675ms for 80 tokens ← queue depth increased

Nemotron 49B: 1,855ms for 80 tokens

The 55-second Gemma 4 31B result wasn’t the model being slow — it was a cold-start artifact. When the model is warm and queue depth is low, Gemma 4 31B responds in under 3 seconds, comparable to everything else. NVIDIA free-tier latency is primarily a function of queue state, not model architecture. It’s variable in ways that matter for interactive use but are invisible in isolated benchmarks.

What this means in practice:

Architectural design sessions: Gemma 4 31B — occasionally slower, always higher quality

Confidential code: Claude Sonnet 4.5 via Anthropic API (enterprise data controls, consistent latency)

Absolute highest stakes / production infrastructure: Claude Opus 4.6 — Lua atomicity was consistent across both runs (9.7, 9.5)

What This Means

On this class of task — distributed systems middleware with a concurrency-sensitive correctness requirement — the best free model landed in the same tier as the best paid model on a single run: 9.3 vs. 9.7, a 0.4-point gap. That result doesn’t generalize to all engineering tasks, but it’s a strong signal that the assumption “you need the most expensive model for serious work” deserves scrutiny, not blind acceptance. The tradeoffs worth evaluating are latency, data governance, and task fit — not a blanket capability gap that may no longer exist.

This has practical implications:

1. Teams routing all engineering tasks to Opus/o3 without evaluating task fit are likely over-spending for at least some of their use cases.

2. In this benchmark, Redis implementation quality was the highest-signal discriminator of distributed-systems reasoning depth — it requires simultaneous understanding of concurrency, atomicity, and operational constraints, dimensions that standard coding benchmarks do not measure.

3. NVIDIA’s free inference gateway is a serious tool for professional engineering, not just experimentation. The combination of Gemma 4 31B (architecture) and Qwen3 480B (implementation) covers most of the engineering workflow at zero cost.

4. API changelog monitoring belongs in your dependency stack. o3’s parameter migration broke evaluation before a single token was generated. Track AI API breaking changes the same way you track npm and pip — it’s a production dependency with the same operational requirements.

Caveats I Want to Be Explicit About

Multi-run data. I ran additional independent calls for six of eight models (2–3 runs each). The N-run averages in the results table above reflect this. Broad structural patterns (Lua vs. GET-SET, Redis vs. not, fake timers) were consistent. Implementation quality within a pattern was not — Gemma 4 31B’s ZREM cleanup appeared in run1 and not runs 2 or 3. Models within 1.0 point of each other should be treated as approximately equivalent given single-scorer variability and LLM stochasticity.

One task, one domain. These conclusions apply to distributed systems and middleware tasks. Generalizing to all software engineering would be overreach.

Conflict of interest disclosed. I use NVIDIA’s free tier in my own tooling. I have a stake in free-tier models performing well. Scores were assigned before models were compared against each other.

Anthropic timings are warm-timed. A short warm probe fired first; the full task fired immediately on return. Sonnet 4.5 confirmed at 33.8s; Opus 4.6 estimated at ~35s from prior run.

All Redis code excerpts, scoring rationale, and timing data are published publicly for independent verification: github.com/shaileshjgd/FrontierModelEvals

On a single run, a free model reached the same tier as the best paid model (9.3 vs. 9.7). Across multiple runs, Opus held its ceiling consistently while Gemma showed real variance. Both findings matter. The ceiling result tells you what’s possible on the free tier. The variance result tells you not to ship code from a single free-tier generation into a distributed system without review. The consistency result tells you what you’re actually buying when you pay for Opus.

For this benchmark task class — and likely adjacent middleware and concurrency-sensitive tasks — the free tier gets you 70–90% of the ceiling in practice. The remaining gap shows up in exactly the operational details you’d miss in a code review but feel in a production incident.

Full methodology and raw outputs

Full methodology report (6-dimension rubric, per-model Redis analysis, timing methodology, limitations, all 19 raw outputs):

github.com/shaileshjgd/FrontierModelEvals → REPORT.md

Raw outputs, scoring rationale, and Redis code excerpts from every model:

github.com/shaileshjgd/FrontierModelEvals → 2026-04-02 evals folder

Tags: #AIEngineering #LLM #MachineLearning #SoftwareEngineering #Gemma4 #CloudAI #OpenSource #RateLimiting #Redis #TypeScript

The Case for Machines That Can Explain Themselves

Shailesh Bhujbal — Thu, 18 Dec 2025 13:22:05 GMT

From fluency to justification: unstructured answers pass through layers of evidence, constraints, consistency, and uncertainty before becoming conclusions that can withstand scrutiny. Illustration created for this article to depict the transition from unconstrained output to evidence-based reasoning.

When fluency is no longer enough, justification must become the governing principle of computational reasoning.

Editor’s note (Dec 19, 2025): An earlier version of this essay displayed non-public citation placeholders from my drafting workflow, which rendered as broken links. I’ve replaced them with a verifiable reference list (primary sources where possible) and added a short change log below. The argument is unchanged.

For a piece about justification, every claim here should be traceable to a checkable source.

Modern systems now write with a proficiency that would have been unthinkable a decade ago. Their fluency, however, has created a misplaced sense of security. Today, polished text reveals little about the soundness of the reasoning behind it. The question is no longer whether a system can produce an answer, but whether the answer can withstand examination.

This is not conjecture. Across medicine, finance, scientific inquiry, and legal analysis, empirical findings point to the same pattern. Clinical summarization systems introduce subtle — but consequential — distortions in patient information while sounding fully authoritative¹². Legal evaluations show arguments supported by citations to cases that do not exist³. Factual audits reveal confident answers grounded in no evidence at all⁴⁵. Large-scale usage studies demonstrate that these issues surface not in dramatic failures but in quiet inconsistencies woven into everyday interactions⁶.

What emerges from this research is straightforward: these systems are skilled at extending a line of discourse and far less capable of interrogating it. They reproduce the surface of expertise while missing the internal scaffolding that makes expertise trustworthy.

This is a familiar problem for anyone who has built or governed large-scale analytical systems in environments where accountability is non-negotiable. In regulated domains, the most serious failures are not mathematical missteps but reasoning gaps — places where an output looks credible but lacks the evidentiary chain required for audit, compliance, or institutional trust.

Retrieval provides some grounding, but it rarely goes far enough. Even with the right documents in hand, models often misinterpret their contents, generalize too broadly, or ignore conflicting information⁷⁸. Having evidence is not the same as using it well.

Confidence adds a different kind of difficulty. Models frequently express certainty where none is warranted⁵. Users, in turn, tend to trust answers more when accompanied by explanations — even when those explanations are not supported by evidence⁹. Apparent transparency can, paradoxically, deepen the underlying opacity.

Much of this can be traced to architecture. Contemporary systems begin with generation; evaluation, if it occurs, follows. Inquiry and expression collapse into a single act. The result is an answer whose form may persuade, even when its foundations would not.

Older philosophical perspectives help make sense of this. Popper argued that ideas gain legitimacy only when they are exposed to the possibility of being wrong. Polanyi emphasized that expertise involves an intuitive sensitivity to when a conclusion extends beyond its evidence. These traditions value discipline — logical, evidentiary, and epistemic — over eloquence.

Encouragingly, emerging research suggests a different direction. Verifier-guided generation treats answers as hypotheses that must earn their validity¹⁰. Uncertainty-aware methods examine whether retrieved evidence truly supports a claim¹¹. Hybrid approaches incorporate symbolic checks to reduce high-risk errors in sensitive domains¹². Multi-stage pipelines separate extraction, evaluation, and narrative, and consistently outperform single-pass generation¹³.

What ties these approaches together is a reversal of assumptions: generation becomes the final step in a chain that begins with assessment, evidence gathering, constraint application, and uncertainty evaluation.

This shift matters because these systems are being woven into decision processes across society — in underwriting, scientific synthesis, regulatory analysis, and enterprise operations. These domains do not rest on eloquence. They rest on justification, reproducibility, and the capacity to withstand challenge.

We once asked whether a machine could write. The more pressing question now is whether a machine can explain — and whether it can remain silent when explanation is not possible.

Only systems capable of justifying their conclusions will deserve the trust that fluency alone once seemed to promise.

Endnotes

¹ ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports — Jeblick et al., European Radiology (2023).

² Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization — Van Veen et al. (2023).

³ Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models — Dahl, Magesh, Suzgun, Ho (2024).

⁴ TruthfulQA: Measuring How Models Mimic Human Falsehoods — Lin et al. (2021/2022).

⁵ Language Models (Mostly) Know What They Know — Kadavath et al. (2022).

⁶ Evaluating base and retrieval augmented LLMs with user-reported hallucinations in AI mobile apps reviews — Masanneck et al., Scientific Reports (2025).

⁷ Enabling Large Language Models to Generate Text with Citations — Gao et al., EMNLP (2023).

⁸ Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation — Wu et al., EMNLP (2024).

⁹ Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting — Turpin et al., NeurIPS (2023).

¹⁰ Chain-of-Verification Reduces Hallucination in Large Language Models — Dhuliawala et al. (2023/2024).

¹¹ Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation — Fadeeva et al. (2025).

¹² Proof of Thought: Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning — Ganguly et al. (2024).

¹³ FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation — Min et al., EMNLP (2023).

Change log

Dec 19, 2025: Corrected Endnotes (replaced broken/non-public citation placeholders with verified primary-source links).