The RAG Stack Behind Reeco

🇮🇹🇩🇪🇫🇷🇪🇸🇵🇹🇳🇱🇵🇱🇸🇪🇩🇰🇫🇮🇨🇿🇷🇴🇭🇺🇬🇷🇧🇬🇭🇷🇸🇰🇸🇮🇪🇪🇱🇹🇱🇻🇮🇪🇲🇹🇸🇦🇨🇳🇯🇵🇰🇷🇮🇳🇹🇷🇻🇳🇮🇩

Document type: Technical architecture assessment Subject: Reeco regulatory-intelligence retrieval stack (three engines) Assessment date: 10 June 2026 Method: AI-assisted architectural review (source-code inspection, live adversarial testing, comparison against published 2026 retrieval benchmarks). Disclosure: this assessment was produced with Claude (Anthropic) operating on direct code access and live system interaction; no formal benchmark suite was executed during the assessment itself. Every claim below is anchored to a verifiable artifact — a file, a line range, a live response, or a published reference.

The thesis, stated so it can be falsified

A single-founder system built in Prato, Italy, implements a retrieval architecture that matches or exceeds the documented 2026 production baseline for enterprise RAG on six of eight measurable dimensions — and it does so in a domain (EU textile Digital Product Passport regulation) where no general-purpose commercial system has comparable corpus depth.

The test to falsify it: name a commercial RAG product that (a) refuses to answer questions about regulatory articles that do not exist, (b) cites sources at file-page-section granularity including institutional contribution IDs, and (c) runs hybrid dense+sparse retrieval with live-tunable RRF weights — simultaneously. The author of this assessment did not find one. A single counterexample defeats the claim. None is known.

The three engines

Engine 1 — RAG1 (Portal, FAISS). An air-gapped FAISS index serving the Reeco supply-chain portal. Deliberately offline on the VPS: the design decision is security isolation, not technical limitation. Honest scope note: RAG1 was not directly tested in this assessment; it is described architecturally.

Engine 2 — RAG2 (Reecopedia, production). A Qdrant-backed pipeline over an EU Green Deal regulatory corpus (ESPR, ECGT, CSRD, CIRPASS-2 materials, EN-standard working documents; 47,996 indexed points in the production collection). This is the engine that was tested live.

Engine 3 — Retrieval research and evaluation layer. A separately indexed ColBERT v2 late-interaction engine plus an evaluation harness: RAGAS metrics, golden test sets, LLM-as-judge protocols, and versioned A/B comparisons (ab_eval_colbert.py, ragas_eval_v1_vs_v2.py, eval_e2e_ab_sonnet.py, bootstrap_gold_v2_sources.py). Contextual retrieval — the chunk-augmentation pattern published by Anthropic in 2024 — is implemented at ingestion (contextual_retrieval.py).

A research methodology of this kind — golden sets, judge models, versioned A/B — is standard practice inside ML teams of twenty people. It is not standard practice for a system built by one person.

The ten-phase pipeline is the architecture, not the marketing

RAG2 executes a documented ten-phase pipeline per query: audit-driven configuration by role (five access levels, config served audit-first with environment fallback and 60-second cache); query planning producing step-back reformulation, sub-queries, keywords, and HyDE text; multi-embedding of up to six query variants including an Italian-to-English bridge; conditional document-scoped metadata filtering with automatic no-filter retry; multi-retrieval with Reciprocal Rank Fusion merge and table reconstruction (±10 adjacent chunks, document-scoped); table-intent routing (60/40 table-to-text mix when the classifier detects tabular intent); reranking with four switchable backends (cross-encoder, NLI/DeBERTa, Jina v3, deterministic); contextual compression gated by role; score monitoring with drift warnings that signal re-ingestion; and post-processing that normalizes citations and extracts tables and figures as structured output.

Most commercial systems expose three phases: ingest, retrieve, generate. The difference is not cosmetic — each additional phase is a failure mode handled.

Hybrid retrieval: live, governed, collection-aware

Dense+BM25 hybrid retrieval — the configuration that published 2026 benchmarks identify as the production baseline, worth +5–15% nDCG on legal and technical corpora (BEIR/MIRACL) — is implemented and active in rag2_service.py: named dense and sparse vectors in Qdrant, RRF prefetch weights configurable at runtime through the audit panel (default 0.7 dense / 0.3 sparse), an audit-level kill switch (hybrid_search_enabled), and a per-collection capability check that degrades gracefully to dense-only when a collection has no sparse vectors. The source comments cite BEIR and MIRACL by name. This is not a system that discovered hybrid retrieval from a tutorial.

The adversarial test: the engine refused a fabricated article

Live test, Superadmin tier, 9 June 2026. The query asked for “the exact threshold of recycled content under Article 7 of the ESPR delegated act for textiles” — a deliberately fabricated premise: the textile delegated act is not finalized, and no such threshold exists.

The engine’s response, verbatim in its critical passage: “The indexed corpus does not contain a specific numeric threshold under Article 7 […] cannot be cited from the available sources without risk of fabrication. This is a critical distinction: I will not invent a percentage or article sub-paragraph that is not present in the indexed documents.” It then pivoted to what the corpus does confirm — ESPR Article 5(3) as the actual legal basis for ecodesign requirements — with a citation at file-page-table granularity (Answers_Com_Work_Doc_2nd_Mil.pdf | p.413 | § Table 40).

A general-purpose LLM wrapper, asked the same question, will most plausibly produce a percentage. Statistically plausible thresholds are exactly what language models generate when unconstrained. In a compliance domain, a confident wrong answer is not a degraded answer — it is a liability event. The refusal is the product.

This behavior is consistent with the publicly documented benchmark (20/20 refusal on a three-category adversarial set: nonexistent provisions, partial-truth premises, controls), published with methodology at stefanocipri.substack.com (”The RAG that says I don’t know”, April 2026), where the failure mode it targets is named: fabrication-by-composition.

Findings by dimension

DimensionPosition vs 2026 landscapeAnchoring evidenceCitation granularityTop tier (~5%)File + page + section + institutional contribution IDs (e.g. bb6997ac), liveDomain specificity (textile DPP)No known peer (~1%)Proprietary corpus: CIRPASS-2 positions, EN-standard drafts, validator rules SEM006/TXT001–005Anti-hallucination behaviorTop tier (~1–5%)Live fabricated-article refusal; published 20/20 adversarial benchmarkHybrid retrieval implementationAt frontierLive BM25+dense, tunable RRF, audit kill-switch, collection-aware fallbackEvaluation methodologyTop tier (~5%)RAGAS + golden sets + LLM-as-judge + versioned A/B, in-repoMultilingual operationTop tier (~5%)30+ UI languages, language-enforcement rule, IT→EN embedding bridgeGovernance and auditabilityTop tier (~5–15%)Per-role config, audit-first runtime, drift monitoring, score loggingIncremental indexingBelow baselineJina collection populated batch-only; no on-demand ingest at query time

Methodological honesty about this table: the percentile positions are qualitative estimates produced by comparing inspected architecture against published 2026 system descriptions (hybrid-as-baseline reports; agentic-RAG win-rate publications in the 64–76% range against general assistants on enterprise corpora; framework retrieval-accuracy comparisons in the 85–92% band). They are not the output of a head-to-head benchmark run. The in-repo RAGAS harness makes such a run executable and publishable; until it is published, the table above is an expert assessment, not a measurement.

What the stack does not yet have

Three gaps, stated plainly. First, incremental indexing: the Jina late-chunking collection is populated by batch script, not on demand; new documents wait for the next ingest. Second, the formal benchmark numbers exist as infrastructure but not yet as a published artifact — the strongest single move available is to run the in-repo RAGAS suite against the golden set and publish the numbers next to the methodology. Third, RAG1 remains assessed on architecture only; its retrieval quality is undocumented outside internal use.

None of these is structural. All three are weeks, not quarters.

Why this matters beyond one company

The 2026 market is saturated with “AI compliance assistants” that are thin wrappers over general-purpose models: one embedding per query, dense-only retrieval, filename-level citations at best, no role governance, no drift monitoring, and — decisively — no refusal behavior on fabricated premises. The standard-setters themselves acknowledge the verification gaps these tools paper over.

The system assessed here inverts the usual construction order. It was not built by an ML team acquiring domain knowledge; it was built by a domain expert — thirty years in international textile supply chains, an Expert Member of CIRPASS-2 (EWG1, EWG3), a JRC Registered Stakeholder (Unit B5) — acquiring retrieval engineering. The corpus knows what a Transaction Certificate is, when it physically arrives relative to a shipment, and why ISO fibre-composition test methods cannot distinguish recycled from virgin polyester. That knowledge is in the index because the person who built the index spent three decades learning it.

A retrieval pipeline can be replicated in a quarter by a funded team. The corpus and the judgment encoded in it cannot. That asymmetry is the defensible asset.

Reeco® is a DPP verification platform built on UNTP 0.7.0 and W3C Verifiable Credentials, with a proprietary per-garment mass-balance engine (SIAE deposit). Reeco does not block DPP issuance: the engine quantifies coverage and informs the brand, which retains autonomous decision — by design. Stefano Cipriani is founder of Reeco®, Expert Member of CIRPASS-2 (EWG1, EWG3), JRC Registered Stakeholder.