The Pipeline Approach That Beat Single-Model Document AI

Hooh co-founder Iaroslav Argunov on privacy-by-design and 3x efficiency gains

Written By:

Published on:

23 Oct 2025, 12:00 pm

In 2025, 78% of organizations handling corporate data plan to implement privacy-by-design principles in their AI projects, according to PwC. The reason is simple: violating privacy standards doesn't just risk fines, it destroys client trust instantly.

Iaroslav Argunov learned this lesson across three vastly different environments. At Russia's Ministry of Economic Development, he faced chaos: decrees mixed with handwritten notes, scanned forms with seals alongside modern reports. His SQL-based repository and ML pipeline tripled manual workflow efficiency. At Yandex, he rebuilt anti-fraud systems, cutting operational costs by 37% while detecting 17% more threats. Now, as co-founder of Boston-based Hooh, he's applying these lessons to Document AI with a critical twist: privacy isn't added later, it's the foundation.

His approach challenges the industry's model-first mentality. While others showcase AI that 'answers beautifully once,' Argunov builds pipelines that handle millions of documents reproducibly. OCR with layout preservation, pre-index PII editing, field-level encryption, knowledge graphs with source verification, each component exists not for technical elegance but business necessity. As he puts it: 'A model is the engine, but without a chassis, brakes, and traffic rules, you won't get far.'

The results speak to practical impact: search time reduced from half a day to minutes, zero PII leakage into embeddings, and every answer traceable to a specific paragraph in a source document. In a world drowning in unstructured data, this systematic approach might determine which companies can actually use their documents versus those that just store them.

A Pipeline Instead of Model Magic

Document AI is not just a model for extracting text from documents. It is a full pipeline where documents go through OCR and layout-aware parsing, normalization, and structuring into schemas, sensitive data processing, multi-layer storage, hybrid search, and source verification. "A model is the engine, but without a chassis, brakes, and traffic rules, you won’t get far. Business needs flow, and the flow creates the process," he explains.

This systematic approach is especially crucial when handling personal and financial data: privacy must be embedded from the start, as fixing errors after deployment is both costly and risky.

Managed Flow

Iaroslav's first experience with document automation was at the Russian Ministry of Economic Development and the Moscow Analytical Center (MAC), where he oversaw macro-analytics projects. The flow of heterogeneous documents, decrees, letters, reports, scanned forms with seals, and handwritten notes generated a huge amount of manual work. "The hardest part was turning a pile of papers into a controlled process: from document appearance to verifying that a final number or quote was indeed taken from a specific paragraph on a specific page," recalls Argunov.

At MAC, he implemented an SQL-based repository, integrated external data sources, and used ML/Computer Vision for city-level tasks, which tripled manual workflow efficiency and increased data flow performance by 61%.

He later applied this methodology at Yandex, a major Russian tech company providing search, advertising, and cloud services. There, Iaroslav rebuilt the anti-fraud and search moderation process, updated annotation rules, and redesigned the ML pipeline. The results: 1.5x faster responses to phishing ads, 17% more proactive detections, and 37% lower operational costs. "I have seen many demo models that answer beautifully once, but business needs flow — reproducible, explainable, and manageable by cost and latency," he notes.

Privacy-by-Design as a Principle

In 2025, Iaroslav co-founded Hooh in Boston, where the core product is a Document AI pipeline built around privacy-by-design principles. "Documents contain PII, finance, medical, and contractual information. Privacy cannot be added later," he emphasizes.

Argunov contributed proprietary NLP algorithms and a knowledge graph schema, defining both architecture and compliance requirements. Development is carried out by a team with valid work authorization in the U.S., ensuring legal compliance. "Pre-index editing, separate keys for each client, field-level encryption, prohibition of training on client data without written consent — all of this must be default."

The Hooh pipeline includes document intake and classification, layout-aware OCR with coordinate preservation, normalization according to ontologies and schemas, pre-index editing of personal data, multi-layer storage (from source to knowledge graph), hybrid search with access control, RAG orchestration with context limitation, hallucination filters, automated fact-checking, and embedded HITL for disputed cases.

This approach reduces the time to find relevant document fragments from hours to minutes, ensures zero leakage of personal data, and provides transparent source traceability — creating conditions for B2B pilots with strict privacy and compliance requirements.

From Pilot to Measurable Impact

Iaroslav highlights three key outcomes of his pipelines, each reflecting the system’s real-world value:

Every answer can be traced to a specific paragraph in a source document, turning conclusions into verifiable evidence.
Zero leakage of PII into indexes and embeddings.
Search and analysis time reduced from half a day to minutes with guaranteed reproducibility.

These results show that Document AI, built as a process rather than a single model, significantly improves document handling efficiency, mitigates risks, and makes results reproducible.

Argunov’s track record speaks for itself, from hackathon wins to judging seats at major tech competitions. His teams took first place at Yandex Crowd Hackathon and twice at Russia’s National Projects contest. At DevHack, he’s now on the other side of the table, reviewing projects for what really matters, innovation, usability, and solid engineering.

“They usually bring me in when AI talk needs to turn into a business plan,” he says with a grin. “That kind of trust doesn’t come from theory, it comes from building things that actually work.”

Through these platforms, he keeps pushing the same message: Document AI must be a process, not a one-off model. And privacy-by-design isn’t just a buzzword, it’s how trust in tech is built from the ground up.

Moving to the Next Level

Today, Argunov continues to evolve Document AI as a reproducible process, from OCR and normalization to knowledge graphs and source verification. Emerging areas include multimodal data integration, secure embeddings, pre-flight checks for RAG, versioning, and document accountability graphs.

Challenges include legal responsibility and auditing generative answers, production-level multimodality, cost containment via smart caching and on-prem solutions, provenance standards, and PII-free secure embeddings. Growth ideas include full lineage graphs, a "privacy budget" metric, machine-readable access policies, and pre-analytics before RAG.

When asked why a process-oriented approach is essential, he answers with a metaphor: "It’s like a restaurant kitchen. You can buy the most expensive knife, but without recipes, hygiene, and portion control, the food will be both expensive and unsafe. Process-driven Document AI delivers three things: you quickly find what you need and know where it came from; your data is secure; quality and cost are under control."

His experience spans government agencies, major tech companies, and startups, and the combination of engineering discipline, deep process understanding, and privacy-by-design principles allows him to build systems that meet real business and legal requirements while setting new standards in the Document AI industry.