Blog

The constraint trap: how deep research quietly cheats

Deep research systems are everywhere now. They can search the web, synthesize dozens of sources, and produce polished reports in minutes. On the surface, they already do everything you’d want: filter by date, restrict domains, plug in a time window, add a few trusted sources, and let the system do the rest.

It sounds almost trivial — until you need those constraints to actually hold.

The problem isn’t that these systems are broken. They’re doing what they’re designed to do: optimize for coherent, helpful synthesis. But when constraints matter, coherence isn’t the goal. Correctness under hard boundaries is.

We evaluated several existing deep-research approaches and saw a consistent pattern: each is strong, but each optimizes for a different failure mode.

Constraint-aware agent: a purpose-built research pipeline that we built to enforce explicit constraints end-to-end and fail closed when they can’t be satisfied.

Perplexity
Fast, soft constraints
OpenAI Deep Research
Polished, slow cadence
Constraint-aware agent
Hard constraints, fast runs

Optimizes for: Speed and breadth over strict correctness

When to use: Quick exploration, early signal scanning, idea generation

Trade-off: Soft constraints under pressure

Optimizes for: Coherent synthesis and readable narratives

When to use: One-off deep dives and background briefs

Trade-off: Slow cadence for repeated runs

Optimizes for: Hard domain and time guarantees, even at the cost of empty results

When to use: Recurring reports and constraint-sensitive research

Trade-off: Fails closed when constraints can’t be satisfied

INDEX

Analyzing the failures

Perplexity example: explicit date window and domain whitelist, yet out-of-bounds sources still appear in the result.

The failures we cared about weren’t subtle. They were operational: you ask for a strict time window and get out-of-window sources; you whitelist domains and unapproved sources slip in; citations look fine until you verify them.

These violations don’t show up as errors. They show up as quiet drift — wrapped in otherwise coherent output. That’s what makes them risky in practice.

We ran the same constraint-heavy requests through OpenAI Deep Research. The report quality was strong, but the runtime was too slow to be operational — often 20+ minutes. In contrast, our constraint-aware agent completes comparable runs in about 3 minutes, making recurring and iterative use feasible.

Deep research agents promise to scan the landscape, filter the noise, and surface what matters — repeatedly, and within clear boundaries. In practice, those boundaries are the most fragile part of the system.

We built this agent for a narrow role: act as a trend scout for internal research groups, generating recurring reports within a fixed historical range, from a trusted set of sources, in a standardized format — fast enough to run regularly.

OpenAI Deep Research run (often 20+ minutes)

Constraint-aware agent run (about 3 minutes)

Making deep research dependable

On paper, the problem looked straightforward. Search engines support date filters. Research tools allow domain restrictions. The hard part wasn’t specifying constraints — it was making them survive the whole pipeline: retrieval, metadata, fallbacks, and synthesis.

We inherited an early prototype that worked for demos, but its behavior was hard to reason about and hard to tighten. We rebuilt on top of GPT Researcher not to make the agent “smarter,” but to start from a clean, understandable baseline we could extend systematically.

The key shift was conceptual rather than technical: constraints stopped being something the model should “try” to respect and became product requirements enforced in code. Domain boundaries became explicit whitelists, time windows became hard start/end checks, and report structure became a programmatic contract rather than prompted prose.

Constraints aren’t “preferences.” They’re requirements.

The surprising part is how rarely constraint failures announce themselves. They don’t crash the system. They produce outputs that look correct. The failures only became obvious once we ran the agent repeatedly — across different domains, time windows, and source distributions — and validated results systematically.

High-level workflow: planning, parallel retrieval, constraint enforcement, and structured synthesis.

Where things started to fall apart

1. When the agent invents continuity under constraint pressure

One of the first signals showed up in citations. Some reports included references that appeared entirely legitimate: plausible titles, publication dates, and summaries that were clearly relevant to the topic. At a glance, everything checked out.

On closer inspection, it didn’t. Publication dates fell outside the selected time window, and URLs were often wrong or slightly corrupted. In many cases, the referenced article actually existed — but the cited link didn’t. You could usually find it by manually searching for the title.

Understanding why this was happening took time. The first instinct was to inspect retrieval, which turned out to be behaving correctly. What we eventually uncovered was a different failure mode: when the system encountered insufficient valid sources that satisfied all constraints, it didn’t fail or return an empty result. It filled the gap itself.

Under constraint pressure, the agent produced internally consistent-looking citations that violated both temporal and domain boundaries. Realizing that this behavior was caused by model hallucination under constraint pressure took significantly longer than we expected.

From that point on, “no valid sources found” became a first-class outcome, not an error to hide or work around.

2. Retrieval looks solved, until constraints interact

Even after addressing hallucination under constraint pressure, another fragility remained. Most deep research pipelines live or die by retrieval, and this is where constraint enforcement proved far more subtle than expected. On paper, retrieval looks solved: filter by domain, filter by date, rank by relevance. In practice, it’s the interaction between those filters that causes problems to emerge.

We initially built the system on top of Tavily. With standard settings and basic search depth, Tavily behaved predictably: when no sources matched the constraints, it often returned no results. Compared to other retrievers we tested — particularly those used in systems like Perplexity — Tavily discarded a much larger share of irrelevant content. The tradeoff was clear: higher precision at the cost of lower recall.

The problematic behavior only appeared once we pushed the system harder. In advanced search depth, and specifically when no valid sources existed for the specified domain and time window, Tavily sometimes returned semantically relevant content that violated domain or temporal constraints. Faced with an empty result set, the system preferred something plausible over nothing correct.

As we expanded the scope, we switched to Exa to increase coverage. The system returned more sources, but that increase in recall came with a familiar cost: a larger share of the results was less relevant.

While validating historical runs, we noticed that some sources returned by Exa were violating the specified time constraints. At first, we assumed this was a fault in our own pipeline. We inspected our filtering logic and then turned to the retriever metadata to understand what was happening.

That’s when the issue became visible. The publication_date metadata returned by Exa was often missing, incomplete, or clearly inaccurate, making the observed filtering behavior difficult to explain.

Only after inspecting example outputs in Exa’s own documentation did it become clear that this behavior wasn’t specific to our implementation. The same inconsistencies were visible there as well — yet the issue is not explicitly documented or called out as a limitation.

More importantly, retriever-side date filtering could not be reliably inferred from the returned metadata alone. Some sources were filtered out even though their metadata dates appeared to fall within the specified time window, while others passed through despite dates being clearly wrong.

3. Filtering for correctness isn’t enough: usefulness matters too

That limitation pushed us to experiment with an additional guardrail. We integrated Scrapegraph AI into the pipeline as a way to extract and reason over page content directly, rather than relying solely on retriever metadata. The initial motivation was temporal validation: confirming publication dates from the page itself when metadata could not be trusted.

Once in place, it became clear that the same mechanism could be applied more broadly. Beyond time filtering, we also faced a criteria filtering problem: there are sources that technically fall within the right domain and time window, but that are structurally not useful — content hidden behind paywalls, announcements without code or pricing, or articles that look relevant but cannot be acted upon.

Using Scrapegraph AI, we experimented with applying natural-language criteria directly to extracted page content before final inclusion. In practice, this worked surprisingly well as an additional filtering layer on top of retrieval and domain constraints.

The approach was promising, but it also introduced additional complexity and variability. For the MVP, we intentionally disabled this layer to keep the system predictable and easier to reason about. It will likely return as part of a more unified filtering layer that combines temporal validation, quality checks, and actionability criteria.

A related but more structural issue emerged once we started comparing outputs across systems. When academic repositories such as arXiv were included, they tended to dominate the final reports. We observed the same skew in our system, OpenAI Deep Research, and Perplexity under comparable scopes.

We don’t believe this dominance automatically indicates bias. A more plausible explanation is content reality: arXiv simply publishes vastly more material than most industry, product, or policy-focused sources. In many cases, this is genuinely where new ideas first appear.

For our MVP, we made coverage balance an explicit part of retrieval. We explored several strategies to prevent a single domain from dominating the results, including reordering combined result lists by domain or splitting retrieval budgets up front. In practice, those approaches either added complexity or proved brittle under edge cases.

Instead, we implemented a simple retry-based mechanism: when a single domain began to dominate the results, the system detected that skew and re-ran retrieval without the dominant source. This kept coverage balanced while remaining predictable and easy to reason about.

Dominant sources can still surface — but only if they earn their place rather than overwhelming the signal by sheer volume.

Retrieval output showing domain dominance: although multiple domains were allowed, results collapsed to a single source. If you inspect the publication dates, you can also see violations of the specified time window conditions in Tavily’s advanced search depth mentioned before.

The decisions that quietly made the system reliable

One of the fastest and most impactful quality improvements had nothing to do with models. We removed the free-text input field. Instead, users select a research group from a dropdown, and the system injects the domain description and constraints automatically.

This dramatically reduced input variance and improved consistency. Later, we reintroduced limited free text — but only for editing the research group title and description, not for redefining the task itself.

Several structural decisions proved just as important. Early versions relied on prompting the model to follow a template. The results looked good — until they didn’t. Small deviations accumulated, formatting drifted, and reports became harder to compare and validate over time. We replaced prompt-level formatting with structured LLM output, schema enforcement in code, and deterministic JSON → Markdown conversion. This removed an entire class of formatting errors and made reports easier to render, validate, and evolve. Streaming became harder, but for this use case reliability mattered more than immediacy.

To understand how and why the system behaved the way it did, we instrumented the agent using LangSmith. Tracing full runs, inspecting retrieval decisions and intermediate steps, and comparing behavior across configurations allowed us to identify failure modes that would have been invisible otherwise. The goal wasn’t just debugging, but learning — feeding those insights back into better defaults, stronger guardrails, and more predictable behavior over time.

Finally, we deliberately avoided optimizing for a single language model too early. Throughout development, we experimented with multiple models, treating the model as a replaceable component inside a constrained system rather than a fixed choice. This helped us understand how hallucination behavior, cost, latency, and reasoning depth change under strict constraints — and reinforced the idea that architecture and guardrails matter more than any single model choice.

Results so far

Under the same domain- and time-constrained requests, systems diverged in ways that matter operationally. Perplexity and OpenAI Deep Research can produce polished synthesis, but neither treats constraints as hard requirements — and violations tend to slip in quietly.

This agent behaves differently. Constraints are enforced end-to-end. Time windows and domain whitelists hold throughout the pipeline, report structure is deterministic, and “no valid sources found” is an intentional outcome rather than something to smooth over.

Latency reinforced the distinction. In our demo, a full run completes in about 3 minutes. OpenAI Deep Research often took 20+ minutes, and some runs failed to complete. For recurring reporting, that difference determines whether the system is usable at all.

The most important shift was qualitative: what began as a promising prototype now behaves like infrastructure — something that can be run repeatedly, reasoned about, and improved incrementally without surprising its users.

What’s next

Reintroduce a unified LLM-as-a-judge layer for quality and actionability
Strengthen temporal validation beyond retriever metadata
Expand curated domains for product, legal, and policy coverage
Use observability data to systematically optimize the system

CLOSING THOUGHT

Deep research agents don’t fail because they lack constraints. They fail because constraint enforcement breaks at the system boundaries — retrieval, metadata, and synthesis.

Once you recognize this, the focus shifts. The problem is no longer about crafting better prompts or picking a more capable model. It becomes an engineering problem: how constraints are represented, propagated, and enforced across the system.

When deep research starts behaving like infrastructure — predictable, inspectable, and dependable — that’s when it becomes genuinely useful.