Escaping LLM Collapse: Why AI Keeps Recommending the Same 3 Flights (and How We Broke It)

Model collapse happens when an LLM optimizes for probability instead of relevance - diversity dies and only the statistically common answers survive. To break that behavior, we override the model’s default priors with structured context, domain rules, and explicit constraints. Without that, Otto would still be suggesting “United at 9 AM and the Marriott by the airport” for every trip.

The Core Problem: LLM Collapse in Travel Recommendations

The Collapse Phenomenon

Large language models, trained on broad internet data, develop strong statistical priors toward high-frequency answers.
In travel planning, this leads to collapse: the model repeatedly recommends the same 3–5 “safe” options - major airline hubs, global hotel chains, and peak-time itineraries - regardless of user profile or trip context.
Instead of exploring diverse reasoning paths, the model defaults to greedy decoding - choosing the highest-probability next token and ignoring lower-probability yet more relevant alternatives.
Result: Recommendations converge to generic, low-variance outputs that fail to capture individual preferences, contextual constraints, or domain expertise.

Why Traditional RAG Fails

Retrieval-augmented generation enriches the prompt with external facts, but it doesn’t fundamentally alter the model’s internal prior.
The LLM’s pre-training bias - “most travelers book X” - still dominates, drowning out retrieved, user-specific context.
Without explicit context engineering to re-prioritize personalized data and decision heuristics, the model treats specialized travel knowledge as secondary evidence rather than the primary decision source.

Multi-Layered Context Engineering

Architecture Overview

We implement a three-tier context injection system that progressively narrows from broad domain knowledge to specific trip requirements, strategically overriding generic training data with specialized, hierarchical constraints:

Tier 1: Foundation Context (Broadest Layer)

The base layer establishes domain expertise and collective behavioral patterns that inform all recommendations.

Domain Knowledge: Expert Travel Industry Intelligence

Inject expertise from executive travel arrangers and senior travel industry professionals directly
Core competencies: Comfortableness of route/seat/room, upgrade strategies, airline alliance optimization, loyalty points, routing efficiency
Examples of domain rules:
- Domestic flights: prioritize aisle seats for easier movement and uninterrupted productivity;
- Avoid red-eye flights unless the traveler specifically demands it as sleep loss hurts next-day performance;
- Hotel location hierarchy: walking distance to offices or meeting venues > room size > facility amenities;
- Layover optimization: domestic avoid layover, international 2-3hr optimal

Collective Wisdom: Aggregated Beta User Patterns

Statistical behavioral data from travel industry executives and road warriors (beta user cohort)
Convert qualitative feedback into quantitative preference signals
Examples:
- "Prefer changeable over refundable for cost-flexibility balance"
- "Avoid morning flights <7am (fatigue) and redeyes (next-day performance)"
Key insight: Peer behavior from similar professional profiles breaks generic consumer patterns more effectively than abstract personalization

Purpose of Foundation Context: Overrides the model's training on general consumer travel patterns with work travel norms.

Tier 2: Specificity Context (Middle Layer)

This layer personalizes the foundation with company-specific policies, individual user preferences, and destination-specific intelligence.

User-Specific Priors: Individual Travel History

Parse and structure historical booking patterns: airlines, hotel brands, booking windows, seat preferences, layover tolerance
Build quantified preference profiles that override population-level defaults
Examples of extracted patterns:
- "Last 12 flights: 10/12 Delta (83%), 11/12 Morning departure (92%), 0/12 Redeyes (0%)"
- "Hotel booking: 7/8 Marriott properties, avg 3.2 nights, high-floor preference (6/8)"
Key insight: Making implicit preferences explicit and mandatory prevents collapse to population-level defaults, but policy still wins in conflicts

Destination-Specific Knowledge: Hyper-Local Intelligence

Dynamic context loading for each destination: neighborhood characteristics, seasonal weather, traffic patterns, local events
Real-time travel intelligence: seat/room availability, recent reviews (last 30 days), pricing trends, local advisories
Examples:
- "NYC March: 40-55°F, 30% rain, midtown traffic peak 8-10am, 5-7pm"
- "350 Fifth Ave access: Best via JFK (longer) or LGA (traffic), avoid EWR morning commute"
- "Marriott Marquis: 0.3mi walk (6min), recent reviews note slow elevators during convention season"
Key insight: Hyper-local context forces the model to reason about specific trade-offs (distance vs. traffic vs. cost) rather than generic "best practices"

Company Policy

Policy evaluation as a strong signal:
- Travel class restrictions (economy domestic, business international)
- Ticket flexibility mandates (changeable required, refundable not allowed)
- Advance booking windows (minimum 14 days for international)
- Other flexible rules (e.g. flight price less than 1.5x average price of the past 30 days)

Purpose of Specificity Context: Creates the personalized operating parameters within policy boundaries—this is where the recommendation becomes tailored to the individual while respecting organizational constraints.

Tier 3: Trip Context (Narrowest, Highest Priority)

The top layer contains trip-specific constraints that override all lower layers when conflicts arise.

Trip-Specific Constraints: Per-Trip Requirements

Extracted from current conversation: meeting schedules, event information, traveling companions, special needs
Critical distinction: These are immediate, concrete requirements, not historical patterns or general preferences
Examples:
- Meeting location and time: "350 Fifth Avenue, March 15 at 9:00am"
- Arrival deadlines: "Must arrive evening of March 14 (hotel check-in before meeting)"
- Special requirements: "Traveling with CEO—need adjacent hotel rooms"
- Event-driven constraints: "Conference badge pickup 7-8am, sessions start 8:30am"
- Companion needs: "Colleague has mobility issues—wheelchair accessible hotel required"
Key insight: Trip context is the "right now" layer—it represents the specific problem to solve, not general tendencies

Purpose of Trip Context: Ensures the recommendation solves the actual current need, not an idealized or historical scenario.

Why This Structure Prevents Collapse:

Specificity increases up the pyramid: Broad domain knowledge → Narrow trip requirements
Override strength increases up the pyramid: Foundation provides defaults, Trip context provides overrides, and helps conflict resolution
Model's training data sits below all tiers: Generic internet patterns are systematically displaced by structured, relevant context
The "funnel" prevents generic solutions: By the time all three tiers are applied, the solution space has been narrowed from "all possible flights" to "flights that satisfy this specific user's company policy for this particular trip"

Results: From Generic to Hyper-Personalized

Metrics: % of user picked one of top 6 flight fare options or top 4 hotel/room options

Before Context Engineering: 43% of recommendations were picked by user

After Context Engineering: 87% of recommendations were picked by user

Lessons Learned

What Works

Explicit priority markers in prompts are more effective than hoping the model infers importance
Breaking recommendation into "generate candidates → evaluate holistically" prevents premature collapse
User history formatted as rules ("ALWAYS prefers X") works better than examples ("User previously chose X")
Domain expertise injected as axioms creates stronger guardrails than RAG-style retrieval

What Doesn't Work

Simply adding more data to context without structure makes collapse worse (more noise)
Hoping the model will "learn" user preferences from conversation alone without explicit profile
Generic "be creative" or "think outside the box" prompts have near-zero effect on reducing collapse

Open Challenges

Prompt the user to share more specific constraints of the trip
Latency introduced by reasoning, reflection and complex context retrieval
Balancing context length limits with comprehensive knowledge injection

References & Further Reading

Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., & Anderson, R. (2024). AI models collapse when trained on recursively generated data. Nature, 624, 115-121. https://www.nature.com/articles/s41586-024-07566-y, Foundational paper on model collapse: how LLMs lose tail distributions and default to common patterns when trained on synthetic data

Gerstgrasser, M., Schaeffer, R., Dey, A., et al. (2024). Model Collapse Demystified: The Case of Regression. arXiv preprint arXiv:2402.07712. https://arxiv.org/abs/2402.07712, Theoretical analysis of model collapse mechanisms: finite sampling bias and peaked distributions

Zhang, Y., et al. (2024). Outcome-based Exploration for LLM Reasoning. arXiv preprint arXiv:2509.06941. https://arxiv.org/abs/2509.06941, Addresses diversity collapse in RL-trained LLMs; proposes exploration bonuses to prevent concentration on common correct answers

Model Collapse Explained: How Synthetic Training Data Breaks AI. TechTarget. https://www.techtarget.com/whatis/feature/Model-collapse-explained-How-synthetic-training-data-breaks-AI, Accessible overview of collapse phenomenon and practical implications for recommendation systems

Mananghat, S. (2024). Is LLM Model Collapse Inevitable? Medium. https://sanoojm.medium.com/is-llm-model-collapse-inevitable-2cb068128207, Discussion of diversity loss in AI-generated content and mitigation strategies

What Is Model Collapse? IBM Research. https://www.ibm.com/think/topics/model-collapse, Comprehensive overview: causes, impacts on LLMs, and solutions including data provenance tracking

Troise, A. (2024). A Reflection on the Phenomenon of LLM Model Collapse Leading to the Decline in AI Quality. Medium. https://levysoft.medium.com/a-reflection-on-the-phenomenon-of-llm-model-collapse-leading-to-the-decline-in-ai-quality-a6993f86866c, Analysis of data pollution and "digital inbreeding" effects on model quality

Yao, S., Yu, D., Zhao, J., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Prompt Engineering Guide. https://www.promptingguide.ai/techniques/tot, Practical guide to implementing structured reasoning with explicit context hierarchies

Prompt Engineering Overview. Anthropic Documentation. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview, Best practices for context structuring, priority markers, and instruction hierarchy

Advanced Prompt Engineering Techniques. OpenAI Cookbook. https://cookbook.openai.com/, Strategies for structured prompts, few-shot learning, and context window management

JSON Mode and Structured Outputs. OpenAI Documentation. https://platform.openai.com/docs/guides/structured-outputs, Technical implementation of JSON schema enforcement for reliable structured generation

Constrained Decoding for Structured Generation. Hugging Face Blog. https://huggingface.co/blog/constrained-beam-search, How structured output formats improve logical reasoning and instruction-following

Why JSON Improves LLM Reasoning. Anthropic Research. https://www.anthropic.com/research, Research on how structured formats trigger different reasoning pathways in language models

Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Papers with Code. https://paperswithcode.com/method/rag, Foundation for dynamic context injection and knowledge retrieval strategies

Advanced RAG Patterns. LlamaIndex Documentation. https://docs.llamaindex.ai/en/stable/, Hierarchical retrieval, context prioritization, and multi-source knowledge integration

Context Window Management Strategies. Anthropic Blog. https://www.anthropic.com/index/claude-2-1-prompting, Techniques for managing long context windows and maintaining attention on critical information

LLMs for Travel and Hospitality. Tourism Analysis. https://www.tandfonline.com/journals/rtxg20, Academic research on applying AI to personalized travel recommendation systems

DEC 4

After 9 months in Beta, Otto is now open to everyone! Read our announcement