Escaping LLM Collapse: Why AI Keeps Recommending the Same 3 Flights (and How We Broke It)
Model collapse happens when an LLM optimizes for probability instead of relevance - diversity dies and only the statistically common answers survive. To break that behavior, we override the model’s default priors with structured context, domain rules, and explicit constraints. Without that, Otto would still be suggesting “United at 9 AM and the Marriott by the airport” for every trip.

The Core Problem: LLM Collapse in Travel Recommendations
The Collapse Phenomenon
- Large language models, trained on broad internet data, develop strong statistical priors toward high-frequency answers.
- In travel planning, this leads to collapse: the model repeatedly recommends the same 3–5 “safe” options - major airline hubs, global hotel chains, and peak-time itineraries - regardless of user profile or trip context.
- Instead of exploring diverse reasoning paths, the model defaults to greedy decoding - choosing the highest-probability next token and ignoring lower-probability yet more relevant alternatives.
- Result: Recommendations converge to generic, low-variance outputs that fail to capture individual preferences, contextual constraints, or domain expertise.
Why Traditional RAG Fails
- Retrieval-augmented generation enriches the prompt with external facts, but it doesn’t fundamentally alter the model’s internal prior.
- The LLM’s pre-training bias - “most travelers book X” - still dominates, drowning out retrieved, user-specific context.
- Without explicit context engineering to re-prioritize personalized data and decision heuristics, the model treats specialized travel knowledge as secondary evidence rather than the primary decision source.
Multi-Layered Context Engineering
Architecture Overview

We implement a three-tier context injection system that progressively narrows from broad domain knowledge to specific trip requirements, strategically overriding generic training data with specialized, hierarchical constraints:
Tier 1: Foundation Context (Broadest Layer)
The base layer establishes domain expertise and collective behavioral patterns that inform all recommendations.
Domain Knowledge: Expert Travel Industry Intelligence
- Inject expertise from executive travel arrangers and senior travel industry professionals directly
- Core competencies: Comfortableness of route/seat/room, upgrade strategies, airline alliance optimization, loyalty points, routing efficiency
- Examples of domain rules:
- Domestic flights: prioritize aisle seats for easier movement and uninterrupted productivity;
- Avoid red-eye flights unless the traveler specifically demands it as sleep loss hurts next-day performance;
- Hotel location hierarchy: walking distance to offices or meeting venues > room size > facility amenities;
- Layover optimization: domestic avoid layover, international 2-3hr optimal
Collective Wisdom: Aggregated Beta User Patterns
- Statistical behavioral data from travel industry executives and road warriors (beta user cohort)
- Convert qualitative feedback into quantitative preference signals
- Examples:
- "Prefer changeable over refundable for cost-flexibility balance"
- "Avoid morning flights <7am (fatigue) and redeyes (next-day performance)"
- Key insight: Peer behavior from similar professional profiles breaks generic consumer patterns more effectively than abstract personalization
Purpose of Foundation Context: Overrides the model's training on general consumer travel patterns with work travel norms.
Tier 2: Specificity Context (Middle Layer)
This layer personalizes the foundation with company-specific policies, individual user preferences, and destination-specific intelligence.
User-Specific Priors: Individual Travel History
- Parse and structure historical booking patterns: airlines, hotel brands, booking windows, seat preferences, layover tolerance
- Build quantified preference profiles that override population-level defaults
- Examples of extracted patterns:
- "Last 12 flights: 10/12 Delta (83%), 11/12 Morning departure (92%), 0/12 Redeyes (0%)"
- "Hotel booking: 7/8 Marriott properties, avg 3.2 nights, high-floor preference (6/8)"
- Key insight: Making implicit preferences explicit and mandatory prevents collapse to population-level defaults, but policy still wins in conflicts
Destination-Specific Knowledge: Hyper-Local Intelligence
- Dynamic context loading for each destination: neighborhood characteristics, seasonal weather, traffic patterns, local events
- Real-time travel intelligence: seat/room availability, recent reviews (last 30 days), pricing trends, local advisories
- Examples:
- "NYC March: 40-55°F, 30% rain, midtown traffic peak 8-10am, 5-7pm"
- "350 Fifth Ave access: Best via JFK (longer) or LGA (traffic), avoid EWR morning commute"
- "Marriott Marquis: 0.3mi walk (6min), recent reviews note slow elevators during convention season"
- Key insight: Hyper-local context forces the model to reason about specific trade-offs (distance vs. traffic vs. cost) rather than generic "best practices"
Company Policy
- Policy evaluation as a strong signal:
- Travel class restrictions (economy domestic, business international)
- Ticket flexibility mandates (changeable required, refundable not allowed)
- Advance booking windows (minimum 14 days for international)
- Other flexible rules (e.g. flight price less than 1.5x average price of the past 30 days)
Purpose of Specificity Context: Creates the personalized operating parameters within policy boundaries—this is where the recommendation becomes tailored to the individual while respecting organizational constraints.
Tier 3: Trip Context (Narrowest, Highest Priority)
The top layer contains trip-specific constraints that override all lower layers when conflicts arise.
Trip-Specific Constraints: Per-Trip Requirements
- Extracted from current conversation: meeting schedules, event information, traveling companions, special needs
- Critical distinction: These are immediate, concrete requirements, not historical patterns or general preferences
- Examples:
- Meeting location and time: "350 Fifth Avenue, March 15 at 9:00am"
- Arrival deadlines: "Must arrive evening of March 14 (hotel check-in before meeting)"
- Special requirements: "Traveling with CEO—need adjacent hotel rooms"
- Event-driven constraints: "Conference badge pickup 7-8am, sessions start 8:30am"
- Companion needs: "Colleague has mobility issues—wheelchair accessible hotel required"
- Key insight: Trip context is the "right now" layer—it represents the specific problem to solve, not general tendencies
Purpose of Trip Context: Ensures the recommendation solves the actual current need, not an idealized or historical scenario.
Why This Structure Prevents Collapse:
- Specificity increases up the pyramid: Broad domain knowledge → Narrow trip requirements
- Override strength increases up the pyramid: Foundation provides defaults, Trip context provides overrides, and helps conflict resolution
- Model's training data sits below all tiers: Generic internet patterns are systematically displaced by structured, relevant context
- The "funnel" prevents generic solutions: By the time all three tiers are applied, the solution space has been narrowed from "all possible flights" to "flights that satisfy this specific user's company policy for this particular trip"
Results: From Generic to Hyper-Personalized
Metrics: % of user picked one of top 6 flight fare options or top 4 hotel/room options
Before Context Engineering: 43% of recommendations were picked by user
After Context Engineering: 87% of recommendations were picked by user
Lessons Learned
What Works
- Explicit priority markers in prompts are more effective than hoping the model infers importance
- Breaking recommendation into "generate candidates → evaluate holistically" prevents premature collapse
- User history formatted as rules ("ALWAYS prefers X") works better than examples ("User previously chose X")
- Domain expertise injected as axioms creates stronger guardrails than RAG-style retrieval
What Doesn't Work
- Simply adding more data to context without structure makes collapse worse (more noise)
- Hoping the model will "learn" user preferences from conversation alone without explicit profile
- Generic "be creative" or "think outside the box" prompts have near-zero effect on reducing collapse
Open Challenges
- Prompt the user to share more specific constraints of the trip
- Latency introduced by reasoning, reflection and complex context retrieval
- Balancing context length limits with comprehensive knowledge injection
References & Further Reading
- Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., & Anderson, R. (2024). AI models collapse when trained on recursively generated data. Nature, 624, 115-121. https://www.nature.com/articles/s41586-024-07566-y, Foundational paper on model collapse: how LLMs lose tail distributions and default to common patterns when trained on synthetic data
- Gerstgrasser, M., Schaeffer, R., Dey, A., et al. (2024). Model Collapse Demystified: The Case of Regression. arXiv preprint arXiv:2402.07712. https://arxiv.org/abs/2402.07712, Theoretical analysis of model collapse mechanisms: finite sampling bias and peaked distributions
- Zhang, Y., et al. (2024). Outcome-based Exploration for LLM Reasoning. arXiv preprint arXiv:2509.06941. https://arxiv.org/abs/2509.06941, Addresses diversity collapse in RL-trained LLMs; proposes exploration bonuses to prevent concentration on common correct answers
- Model Collapse Explained: How Synthetic Training Data Breaks AI. TechTarget. https://www.techtarget.com/whatis/feature/Model-collapse-explained-How-synthetic-training-data-breaks-AI, Accessible overview of collapse phenomenon and practical implications for recommendation systems
- Mananghat, S. (2024). Is LLM Model Collapse Inevitable? Medium. https://sanoojm.medium.com/is-llm-model-collapse-inevitable-2cb068128207, Discussion of diversity loss in AI-generated content and mitigation strategies
- What Is Model Collapse? IBM Research. https://www.ibm.com/think/topics/model-collapse, Comprehensive overview: causes, impacts on LLMs, and solutions including data provenance tracking
- Troise, A. (2024). A Reflection on the Phenomenon of LLM Model Collapse Leading to the Decline in AI Quality. Medium. https://levysoft.medium.com/a-reflection-on-the-phenomenon-of-llm-model-collapse-leading-to-the-decline-in-ai-quality-a6993f86866c, Analysis of data pollution and "digital inbreeding" effects on model quality
- Yao, S., Yu, D., Zhao, J., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Prompt Engineering Guide. https://www.promptingguide.ai/techniques/tot, Practical guide to implementing structured reasoning with explicit context hierarchies
- Prompt Engineering Overview. Anthropic Documentation. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview, Best practices for context structuring, priority markers, and instruction hierarchy
- Advanced Prompt Engineering Techniques. OpenAI Cookbook. https://cookbook.openai.com/, Strategies for structured prompts, few-shot learning, and context window management
- JSON Mode and Structured Outputs. OpenAI Documentation. https://platform.openai.com/docs/guides/structured-outputs, Technical implementation of JSON schema enforcement for reliable structured generation
- Constrained Decoding for Structured Generation. Hugging Face Blog. https://huggingface.co/blog/constrained-beam-search, How structured output formats improve logical reasoning and instruction-following
- Why JSON Improves LLM Reasoning. Anthropic Research. https://www.anthropic.com/research, Research on how structured formats trigger different reasoning pathways in language models
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Papers with Code. https://paperswithcode.com/method/rag, Foundation for dynamic context injection and knowledge retrieval strategies
- Advanced RAG Patterns. LlamaIndex Documentation. https://docs.llamaindex.ai/en/stable/, Hierarchical retrieval, context prioritization, and multi-source knowledge integration
- Context Window Management Strategies. Anthropic Blog. https://www.anthropic.com/index/claude-2-1-prompting, Techniques for managing long context windows and maintaining attention on critical information
- LLMs for Travel and Hospitality. Tourism Analysis. https://www.tandfonline.com/journals/rtxg20, Academic research on applying AI to personalized travel recommendation systems
