When Could Your Agent Just Buy the Ticket Itself?

You've thought about this: at what point should your AI agent just act? Book the flight, pick the hotel, send the message - without asking first. The obvious answer is "when it has permission." The right answer is more interesting. Agency isn't a permission level. It's the degree to which an agent has internalized what you actually care about. For decisions with competing tradeoffs, that calibration - not consent - is the real bottleneck. This piece is a problem statement, not a solution.

It’s Not About Consent. It’s About Calibration.

You’ve thought about this: at what point should your AI agent just act? Book the flight, pick the hotel, send the message - without asking first. The obvious answer is “when it has permission.” The right answer is more interesting. Agency isn’t a permission level. It’s the degree to which an agent has internalized what you actually care about. For decisions with competing tradeoffs, that calibration - not consent - is the real bottleneck. This piece is a problem statement, not a solution. It’s an argument that we’ve been designing for the wrong thing.

The hotel problem

Imagine you’re asking an AI agent to find you a hotel for a work trip to San Francisco next Thursday. You’ve got a flight already booked, a meeting at 9am in SoMa, and a travel policy that caps you at $350 a night. Sounds straightforward.

But here’s what the agent actually needs to figure out.

First, your flight is already booked - which means check-in and check-out dates are hard constraints, not preferences. But your meeting at 9am is different: it’s flexible enough that being five minutes late wouldn’t be catastrophic, whereas a 40-minute commute from the airport probably would be. The agent needs to reason about kinds of rigidity, not just schedule facts.

Second, it should know that the last three times you visited San Francisco, you stayed at the Marriott on 4th Street - not because it’s spectacular, but because it’s predictable, the beds are good, and you have Bonvoy status. It should know your company policy allows up to $350, but that you’ve historically booked at around $280, not because you were told to save money but because you don’t see the value in paying more for the same thing in a different building.

Third, the hotel’s own listing says “quiet rooms, excellent location, great gym.” Reviews on three different platforms say the gym equipment is outdated, the location is genuinely great, and rooms facing the street are loud. Which source does the agent trust, and on which dimensions?

Fourth - and this is where it gets interesting - sometimes these things trade off against each other in ways that don’t have a clean answer. The $320 room at the Marriott has a king bed. There’s a $290 room at a boutique hotel two blocks closer to your meeting: double bed, consistently better reviews on quietness, and breakfast included. What does the agent do?

This is, in the language I use internally, a TRAC-DP problem: Tradeoff-aware, Ranked, Argumentative, Contextual Decision Process. The tradeoffs are real and competing - not edge cases, but the core of the task. The ranking of those tradeoffs is personal and shifts with context: the same user who insists on walkability on a solo trip might prioritize budget rigidly on a team offsite. The answer isn’t retrieved; it’s argued - there is no lookup table that resolves “closer to meeting vs. better reviews vs. lower cost vs. brand reliability.” And it’s a process: not a one-time configuration but an understanding that should deepen with every trip, every choice, every signal the user gives.

Most problems worth solving with AI agents have this shape. The question is whether our agents are actually built for it. And when I look at how most agent systems are designed today, I don’t think they are - not because they lack capability or speed, but because we’ve been optimizing for the wrong axis entirely.

The missing axis

When we talk about building better AI agents, the conversation usually collapses into two levers: make them more capable (better models, more tools, broader knowledge) or make them more efficient (smaller context, faster inference, cheaper tokens). Those are real levers. But there’s a third axis that barely gets talked about: agency.

Agency isn’t the same as capability. You can have a very capable agent with zero agency - one that asks for approval before every judgment call, that returns the decision to you whenever the path isn’t fully specified. A capable agent without agency is still essentially a search box with a nicer interface.

Agency also isn’t the same as efficiency. An agent can be extremely lean - compact context, no wasted computation - while still being completely deferential to the human at every decision point.

Agency is about something else: where the origin of action sits. Who - or what - is actually making the call? And for a TRAC-DP problem like finding the right hotel, that question isn’t a philosophical abstraction. It’s the central design question.

What Self-Determination Theory gets right

In the 1980s, Edward Deci and Richard Ryan developed Self-Determination Theory (SDT), a framework for understanding human motivation.¹ Their core insight was that motivation isn’t binary - it exists on a spectrum defined by locus of causality: where does the reason for your action originate?

At one end: external regulation. You do something because someone told you to, or because there’s a reward waiting. The action originates entirely outside you. Moving inward: introjected regulation, where you’ve absorbed an external rule but haven’t made it your own. Further in: identified and integrated regulation, where you act because you understand and endorse the value of what you’re doing. At the far end: intrinsic motivation, where you act because the thing itself is meaningful.

The reason this maps so cleanly to AI agents is that the SDT spectrum isn’t really about emotion - it’s about the internalization of values. An externally regulated agent does what it’s told. An integrated agent acts from an internalized model of what matters to you, why it matters, and how to weigh competing things against each other.

And here’s the key reframe: moving an agent up this spectrum isn’t about granting it more permissions. It’s about the agent earning the right to act by developing a genuine model of your values. A highly permissioned agent with no internalized values is dangerous. An agent with fewer formal permissions but a rich, calibrated model of what you actually care about can operate with much higher effective agency. Permission is a blunt instrument. Calibration is precise.

Why agency is hard: consent is dimensional

Consider a specific scenario: you previously booked a hotel at $320/night, king bed, no breakfast. Now the agent finds an opening at the same hotel, same price, king bed - but this time breakfast is included. Does it book without asking? Has it already received your consent?

The intuitive answer is “yes, obviously book it.” But the agent can’t know that without understanding which dimensions of your original decision were load-bearing.

When you made that first booking, you simultaneously encoded several things: a budget (maybe a ceiling, maybe just the going rate), a room type preference (maybe a genuine need for space, maybe just what was available), and a meal stance (maybe you prefer finding a local café in the morning, maybe breakfast was just more expensive elsewhere and you didn’t care). Not all of these were equally important to your decision. In TRAC-DP terms: the tradeoffs were present, but their ranking was invisible to the agent. And an agent that can’t distinguish load-bearing preferences from incidental ones will either over-ask - paralysing itself with clarifying questions - or under-ask, silently resolving tradeoffs in ways you’d have disagreed with if consulted.

Neither failure is a capability problem. Both are calibration problems.

The deeper issue is that consent, as most agent systems currently model it, is binary: the user said yes or no to something at some point. But consent is actually dimensional. It has load-bearing dimensions and incidental ones. It has context-dependence: what you consented to on a solo trip is not the same consent on an international client trip. It decays with distance - the further a new situation is from the original decision, the less the prior consent applies. And it’s not static: your values shift, your circumstances shift, and an agent operating on a consent model from eighteen months ago is operating on stale data.

The click is a consent protocol

Traditional websites are built around clicks. Every click does two things simultaneously: it expresses intent, and it transfers responsibility. When you click “Book this hotel,” you made the choice. The click is a consent mechanism so lightweight that most users don’t even notice it functioning that way.

Agents operating autonomously on your behalf don’t have natural consent checkpoints. They’re browsing, scraping, filtering, comparing - often doing dozens of micro-actions before surfacing anything to you. Each of those micro-actions is a judgment call: what to search for, what to filter out, which tradeoffs to evaluate, how to rank competing signals. None of it was explicitly clicked into.²

This is what makes workflows feel like workflows and agents feel like agents. A workflow has predefined checkpoints - it does exactly what it was programmed to do and stops at specific gates. An agent makes judgment calls throughout, which is exactly what creates its value. But without a coherent consent model, that value creation becomes hard to trust.

The answer isn’t to reintroduce clicks - that collapses the agent back into a workflow. It’s to design proportional trust checkpoints: act confidently in well-calibrated territory, surface reasoning explicitly when crossing into unfamiliar tradeoff space. Book the Marriott king for $320 without asking, because you’ve done this dozens of times. But if the Marriott is sold out and the agent is considering a boutique hotel in a different neighborhood - that’s new territory, and it should show its work.

Anthropic’s recent engineering article on long-running agent applications³ is instructive here. They found that separating the generator agent from the evaluator agent was more tractable than making the generator self-critical - external feedback gives the generator something concrete to iterate against. That’s a structural response to the consent problem: make the judgment-checking explicit and separate.

Calibration is building a theory of you

Most “personalization” systems learn preferences: you clicked on these hotels before, so you probably like these features. But preference learning is pattern matching. It tells you what someone chose, not why. And knowing why is the only thing that lets you generalize to situations that haven’t appeared in the data - like a sold-out Marriott, or a city the user hasn’t visited before, or a trip that mixes work and leisure in an unusual way.

Building a theory of a user means modeling their values, their relative weights, and how those weights shift by context. It means understanding that “no breakfast” in a past booking was a price artifact, not a genuine preference. It means knowing that this user bends the travel policy slightly on international trips but follows it strictly on domestic ones - not because they’re inconsistent, but because the contexts call for different norms.

Research on agent planning and evaluation⁴ suggests that for complex tasks, the bottleneck is almost never raw execution capability. It’s the quality of the internal model of the problem: the decomposition, the estimation, the ability to recognize which decisions are load-bearing and which are incidental. The hotel agent’s limit isn’t its ability to call a booking API. It’s the quality of its model of you.

This is why “personalized” is almost too weak a word for what high-agency agents need to do. Personalization implies adapting to known preferences. What we’re describing is closer to understanding - a persistent, contextual, evolving model of what matters to you and why. The agent isn’t matching your history. It’s developing a theory that can survive contact with situations your history doesn’t cover.

What this changes

Most current discourse around AI agents focuses on capability benchmarks: can the agent use tools correctly, complete multi-step tasks, avoid hallucinations? These matter. But they’re not sufficient - and for TRAC-DP problems, they’re not even close to sufficient.

The agents that will matter most for genuinely complex problems - the ones where the decision space is large, the constraints are soft, and the tradeoffs are personal - are the ones that have developed something like integrated regulation. Not more permissions. Not more tools. A better model of the human they’re acting for.

That’s a different design problem, and it requires a different evaluation framework. Not “did the agent complete the task?” but “did it correctly identify which dimensions of the task required human input, and handle the rest with appropriate confidence?” Not “does the agent know the user’s preferences?” but “does it know which preferences are load-bearing in this situation?”

The sprint construct story from Anthropic’s article is a small but concrete example. The sprint scaffolding - where generator and evaluator negotiated scope before each chunk of work - was external regulation imposed on the agent because it couldn’t maintain coherent intent over long tasks. When the model improved enough to hold a two-hour project together internally, the scaffolding became friction. The agent had developed enough integrated planning that it no longer needed someone else to structure its commitments.

That’s what moving up the agency axis looks like in practice. Not a permissions upgrade. An internalization upgrade.

We are nowhere near the end of this problem. But we’re at the point where naming it correctly starts to matter. TRAC-DP problems are not retrieval problems. They are not workflow problems. They are not permission problems. They are calibration problems - and the sooner we design for that, the sooner agents will feel less like an elaborate search interface and more like something that actually knows you.

References

¹ Ryan, R. M., & Deci, E. L. (2000). Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. American Psychologist, 55(1), 68–78. https://doi.org/10.1037/0003-066X.55.1.68

² The ReAct paper (Yao et al., 2022) showed that interleaving reasoning traces with actions makes the agent’s process visible - a partial structural answer to the consent gap, but one that doesn’t resolve responsibility transfer at the human-agent boundary. arxiv.org/abs/2210.03629. Reflexion (Shinn et al., 2023) extended this with verbal reinforcement: the agent reflects on its own outputs after task completion. arxiv.org/abs/2303.11366.

³ Rajasekaran, P. (2025). Harness design for long-running application development. Anthropic Engineering. anthropic.com/engineering/harness-design-long-running-apps

⁴ For the finding that planning and evaluation capabilities matter more than execution skills as task complexity grows, see: arxiv.org/abs/2602.11865

DEC 4

After 9 months in Beta, Otto is now open to everyone! Read our announcement