Meta Bought a Browser Agent. Here’s Why the Real Future Is API-Native Agents.
Browser agents today feel like mobile web apps in 2009 - impressive demos, fragile foundations. The next generation of AI agents will look more like native apps: API-first, state-aware, and built for real job completion.

In late 2025, Meta quietly acquired Manus, a startup building browser-use agents - systems that watch web pages and complete tasks by clicking, scrolling, and typing, more or less the way a human would.
I wasn’t surprised.
Over the past year, browser agents have been having a moment. The demos are compelling, and recent academic work like Web World Models (Feng et al., arXiv:2512.23676, 2025) gives this direction a more formal framing by treating the web stack itself as a persistent environment for agents.
Browser agents are easy to understand. They feel powerful because you can literally watch them work. That visibility matters, especially early on.
But here’s where my view starts to diverge.
If what you’re optimizing for is a demo or a prototype, browser agents are great. If what you’re optimizing for is reliable job completion - something you can run every day, at scale, and actually trust - then browser-use agents, even when augmented with world models, run into fundamental, structural limits that no amount of clever prompting can fully remove.
That gap is what my post here is really about.
1. The fundamental limitation of browser-use agents
These limits are not about better prompting or stronger models. They are architectural.
UI ≠ State
A browser exposes rendered views, not canonical system state.
The agent sees:
- DOM trees
- Text snippets
- Buttons and forms
What actually determines correctness:
- Backend state
- Business rules
- Authorization logic
- Transaction boundaries
This mismatch is structural. No world model layered on top of a browser can fully recover the ground truth of the system behind the UI.
The action space is adversarial
Browser actions are fragile by design.
Agents must contend with:
- Layout changes
- A/B experiments
- Lazy loading
- Rate limits and bot defenses
- CAPTCHAs and session timeouts
From the agent’s perspective, the environment is adversarial and unstable. Even when the intent is clear, execution remains probabilistic.
Completion is not first-class
Most browser agents evaluate success implicitly:
“Did the page change in a way that looks right?”
There is rarely a definitive, machine-checkable signal that says:
“The job is complete, correctly and transactionally.”
This is why browser agents excel at demonstration but struggle with reliability.
2. A familiar lesson from the history of mobile apps
There is a very direct historical parallel here - and it is not abstract.
In January 2007, at the original iPhone keynote, Steve Jobs famously argued that the mobile browser was enough. The iPhone launched without an App Store. Jobs’ position was explicit:
Mmodern web technologies (HTML, CSS, JavaScript) could deliver full-fledged applications directly through Safari.
(Apple iPhone Keynote, Macworld San Francisco, Jan 9, 2007)
That philosophy shaped early mobile strategy across the industry.
In 2010–2011, Facebook went all-in on HTML5 for mobile, betting that a single web codebase could replace native apps across iOS and Android. This was not a side experiment - it was Facebook’s primary mobile strategy at the time.
By September 2012, the results were clear. Mark Zuckerberg publicly acknowledged the failure of that approach, stating:
“The biggest mistake we made as a company was betting too much on HTML5.”
(Facebook TechCrunch Disrupt interview, Sept 2012)
Facebook rapidly rebuilt its mobile experience as fully native apps. Performance, reliability, and user experience improved almost immediately.
Meanwhile, an entirely different class of companies was emerging - companies that could not exist on a mobile-web foundation.
Uber, founded in March 2009, scaled through the early 2010s by deeply integrating with native mobile capabilities: GPS, background location updates, push notifications, real-time networking, and low-latency state synchronization. A mobile-browser-only version of Uber would not have worked - not as a prototype, and certainly not at scale.
By 2013–2014, the industry consensus had flipped. Serious consumer products standardized on native apps, not because HTML5 was “bad,” but because interfaces are not systems of record, and real products need first-class access to state, actions, and lifecycle.
This wasn’t a cosmetic transition. It was an architectural correction.
With that context, the analogy is no longer rhetorical - it is structural:
Browser agent vs API agent <-> mobile web app vs native mobile app
Browser agents, like early mobile web apps, are compelling, portable, and demo-friendly.
API-native agents, like native mobile apps, are built for reliability, performance, and job completion at scale.
3. Three research lines converging on API world models
Before breaking this down, it helps to define what we mean by an API world model - loosely and pragmatically, not academically.
In this context, an API world model refers to a system where:
- Canonical state lives in APIs, not in UI or latent tokens
- Actions are typed, authorized API calls
- State transitions are deterministic and machine-verifiable
- Completion is explicit (success, failure, pending), not inferred
- LLMs operate at the planning and semantic layers, not as the execution engine
It is not a single product or paper. It is an architectural direction.
From that perspective, three previously separate research and product lines are quietly converging on the same idea.
1) Tool-augmented / API-grounded agents
This line includes:
- Tool calling via structured schemas
- ReAct-style planning
- Function-calling LLMs
The core idea is that tools are not helpers - they are the environment. However, much of this work still treats tools as stateless, rather than as a persistent world.
2) Programmatic environments as world models
A second line appears in environments where:
- State transitions are encoded in code
- Rules are explicit
- Success criteria are unambiguous
This includes software engineering agents, simulation-style benchmarks, and task-oriented environments. These systems trade open-endedness for correctness - and that trade-off is often exactly what real jobs require.
3) Enterprise platforms (quietly the most important)
The most mature physics layers for API world models already exist in production:
- Payments: Stripe, Adyen
- Travel booking: Amadeus, Sabre, Spotnana, Travelport
- Cloud infrastructure: AWS, Azure, Google Cloud
- CRM / ERP: Salesforce, SAP, Workday
These systems already provide typed schemas, authorization boundaries, transaction semantics, and clear definitions of completion. LLMs are being attached to these worlds - not replacing them.
4. How Web World Models fit into this picture
The Web World Models paper (Feng et al., 2025) is important because it clearly articulates a missing middle ground between:
- Fixed-schema web apps, and
- Fully generative world models
However, its scope remains centered on the web stack - HTML, HTTP, and content generation layered over deterministic code.
That makes WWM a meaningful step forward for persistence and controllability, but it does not remove the core limitations of browser-mediated action when the goal is job completion.
5. Why Meta buying a browser agent still makes sense
To be clear, I don’t think browser agents are a mistake.
They matter.
If you’re trying to deal with legacy systems, or explore workflows where APIs simply don’t exist, browser agents are often the only viable option. They are also incredibly effective as demos - you can see the agent work, which makes the value intuitive in a way API calls never will.
From that perspective, Meta acquiring Manus makes sense. Browser agents are a powerful way to bootstrap capability, learn about real-world friction, and study how agents interact with messy human-facing systems.
But I don’t believe browser agents are the end state.
Once you care about operating at scale - about reliability, accountability, and correctness - the abstraction starts to break. UI is not state. Clicking is not a transaction. And “it looks done” is not the same as “it is done.”
Browser agents will continue to exist, and they will be useful. But the agents that people eventually trust with real work will not live in the DOM. They will live in systems where state, actions, and completion are first-class.
That’s the line I draw, and it’s why we’re building Otto the way we are.
6. Otto’s mission: completion over clicks
This is ultimately why we’re building Otto the way we are.
Our mission is simple and explicit: job completion for people who value time and convenience. Once you take that seriously, a lot of architectural decisions stop being debatable.
We focus on API-first interaction because that’s where the real state lives. We engineer context around structured data, not scraped UI. We optimize entire flows end-to-end, not individual steps that happen to look good in a demo. And we treat completion as a contract - something that can be verified and trusted - not a heuristic.
Browser automation still has its place. It’s useful at the edges, and sometimes it’s the only way to bridge a gap. But it’s not the foundation you want if you’re building something people rely on every day.
In the end, users don’t care how an agent works. They care whether the job is done - correctly, quickly, and without friction.
That’s the future I’m betting on. And it’s a future where agents live in worlds where completion is first-class.


