Meta Bought a Browser Agent. Here’s Why the Real Future Is API-Native Agents.

Browser agents today feel like mobile web apps in 2009 - impressive demos, fragile foundations. The next generation of AI agents will look more like native apps: API-first, state-aware, and built for real job completion.

In late 2025, Meta quietly acquired Manus, a startup building browser-use agents - systems that watch web pages and complete tasks by clicking, scrolling, and typing, more or less the way a human would.

I wasn’t surprised.

Over the past year, browser agents have been having a moment. The demos are compelling, and recent academic work like Web World Models (Feng et al., arXiv:2512.23676, 2025) gives this direction a more formal framing by treating the web stack itself as a persistent environment for agents.

Browser agents are easy to understand. They feel powerful because you can literally watch them work. That visibility matters, especially early on.

But here’s where my view starts to diverge.

If what you’re optimizing for is a demo or a prototype, browser agents are great. If what you’re optimizing for is reliable job completion - something you can run every day, at scale, and actually trust - then browser-use agents, even when augmented with world models, run into fundamental, structural limits that no amount of clever prompting can fully remove.

That gap is what my post here is really about.

1. The fundamental limitation of browser-use agents

These limits are not about better prompting or stronger models. They are architectural.

UI ≠ State

A browser exposes rendered views, not canonical system state.

The agent sees:

DOM trees
Text snippets
Buttons and forms

What actually determines correctness:

Backend state
Business rules
Authorization logic
Transaction boundaries

This mismatch is structural. No world model layered on top of a browser can fully recover the ground truth of the system behind the UI.

The action space is adversarial

Browser actions are fragile by design.

Agents must contend with:

Layout changes
A/B experiments
Lazy loading
Rate limits and bot defenses
CAPTCHAs and session timeouts

From the agent’s perspective, the environment is adversarial and unstable. Even when the intent is clear, execution remains probabilistic.

Completion is not first-class

Most browser agents evaluate success implicitly:

“Did the page change in a way that looks right?”

There is rarely a definitive, machine-checkable signal that says:

“The job is complete, correctly and transactionally.”

This is why browser agents excel at demonstration but struggle with reliability.

2. A familiar lesson from the history of mobile apps

There is a very direct historical parallel here - and it is not abstract.

In January 2007, at the original iPhone keynote, Steve Jobs famously argued that the mobile browser was enough. The iPhone launched without an App Store. Jobs’ position was explicit:

Mmodern web technologies (HTML, CSS, JavaScript) could deliver full-fledged applications directly through Safari.
(Apple iPhone Keynote, Macworld San Francisco, Jan 9, 2007)

That philosophy shaped early mobile strategy across the industry.

In 2010–2011, Facebook went all-in on HTML5 for mobile, betting that a single web codebase could replace native apps across iOS and Android. This was not a side experiment - it was Facebook’s primary mobile strategy at the time.

By September 2012, the results were clear. Mark Zuckerberg publicly acknowledged the failure of that approach, stating:

“The biggest mistake we made as a company was betting too much on HTML5.”
(Facebook TechCrunch Disrupt interview, Sept 2012)

Facebook rapidly rebuilt its mobile experience as fully native apps. Performance, reliability, and user experience improved almost immediately.

Meanwhile, an entirely different class of companies was emerging - companies that could not exist on a mobile-web foundation.

Uber, founded in March 2009, scaled through the early 2010s by deeply integrating with native mobile capabilities: GPS, background location updates, push notifications, real-time networking, and low-latency state synchronization. A mobile-browser-only version of Uber would not have worked - not as a prototype, and certainly not at scale.

By 2013–2014, the industry consensus had flipped. Serious consumer products standardized on native apps, not because HTML5 was “bad,” but because interfaces are not systems of record, and real products need first-class access to state, actions, and lifecycle.

This wasn’t a cosmetic transition. It was an architectural correction.

With that context, the analogy is no longer rhetorical - it is structural:

Browser agent vs API agent <-> mobile web app vs native mobile app

Browser agents, like early mobile web apps, are compelling, portable, and demo-friendly.

API-native agents, like native mobile apps, are built for reliability, performance, and job completion at scale.

3. Three research lines converging on API world models

Before breaking this down, it helps to define what we mean by an API world model - loosely and pragmatically, not academically.

In this context, an API world model refers to a system where:

Canonical state lives in APIs, not in UI or latent tokens
Actions are typed, authorized API calls
State transitions are deterministic and machine-verifiable
Completion is explicit (success, failure, pending), not inferred
LLMs operate at the planning and semantic layers, not as the execution engine

It is not a single product or paper. It is an architectural direction.

From that perspective, three previously separate research and product lines are quietly converging on the same idea.

1) Tool-augmented / API-grounded agents

This line includes:

Tool calling via structured schemas
ReAct-style planning
Function-calling LLMs

The core idea is that tools are not helpers - they are the environment. However, much of this work still treats tools as stateless, rather than as a persistent world.

2) Programmatic environments as world models

A second line appears in environments where:

State transitions are encoded in code
Rules are explicit
Success criteria are unambiguous

This includes software engineering agents, simulation-style benchmarks, and task-oriented environments. These systems trade open-endedness for correctness - and that trade-off is often exactly what real jobs require.

3) Enterprise platforms (quietly the most important)

The most mature physics layers for API world models already exist in production:

Payments: Stripe, Adyen
Travel booking: Amadeus, Sabre, Spotnana, Travelport
Cloud infrastructure: AWS, Azure, Google Cloud
CRM / ERP: Salesforce, SAP, Workday

These systems already provide typed schemas, authorization boundaries, transaction semantics, and clear definitions of completion. LLMs are being attached to these worlds - not replacing them.

4. How Web World Models fit into this picture

The Web World Models paper (Feng et al., 2025) is important because it clearly articulates a missing middle ground between:

Fixed-schema web apps, and
Fully generative world models

However, its scope remains centered on the web stack - HTML, HTTP, and content generation layered over deterministic code.

That makes WWM a meaningful step forward for persistence and controllability, but it does not remove the core limitations of browser-mediated action when the goal is job completion.

5. Why Meta buying a browser agent still makes sense

To be clear, I don’t think browser agents are a mistake.

They matter.

If you’re trying to deal with legacy systems, or explore workflows where APIs simply don’t exist, browser agents are often the only viable option. They are also incredibly effective as demos - you can see the agent work, which makes the value intuitive in a way API calls never will.

From that perspective, Meta acquiring Manus makes sense. Browser agents are a powerful way to bootstrap capability, learn about real-world friction, and study how agents interact with messy human-facing systems.

But I don’t believe browser agents are the end state.

Once you care about operating at scale - about reliability, accountability, and correctness - the abstraction starts to break. UI is not state. Clicking is not a transaction. And “it looks done” is not the same as “it is done.”

Browser agents will continue to exist, and they will be useful. But the agents that people eventually trust with real work will not live in the DOM. They will live in systems where state, actions, and completion are first-class.

That’s the line I draw, and it’s why we’re building Otto the way we are.

6. Otto’s mission: completion over clicks

This is ultimately why we’re building Otto the way we are.

Our mission is simple and explicit: job completion for people who value time and convenience. Once you take that seriously, a lot of architectural decisions stop being debatable.

We focus on API-first interaction because that’s where the real state lives. We engineer context around structured data, not scraped UI. We optimize entire flows end-to-end, not individual steps that happen to look good in a demo. And we treat completion as a contract - something that can be verified and trusted - not a heuristic.

Browser automation still has its place. It’s useful at the edges, and sometimes it’s the only way to bridge a gap. But it’s not the foundation you want if you’re building something people rely on every day.

In the end, users don’t care how an agent works. They care whether the job is done - correctly, quickly, and without friction.

That’s the future I’m betting on. And it’s a future where agents live in worlds where completion is first-class.

DEC 4

After 9 months in Beta, Otto is now open to everyone! Read our announcement