WebMCP Proposal - Hacker News

Posted by Alifatisk 5 hours ago

WebMCP Proposal(webmachinelearning.github.io)

117 points | 62 comments

nozzlegear 33 minutes ago|

The fact that the "Security and privacy considerations" and the "Accessibility considerations" sections are completely blank in this proposal is delightful meta commentary on the state of the AI hype cycle. I know it's just a draft so far, but it got a laugh out of me.

gavmor 3 hours ago||

This seems backwards, somehow. Like you're asking for an nth view and an nth API, and services are being asked to provide accessibility bridges redundant with our extant offerings.

Sites are now expected duplicate effort by manually defining schemas for the same actions — like re-describing a button's purpose in JSON when it's already semantically marked up?

jauntywundrkind 5 minutes ago||

I see two totally different things from where we are today

1. This is a contextual API built into each page. Historically site's can offer an API, but that API a parallel experience, a separate machine-to-machine channel, that doesn't augment or extend the actual user session. The MCP API offered here is one offered by the page (not the server/site), in a fully dynamic manner (what's offered can reflect what the state of the page is), that layers atop user session. That's totally different.

2. This opens an expectation that sites have a standard means of control available. This has two subparts:

2a. There's dozens of different API systems available, to pick from, to expose your site. Github got half way from rest to graphql then turned back. Some sites use ttrpc or capnweb or gproto. There hasn't actually been one accepted way for machines to talk to your site, there's been a fractal maze of offerings on the web. This is one consistent offering mirroring what everyone is already using now anyways.

2b. Offering APIs for your site has gone out of favor in general. It often has had high walls and barriers when it is available. But now the people putting their fingers in that leaky damn are patently clearly Not Going To Make It, the LLM's will script & control the browser if they have to, and it's much much less pain to just lean in to what users want to do, and to expose a good WebMCP API that your users can enjoy to be effective & get shit done, like they have wanted to do all along. If webmcp takes off at all, it will reset expectations, that the internet is for end users, and that their agency & their ability to work your site as they please via their preferred modalities is king. WebMCP directs us towards a rfc8890 complaint future, by directly enabling site agency. https://datatracker.ietf.org/doc/html/rfc8890

foota 3 hours ago||

No, I don't think you're thinking about this right. It's more like hacker news would expose an MCP when you visit it that would present an alternative and parallel interface to the page, not "click button" tools.

cush 2 hours ago||

You're both right. The page can expose MCP tools like via a form element which is as simple as adding an attribute to an existing form and completely aligns with existing semantic HTML - eg submitting an HN "comment". Additionally, the page can define additional tools in javascript that aren't in forms - eg YouTube could provide a transcript MCP defined in JS which fetches the video's transcript

https://developer.chrome.com/blog/webmcp-epp

znpy 28 minutes ago||

I think that rest and html could probably be already used for this purpose BUT html is often littered with elements used for visual structure rather than semantics.

In an ideal world html documents should be very simple and everything visual should be done via css, with JavaScript being completely optional.

In such a world agents wouldn’t really need a dedicated protocol (and websites would be much faster to load and render, besides being much lighter on cpu and battery)

cadamsdotcom 4 hours ago||

Great to see people thinking about this. But it feels like a step on the road to something simpler.

For example, web accessibility has potential as a starting point for making actions automatable, with the advantage that the automatable things are visible to humans, so are less likely to drift / break over time.

Any work happening in that space?

jauntywundrkind 1 minute ago||

Chris Shank & Orion Reed doing some very nice work with accessibility trees. https://bsky.app/profile/chrisshank.com/post/3m3q23xpzkc2u

I tried to play along at home some, play with rust accesskit crate. But man I just could not get Orcas or other basic tools to run, could not get a starting point. Highly discouraging. I thought for sure my browser would expose accessibility trees I could just look at & tweak! But I don't even know if that's true or not yet! Very sad personal experience with this.

jayd16 4 hours ago|||

In theory you could use a protocol like this, one where the tools are specified in the page, to build a human readable but structured dashboard of functionality.

I'm not sure if this is really all that much better than, say, a swagger API. The js interface has the double edge of access to your cookies and such.

thevinter 3 hours ago|||

We're building an app that automatically generates machine/human readable JSON by parsing semantic HTML tags and then by using a reverse proxy we serve those instead of HTML to agents

egeozcan 4 hours ago||

As someone heavily involved in a11y testing and improvement, the status quo, for better or worse, is to do it the other way around. Most people use automated, LLM based tooling with Playwright to improve accessibility.

cadamsdotcom 3 hours ago||

I certainly do - it’s wonderful that making your site accessible is a single prompt away!

Flux159 4 hours ago||

This was announced in early preview a few days ago by Chrome as well: https://developer.chrome.com/blog/webmcp-epp

I think that the github repo's README may be more useful: https://github.com/webmachinelearning/webmcp?tab=readme-ov-f...

Also, the prior implementations may be useful to look at: https://github.com/MiguelsPizza/WebMCP and https://github.com/jasonjmcghee/WebMCP

politelemon 4 hours ago|

This GitHub readme was helpful in understanding their motivation, cheers for sharing it.

> Integrating agents into it prevents fragmentation of their service and allows them to keep ownership of their interface, branding and connection with their users

Looking at the contrived examples given, I just don't see how they're achieving this. In fact it looks like creating MCP specific tools will achieve exactly the opposite. There will immediately be two ways to accomplish a thing and this will result in a drift over time as developers need to take into account two ways of interacting with a component on screen. There should be no difference, but there will be.

Having the LLM interpret and understand a page context would be much more in line with assistive technologies. It would require site owners to provide a more useful interface for people in need of assistance.

bastawhiz 3 hours ago||

> Having the LLM interpret and understand a page context

The problem is fundamentally that it's difficult to create structured data that's easily presentable to both humans and machines. Consider: ARIA doesn't really help llms. What you're suggesting is much more in line with microformats and schema.org, both of which were essentially complete failures.

LLMs can already read web pages, just not efficiently. It's not an understanding problem, it's a usability problem. You can give a computer a schema and ask it to make valid API calls and it'll do a pretty decent job. You can't tell a blind person or their screen reader to do that. It's a different problem space entirely.

root_axis 2 hours ago||

Hmmm... so are we imagining a future where every website has a vector to mainline prompt injection text directly from an otherwise benign looking web page?

jasonjmcghee 20 minutes ago|

In response to microphone or camera access proposals you could have said "so we're going to let every website have a vector to spy on us?"

This is what permissions are for.

rgarcia 3 hours ago||

This is great. I'm all for agents calling structured tools on sites instead of poking at DOM/screenshots.

But no MCP server today has tools that appear on page load, change with every SPA route, and die when you close the tab. Client support for this would have to be tightly coupled to whatever is controlling the browser.

What they really built is a browser-native tool API borrowing MCP's shape. If calling it "MCP" is what gets web developers to start exposing structured tools for agents, I'll take it.

xg15 1 hour ago|

Yeah, this seems like a weird niche where an agent has to interact with an existing browser session.

That, or they expect that MCP clients should also be running a headless Chrome to detect JS-only MCP endpoints.

charcircuit 4 hours ago||

This is coming late as skills have largely replaced MCP. Now your site can just host a SKILL.md to tell agents how to use the site.

jasonjmcghee 17 minutes ago||

It's not meant to describe how to use the site, it should / can replace the need for playwright and DOM inspection / manipulation entirely.

Think of it like an "IDE actions". Done right, there's no need to ever use the GUI.

As opposed to just being documentation for how to use the IDE with desktop automation software.

hnlmorg 3 hours ago|||

The purpose of this appears to be for sites that cannot be controlled via prompt instructions alone.

I do like agent skills, but I’m really not convinced by the hype that they make MCP redundant.

dionian 1 hour ago||

seems like skill is a better interface, but state still needs to be externally managed, even if not using mcp as the protocol

fdefitte 2 hours ago|||

Skills are great for static stuff but they kinda fall apart when the agent needs to interact with live state. WebMCP actually fills a real gap there imo.

charcircuit 2 hours ago||

What prevents them with working with live state. Coding agents deal with the live state of source code evolving fine. So why can't they watch a web page or whatever update over time? This seems to be a micro optimization that requires explicit work from the site developer to make work. Long term I just don't see this taking off versus agents just using sites directly. A more long term viable feature would be a way to allow agents to scroll the page or hover over menus without the user's own view being affected.

ATechGuy 4 hours ago||

Interesting. I'd appreciate an example. Thanks!

ednc 4 hours ago||

check out https://moltbook.com/skill.md

esafak 19 minutes ago|||

no workie

Spivak 3 hours ago|||

I really like how the shell and regular API calls has basically wholesale replaced tools. Real life example of worse-is-better working in the real world.

Just give your AI agent a little linux VM to play around that it already knows how to use rather than some specialized protocol that has to predict everything an agent might want to do.

nip 3 hours ago||

The web was initially meant to be browsed by desktop computers.

Then came mobile phones with their small screens and touch control which forced the web to adapt: responsive design.

Now it’s the turn of agents that need to see and interact with websites.

Sure you could keep on feeding them html/js and have them write logic to interact with the page, just like you can open a website in desktop mode and still navigate it: but it’s clunky.

Don’t stop at the name “MCP” that is debased: it’s much bigger than that

vessenes 4 hours ago|

I’m just personally really excited about building cli tools that are deployed with uvx. One line, instructions to add a skill, no faffing about with the mcp spec and server implementations. Feels like so much less dev friction.

climike 1 hour ago||

[dead]

More comments...