We accidentally solved robotics by watching 1M hours of YouTube

Posted by alexcos 2 days ago

We accidentally solved robotics by watching 1M hours of YouTube(ksagar.bearblog.dev)

206 points | 165 comments

dchftcs 1 day ago|

Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.

For example, so that you don't crush a human when doing massage (but still need to press hard), or apply the right amount of force (and finesse?) to skin a fish fillet without cutting the skin itself.

Practically in the near term, it's hard to sample from failure examples with videos on Youtube, such as when food spills out of the pot accidentally. Studying simple tasks through the happy path makes it hard to get the robot to figure out how to do something until it succeeds, which can appear even in relatively simple jobs like shuffling garbage.

With that said, I suppose a robot can be made to practice in real life after learning something from vision.

carlosdp 1 day ago||

> Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.

I'm not sure that's necessarily true for a lot of tasks.

A good way to measure this in your head is this:

"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"

When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.

It therefore follows that robots should be able to learn with just RGB images too! Counterexamples would be things like grabbing an egg without crushing, perhaps. Though I suspect that could also be done with just vision.

moefh 1 day ago|||

> It therefore follows that robots should be able to learn with just RGB images too!

I don't see how that follows. Humans have trained by experimenting with actually manipulating things, not just by vision. It's not clear at all that someone who had gained intuition about the world exclusively by looking at it would have any success with mechanical arms.

amelius 1 day ago||

You'd use a two-step approach.

1. First create a model that can evaluate how well a task is going; the YT approach can be used here.

2. Then build a real-world robot, and train it by letting it do tasks, and use the first model to supervise it; here the robot can learn to rely on extra senses such as touch/pressure.

godelski 1 day ago||

You're agreeing with the parent btw. You've introduced a lot more than just vision. You introduced interventional experimentation. That's a lot more than just observation

amelius 1 day ago||

What I describe is an unsupervised system.

What you say ("interventional") sounds like it's human-supervised.

But maybe I'm interpreting it in the wrong way, so please correct me if so.

godelski 23 hours ago||

By "intervention" I mean interacting with the environment. Purpose a hypothesis, test, modify, test. You can frame RL this way though RL usually generates hypotheses that are far too naïve.

This looks like a good brief overview (I only skimmed it but wanted to give you more than "lol, google it") http://smithamilli.com/blog/causal-ladder/

amelius 9 hours ago||

Yes, you need to let the robot play (interact with the environment) to learn the vision-versus-touch correlations, but you can do so in an unsupervised way (as long as you choose the environment wisely).

jpc0 1 day ago||||

I think you vastly underestimate how difficult the task you are proposing would be without depth or pressure indication, even for a super intelligence like humans.

Simple concept, pick up a glass and pour its content into a vertical hole the approximate size of your mouth. Think of all the failure modes that can be triggered in the trivial example you do multiple times a day, to do the same from a single camera feed with no other indicators would take you hours to master and you already are a super intelligent being.

jrimbault 1 day ago|||

A routine gesture I've done everyday for almost all my life: getting a glass out of the shelves and into my left hand. It seems like a no brainer, I open the cabinet with my left hand, take the glass with my right hand, throw the glass from my right hand to the left hand while closing the cabinet with my shoulder. Put the glass under the faucet with left hand, open the faucet with the right hand.

I have done this 3 seconds gesture, and variations of it, my whole life basically, and never noticed I was throwing the glass from one hand to the other without any visual feedback.

gregmac 1 day ago||

And you're used to the weight of the glass, which you instantly recognize when you pick it up. If it was a different weight than you were expecting, you'd probably slow down and be more deliberate.

If you were to just do the exact same robotic "throw" action with a glass of unexpected weight you'd maybe not throw hard enough and miss, or throw too hard and possibly break it.

var_cw 1 day ago||||

The point is how much non-vision sensors vs pure vision, helps humans to be humans. Don't you think this point was proven by LLMs already that generalizability doesn't come from multi-modality but by scaling a single modality itself? And jepa is for sure designed to do a better job at that than an LLM. So no doubt about raw scaling + RL boost will kick-in highly predictable & specific robotic movements.

godelski 1 day ago|||

  > LLMs already that generalizability

This is not a proven statement. In fact, it's pretty clear that they don't. They have some generalization but that's not enough for what you're inferring. The best way to show this is to carefully talk to an LLM about anything you have a lot of domain expertise in. Be careful to not give it answers (information leakage can sneak in subtly) and specifically look for those small subtle details (that's why it needs to be a topic you have expertise in). "The smell" will be right but the information won't.

Also, LLMs these days aren't trained on just language

datameta 1 day ago|||

> generalizability doesn't come from multi-modality but by scaling a single modality itself

Could you expand on what you mean by this?

stavros 1 day ago|||

If I have to pour water into my mouth, you can bet it's going all over my shirt. That's not how we drink.

jpc0 1 day ago||

Except this is the absolutely most common thing humans do, and my argument is that that it will spill water all over but rather that it will shatter numerous glasses, knock them over etc all before it has picked up the glass.

The same process will be repeated many times trying to move the glass to its “face” and then when either variable changes, plastic vs glass, size, shape, location and all bets are off purely because there just plainly is the enough information

abenga 1 day ago||||

Humans did not accumulate that intuition just using images. In the example you gave, you subconsciously augment the image information with a lifetime of interacting with the world using all the other senses.

amelius 1 day ago||

Yes, without extra information, manipulating everyday objects is probably as intuitive to robots as manipulating quantum scale molecules is for humans.

godelski 1 day ago||||

  > because you as a human have really good intuition about the world.

This is the line that causes your logic to fail.

You introduced knowledge not obtained through observation. In fact, the knowledge you introduced is the whole chimichanga! It is an easy mistake to make, so don't feel embarrassed.

The claim is that one can learn a world model[0] through vision. The patent countered by saying "vision is not enough." Then you countered by saying "vision is enough if you already have a world model."

[0] I'll be more precise here. You can learn *A* world model, but it isn't the one we really care about and "a world" doesn't require being a self consistent world. We could say the same thing about "a physics", but let's be real, when we say "physics" we know which one is being discussed...

deadfoxygrandpa 1 day ago||||

counterpoint: think about all the tasks you could do with your hands and arms while your eyes are closed. i think its really a lot of stuff considering blind people can do the vast majority of things sighted people can do, and i suspect anything you could do with your eyes closed would be extremely difficult to do with a camera feed as the literal only sensory input

jaisio 1 day ago||||

> When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.

And where does this intuition come from? It was buily by also feeling other sensations in addition to vision. You learned how gravity pulls things down when you were a kid. How hot/cold feels, how hard/soft feels, how thing smell. Your mental model of the world is substantially informed by non-visual clues.

> It therefore follows that robots should be able to learn with just RGB images too!

That does not follow at all! It's not how you learned either.

Neither have you learned to think by consuming the entirety of all text produced on the internet. LLMs therefore don't think, they are just pretty good at faking the appearance of thinking.

corimaith 1 day ago||||

>"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"

There are an infinite number of scenes that can be matched to one 2d picture. And what is a scene really? The last time I checked, RGB was not a good way of input in Computer Vision and rather relied on increasing levels of gradients via CNNs to build a compositional scene. None of that is paticularly translatable to how a LM works with text.

suddenlybananas 1 day ago|||

Humans have innate knowledge that help them interact with the world and can learn from physical interaction for the rest. RGB images aren't enough.

whatever1 1 day ago||

Video games have shown that we can control pretty darn well characters in virtual worlds where we have not experienced their physics. We just look at a 2D monitor and using a joystick/keyboard we manage to figure it out.

deadfoxygrandpa 1 day ago|||

a game has very limited physics. like the buttons you press are pre-tuned to perform certain actions and you arent dealing with continuous nearly infinite possibilities with large ranges of motion, pressure, speed etc. like think about how difficult the game QWOP is because you mostly just have visual feedback

whatever1 1 day ago||

I beg to disagree. I got introduced to brand new (to me) physics of flying airplanes by MS flight simulator. None of the rules I knew in real life applied (gravity matters only sometimes, height can be traded for speed etc). Yet learned how to fly.

And when I took real classes in a real Cessna, this experience was transferable (aka the flying model I had in my brain was very similar to the one I experienced with my full body in the cockpit).

suddenlybananas 1 day ago|||

Yeah but we already have a conception of what physics should be prior to that that helps us enormously. It's not like game designers are coming up with stuff that intentionally breaks our naïve physics.

godelski 1 day ago||

I mean they do but we often have generalized (to some degree) world models. So when they do things like change gravity, flip things upside down, or even more egregious changes we can adapt. Because we have contractual counterfactual models. But yeah, they could change things so much that you'd really have to relearn and that could be very very difficult if not impossible (I wonder if anyone has created a playable game with a physics that's impossible for humans to learn, at least without "pen and paper". I think you could do this by putting the game in higher dimensions.)

namibj 1 day ago|||

If the robot already knows "how to" the happy path, the training difficulty falls severely at least if it can continue after a recovery.

dchftcs 1 day ago||

The tasks you do to recover from the failure is often different from the happy path. For example, the happy path of dumping garbage is carrying a garbage bag to a collection bin. The non-happy path is that the bin is overflowing and you have to put the bag on the ground, or if the bag leaks and you need to move to a new bag, or if the bag breaks entirely and you have to pick up the trash again.

But yeah, I think a better way to put it is that sampling the happy path would indeed make the failure case easier, but sampling just happy paths is far from sufficient from completing even some of the simplest human tasks with failure.

rocqua 1 day ago|||

On humans, you can generally see the force they apply by looking at strain.

dchftcs 1 day ago||

The error margins will be huge, and for small enough force (like the skinning part or handling fine mechanical stuff) there's basically almost zero signal.

godelski 1 day ago||

  > Pure vision will never be enough because it does not contain information

Say it louder for those in the back!

But actually there's more to this that makes the problem even harder! Lack of sensors is just the beginning. There's well known results in physics that:

  You cannot create causal models through observation alone.

This is a real pain point for these vision world models and most people I talk to (including a lot at the recent CVPR) just brush this off as "we're just care if it works." Guess what?! Everyone that is pointing this out also cares that it works! We need to stop these thought terminating cliches. We're fucking scientists.

Okay, so why isn't observation enough? It's because you can't differentiate alternative but valid hypotheses. You often have to intervene! We're all familiar with this part. You control variables and modify one or a limited set at a time. Experimental physics is no easy task, even for things that sound rather mundane. This is in fact why children and animals play (okay, I'm conjecturing here).

We need to mention chaos here, because it's the easiest way to understand this. There's many famous problems that fall into this category like the double pendulum, 3 Body Problem, or just fucking gas molecules moving around. Let's take the last one. Suppose you are observing some gas molecules moving inside a box. You measure their positions at t0 and at T. Can you predict their trajectories between those time points? Surprisingly, the answer is no. You can only do this statistically. There's probably paths but not deterministic (this same logic is what leads to multiverse theory btw). But now suppose I was watching the molecules too, but I was continuously recording between t0 and T. Can I predict the trajectories? Well, I don't need to, I just write it down.

Now I hear you, you're saying "Godelski, you observed!" But the problem with these set of problems is that if you don't observe the initial state you can't predict moving forwards and if you don't have very precise observation intervals you are hit with the same problem. I you turn around while I start a double pendulum you can have as much time as you want when you turn back around, you won't be able to model its trajectories.

But it gets worse still. There are confounding variables. There is coupling. Difficult to differentiate hypotheses via causal ordering. And so so much more. If you ever wonder why physicists do so much math it's because doing that is a fuck ton easier than doing the whole set of testing and then reverse engineering the equations from those observations. But in physics we care about counterfactual statements. In F=ma we can propose new masses and new accelerations and rederive the results. That's the what it is all about. Your brain does an amazing job at this too! You need counterfactual modeling to operate in real world environments. You have to be able to ask and answer "what happens if that kid runs into the street?"

I highly suggest people read The Relativity of Wrong [0]. Its a short essay by Isaac Asimov that can serve as a decent intro, though far from complete. I'm suggesting it because I don't want people to confuse "need counterfactual model" with "need the right answer." If you don't get into metaphysics, these results will be baffling.[1] It is also needed to answer any confusion you might have around the aforementioned distinction.

Tldr:

  if you could do it from observation alone, physics would have been solved a thousand years ago

There's a lot of complexity and depth that is easy to miss with the excitement, but it still matters.

I'm just touching the surface here too, and we're just talking about mechanics. No quantum needed, just information loss

[0] https://hermiene.net/essays-trans/relativity_of_wrong.html

[1] maybe this is why there are so few physicists working on the world modeling side of ML. At least, using that phrase...

liendolucas 1 day ago||

I didn't understand a single word about this post and what was supposed to be solved and had to stop reading.

Was this actually written by a human being? If so, the author(s) suffer from severe language communication problems. Doesn't seem to be grounded at least with reality and my personal experience with robotics. But here's my real world take:

Robotics is going to be partially solved when ROS/ROS2 becomes effectively exterminated and completely replaced by a sane robotics framework.

I seriously urge the authors to use ROS/ROS2. Show us, implementing your solution with ROS, pushing it to a repository and allow others to verify what you solved, maybe?. Suffer a bit with the framework and then write a real post about real robotics hands-on, and not just wander on fancy uncomprehensible stuff that probably no-one will ever do.

Then we can maybe start talking about robotics.

w4 1 day ago||

It is readily understandable if you are fluent in the jargon surrounding state of the art LLMs and deep learning. It’s completely inscrutable if you aren’t. The article is also very high level and disconnected from specifics. You can skip to FAIR’s paper and code (linked at the article’s end) for specifics: https://github.com/facebookresearch/vjepa2

If I had to guess, it seems likely that there will be a serious cultural disconnect as 20-something deep learning researchers increasingly move into robotics, not unlike the cultural disconnect that happened in natural language processing in the 2010s and early 20s. Probably lots of interesting developments, and also lots of youngsters excitedly reinventing things that were solved decades ago.

godelski 1 day ago||

  > if you are fluent in the jargon surrounding state of the art LLMs and deep learning

It is definitely not following that jargon. Maybe it follows the tech influencer blog post jargon but I can definitively say it doesn't follow jargon used in research. Which, they are summarizing a research paper. Consequently they misinterpret things and use weird phrases like "actionable physics," which is self referential. "A" physics model is necessarily actionable. It is required to be a counterfactual model. While I can understand the rephrasing to clarify to a more general audience that's a completely different thing than "being fluent in SOTA work." It's literally the opposite...

Also, it definitely doesn't help that they remove all capitalization except in nouns.

rage4774 1 day ago|||

I totally agree with you. On the other hand the theory behind it -to combine image recognition to predict the outcome based on specific physical impacts- does sound intriguing and like a somewhat newer idea.

But besides that, you‘re totally right. It’s too „loose“ since to realize that idea the process would have to be way different (and properly explained)

godelski 1 day ago|||

  > Doesn't seem to be grounded at least with reality and my personal experience with robotics.

It also doesn't match my personal experience with physics nor ML, and I have degrees in both.

You cannot develop accurate world models through observation alone, full stop.

You cannot verify accurate world models through benchmarks alone, full stop.

These have been pain points in physics for centuries and have been the major pain point even before the quantum revolution. I mean if it were possible, we'd have solved physics long ago. You can find plenty of people going back thousands of years boldly claiming "there is nothing new to be learned in physics," yet it was never true and still isn't true even if we exclude quantum and relativity.

Side note: really the paper is "fine" but I wish we didn't put so much hype in academic writing. Papers should be aimed at other academics and not be advertisements (use the paper to write advertisements like IFLS or Quanta Magazine, but don't degrade the already difficult researcher-to-researcher communication). So I'm saying the experiments are fine and the work represents progress but it is over sold and the conclusions do not necessarily follow

Btw, the paper makes these mistakes too. It makes a very bold assumption that counterfactual models (aka a "world model") are learned. This cannot be demonstrated through benchmarking, it must be proven through interpretability.

Unfortunately, the tail is long and heavy... you don't need black swan events to disrupt these models and boy does this annoying fact make it easy to "hack" these types of models. And frankly, I don't think we want robots operating in the wild (public spaces, as opposed to controlled spaces like a manufacturing floor) if I can make it think an iPhone is an Apple with just a stickynote. Sure, you can solve that precise example but it's not hard to come up with others. It's a cat and mouse game, but remember, Jerry always wins.

YeGoblynQueenne 1 day ago||

It's not a scholarly article but a blog post but you're still right to be frustrated at the very bad writing. I do get the jargon, despite myself, so I can translate: the authors of the blog post claim that machine learning for autonomous robotics is "solved" thanks to an instance of V-JEPA 2 trained on all videos on youtube. It isn't, of course, and the authors themselves point out the severe limitations of the otherwise promising approach (championed by Yan LeCun) when they say, in a notably more subdued manner:

>> the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.

>> in practice, this means you have to manually fiddle with camera positions until you find the sweet spot. very scientific. much engineering.

>> long-horizon drift

>> try to plan more than a few steps ahead and the model starts hallucinating.

That is to say, not quite ready for the real world, V-JEPA 2 is.

But for those who don't get the jargon there's a scholarly article linked at the end of the post that is rather more sober and down-to-earth:

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

https://arxiv.org/abs/2506.09985

In other words, some interesting results, some new SOTA, some incremental work. But lots of work for a big team of a couple dozen researchers so there's good stuff in there almost inevitably.

dimatura 1 day ago||

"why didn't we think of this sooner?", asks the article. Not sure who the "we" is supposed to be, but the robotics community has definitely thought of this before. https://robo-affordances.github.io/ from 2023 is one pretty relevant example that comes to mind, but I have recollections of similar ideas going back to at least 2016 or so (many of which are cited in the V-JEPA2 paper). If you think data-driven approaches are a good idea for manipulation, then the idea of trying to use Youtube as a source of data (an extremely popular data source in computer vision for the past decade) isn't exactly a huge leap. Of course, the "how" is the hard part, for all sorts of reasons. And the "how" is what makes this paper (and prior research in the area) interesting.

a_t48 1 day ago|

I definitely saw somebody at Actuate last year talking about supplementing training videos for VLA with Youtube, but I think they actually found that "any" video of the real world helped give a better physics "understanding" to the model.

hahaxdxd123 2 days ago||

Extremely oversold article.

> the core insight: predict in representation space, not pixels

We've been doing this since 2014? Not only that, others have been doing it at a similar scale. e.g. Nvidia's world foundation models (although those are generative).

> zero-shot generalization (aka the money shot)

This is easily beaten by flow-matching imitation learning models like what Pi has.

> accidentally solved robotics

They're doing 65% success on very simple tasks.

The research is good. This article however misses a lot of other work in the literature. I would recommend you don't read it as an authoritative source.

Voloskaya 2 days ago||

This article contains so many falsehoods and history rewrites that it's pretty painful to read.

imranq 2 days ago||

This was a bit hard to read. It would be good to have a narrative structure and more clear explanation of concepts.

Aurornis 1 day ago||

> This was a bit hard to read.

This writing style is prominent on Twitter and niche Discords. It's funny how much I've come to be able to cut right through it, but if you haven't seen much of it it's really hard to parse. That's by design, too. The vibe of this writing style is to project an air of confidence so strong that the author doesn't care if you get it or not. It's a sort of humblebrag where the writing is supposed to flex the author's understanding of the subject while also not caring if you get it or not.

As others have already covered, there's also some heavy stretching of the truth and rewriting of history going on in this post. That's also common of the extreme bravado in this style of semi-impenetrable writing: The vagueness and ambiguities allow the author to make grandiose claims but then wiggle out of them later if someone is astute enough to catch on.

For example: The blog post is written as “We…” but is the author part of the team? Or is he using “we” meaning society in general?

Pyxl101 1 day ago||

What's the point in writing something while "not caring" if the reader understands or not? Seems like a false confidence or false bravado to me; it reads like an attempt to project an impression, and not really an attempt to communicate.

Aurornis 1 day ago|||

Basically: If you understand the topic well, you’re not the target audience.

This is a type of information arbitrage where someone samples something intellectual without fully understanding it, then writes about it for a less technical audience. Their goal is to appear to be the expert on the topic, which translates into clout, social media follows, and eventually they hope job opportunities.

The primary goal of the writing isn’t to get you to understand the topic clearly, because that would diminish the sense that the author is more knowledgeable than you. The goal is to sound guru-like while making the topic feel impenetrably complex for you, while appearing playfully casual for the author.

dclowd9901 1 day ago||

I guess "bullshitting as a career" isn't going away any time soon.

dotancohen 1 day ago|||

This style of writing is very effective at convincing people in their impressionable years of a narrative or viewpoint, often one that is hard to defend with more traditional writing styles.

I hope I'm wrong, but this looks like an effort to normalize such writing style. As this happens, intelligent discourse and rhetoric become harder.

signal-intel 2 days ago|||

Very intentional. Their response would be: “if you need narrative structure and clear explanation of concepts, yngmi”.

YeGoblynQueenne 1 day ago||

And the answer to that would be: WNGTI.

https://www.youtube.com/watch?v=4xmckWVPRaI

Capitalia tantum.

dclowd9901 1 day ago||

It would also be good if the perspective of the article would stay put. This "we" and "they" thing was at best confusing and at worst possibly a way to get more clicks or pretend the author had something to do with the work.

october8140 1 day ago||

I was unable to make through the article (now we're talking).

contingencies 2 days ago||

This is interesting for generalized problems ("make me a sandwich") but not useful for most real world functions ("perform x within y space at z cost/speed"). I think the number of people on the humanoid bandwagon trying to implement generalized applications is staggering right now. The physics tells you they will never be as fast as purpose-built devices, nor as small, nor as cheap. That's not to say there's zero value there, but really we're - uh - grasping at straws...

dotancohen 1 day ago||

The value is in the generalisation.

For a single example, in any factory watch how humans are added as ad-hoc machines wherever a problem occurs. Machine N outputting faster than machine N+1 can accept? Have a human stack, and destack, the product between them. No matter the size, shape, it within reason the weight of the product. But most importantly: the process can begin within seconds of the problem occurring. No need for a programmer, developer, or maintenance worker to get involved. Just a clear order from the shift manager.

A general purpose robot with physical interfaces similar to a human would be very valuable for such environments. If it had the software to be as easy to instruct as a human.

contingencies 3 hours ago||

Your assumption set: conventional factory space, idle humans, traditional management, ad-hoc process with skilled managers. This is similar to the "job shop" mentality in (dying) manufacturing. You additionally assume general purpose magic hardware that can usefully do anything.

Reality: Most value is in shrinking things, excluding humans, automating management, carefully designed process, and specialist hardware that does a subset of things very well. Relying on human(oid)s is a sure-fire way to suck.

dotancohen 2 hours ago||

Correct, I'm talking about the 98% of factories in the world today and in the near future. Obviously the far future will see changes in manufacturing, just as manufacturing has seen changeds every decade since we've been manufacturing things at scale.

foobarian 2 days ago|||

I wonder if a generalized machine would have an advantage from scale, and then putting all the specialized stuff into software. We have seen this play out before.

jes5199 2 days ago|||

analogy: a CPU is more expensive, more complicated, more energy demanding than custom made circuitry, in most cases.

ahmedbaracat 2 days ago|||

Well, there’s a middle ground, kinda. Using more specialized hardware (ex: cobots) but deploy state-of-art Physical AI (ML/Computer Vision) on them. We’re building one such startup at ko-br (https://ko-br.com/) :))

contingencies 2 days ago||

Quite a few startups in your space. Many deployed with customers. Good luck finding a USP!

xyzzy123 1 day ago|||

As the vendor you can sell it with the promise that awesomeness is coming "just around the corner" with the next software update.

You can also seek investment without committing to an actual concrete business model.

jjangkke 2 days ago||

Very good point! This area faces a similar misalignment of goals in that it tries to be a generic fit-all solution that is rampant with today's LLMs.

We made a sandwich but it cost you 10x more than it would a human and slower might slowly become faster and more efficient but by the time you get really good at it, its simply not transferable unless the model is genuinely able to make the leap across into other domains that humans naturally do.

I'm afraid this is where the barrier of general intelligence and human intelligence lies and with enough of these geospatial motor skill database, we might get something that mimics humans very well but still run into problems at the edge, and this last mile problem really is a hinderance to so many domains where we come close but never complete.

I wonder if this will change with some sort of computing changes as well as how we interface with digital systems (without mouse or keyboard), then this might be able to close that 'last mile gap'.

esjeon 2 days ago||

Note that the username here is a Korean derogatory term for Chinese people.

jcrawfordor 1 day ago||

It's an interesting comment, it has the same "compliment the OP, elaborate, raise a further question" format I've seen used by apparently LLM-generated spam accounts on HN. But, the second paragraph is so incoherently structured that I have a hard time thinking an LLM produces it.

teleforce 15 hours ago||

It seems that in order for robotics and automation to work properly, AI models including LLMs, YOLO, RL and others need helps from their distant cousins namely logic, optimization and constraint programming that can be attributed as intelligent automation or namely IA [1],[2],[3],[4].

[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:

https://www.youtube.com/live/TknN8fCQvRk

[2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT (2011) [video]:

https://youtube.com/watch?v=HB5TrK7A4pI

[3] Google OR-Tools:

https://developers.google.com/optimization

[4] MiniZinc:

https://www.minizinc.org/

pr337h4m 2 days ago|

IMO, VideoMimic is a better proof-of-concept

https://www.videomimic.net/

https://www.videomimic.net/page1.html

Keyframe 2 days ago|

Looks like it was trained on Shaolin Drunken Fist videos. Does it look drunk because of the videos or because there's a discrepancy between videos and it not accounting for gravity and physics in general?

jdmichal 1 day ago||

My guess would be lack of actuators. For instance, this robot looks like it has an ankle that can only go up and down, but not roll like a human's. Also, I wonder if there's a center of gravity issue, as it almost always appears to be leaning backwards to even out.

I think it's still pretty impressive in its recoveries, even though there's an unnaturally large number of them necessary. About 8 seconds into the video on the homepage, it almost misses and ends up slipping off the second step. I've eaten shit at missing a couple inch curb, though I don't think "graceful" has ever been used as a descriptor for me. So the fact that it just recovers and keeps going without issue is impressive to me.

namibj 1 day ago||

> So the fact that it just recovers and keeps going without issue is impressive to me.

I'm pretty sure that's just a matter of reaction speed and it maintaining a constant focus/vigilance on it's movement that you'd usually not reserve outside of some sports and situations pre-identified as deserving the attention due to danger, like concentrating on balance and not getting into a position that overstresses your joints when you know it's icy.

More comments...