Even if machines could think, they wouldn’t be able to do anything about it

By Alberto Moel, Vice President Strategy and Partnerships, and Clara Vu, co-founder and VP Engineering, Veo Robotics

Welcome back, dear reader, to one more edition of our contribution to the automation debate. If you recall, in our last post we made the point that the dystopian buzzkill around the robopocalypse—where humans are made useless and redundant by our own robotic creations—is more or less, to paraphrase Henry Ford, bunk.

In that post, we outlined how humans and machines think differently, and how the kind of “thinking” that would allow machines to acquire some form of sentience or willful intelligence is not part of our current AI toolbox. It may very well take far longer to be realized than most people think, if it ever is.

But there is more. In order for the “thinking” part of a machine (or a human, for that matter) to have any value and functionality in applications that require interaction with the physical world, we need 1) a way to capture inputs to be processed by the “thinking” element, and 2) an output module to execute actions on the external world. The response of the external world to these output actions must then be captured by the input module and fed back to the processor, and so on.

Figure 1.

Figure 1.

Basically, all computing systems (human or machine) require three basic modules: an input module that takes signals and data from the external environment; a processing module that stores data and uses algorithms in hardware or wetware to analyze and interpret the data; and an output module that generates signals from the system and sends them back out to the world.1 Every computing device (and, we assume, most humans) in the past, present, and future had, has, and will have this sense-process-output loop (Figure 1).

We have discussed why machine and human computing are very different, and that machine compute is so far behind human compute that we don't have to worry about it catching up and taking all of our jobs. But on top of not being able to match our computing power, machines also fail spectacularly when compared to human perception and manipulation skills. Machine perception and actuation capabilities might even be worse than their compute ones, so the path to competitively intelligent and perceptive machines is doubly fraught.

Roboticists have been successful in designing robots and automation systems capable of superhuman speed and precision. What's proven more difficult is inventing robots with human capabilities, in other words, machines that are able to inherently understand and adapt to their environments.

This reality is also known as Moravec’s paradox (after CMU robot expert Hans Moravec): “it is comparatively easy to make computers exhibit adult-level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.”

When it comes to work in the physical world, humans have a huge flexibility advantage over machines. Automating a single activity, like soldering a wire onto a circuit board or fastening two parts together with screws, is pretty easy, but that task must remain constant over time and take place in a consistent environment. The circuit board must show up in exactly the same orientation every time.

Each time the task changes—each time the location of the screw holes move, for example—production must stop until the machinery is reprogrammed. Today’s factories, especially large ones in high-wage countries, are highly automated, but they’re not full of general-purpose flexible automated systems. They’re full of dedicated, specialized machinery that’s expensive to buy, configure, and reconfigure.

Humans, on the other hand, have hands. And eyes.2 Both are incredibly sophisticated sensing and actuation devices, orders of magnitude more flexible than those of any existing machine.

Pause for a second and look at your hand. What do you see? Five fingers, four of them with three joints and one with two (the thumb). The lateral motion of four of them is somewhat constrained, but the thumb can pivot freely, so the whole ensemble is multi-functional and has many degrees of freedom (up to 27, depending on how you count). The hand is an incredibly versatile system, capable of performing dozens of different mechanical actions.

And, as if that wasn't enough, the whole hand is covered with a protective skin that has thousands of sensitive nerve endings that can detect minute changes in pressure, temperature, and humidity and give and receive tactile feedback.

It is connected to a prodigious storage memory that allows the hand to identify objects purely by sense of touch. It can tell a saucer from a cup, a wooden desk from a metal grate, a screwdriver from a wrench, and a nut from a bolt. It can identify if the object being touched or grasped is a cat, a dog, or a bird, and even if it's dead or alive. And we're not even talking fast processor speeds. Nerve reactions and cognitive processing times happen in milliseconds, orders of magnitude slower than trailing edge computing devices.

Couple the hand with the eye (an amazing stereoscopic imaging device) and you're now able to position the hand in space, identify objects for optimal manipulation, and further classify objects based on color and other attributes that can only be captured visually. And this is even if they're partially obscured, or if they're just representations of what you're looking for.3 And although training the human hand to become adept at manipulation can take years, it occurs naturally.4

You now have the elements for a biological feedback loop of unbelievable power, connecting the eye and the hand through the brain. This feedback loop is amply capable, with minimal training, of setting specific objective functions (e.g., insert tab A into slot B) and carrying out sets of computations and mechanical actions to reach those objectives. Additionally, the human feedback loop has the ability to learn and update its computational processes in response to input stimuli.


Consider the example of an action we are all familiar with, but which most of us don’t give much thought to. Think of a shoe. Specifically, a shoe with shoelaces, old school. The action of tying a shoelace is not a major project for most able-bodied children and adults. It’s something done without thinking, almost on auto-pilot, completely automatic. But step back for a second and think of the complexity of such a simple action.

Let’s review what you need to do in order to tie a shoelace, from the very beginning. Let’s assume an initial condition where you’ve put your shoes on to go out for a jog, and one of your shoelaces has become untied:

  1. You sense that your shoelace is untied. That first sensing is haptic—you feel that your foot is loose inside the shoe. The sensors in your skin and muscles sense something is just slightly amiss, the enveloping pressure has changed, something is not quite “right.” You might not be able to consciously articulate what it is, but there are a number of flashing red lights, triggered sensors, and wetware PLCs sending signals all over the place.
  2. The sensors in your feet and lower leg muscles report something amiss and trigger a command for visual confirmation. You look down and see that your shoelace is untied.
  3. You trigger a subroutine to tie the shoelaces. You stop jogging, send signals to your muscles and skeletal system to lean down, kneel, sit, or bend over, and, using visual servoing, create a trajectory plan to move your 27 degrees of freedom-articulated and sensor-loaded grippers (i.e., your hands) toward the shoelace area.
  4. As your hands approach the shoelace area, you adjust their trajectory via visual servoing so as to determine, in real-time, the shoelace poses and grasp points. All the while accounting for the fact that the shoelaces may change pose arbitrarily and in random directions. Using simultaneous haptic and visual feedback fused into a single sensory process, you determine, in real-time, the optimal trajectory to disentangle the shoelaces and perform a complex maneuver to re-tie the shoelaces, with sub-millimeter accuracy and a high tolerance for ambiguity on execution.
  5. Switching back to foot muscle and skin feedback, you determine proper pressure and transfer that sensory information to eye-hand coordination and path planning to pull the shoelaces just so, with the right level of tension so as to conclude the subroutine.
  6. You signal your body to stand up, run quick QA tests to make sure the procedure has been completed properly, and repeat the sequence if necessary.

All this is happening while your computer brain is running a bunch of pumps, drives, shafts, and millions of sensors, and your SCADA system is sending signals to the MES and ERP to remind you to buy more milk before heading home.

Know of any man-made machine that can do this? No? We don’t either. Humans have judgment, dexterity, and creativity beyond the capability of any machine at this point, and that will continue to be the case into the far future.

Figure 2.

Figure 2.

Machines, as they are currently configured, are very effective at highly-specific and constrained tasks. It is exactly this narrowing down of the objective functions and complexity of sensing and actuating elements that makes automation possible. While narrowly-defined tasks are more easily automated, the complexity of automating tasks that require a broad combination of skills increases exponentially (Figure 2).

The reality is, with a robot or any other machinery, automating a single step in a manufacturing process involves a team of engineers working for months or even years, up to millions of dollars of capital equipment purchased from many different vendors, and extensive custom engineering of mechanical parts. Often, it involves rethinking the task and changing the design of input parts, adjacent tasks, or the entire method of manufacturing.

Robotics is not just about what robots have to see or understand, but also about what robots have to do. We’ve barely started thinking about what we do with our hands and our bodies. How are we to even begin to take the human out of the loop when faced with replacing such a capable biological machine?

Even “trivial” manipulation problems aren’t exactly solvable without prodigious effort. Consider self-driving cars. Self-driving involves the extremely complicated perception problems of identifying in real time and at high speed everything in front of the vehicle and inferring the function and intent of all the objects the perception system has identified and classified.

But the “manipulation” required is actually relatively straightforward and a “2D problem”: slow down or stop, turn left, turn right, or some combination of those four operations. Tens of billions have been spent solving the self-driving sense-process-output loop and we’re still many years from sufficiently reliable systems. Now consider the arbitrary 3D manipulations required in manufacturing, and the complexity of the problem gets exponentially harder, even if the perception problem is more structured.

In asking a machine to perceive and manipulate, we’re essentially trying to replicate the billions of years of evolution it took us, and other animals, to become adept at moving around, understanding, and interacting with our environments. Compare that to getting a computer to play a game of Go: humans only learned how to play it about 2,500 years ago, meaning the computer is replicating just an instant in evolutionary time. Same with language processing and generation—the first instance of written language is estimated to have been around 5,000 years ago, a blink of the eye in evolutionary time.

Mastering chess, Go, or poetry seem like pretty sophisticated activities, but because we’re not even asking the machine to pick up and move the pieces, stones, or pen itself, it’s faced with a relatively simpler set of problems. Evolution has provided us with incredibly versatile and finely-tuned sense-process-output loops that are complex and have been crucial to our survival. It shouldn’t be surprising that trying to replicate all of that learning and experience is going to be much more difficult than teaching a machine to play virtual games or recognize language. We are probably decades or a century away from solving human-level perception and actuation for general applications.

1 We’re obviously not making any major revelatory discovery here. This idea has been around for centuries (going back to Charles Babbage and Ada Lovelace), but is best outlined by Alan Turing in his 1950 paper Computing Machinery and Intelligence, Mind 59.

2 And ears and noses, and heads, shoulders, knees, and toes. But we abstract from that complexity in order to make a point. Another human capability, abstraction.

3 Think about this the next time nature calls and you are trying to tell the female from the male restroom. The "representation" on the door could be a letter, a drawing, a stick figure, or some other indicator of gender, and it (usually) only takes us seconds to make the relevant discrimination. State of the art machinery would be hard-pressed to make such discriminations reliably. On the other hand, this exercise is pointless, as machines don't need to use the restroom. A periodic oil change and some bolt tightening, and they're happy as clams. Which is fitting, as clams don't have brains.

4 Although not without some potential bumps along the way. When Alberto was a tyke, he was infamous for disassembling devices into their entropic (i.e., irreversible) components, and at one point ended up taking apart the family phone. Back then, the phones were all landlines and the "handsets" (which were more like giant Bakelite boxes with rotary dials that could be used as weapons) were actually rented from the local phone company, which would conduct semi-regular spot checks. Alberto’s parents had to hide the disassembled phone in a closet and hope that the phone company's inventory management wouldn’t miss it. It’s probably is still sitting there, in pieces, in a closet, just in case the phone company comes looking for it. Another time, he took the TV remote control (infrared based!) for a swim in the backyard pool, as it looked like it would make a pretty cool submarine. It ended up in one of the pool debris collection points and stayed there for decades.