How Project Natal Works

Project Natal has two input devices:

First is a 3D camera, which can measure both the colour and the distance of every pixel.

Second is a microphone array, which is several directional microphones arranged in a pattern. They can separate out different sound sources and determine where each sound came from, and can thus filter out any noise.

The 3D camera is actually made out of two cameras, one that senses distance using infra-red and one normal one that senses colour. But it helps to think of it as a single camera that can also measure depth.

The 3D camera makes Project Natal much more powerful than a 2D camera like the Eye-toy. It makes it dead easy to filter out background objects, and find only the objects that you are looking for. And it doesn’t need to rely on colours to recognise things, it can just look at their 3D shape.

It could operate in complete darkness and still see the 3D shape of everything, but knowing the colours would make it easier to recognise hands and faces and other things.

The 3D camera would also allow it to easily scan in any 3D object and convert it into a virtual object in the game. It has only been demonstrated with the 2D image on a skateboard, but there is no reason why Natal couldn’t scan in any object you showed it, if you showed it from a few different angles.

One downside of the 3D camera is that it relies on line-of-sight and can’t see behind things. So it can’t track your hand when it goes behind your back, or when someone stands in front of you. But a Wii Remote, or magnetic 6DOF tracker could still track it.

Another downside is that it could get confused about the distance of transparent or reflective objects, since the IR light that goes back might have first come from somewhere further away.

The microphone array is important for speech recognition, since it can filter out noise, and could separately recognise several speakers talking at once. Especially since the TV will also be making noise. It could also be used to have virtual characters still look at you when you go out of sight of the camera and still talk to it. And in a multiplayer game it helps it know which player is speaking by knowing which sound came from which body that it can see.

It doesn’t just recognise words though. In games like Milo and Kate it can recognise the emotion of the speaker, and can tell things like whether they are telling a joke. It could also be used for rhythm games by using anything that makes a sound, or by clapping. It’s not known whether the monster game used the microphone or facial expression tracking to control breathing fire. It would have to be a good quality microphone, because normally speech recognition requires a microphone a few cm from your mouth.

But the most important part of how Project Natal works is the software. Microsoft, with some help from Peter Molyneux, went around to all their different research projects that they’d been working on for other projects, and collected their software technology and put it all together.

First, there’s the speech recognition. Microsoft has been working on their own speech recognition engine and API for a long time. You can use the same speech recognition engine as Natal for free in gaming right now if you want. Download GlovePIE, and train speech recognition in the Speech control panel. If you only have Windows XP, you will need to first either install the speech recognition from Microsoft Office (best way) or download and install the SAPI 5 SDK with speech recognition.

Then there’s facial recognition. You might have seen this in other Microsoft products, such as Windows Live Photo Gallery, which you can download here for free, and start your computer automatically recognising all the faces in your photos.

Peter Molyneux also mentions handwriting recognition, although we haven’t seen it used yet. But I’m guessing we will in some games.

And there was no doubt a lot of other code.

Then there’s a huge amount of new software which Microsoft had to write. It has to find the shapes of people, and from there convert the surface data into 48 skeletal points for each player. It can do that for 4 people at once, 30 times per second. It can even identify individual fingers if they are close enough.

The need for all that software is why Project Natal could only have been made fully by Microsoft, not Nintendo, Sony, or Sega.