Like Magic, Get Audio From Video Only Recordings

This is pretty cool, though the fine print shows big limitations.

Watch the video:

The two big constraints are:

  • Massive processing power: "Processing each [15 second] video typically took 2 to 3 hours using MATLAB on a machine with two 3.46GHz processors and 32GB of RAM."
  • The frame rates of the videos averaged 2000fps - that's a lot...

Also, it appears that the pixel density of the object needs to be quite high as the FoV in the images is very tight around the object being monitored.

Surely there's espionage / military applications but general use is far off.

For more, see the project's home page and its research paper.

This story is so last week. :)

The frame rates of the videos averaged 2000fps - that's a lot...

Though consumer grade frame-rates of 60 fps were also exploited to some degree...

In other experiments, however, they used an ordinary digital camera. Because of a quirk in the design of most cameras' sensors, the researchers were able to infer information about high-frequency vibrations even from video recorded at a standard 60 frames per second. While this audio reconstruction wasn't as faithful as it was with the high-speed camera, it may still be good enough to identify the gender of a speaker in a room; the number of speakers; and even, given accurate enough information about the acoustic properties of speakers' voices, their identities.

I think the 'quirk' here is the standard rolling shutter of CMOS. Since it is not a snapshot of any particular moment in time, like a CCD or global shutter is, is it giving information on a pixel by pixel basis possibly up to a max frequency of FPS x pixel count!?!

Yes, they mention a 60fps video input and allude to an audio recording sample available on the Internet but I could not find it on their project page.

Presumably, there is a non trivial reduction in audio fidelity or else they would not have focused on tests with 30x the frame rate.

This is from the article:

The researchers also produced a variation on the algorithm for analyzing conventional video. The sensor of a digital camera consists of an array of photodetectors—millions of them, even in commodity devices. As it turns out, it's less expensive to design the sensor hardware so that it reads off the measurements of one row of photodetectors at a time. Ordinarily, that's not a problem, but with fast-moving objects, it can lead to odd visual artifacts. An object—say, the rotor of a helicopter—may actually move detectably between the reading of one row and the reading of the next. For Davis and his colleagues, this bug is a feature. Slight distortions of the edges of objects in conventional video, though invisible to the naked eye, contain information about the objects' high-frequency vibration. And that information is enough to yield a murky but potentially useful audio signal.

The audio vibrations of human speech (including formants) is as high as 4000 hz, which is way too fast for mere 60fps(60 hz) video. But 60fps x 1080p = 65000 hz, which even after doubling the min sample rate to 8000hz, as demanded by the Nyquist theory, is still sufficient to recover communication. Still a ways off as you say, but possible... So if someone demands a camera with the CMOS one row read-out method, you'll know what they are up to...

And as Luke implies, take care when the camera is rolling, because the shutter might be too. ;)

Shhhhhhh! The plants are listening.

I can't believe this hasn't been the pivotal topic of a CSI episode yet.

Your intuition is correct:

In the episode, "Committed," Gil Grissom, the character who heads up a crew at the Las Vegas crime lab, brings equipment into a psychiatric institution to try to solve a murder there. ...Grissom hypothesizes that with a Doppler laser and optical transducer, he might derive sounds from the clay that could provide clues. With a computer hook-up, they hear a high-pitched noise and something that sounds like a word — the mother's pet name for her son.

Here of course we have the over-the-top addition of the sounds being in the past and the vibrations being encoded in the clay, but that's CSI. At least they got the Doppler part right.

Very interesting. This could be worth a Phd. I bet this has been well know by Military/Researches for years

This is the exact opposite of what they did in "The Dark Knight". They used audio only recordings to map out a video image.