This is from the phys.org article:
The researchers also produced a variation on the algorithm for analyzing conventional video. The sensor of a digital camera consists of an array of photodetectors—millions of them, even in commodity devices. As it turns out, it's less expensive to design the sensor hardware so that it reads off the measurements of one row of photodetectors at a time. Ordinarily, that's not a problem, but with fast-moving objects, it can lead to odd visual artifacts. An object—say, the rotor of a helicopter—may actually move detectably between the reading of one row and the reading of the next. For Davis and his colleagues, this bug is a feature. Slight distortions of the edges of objects in conventional video, though invisible to the naked eye, contain information about the objects' high-frequency vibration. And that information is enough to yield a murky but potentially useful audio signal.
The audio vibrations of human speech (including formants) is as high as 4000 hz, which is way too fast for mere 60fps(60 hz) video. But 60fps x 1080p = 65000 hz, which even after doubling the min sample rate to 8000hz, as demanded by the Nyquist theory, is still sufficient to recover communication. Still a ways off as you say, but possible... So if someone demands a camera with the CMOS one row read-out method, you'll know what they are up to...
And as Luke implies, take care when the camera is rolling, because the shutter might be too. ;)