Since they're delivering 720p video but using a higher resolution chip, the best approach would be to get more samples from those extra pixels. You're right, in Audio and some radio frequency samplers, oversampling almost always means using a faster clock rate, which tends to be more expensive. But you can get the same benefit by fetching extra samples from other pixels in the focal plane array, instead of sampling fewer pixels faster. That way, you're able to use the A/Ds that are already integrated within the chip.
Optimistically, I'll call a frame of 720p video a megapixel. Although the camera is only delivering a megapixel per frame, the focal plane array could actually read out 4x the area, or 4 megapixels. Then, the camera processor could decimate those 4 megapixels down to one megapixel through 2x2 binning, which would also increase the dynamic range.
The cost driver in this approach is, they must have the computational power to achieve the 2x2 binning, which is a simple add function.
Other design changes include: changing biasing or exposure time to reduce the chip's saturation for a given light level, as well as a different lens focal length, because the effective focal plane area would change from 720p pixels to 4x that many pixels. For the latter, I'm making an assumption that could be wrong. That is, without spatial oversampling, I assumed that they would have chosen to sample a contiguous megapixel, which would require a different focal length than if reading out four megapixels for 2x2 binning. However, without binning, it's possible that they might have chosen to sample one pixel out of every 2x2 block, which would image the same focal plane area and require no change in focal length if upgrading to higher dynamic range by reading out all the pixels in each 2x2 block. Either way, cost should be a wash.