The NVR that I am (fairly intimately) familiar with stores video and audio in two separate databases. Consider the following setup - 2 cameras in the same room, only one of them recording audio. If camera A has the the microphone, and is set up to "record on motion", we might encounter situations where we have video on camera B, but if camera A had no motion, we'd have no audio.
With regards to H.264, then it is a video format. As far as I know it does not have intrinsic support for audio. What happens is that the H.264 is saved in a file, along with one or more audio tracks (which could be ACC, MP3 or something else). The way the video and audio is laid out in the file depends on the "container" - e.g. a quicktime file is a "container" file, just like an AVI file is a container file. A container may support multiple different video and audio codecs - so an AVI file might contain a h.264 encoded video and a MP3 encoded audio stream. But it might also contain a video file encoded with Cinepak, and thus you sometimes need to install a codec in order to open an AVI file; in the old days you'd download DivX or it's dark sibling Xvid (notice the clever reversal of characters).
A common way to place video and audio bits in the file is to interleave the files (guess what the "I" stands for in AVI). So you'd have 200 ms of video bits, then 200 ms of audio, then 200 ms of video and so on. In fact RTSP also supports this interleaving of data. Natually, this makes synchonization of audio and video a little easier. If you start playing 50% into the file, the data you read is already synchronized in the file. Chris calls it "strands of DNA" which is I think is a good way of looking at it. This interleaving will also work if the video is MJPEG. The RTP packet header actually contains the time-code of the video, and thus does not rely on the video format for synchronization.
The NVR might read the video and audio as one stream via RTSP, or it might read each stream individually, but some bytes will belong in the video bin, and others in the audio. So, depending on the camera driver, the NVR might tear those DNA strands apart.
When playing from an NVR (again - the ones I know), things are a little different. You basically have 2 streams coming from the NVR. The client will look at the timestamps of the two streams and make sure they are synced up at all times. Although this seems trivial, it turns out that it can be quite a hassle. It's doable - but it takes a bit of tinkering. For whatever reason, it's one of the things that people tend to screw up - I've lost count of the number of times I just needed to "fix this little thing", which then turned out to throw the whole A/V sync out the window.
LIVE is a slightly different matter. For video you can pretty much decode and show as you get frames, but you just can't do that with audio. You can tolerate stuttering framerates, but audio is a whole different ballgame. You just cannot decode and play, it sounds TERRIBLE and it is useless. So what you need to do is to buffer a bit of the audio. But if you are buffering audio, you need to buffer video as well (to remain in sync). Once you do that, you are introducing latency. Some people hate that, especially for PTZ cameras.
Who knows what was said, and what was and wasn't done? Isn't it easy to just say that "oh.. that was a technical error" and blame it on the integrator? Why did the cameras have mics in the first place?