I think that in the current market a VMS is about video 100% of the time, and audio maybe 10% of the time, and in those cases the audio is always subservient to the video.
To answer your question, I would expect that any kind of recording only happens when there is video motion. Audio would then only be recorded when/if there was video motion to trigger the recording.
In most cases, motion-only recording is done to conserve bandwidth and storage. While audio streaming should have lower bitrates than video, it's not "free". Recording audio continuously would likely consume enough network and storage resources that the customer could have recorded a reduced resolution (~CIF to D1) stream at a low framerate/bitrate, which I feel they would often prefer over continuous audio recording if given the option.
The only time I could see a customer reasonably expecting continuous audio recording was if audio was a major function in the security implementation. For example, a gunshot detection system might want continuous audio recording, so that if there was a missed event the user could go back and review the audio and possibly work with the manufacturer to determine why the event wasn't captured. However, in your case you describe video as the trigger component, and therefore presumably the primary use of the system.
I request to please share the metrics of the bandwidth occupied if Audio is on during video motion and without video motion. It may happen that the subjects are not in the video but their discussions or deliberations which might give to clue to incidents may be useful to for investigations. In such case what are the storage requirements to have the audio on with motion detection only.
normally most modern cameras that have built in audio and if they are using H.264 encode it into one H.264 stream.
that being said how the VMS/NVR writes this data depends on the VMS/NVR. some have options to record audio as a separate stream or as part of the channel.
Some Cameras have "audio detection" which will tell the VMS/NVR to record the same way motion detection will.
It really depends on the type of cameras and VMS/NVR you are using.