140 network cameras providing 24 hours of feed. Processing on 328 nodes, cluster totals 47 TB of system RAM and 3 TB of GPU RAM. See page 3, section C. Computing Clusters for details on the processing power they utilized.
Conclusion: YOLO struggles to consistently detect the same humans and cars as their positions change from one frame to the next; it also struggles to detect objects at night time. The findings suggest that state-of-the art vision solutions should be trained by data from network camera with contextual information before they can be deployed in applications that demand high consistency on object detection.
I found this fascinating as I'm not satisfied with the current paradigm I have in a simple home-based network using Reolink. The capture is satisfactory, but the monitoring and tweaking of triggers and sensitivity settings seem primitive. I've been working on a system that might allow full capture that is then time-delayed for processing foregoing the real-time experience. My objective, in part, is to be able to whittle down interesting events and to use machines to identify an executive summary. This paper seems to be along the same line; what the paper does demonstrate is the tremendous amount of processing power needed to achieve such a paradigm. Moreover, it confirms my assessment that the various lighting conditions outdoors do play an important role is object detection and that a way to introduce into the model processing known set objects as a stationary camera might have be excluded by way of a mask, if you will.
Lastly, I'm not sure if computer vision is something is appropriate for this forum so please speak up if this is too esoteric.