Hikvision Nvidia Supercomputing Partnership

Author: John Honovich, Published on Jul 18, 2016

Hikvision is gearing up its supercomputing efforts.

Partnering with Nvidia, one of the US' largest tech companies, doing $5 billion revenue annually, Hikvision is getting the first of Nvidia's new supercomputers inside China.

In this report, we examine what Hikvision is doing, what Nvidia is offering and what this means for the future of video surveillance.

********* ** ******* ** *** ************** *******.

********** **********, *** ** *** **' ******* **** *********, ***** $* billion ******* ********, ********* ** ******* *** ***** ** ******'* new ************** ****** *****.

** **** ******, ** ******* **** ********* ** *****, **** Nvidia ** ******** *** **** **** ***** *** *** ****** of ***** ************.

[***************]

Supercomputing ***********

************* ** ******* *****'* *********** ***-***** ****** ******* ** ********** '*** ******* ** * ***' (though **** ** ***** ** *** **** ****** ********** ** time ** ************).

*** **** **** ** *** ****** ***** ********* / ******** vision, ***** *** *** **** ******** *** ******, ******** ** Nvidia's ******** **********. **** *****, **** ******'* ***, ******** *** DGX-1 *****:

******: ******'* *** ******** ***** *********** **** ********* **** **** new ***** **** ***** ***** **********:

Hikvision ***** ***** ********* *** ******

***** ********* ******* ** *** **** ** *** ** *** end ********* *** ***** *** ****, ****** ** ***** **** are ******* ** **** ***********, ********* * *** ****** ** building*****'* ******(******: ********) **** ************ ****** (*** ** ** ******** **** the**********'* ******).

*** ***-* ** ** ********* ** *********'* ************ **** ******. During ******'* *** **********,*********'* ** ** *&* *** ************** ********** *************** ** **** ********* ** *****. *************** ********* ***** *****:

**** ********* ************ ********* * ********* ***** ****** ************ *********** *****'*(****** *** ** **** ***** ******* ** *** ****), *****'* include **-***** ***** ******** * ********* **** ** **** *********** video ********* **** ******** ************ ******* / *********.

*** *** ***-*, ********* *** *** ************ ******** *** ***-********* **************** ******** *********'* *********** *************. ******* ***********, *** ***-* **** not ** ********* ********* ***** *** ********* ** ****, ***** Hikvision ** **** ** ******.

********* **** **** ******** ** **** **** *** ***** **** this, *.*.:

******* ******* ** *** **** ****** ***** ****** *********** ** performing *** **** ****** **** / ******:

*** ******* *** * ************* ** *** **** **** ******* plates ********, **** ***** ***** ******** / ***** ** ******** to ** ******* ****** **** *** *******:

*******, ********* ****** * ******** **** ******* **** ******* ******** talking ***** *** ****** *********** ********** **** ****** ******** *** found ** ******* ***** **** ********** *** *** ****, ******* her '** * *******' ** ***** ****** ******. *** ******* also ***** ***** *** *** ****** *** **** **** ** helping ** ********* * *********:

Costs ***** ****** / ***

**** **** ********* *** **** **** **** ***** **** / overhead. ********* **** "******* ********* ** ************ ***** ******* ** Flops/sec" ***** ** ** ******** ****** *** *********** *** (*.*., a ******* ***** ** ** *** ***** ** ******** ** gigaflops, * ******** ** **** ********* ***** ** ** ****). By ********, *** *** ***-* ** ********* ** *** *********/*, though ***** *** ***** ****** ** ** $***,*** ***** *** also *** ****** **** *****-***** ***** ********.

*******

***** *********'* '********' **-**-****** ********, ** ** *** ***** **** is ****** ** **** ** ****** ** *** ******* ***** for ***** ** **** *** *********'* ********* *********** ** ******** computer ****** *** ** * **** ****** ** ***** ******* to ************* **** **** *** ******'* ** ***** *** ******* the ******* ********** ****** ******** ******* ** *** **********.

Comments (37)

...the new DGX-1 is specified at 170 Teraflops/s, though still not cheap coming in at $129,000 price and also not coming with video decoding.

The nvidia computer doesn't come with a video card? In case you want to use an AMD?

I added a note to qualify it as 'large scale'. I am not sure what video card it comes or does not but it does not appear to be built for handling decoding of video like the Tegra does.

Wondering how they got around the supposed U.S. supercomputer export restrictions you always hear about.

Soon they might be restricting exports to us though:

I guess this means we will be seeing a Hikvision clone of the Nvidia box in a 6 months or so.

By contrast, the new DGX-1 is specified at 170 Teraflops/s, though still not cheap coming in at $129,000 price and also not coming with video decoding.

My understanding is, yes a DXG-1 probably could be used for video decoding, because GPUs are rather generic, but you probably wouldn't use it for that. Their purpose is to save time training neural networks, so the rest of us have access to hardware that means waiting 3 weeks for a network to train, those with a DXG-1 may only have to wait only 1 day (for example). I think it is meant for accelerating the R&D side of deep learning. At a price of 129,000 USD, you'd want to keep it running 24/7 doing the most processor intensive work in order to pay for itself, and you wouldn't bog it down with decoding video if there was another way.

I am also thinking, if it was a western company, e.g. a major VMS or camera manufacturer, if they purchased a DXG-1, then they'd probably keep it quiet.

I am also thinking, if it was a western company, e.g. a major VMS or camera manufacturer, if they purchased a DXG-1, then they'd probably keep it quiet.

Agreed. In fairness, I am pretty sure Hikvision wants to keep this quiet as well. The announcement was not from Hikvision's website (not even the Chinese only version). However, it is in their government parent company though.

And Nvidia declined comment.

Humm... soon Person of Interest show will not be science fiction anymore...

I don't know about that :-), but I think the main outcome of this type of technology, at least in the immediate future, are better tools for searching archived video in VMS/NVRs.

But how will you add this in to VMS / NVRs without a lot more cost / new equipment? Since it requires a lot more processing power, this does not fit into the typical model of either open VMS on COTS Intel servers or low cost low power NVRs. Yes/no?

My understanding is that it requires a huge amount of processing power to train a neural network, but not as much to apply an input and get an output once it is trained. Now, there are always going to be exotic solutions that train and update themselves in real time, but lets keep things simple. The deep learning frameworks out there seem to automatically scale with hardware, so the more GPUs you have installed on the machine, the better they perform, and if there are no GPUs then they just use the CPUs, in which case it is a lot slower. At what point the lack of processing power causes everything to become so slow as to be useless, I am not sure. But maybe that means you can only process one channel and every second I-frame - still useful in some circumstances. Keep in mind GPUs are still getting faster and cheaper, and their power requirements are decreasing.

My understanding is that it requires a huge amount of processing power to train a neural network, but not as much to apply an input and get an output once it is trained.

Hikvision VP says 2TFlops for surveillance video:

I am sure it takes far more for initial training and it must vary depending on what type of DNN one has developed but, from everything I have seen so far, it is still quite high compared to traditional analytics.

Hikvision VP says 2TFlops for surveillance video

But to put that into perspective, Nvidia's GTX 1080 cards are around 9 Teraflops, and they are meant for gamers. The Tesla M4 server GPU is around 2.2 teraflops.

NVIDIA Tesla M4

"7x more power-efficient processing than CPUs for deep learning at 20 images/sec/watt"

So it does look to me that the potential to improve VMS analytics are at least in the realm of possibility. If it proves too expensive to add a GPU to every VMS server, or too processor intensive to analyse more than a single camera stream (say) , then perhaps there is still the potential to upload archived data from the VMS and do searches on the client. That could still be useful in some situations.

As for another comparison, the Tegra's being used by Hikvision have a max of ~1 teraflops.

I do not know why Hikvision choose the Tegras vs the GTZ 1080 but that's surely do to my lacking of understanding on details.

So I agree with you that the trends (mid to long term) are favorable but right it looks like adding hundreds of dollars (minimally) in pure hardware costs to do this.

Tegras are GPUs for mobile devices aren't they? Therefore lower power than a desktop GPU like the GTX 1080 and better suited for server hardware. GTX 1080s are power hungry desktop GPUs and I wouldn't put one in a VMS server. I was thinking however, that a traditional gaming/desktop GPU could still be useful if you do offline analysis on the client machine. e.g. upload 10 hours of video from the VMS server to the client for 2 cameras after an incident, and do a search for all people in green shirts. Wouldn't be as useful as realtime analysis on every channel, but still useful and not really any additional cost.

Yes, this is very interesting scenario. When you have thousands of cameras, it is may be too expensive to analyze them all realtime. But when you are investigating some case, you know what cameras and what periods you want to analyze, so the possibility of VMS to transfer video from recording server and generate metadata during this process is very unique and valuable, I believe.

There is an efficient way of using neural network in real time. To analyze not the raw high-res stream 25fps, but only captured objects. I believe, that this technology will help to dramatically reduce false alerts. After getting an object from conventional motion detector, we can try to classify it by neural network. So, it is one piece of frame in some seconds...not 25 full frames in one.

The biggest problem here is learning. Not only processor power consumption but also the process. We've tested Alexnet and it doesn't work well for our applications. 1000 classes of objects is too much for our case. And may be quality of training photos was too good in compare with typical surveillance pictures. Training the network is biggest challenge here.

That is what I understand also, the image is broken down into bounding boxes and ideally each bounding box contains only one object to be categorized by the deep learning network (DLN) - but there could be several per image. Presumably therefore, the load on the DLN would depend on what the camera is viewing. A camera on a busy city street with lots of people continuously walking, would probably have a high processor load. The opposite would be detecting the presence of a car in a car-park overnight that would normally be empty - perhaps in this case the server may not need a GPU at all... ?

This is an interesing photo I made during IFSEC:

HikVision 1U server with 8 GPUs on board!

Interesting. The question I ask is, are these off-the-shelf Nvidia GPU boards that I am seeing here? ... or is Hikvision taking the Tegra processor directly and building their own circuit boards around it.

U2 - That's a 1RU rack case with 16x Tegra chips running. That's what's under all those heat sinks. It's a really innovative way to put over ~14 Teraflops of processing power on a very low power consumption appliance: ~300w.

The picture is displaying sideways. Download it and rotate it clockwise 90° and it gives better context.

Rack in 40 of these with fast switches laced throughout the rack and you've got a half petaflop of distributable neural net trained AI-driven video analytics for a couple hundred thousand bucks and the power and cooling bills of a studio apartment.

Stuff like this is going to make for an interesting new generation of surveillance video.

Some more info here, from NVIDIA GTC China

AI City, 1 Billion cameras by 2020

From the video:

"with the enormous volume of data they (Hikvision) have collected..."

It is occurring to me that we might have been missing the point slightly when talking about hardware. The main thing you need to develop a deep learning product like what we have been discussing here, is not really the hardware, but the enormous amount of data you need to train the DLNs. This must surely be one area where Hikvision's relationship with the Chinese government gives them a significant advantage. In contrast, most VMS/camera companies, especially in the west, have no access to the data their products record, as this is obviously not owned by them.

2, thanks for sharing, very interesting video.

I've embedded the relevant section below:

To your point:

The main thing you need to develop a deep learning product like what we have been discussing here, is not really the hardware, but the enormous amount of data you need to train the DLNs.

I do agree that the data is critical but even once you've trained the DLN, it still takes a ton of processing to do.

That video confirms you need 2 TX1s per video channel and ~8 per 1RU box consuming 300 watts.

I certainly believe that is far less than traditional approaches but with TX1 pricing ~$500 per unit and the cost of the box and power, it is still a lot of money per channel.

At what resolution and frame rate is each stream? 1080, 25fps? What say you just apply it to a second stream at roughly 1/3 the width and height and only 5 fps. Right there you have reduced the hardware requirements to (1/3) * (1/3) * (1/5) =1/45. Presumably there would be a compromise in quality but I bet you could still get good results.

Even if you do that, you still need a tonne of data. If you want 60,000 images of people as seen through the perspective of a security camera, where do you get it? You can use generic images of people, cars etc. from the internet, but ideally you'd use images taken from security cameras themselves, which tend to be higher up looking down, and, well everything just looks different through a security camera... Where can you get data that from?

Anywhere on the internet that contains 1000s of hours of stock video surveillance footage, from 1000s of different cameras, indoors and outdoors?

At what resolution and frame rate is each stream? 1080, 25fps? What say you just apply it to a second stream at roughly 1/3 the width and height and only 5 fps. Right there you have reduced the hardware requirements to (1/3) * (1/3) * (1/5) =1/45. Presumably there would be a compromise in quality but I bet you could still get good results.

That's just speculation.

(1) the slide only lists resolution. You entered in frame rate.

(2) since this is a sales presentation, it is more reasonable to assume they are using optimistic numbers. You are assuming they are overstated by a factor of 45x. If they thought they could do it 1/45 the combined pixel count and resolution, they likely would have used that to make the channel count per TX1 / server far higher.

My point is still valid, which is you can always step down the resolution and frame rate, and therefore HW requirements, and still get useful results. Just assume they are talking 5 fps, then I would say step down to 1 fps and my maths would still work out the same. As Murat said above, even more optimization if you combine it with VMD.

Here at the office I have four outdoor cameras, All I want to know is if an image contains a person or a vehicle. Most of the time nothing is happening, and probably would only needs to process an image every 1 minute on average if combined with VMD. I wouldn't even need a GPU for that. But I would still need lots of data to train the network, that I find is the stumbling block.

No doubt, to get the full benefits of DL, e.g. recognizing a fight breaking out on a street, then you probably do need the sort of hardware they are talking about, and even more data...

Just assume they are talking 5 fps, then I would say step down to 1 fps and my maths would still work out the same.

But as you know and acknowledge 1fps would significantly reduce the performance.

Basically, you are assuming that Nvidia's marketing people are hurting their marketing claims by using unrealistically high stream requirements. I am simply saying that is an imprudent assumption to make.

Basically, you are assuming that Nvidia's marketing people are hurting their marketing claims by using unrealistically high stream requirements. I am simply saying that is an imprudent assumption to make.

Not really, they would know that developers are aware there are many different ways DL can be applied to video analysis, or indeed any task. Simple analysis to detect if 1 in 10 images contains a vehicle or person (my example above), is not the same thing as "recognise that someone has had an accident, recognise that a pet or child has been lost..." . The first is easy (at least conceptually) but I wouldn't have a clue how to do the second.

The pascal architecture (nVidia 10xx) is too new for them to have used, but the mobile chips are now available so in a moore's law like fashion they just got a massive boost in capability at the same or lower price & power.

If you build a 'server farm' that puts the processing architecture in the cloud and makes it a demand based service available to any VMS installation and we start to have a model that could be practical. No hardware to buy and you only pay for what you use, but when you use it, you have more capability available then would be practical to build into a box.

If I were HIK, that's what I'd do, then I'd build the capability in my VMS client to interact with it. So customers can do things like search video, set critieria and push video at that service.

But hey, that's just me.

If you build a 'server farm' that puts the processing architecture in the cloud and makes it a demand based service available to any VMS installation and we start to have a model that could be practical.

If I were HIK, that's what I'd do

So customers have to send their video to a cloud system run by the Chinese government?

I understand what you mean from a technical perspective, just strikes me as debatable from a security perspective for many customers.

And if someone in the U.S. licensed the technology, ran it in the U.S. and guaranteed that the data never left american soil (digital soil that is..) and that HIK or the Chinese government never had access?

So customers have to send their video to a cloud system run by the Chinese government?

ezviz cloud?

Probably comes down to cost. I don't know much about what's out there regarding online analytics services, but I know that the Microsoft one is pretty impressive:

Microsoft Cognitive API (be sure to upload your images to their demo)

Problem is the price, they charge something like $1.50 per 1000 images. So for a large city site of 1000 cameras at 1 fps, you'd be paying $1.50 per second! It adds up very quick. So I assume that the Hikvision product exists because in many cases it is cheaper to buy your own 'server farm'.

It certainly would if you were fully utilising it, in particular if you were using it for real time analysis. But, if the use case is forensic/event based and you only need the capability occasionally, then the scales would start to tip in the on-demand direction.

I like the business model though, make it demand based pricing, and easily accessible through the tools a customer uses all day...

The 2016 results for "ImageNet" (one of the most important computer vision competitions) came out fairly recently; HikVision entered for the first time, and did well:

  • http://image-net.org/challenges/LSVRC/2016/index

My (unaffiliated) summary of HikVision's results is:

  • Second in “Object Localisation” - detect, classify (from 1000 defined categories) and find bounding boxes for the five most "prominent" objects in each of a collection of photographs. I.e. what's the five most important things in the scene.
  • Second in “Object Detection” - detect, classify (from 200 defined categories) and place bounding boxes around all of the objects (in the 200 categories) in each of a collection of photographs. Useful for search and perhaps alarms.
  • Won “Scene Classification” - this is assigning the scene viewed by the camera to one of 365 categories (city street, shopping mall etc), so the camera can know which type of scene it's “looking” at and can apply appropriate analytics accordingly.
  • Middle ranking results in “Scene Parsing” - here the task is to label each pixel in an image with the category of object (from 150 object categories) which it is part of. This is (perhaps) the most difficult challenge but also has the most useful applications: it support all the applications of the others and then some. Note that this task is new in 2016 in contrast to the others.

These are all on stills (photographs); ImageNet also has an “Object Detection in Video” task which is similar to “Object Detection” but uses 30 categories (simpler). HikVision do not seem to have entered this.

Many teams approach ImageNet as a numbers game: lots of powerful GPUs, lots of PhD students and/or postdocs trying all sorts of variations in a (one hopes) intelligent fashion. The description on HikVision's submission seems to indicate this strategy. Many of the entries involve "ensembles": a collection of networks processing the same image in parallel then taking the average (or some other combination) of the results. This improves the results by a few percent but increases the (test time) processing cost (and the training cost) by the number of networks in the ensemble.

Hik's (affiliated) summary of the competition's results here.

Login to read this IPVM report.
Why do I need to log in?
IPVM conducts unique testing and research funded by member's payments enabling us to offer the most independent, accurate and in-depth information.

Related Reports

Hikvision Admits USA Sales Falling on Apr 22, 2019
Hikvision, in a new Chinese financial filing, has admitted that its USA sales are now falling. Less than a year after the US government passed a...
Verint Victimized By Ransomware on Apr 18, 2019
Verint, which is best known in the physical security industry for video surveillance but has built a sizeable cybersecurity business as well, was...
ISC West 2019 Report on Apr 12, 2019
The IPVM team has finished at the Sands looking at what companies are offering and how they are changing their positioning. See below for 50+...
Bosch AI Camera Trainer Released And Tested on Apr 09, 2019
Bosch is releasing a highly unusual new AI feature - 'Camera Trainer'. Now, coming as a standard feature in Bosch IVA/EVA analytics, one can train...
Hikvision Conducts Military Training For New Employees on Apr 04, 2019
Hikvision's new employees recently completed a boot camp where they wore Chinese army uniforms and were trained by former army personnel, as shown...
Airship VMS Profile on Apr 03, 2019
Airship has been developing VMS software for over 10 years, however, with no outside investment, and minimal marketing, the company is not well...
How China's Pay By Facial Recognition Works on Apr 02, 2019
Many social media posts have variously celebrated or warned about the growing use of facial recognition for payments in China. An example of one...
Avigilon USA Factory Visit Report on Apr 01, 2019
In a building that looks more like corporate offices than a 'factory' and just down the road from suburban housing developments is Avigilon's USA...
Dahua Favorability Results 2019 on Apr 01, 2019
Dahua favorability declined, in IPVM's 2019 integrator favorability series, driven by their backdoors, resulting in mass hacking and US government...
Goldman and Fidelity Funds Sell Off Hikvision, Dahua Stock Over Xinjiang on Mar 29, 2019
Major US funds run by financial giants such as Fidelity and Goldman Sachs have sold their equity stakes in Hikvision and Dahua "as scrutiny...

Most Recent Industry Reports

Hikvision Admits USA Sales Falling on Apr 22, 2019
Hikvision, in a new Chinese financial filing, has admitted that its USA sales are now falling. Less than a year after the US government passed a...
Speco Ultra Intensifier Tested on Apr 22, 2019
While ISC West 2019 named Speco's Ultra Intensifier the best new "Video Surveillance Cameras IP", IPVM testing shows the camera suffers from...
Arecont Favorability Results 2019 on Apr 22, 2019
Arecont's net negativity remained the same in IPVM's 2019 integrator study, though integrator's feeling became relatively more neutral compared to...
H.265 Usage Statistics on Apr 19, 2019
H.265 has been available in IP cameras for more than 5 years and, in the past few years, the number of manufacturers supporting this codec has...
ACRE Acquires RS2, Explains Acquisition Strategy on Apr 19, 2019
ACRE continues to buy, now acquiring RS2, just 5 months after buying Open Options. One is a small access control manufacturer from Texas, the...
Access Control Course Spring 2019 - Last Chance on Apr 19, 2019
Register for the Spring Access Control Course. IPVM offers the most comprehensive access control course in the industry. Unlike manufacturer...
Riser vs Plenum Cabling Explained on Apr 18, 2019
You could be spending twice as much for cable as you need. The difference between 'plenum' rated cable and 'riser' rated cable is subtle, but the...
Verint Victimized By Ransomware on Apr 18, 2019
Verint, which is best known in the physical security industry for video surveillance but has built a sizeable cybersecurity business as well, was...
Milestone Drops IFSEC on Apr 18, 2019
Milestone has dropped out of Europe's largest annual security trade show (IFSEC 2019), telling IPVM that they "have found that IFSEC in EMEA no...

The world's leading video surveillance information source, IPVM provides the best reporting, testing and training for 10,000+ members globally. Dedicated to independent and objective information, we uniquely refuse any and all advertisements, sponsorship and consulting from manufacturers.

About | FAQ | Contact