Google’s Super Resolution Enhancement Examined

By Brian Karas, Published Feb 10, 2017, 10:10am EST

"Zoom in and enhance, I think there is a clear shot of his face in the reflection" has been the kind of statement CCTV users have wanted to make reality for years.

Hollywood has imagined this technology, but now engineers at Google have released a whitepaper on "Pixel Recursive Super Resolution", showing real-world examples of upsampling a highly pixelated image into one with details and recognizable features.

In this report we examine Google's super resolution whitepaper, and how this technology could impact the security industry.

Recursive ***** ********** ********

*** ********** ********* ** algorithm **** *** **** low ********** ***** ******, and ************ *********** ** output **** ********* ******/**********. When ******** ** *** actual ***** *** ***-********** input *** ******* ****, the *********** ****** **** many ************, ****** ** that * ***** ******** could ****** ********* *** person, ***** **** ***** not **** **** **** to **** *** ***-*** image.

** ******* ** *** downsamples ****** ***** **** the *********, *** ******, and *** ****** ****** image *** ***** *****:

*** ******* ** ********* as ***** ******* ** an ****** ******** * painting ** * **** from * ******, ** from * **********, ***** the ****** **** ******* to *** ******** ***** on ** ************* ** how *** ******* ****** be ******* **** ***** observations ** *** *******.

***** *******/****** *** ** reconstructed, ** ** *** limited **** ** *****, the ********** **** ***** examples ** ****** ** bedrooms **** **** ************* from ******* ***-*** ***** images.

Requires ***** ***** ******/********

** ***** ** ******* accurate ******, *** ********* needs ** ** ******* beforehand ** * ****** of ****** ********** ******. If *** * ***-*** image **** **** *** correlate ** ******/******* ** has ********** ******* *** learned ****, *** ******* will ****** ** ****** inaccurate.

** ********** *** ***** on *** **** *** initial training ******* *****, ****** the ********** **** ***** that * **** **** used ** *** ******** process, ********** **** ** is **** ******** *********.

Limitations ** ***** ********** ****** *** ********

******* *** ****** ****** are *********** ********-********* *******, the ****** ***** ****** be ************ *** ******** in ********. ******* ** police ********, * ****-***** application ** **** ********** in ******** ***** **** narrow **** * ******, but ***** ****** *** be ****** ** ******** in *** **** *** a **** ****-********** ********* would **.

Compared ** ****** ***********

****** **** ********* ****** images ** *** ******** set, *** **** ** not ******** ** ** a ****** *********** *******, and *** ************ ********* include ********* ***-*** ****** to **** **** **** photo-realistic ** ****** ****** based ** ******** ******** from ******* ******. *** algorithm ** *** ********** to *** ******* ******* two ******* ********** ******, it ** ********** ** create ******* ***** **** exist ** ** ***** image.

Applications ** ******** ********

**** ********** ** ******** to ***** *********-***** "**** in *** *******" ************* to *** ****** ******* of **** *****. *** reliance ** ***** ******** would **** ** ******** the ********* *** ********, as ********* *** ***** looking *** ******* ** persons ** ******* **** have *** ****** **** seen ** *** ******* previously. ***** ***** ** applications ** *** ****-*** images ** ******* **** one ****** ** *********** identify * ****** ** object ** *** ********** of *** ***** **** another ******, ****** *** overhead ** ** ************ processing ****** *** ******** the ********* ***** ****** make **** *********** ***** the ******** ************ *** significantly *******.



Comments (21)

Just wondering how this might benefit LPR tech...even if it gave me a better guess to work off of, it could be very beneficial to me/the police.

This is somewhat different than LPR tech, though a lot of the core learning frameworks are the same.

In LPR, you are trying to find an exact match for a set of characters, this makes parts of the problem easier because you are generally limited to a very small set of possibilities (A-Z, 0-9, etc.). Also, if you can figure out (or guess) the state you may be able to limit it further if the state uses particular sequence formats.

For this Google application, they are really attempting to "paint" an image that looks convincing to a human. It is less about precise accuracy, and more about filling in details that make the objects recognizable. So, for this reason, it would not be ideal for LPR, as the algorithm might be able to take a blob of pixels and turn it into an image that looks very much like what you would expect a license plate to look like, it would probably be less concerned with differentiating between an "O" and a "Q" (for example). Or, the training data could cause it make other errors- if none of the input plates had "Q"s, but lots of them had "O"s, it might take a pixel blob of a plate that was "QQQ" and paint it as "OOO". (this is my interpretation from reading the whitepaper, there might be more to it).


Countpoint on your last bit --

It is very simple to write a quick filtering algorithm that would replace any 0, O, or Q with a wildcard that would search across all 3. So, if you're telling me that it could take a blob of pixels and potentially output to something that would give me a range of possible numbers, that would be immensely helpful in a real, serious investigation. If you're able to hand the police a list of even as few as 500 license plate numbers for them to search across, and something like the color of the car, they can narrow a list down incredibly quickly.

The ability to create a reasonable guess out of something totally unrecognizable would be game-changing for both the security and investigations industry and law enforcement and police work overall. The ability of law enforcement to take one tiny piece of evidence -- even an educated guess -- and start to narrow the field is extremely impressive. And, to your point, you could very easily input millions of pieces of reference data to compare against, so it could definitely "learn" from a database of plate photos from every state, etc.

I said it in another thread awhile back, but by far the most interesting thing happening in the electronic security industry right now is the utilization of "big data" or data analytics to provide either relevant historical or actionable information to end-users. There are some exciting things happening in that regard already -- I think the next few years is going to see an explosion of it, specifically as the 1 or 2 companies that are currently doing it start to grow it more and more.

Very interesting... but nuts to court-admissibility.

Cool stuff, but sounds like a court admissibility nightmare...

How often does video evidence get used in trial, I wonder?

As opposed to forcing a plea...

Also, if the video evidence can just identify the perps to police, then a case can often be made without video evidence.


Although a good defense attorney would ask, "Why did you focus on my client in the first place? Lead me through the steps you took, please."

Why did you focus on my client in the first place? Lead me through the steps you took, please.

"We took the grainy, poor quality video evidence and ran it thru the Google enhancer.  This enhancer uses various assumptions about various facial characteristics to construct an image which may or may not help in identifying the subject.  In this case, the artificially enhanced image prompted several employees to suggest that the recreation looked like the suspect.   The suspect was investigated and their alibi checked.  The suspect was also found to be in possession of the stolen merchandise."

Considering Apple's facial recognition capabilities in iPhone 6 and up, were a guy to integrate into their databases (fat chance), parse their catalogs and compare users' iCloud photos with video/images captured via surveillance, you'd be on to something.

I mean... who's to say we haven't all already agreed to this in Apple's 56-page TOS agreement we're so quick to scroll through and click 'Agree'?


Matt -

I thought about the social media integration angle. Most pics on Facebook/etc. tend to be high resolution, they could make a good source library for input images. 

However, there can be challenges for an algorithm like this that is trying to "paint", not trying to "match". You could end up with it creating an output image that is the compilation of facial features of several people. For the purposes of creating a "painting", the image would be highly representative of the reality (a person with a round face, short hair and small nose, or a person with a square face, long hair and big ears), but it might not accurate in terms of identifying a specific person.

If you are using a smaller dataset, say just the employees from a building, or just the students on a campus, you will likely have outputs that better match the specific individual than if you use a dataset of "all people".

Part of these problems they do attempt to tackle, it comes down to figuring out what parameters can be "normalized" and which ones make up unique qualities. In the whitepaper they use an example of sampling cars, if you average the paint colors of all cars together you would wind up with a very drab "average" color that would likely not be applicable to any vehicle. But you could average minor differences among reds or blues or greys to produce output images that were still representative.

Taken to an extreme, this technology could be useful to create a next generation smart CODEC. If you can compress the background of a scene down to "there is a red truck, a blue roadster and a brown delivery vehicle" that could be less data to store than recording the actual pixels of those objects. When playing back video, the algorithm would create representations of those objects on the fly. Or you could store a few reference frames of the vehicles, along with movement paths, and then dynamically draw the objects in along the way, used learned data about appearances to create frames that realistically represent how the vehicles would look as they moved or turned, even if you did not have that actual data recorded.

Further, if the recreation can be done fast enough, and with an "average" processor, it could be done client-side, allowing the server to record and stream very low bitrate data, while still allowing a client to piece together a more detailed image.

I do not think we will see applications of this in the near-term for surveillance, but opening up this level of image processing may lead to some amazing capabilities.

Don't rule it out so quickly. Maybe not in terms of enhancement. But be on the lookout for SM integration with surveillance, AI & analytics in the very near future. 

Do you think it will be possible to pass the footage from any camera to this algorithm? Because if that is the case you could use post analytics on all your cameras on site and cross compare the "guesses" from each camera to get real close to having the actual plate. 



Do you think it will be possible to pass the footage from any camera to this algorithm?

In theory, yes, the question is at what computational cost. The whitepaper references using 8 GPU's to handle the input from a series of static images. I did not see an exact number of the quantity of input images, but even 5 cameras running at 5 fps would generate 25 images per second, potentially requiring a LOT of GPU horsepower to process.

From what I have seen, and conversations with have had with other image-processing experts, I think that we are at the stage in technology where things are very possible when it comes to advanced image processing, but still very expensive in terms of compute requirements (or even expensive in terms of hardware cost). We will likely need another couple of generations of GPU advancements before this becomes the kind of thing that is just part of an average to high end security system.



This is interesting  ...

For the most part we think in term of local processing power for whatever we do... local processing power seems to be becoming less relevant as the cloud is becoming more pervasive. If this is important a search could be performed by a swarm of servers in the cloud. The results could be later be spit to a lowly smartphone if needs be ...

Speculation? Perhaps. I believe we're getting there. 


Yes to all but.. the way that DHS and other federal agencies are throwing $$ around it might come faster than estimated. Who but big brother has pockets deep enough to get this done and that in itself, at least for me, is the scary part. When evidence can be "created" anyone can be made guilty.

This is real scary stuff.

A captured image has to be proven to be unaltered for it to be acceptable as legal evidence yet we should trust google to decide what the final extrapolated image really is based on what  computers says it should be... if they can do this then they can also make that image appear to be anything they want it to be and it does not meet the evidence requirement.. changing the evidenciary requirement would be a huge can of worms and dangerous to those google or the gov doesnt like. Think of the movie "Enemy of the State" or similar.. No Thanks Google.. design better cameras and lenses.

I don't know that Google's software is really changing anything here.

Whether you use a completely automated algorithm (from Google or from anyone else), a semi-automated algorithm (in which a person clicks a button to apply an enhancement but doesn't necessarily know what the button does), or a completely manual alteration, you still have the legal need to show the original image and show how you arrived at the modified image.

The only problem would occur if Google (or anyone else) was able to create an altered image that could NOT be identified as altered.

I am not an image expert, but it appears that this will become more and more challenging.

...if they can do this then they can also make that image appear to be anything they want it to be and it does not meet the evidence requirement...

Neither does a sketch from a police artist.  Even so, both may be useful in identification.


Interesting discussion but I am skeptical.  More research and especially better training needs to be done.  Now I do love technology and I especially like AI neural networks and cognitive processing...

The easiest part for the algorithm to process should be those parts of the image with the greatest contrast.  For example the dark pupil and iris set against each other and the iris set against the white portions of the eye should be much more accurate.  When I look at the middle image, the algorithm did not get the direction of the eyes correct and when I look at the bottom image, the make up the woman is wearing appears to fool the algorithm again.  In looking at the contrast between the red lips and white teeth in the top and bottom pictures, this seems to be a little better.  The shapes of the lips and mouth are more consistent between the algorithm and actual image.  What I expect to see in all three pictures (or at lease the top two) is better definition between the cheeks the dark background beyond.  I do not see this, especially in the top image where the algorithms missed the jaw line shape badly.  Missing this badly changes the perception of the face substantially....  The top image looks like two different people with the actual image giving the impression of being much younger in age that the algorithm guess.  All three people in the pictures appear to be looking in different directions in the algorithm than in the actual images although as mentioned, the lower image is much less so that the others...  I would like to know more about how this algorithm works, I can guess but....

Another issue is that the results of the algorithm can be greatly effected by the initial scanning resolution of the picture (as opposed to the initial resolution of a low resolution camera).  If color assignment is done inaccurately during the quantization (A to D conversion) process, it seems to me that would throw the algorithm off as well.  

With experience, i know that most FR algorithms face difficulty in skin tones as well as with age (difficulty with younger aged individuals datasets), it would certainly be interesting to run this Google Super Resolution algorithm against faces of African origin and analyze the results, as per Tim's comment, the results would probably not be very enouraging. Having said that, this is still a very interesting and postitive research in FR and in the right direction, I can see a lot of potential but just like everyone else has said, will take a while to get there.

To quote from the conclusion:

As in many image transformation tasks, the central problem of super resolution is in hallucinating sharp details by
choosing a mode of the output distribution.
It is unlikely that hallucinations would be accepted as evidence!
These systems (both this approach and previous attempts) ingest a database of hi-resolution reference images which are then used to generate hi-resolution images "matching" low resolution input. For example it might be that the system has been fed a khollection of Khardasian images (all varieties of that species).
Now suppose such systems are presented with a low-res image that is consistent with Khardasian-ness:
  • In previous approaches, the likely result is a sort of average of the various Khardasians, not matching any one in particular (a Khardasian Khimera)
  • The novelty of this approach is that it will make a definitive choice to match one of the people in its database, rather than a blend / average (so it will be Kim or Kourtney or Khloe and will not be a Khimeric blend of the three). It does not seem to be highly probable that it will pick the correct Khardasian, however: it will make a definitive choice, but perhaps the wrong choice.

Hence there is no evidentiary application, but there may well be intelligence or investigation applications, since a trivial modification could output all plausible candidates for a given low-res image. For example if an investigator somehow knows that the people in an area at a given time are members of a small set of known individuals he may be able to use this to guide further investigation (via different modalities) as to which of that small set is likely in a given low-res image.


Read this IPVM report for free.

This article is part of IPVM's 6,667 reports, 897 tests and is only available to members. To get a one-time preview of our work, enter your work email to access the full article.

Already a member? Login here | Join now
Loading Related Reports