Pixel Requirements For Text Legibility?

I met with a client this week that has some specific requirements for a cashiering function. They would like the cameras mounted above the cashiers to capture text on certain documents, not just bill denomination. They specifically asked for the ability to forensically read 9pt font.

Outside of managing expectations around privacy concerns and who would have access to the recorded video, etc, etc; this brings up a good point for me in technical design which is - What number of pixels density per foot equates to text legibility for a certain font size? How exactly do we define legibility (based on assumed 20/20 vision of the human looking at the image of course)?

I don't yet have the measurements of the target table surface and distance from the ceiling to table for this project, but for the sake of establishing a baseline; let's say that the taget surface area is 2' x 2' 6", and the distance from the camera to the target surface is 4' 6".

What resolution and lens is required to achieve legibility of 9pt font at that distance within the entire target area? What about 10 or 12pt font? Any simple calculation we can use to scale up or down based on a different target area and camera distance?

Thanks in advance for the input!


Chris, very interesting question! We are going to do a test of just this in the next week and publish full results.

One question / comment - you say, "for the sake of establishing a baseline; let's say that the taget surface area is 2' x 2' 6"

You are ok with the FoV being that tiny because most people want much wider coverage areas? You are going to get 400+ppf with a HD camera in this case which I bet will be good enough.

Like I said, we are going to test and see how wide we can go and still get 9pt font.

That's great! I really look forward to seeing the results.

I'm absolutely fine with increasing the FoV for a test - I just honestly didn't think I would be able to do it without going above 3MP. The FoV I provided was an example of what I thought may be the "transaction" area (where checks/cash/paperwork might be placed) on the surface of a typical bank teller station (the type of station for my present project), but I know that a typical desk or workstation would probably be larger in a retail sales environment. It would be great if we could get a larger FoV without having to scale up the resolution tremendously.

I'm hoping that based on the test, we can have a PPF basis to scale the FoV up or down and still know what resolution is needed to hit the legibility requirement for various sizes of text.

I also wonder if text color plays a noticeable part in legibility as well? Perhaps something else for your test.

Thanks again!


Per Wikipedia, a point is 1/72 of an inch. It's not uncommon to see literature utilizing 8x6 pixel character maps, so one might reasonably expect to be able to resolve a character of 48 (8x6) pixels. A 9 point font is 1/8" high, so each pixel in a 9 point, 8x6 character would be 1/64 of an inch. If that is oriented along the 2 1/2 foot axis, that suggests that 1920 pixels should be adequate. Along the 2 foot axis, 1536 pixels should be adequate.

This could go poorly if my assumption is bad, that a pixel of text is equal to a pixel of video. One could imagine that video pixels which are exactly 50% offset from the text pixels would require up to 2x greater resolution to resolve. If this is the case, still, one would expect that a 10 MP camera should be adequate.

Interestingly, document scanning for automated OCR suggests settings of 300 dpi, which would require 9,000 pixels across 2 1/2 feet: not reasonably obtainable with today's affordable cameras. I don't recall the IPVM discussion that inspired this effort, but I attempted to OCR generalized scenes with embedded text such as shop signs and writing on vehicles. I was surprised to discover that standard OCR software such as Adobe Acrobat and Abbyy Finereader could not resolve text from a cluttered scene, even though humans had no difficulty doing so. By this I do not mean that the text was embedded in extraneous garbage from the rest of the scene (this was what I had expected to discover). Instead I mean that there was no semblance of the images' text recognizable anywhere in the resultant OCR text. This surprised me a great deal. It seems that standard OCR software expects very high contrast with few distractors, so I would expect automated OCR to be a bridge too far for this application.

The foregoing does not consider issues such as lighting, document motion, or skew based upon oblique presentation of printed material to the camera, which could further challenge video frame rates and resolution required for adequate recognition.

Summarizing, if well lit printed material is perpendicular to the camera, while 1920x1536 pixels may provide legible text, I would want to start with 10 MP (roughly 2x in each dimension) to account for skew between source vs video pixels. I would expect that automated OCR will likely exceed reasonably accessible camera resolutions, even if background suppression or text isolation were tenable.

It will be interesting to see how IPVM's definitive real-world testing aligns with these predictions.

Good research there. OCR is a whole lot tuffer than just being legible to a human, though I reckon. I think the OP is talkin bout regular folk reading from the image. It sounds like its only needed on an exception basis. Also the documents would probably contain a rather limited vocabulary so that could make it easier. As for the pixel offset issue I'm wondering how often that one could bite you in the a**, only because if its video you gonna get multiple frames of the same document at random offsets when its put down and picked up, as long you make sure the shutter is quick enough not to blur.

Agreed, great research Horace. Jim is correct though; in this particular application I'm not so much concerned about software recognition of the text as I am about a human being able to review recorded video and make out the words or numbers on a document within the target area with a fairly high degree of accuracy. Not sure what effect a cluttered, angled, or poorly lit scene would have on legibility, but I figure that we should start with the requirements for an ideal scene and go from there...

Another unconfirmed source references an 8 vertical pixel requirement for basic legibility of 10pt font with the human eye. With that standard, I think we should be able to achieve forensic legibility (within the set parameters I listed in the orginial post) with a 5MP camera in the neighborhood of 1000 PPF.

At the end of the day though - I have no idea without a real world test! Definitely looking forward to seeing test results.


Chris, all, we have released the test results here.