Projects / Optical Punchcard Reader

I recently helped design a puzzle wherein solvers are required to program a ďrobotĒ to solve a maze. The only input into the robot is punch cards that provide a few basic movement instructions to the robot. To help with the realism and theming I tried my hand at some basic computer vision: input was done by punching holes in actual paper and presenting it to a webcam.

Before starting on this project I knew absolutely nothing about computer vision so most of the techniques are very crude but they got the job done. Much of the original strategy came from a blog post by the author of Sudoku Grab. Since his post was so helpful to me I thought Iíd return the favor to the Internet. Hopefully this is useful to someone. (If it is useful to you, Iíd love to hear about it!)

Important Note: This documentation does not imply any permission to use/license/whatever any patents that I or my employer may have.

The Algorithm

Input Image

This picture shows what the webcam sees and therefore the software needs to understand. In an effort to keep from distracting the solver the punch card is as unobtrusive as possible: there arenít any special registration marks on the corners nor are there any computer digestible hints for the rows or columns beyond the grid provided for the humans. I would have been willing to add these if had they been necessary, but it turned out they were not.

Identify the Card

Threshold Output

The first step is to identify the card (or if there even is a card). That is done by thresholding a grayscale version of the image to identify pixels that are not paper. Thresholding is a fancy way of saying for each pixel look at the surrounding pixels and if this pixel was less than 90% of the surrounding pixels then it is ink. (Confusingly I consider inked pixels to be ďonĒ and show them as white, and non-inked pixels are ďoffĒ and shown as black.) Through experimentation I found that an average of the four pixels worked well.

Blob with the most pixels.

Next I run a blob extraction algorithm on the thresholded image to find any blobs in this image. A blob is simply a contiguous set of filled pixels. To find the blobs, I iterate over every inked pixel in the image. If any neighboring inked pixels have already been assigned to a blob, then all of those blobs are merged together and this pixel is also assigned into that blob. Otherwise a new blob is created containing only this pixel.

The blob we really want is the blob formed by the cardís grid and, potentially, some of the punches in the grid. Heuristically this blob is the one with the most pixels.

But what if that heuristic fails? Looking at the threshold output a human can (still) easily see the card with its circles. But you can also see the two potential outlines that are not the card: the card paperís outline and the red target squareís outline. In Chrisís Sudoku software he didnít worry about such things: he simply required/assumed the user would fill the image with the puzzle they wanted to input by moving the camera. For my use, however, the camera is at a fixed position so I decided to handle it more robustly in a later step.

Points selected by the Jarvis march are shown in red.

Now that I have a blob, I need to find the corners of it so that I can treat it as a quadrilateral. To reduce the number of points I need to consider I select only the left-most and right-most point on each row and only the top-most and bottom-most point on each column. Any interior points are clearly uninteresting because they canít form a point on the bounding shape I will wrap around the blob. The Jarvis march algorithm efficiently wraps this cloud of points into a bounding shape by selecting only the minimal points needed to contain the entire cloud.

Combined lines are shown in red, yellow, blue, and green.

Then I combine lines that have less than 5 degree difference in their angles. This usually results in exactly the four lines that are desired, but sometimes there may be a disconnection in the blob and a corner of the card is cutoff resulting in five lines.

I designed the cards so they are always taller than they are wide. So sorting the lines by their length leaves the left and right edge at the top of the list.

Once the left and right are identified then the shortest distance line between them and containing the remaining line segments is chosen as the top and the bottom. This gives the four corners to the box!

As mentioned earlier once the box is found an additional check is made to increase the robustness of the blob/box algorithm: is there at least one ink pixel inside the box (giving a generous margin to ignore the outline itself)? An actual card will have a portion of the grid in the blob, and most not-cards are merely an outline, as seen above. Since the most common well-endowed blobs detected are the card grid, the cardís paper, and the red rectangle the above blob extraction and identify box portion of the algorithm checks the top three blobs.

Transform the Card

Transforming. Left and right line progress shown in green. Line segment between shown in blue. Sampling point shown in yellow.

A skewed, perspective distorted quadrilateral isnít a convenient platform for building further image processing on so the next step is to transform from those coordinates into a straight-on representation. This is effectively the inverse of the standard transform any graphics card would perform when rendering a texture onto a 3D polygon. The algorithm I settled on is really simple. It steps down the left and right line segments for each row in the target image. At each step it forms a line segment between the left and right point. It walks down this segment sampling points from the source image and assigning it onto the corresponding pixel in the target image. There isnít a 1:1 mapping between these images so to preserve quality and ensure features arenít lost the four best matching pixels in the source contribute to the bilineal interpolation sampling function.

Features of the Grid

So now I have a card as if it had been scanned in flat and in an expected area and with no rotation. Hard part is over? No. We still need to identify the gridís rows and columns. And then within each cell need to identify if there is a punch.

Transformed, Thresholded

To do all of these it would be nice to classify pixels as black, white, or red. So that is exactly what we will do. Start by thresholding the transformed image. I chose to threshold again instead of transforming the already thresholded image because it gave me a higher quality result.

Red (Punched) Pixels

This clearly identifies the white pixels, but the black/red pixels remain to be identified. I was reasonably successful with a heuristic I invented that examined the HSB properties of each pixel to do this classification. But towards the end I grew impatient with its inaccuracy and changed to a neural network. The neural network sample posted by John Wakefield provided the training algorithm. The network takes in a 5x5 sample of RGB values to determine the red-ness of each pixel. The wide sampling lets it fill in the black-ish shadow of each punch circle very nicely and consistently which will help in the upcoming punch identification step of the algorithm.

Black Pixels

Finally identifying the black portions is simple: exclude the red pixels from thresholded pixels.

Identify The Gridís Rows and Columns

Histogram. Green indicates a selected row. Red indicates an excluded row because a nearby row had already been selected.

Identifying the gridís rows is done by examining how many black pixels each image row has. The image rows are sorted by that count and gridlines are selected. Each selected gridline excludes its neighboring image rows from future consideration. I experimented with more official techniques like actual line detection with a Hough transform but my implementations were buggy and/or I didnít understand the algorithm very well.

Identifying the gridís columns is extremely simple: the image is divided into four equally sized columns. With only four columns there isnít much drift or distortion to worry about and so nothing fancy was necessary like it was for the rows.


Identify the Punches

Blob extraction of each cell. Black indicates no inked pixels at all, red indicates a punch, and white indicates no punch was found.

The final step is to process each cell individually. Within a cell, the blob extraction algorithm is run on the red pixels. The only significant expected blob in a cell is a punch, but there will also be some random garbage too. The sample image Iíve used here unfortunately has very little garbage.

A quick identify circle heuristic is fed into a neural network to make the final is-punch determination. The heuristic is a bit like a Hough transform: the center point of the blob is found and then each red pixel votes on the radius of the circle. A reasonable compromise is selected as the radius based on that voting. The neural network uses the circleís center distance from the cellís center, the radius of the circle, the percent of the blob filling the circle, and the percent of the circle filling the cell to choose a final result.

Multiple Passes

To provide confirmation of successful recognition to the user the results can be overlaid onto the captured image.

Even with all of that complexity the algorithm can sometimes produce inaccurate output. Most often this will only happen with either low light or a low quality web cam (especially once the neural network was given control of the key steps above!). But even so the results of a series of image recognitions are combined together. There are some heuristic thresholds that must be met before the values are considered accurate: about two seconds worth of successful recognitions, only a handful of failed recognitions in that time, and a high agreement of each punch cell across all recent recognitions.


I could write a whole article on the optimizations done to make the code perform well. Everything I described about the algorithm could be written in any language or architecture. This section, however, will focus on exactly where I found myself. I implemented this entirely in completely portable C# with no unsafe code in the algorithm implementation. Why did I do that? Iím most familiar with C/C++ and C#. I didnít have time to spend a whole bunch of time dealing low level image loading/saving or web cam capture that I would need with C/C++. So I chose C# and ran with it.

I initially started in WinForms using the frameworkís Bitmap class. That was slow. It took upwards of ten seconds to run the first step of the algorithm. And the profiler told me 99% of the time was spent inside the Bitmap class. So within all of the algorithm code I switched to a FastBitmap class of my own design and did a one time conversion to and from the slow WinForms Bitmap. (Bitmap can be fast if you always access it within unsafe code and access the raw data, but that is painful.)

Using my own bitmap class also gave me a secondary advantage: it was portable across all UI frameworks once I also implemented my own support classes (like Color and Point): WinForms, XNA on Windows Phone, Silverlight on the PC, and Silverlight on Windows Phone. And at one time or another I had the code running on each of them. Ultimately only the WinForms and Silverlight on PC ports were needed.

I made 35 separate checkins with various micro-optimizations (directed by the Visual Studio profiler!) and higher level algorithmic improvements. Given the time I spent optimizing the code to run on a general purpose CPU you will probably wonder why I didnít invest that time in porting the many operations that are obviously well suited for a massively parallel GPU into a shader computation. That is simple: The puzzle UI and the web cam capture were both powered by Silverlight which does not support custom shaders that run on the GPU.

Eleven clients stress testing the cloud by continuously submitting recognition requests.

So quite a lot of time was spent making the code as optimal as possible within that constraint, but ultimately I could have offloaded to the GPU because at the last minute The Cloud saved the day. The low end netbooks available to run the actual puzzle event spent an unexpected and unexplainable amount of time on merely capturing frames from the web cam and left only about 30% of their very meager CPU for running image processing. On their own the netbooks could manage just under two frames per second. So instead of processing everything within Silverlight, with no GPU acceleration possible, for a few seconds after a punch card is detected the processing is offloaded over the network to a state of the art PC (capable of about sixteen frames per second). This more than doubles the recognition rate even with network latency and multiple netbooks sharing the PCís capacity.

In hindsight reducing the number of required successful recognitions to a more manageable number and not performing any offload probably would have been simpler and just as effective.

Copyright 2011 Jon Caruana. Although I work at Microsoft, everything you read here is my personal opinion or creation. That's why it is on and not