Ben Scheirman


My Favorite Production Software Bug

When I first graduated from college I worked for a small company doing custom development work in .NET 1.1.

Our largest client (coincidentally where our offices were) had a print shop and a web site for financial agents to set up and send mailings to folks inviting them to a dinner and telling them about the latest & greatest annuities that they should invest all their money in.

The system was pretty interesting.  With a batch job they’d print out letters, a bio card that showed the agent’s photo on it, and other inserts, such as tickets to the dinner.  These would be collated, folded, and stuffed into an envelope that would be licked, sealed, and affixed with a real stamp. (People are 20 times more likely to open a letter if it has a real stamp – and yes I just made that number up).  It was very impressive to watch it all work.

The website we built allowed the agents to place these orders (with optional inserts) and mail them to a set of folks matching a given demographic all online.

Often times the agents would purchase an upgrade to have a reminder card sent to each person a week before the event occurred.  These cards were special and even though we had a room full of expensive printers, we didn’t have the ability to print these cards.  So we’d have to outsource it to another print shop across town.

The process went something like this:

  • We’d compile all the info, along with a TIF of the agent’s photo and FTP it over to the other company
  • They’d print them all and drive them to the post office for mailing
  • They would charge us money

All of this just worked, and I never had to see the internals of this system.  That is, until my boss went on vacation to Mexico (at the time it was just me and him).

You see, an agent had sent a card to himself and a couple of his friends.  He never received them.  Since he had paid of for the upgrade he was understandably upset.  They asked me to look into it.

I was slightly familiar with the tables, and so I went looking.  There was a table along the lines of ResponseCardQueue.  It contained columns such as agent_id, recipient, address, city, state, zip, and date_sent.

There were tens of thousands of these records.  I issued this query:

SELECT * FROM ResponseCardQueue WHERE date_sent IS NULL

To find that there were about 2100 records returned.  For some reason these weren’t being processed.

I finally found the code that was reading this, and it had some code that looked like this:

public void ProcessCards(Card[] cards)<br />{<br />  try <br />  {<br />    foreach(Card c in cards)<br />    {<br />      string tifFilename = @"\\SOME\NETWORK\PATH\" + c.AgentId + ".TIF"<br />      //copy details + tif image to some folder<br />    }<br />    //zip up folder<br /><br /><br /><br /><br />    //FTP the file to the other print shop<br />    //mark date_sent to<br />  }<br />  catch<br />  {<br />  }<br />}

There are two things to notice.  One was that we were calculating the filename based on the column in the database.  The 2nd was the empty catch block, effectively allowing errors to go on unnoticed.

In this system an agent id was an identity column in another table, so the numbers were incrementing by 1 with each new account.  After much searching, I realized that the column type for the agent id in this table was defined as a char(4).  So as soon as we had our 10000th record in the system, it started looking for filenames that didn’t exist on the network share.

It would be something like this:

agent id 10200 would get truncated to 1020, which in our system didn’t exist (most of the numbers started in the 4000’s.  So the filename didn’t exist (and probably better that it errored out here rather than choose the wrong picture for the card!).  This code threw an exception and stopped processing future records.

And so the unsent records piled up.  For 4 months.

So I diligently made the column type int and updated the records that were below that threshold to correct their agent id numbers.  So guess what happened?  I fixed the clog and with one big TWOOOSH all of the records were processed.

I felt mighty proud.




A few hours later I realized that the cards would actually now be mailed!  How embarrassing it would be to remind someone of an event that took place 3 months ago?

By the time I was able to explain all of this and someone jumped in their car and went to the post office just in time to grab the entire batch before it was about to be mailed.

We still were charged for the printing & postage of those cards, however we saved ourselves the embarrassment of explaining to all of our customers that we screwed up big time.

I learned a valuable lesson that a simple oversight can cost a company a ton of money (and in this case… reputation).

So what’s your favorite production software bug?