When you’re trying to automate the process of sharing resources, every little detail is important.
Flexible work hours require flexible programs and procedures. There was a research project at Microsoft that required a long and complicated calculation. The research team figured they could efficiently process the calculation by harnessing the computing power of the product team’s CPUs when the machines were left unattended for the evening.
They developed a program to install on the computers they wished to use. When it kicked in, it asked the server for some work. It cranked away at the number-crunching and then uploaded the results. Basically, that research team had invented something similar to Folding@home or SETI@home years before those projects existed.
You had to configure the resource-sharing program by specifying the time of day you wanted it to become available for calculations and the time of day you wanted it to stop. That way, it wouldn’t interrupt you while you were still at work reviewing a design document. It didn’t start crunching its numbers and using precious memory and CPU cycles while you were anxiously waiting for your build to complete.
The people on the product team dutifully agreed to make their machines available. However, they found the “specify the period of day during which you want the program to be available for computation” feature wasn’t working. The program was constantly waking up and interrupting them. Worse yet, it was making their systems run really slow right in the middle of the work day.
Eventually, the research team found the source of the problem. The code that determined when to run the calculations went like this:
time = GetTimeOfDay();
if (time < StopTime || time >= StartTime)
In other words, it ran the calculations if the current time was before the stop time or after the start time.
The members of the research team set their stop time to 09:00, or 9 a.m. That’s when they all typically arrived for work every morning. They also set their start time to 17:00, 5 p.m. That’s when they left work to go home.
The people on the product team also set their stop time to 09:00, or 9 a.m. However, they set their start time to 01:00, or 1 a.m., because they sometimes worked late into the night and didn’t want the number-crunching routine to interrupt them.
As you can see from the code, if you set your StartTime to midnight or later, the number-cruncher ends up running all day. Apparently, the people on the research team never work late.
Ironically, it took the research team a few iterations before they found the correct algorithm to determine when to run the number-cruncher. Sometimes even geniuses have trouble balancing their checkbooks.
I was reminded of this story when I ran into a similar situation not too long ago. There was a batch file that kicked off some tools to analyze data. There were sporadic reports where the batch file would sometimes stop working and spit out the message, “Internal error, please contact the XYZ support team.”
If you sent an e-mail to the XYZ support team to report the error, they would write back, “We can’t reproduce the error. We’ve turned on diagnostics on the server. Please try again, and we’ll study the log files.” The second time you ran the batch file, it always succeeded.
They ultimately identified the reason for the problem: The batch file would try to build a log file name from the current date and time. It did so by extracting substrings from the %TIME% and %DATE% environment variables. If you ran the program before 10 a.m., the extracted time had a leading space, and that messed up the calculations.
Because the people who maintained the batch file didn’t settle into work before 10 a.m. or so, by the time they wrote back to say, “Try it again now,” it was already 10:30 a.m. The bug no longer occurred. Just by taking a little longer than usual to finish their morning cup of coffee, the members of that support team managed to avoid a bug hiding in their own tool.