Stunning Quality
Back in the 90s I was involved with about 100 other people in a project to develop a new voice mail system – software, hardware and firmware. The hardware was 100% new design, and the software was about 70% new. Along the way we stumbled into something that improved our end quality in a way that can reasonably be described as stunning.
A few months away from our first controlled introduction (CI) or beta site – and things didn't really feel right – lots of the system was best described as squishy – that is, not really solid. They passed unit tests, and the system tests worked continually for days or weeks without failures, but then something would fail. These squishy components included key subsystems like voice record/play – vital to the system. For a voice mail system, this is having a database for critical data that usually stores the data. Not good enough.
About this time, our second level manager (2LM) declared that everyone will work 60 hours a week on site until our CI. As it turns out, my job had been to lead the device driver team (we had about 18 new UNIX device drivers – and I was the only experienced UNIX device driver writer) – and all of these were quite solid – for reasons worthy of another article. But returning to this article... As was my personal style, I had been winding down and celebrating no longer being able to break anything no matter how hard I tried by spending time in what is sometimes called Beneficial Scholastic (BS) discussions – avoiding anything that looked like work for a few days. So, when our 2LM declared that we were all to work 60 hours a week, I wondered what I could possibly do with 60 hours a week. So I asked her what we were supposed to do with those 60 hours. Her reply? Do the same as you are doing – only more so. Some of you may know me, and know I have a certain fondness for BS discussions – but I was certain that not even I could spend 60 hours a week in BS – not to count that it would be a bit on the unproductive side.
Squishy Code
I knew there was a good bit of squishy code out there – you could feel it in the air – in the hallway and water-cooler talk. If I was going to spend 60 hours a week on-site I wanted it to be for something that mattered. I wanted to get the voice play/record code fixed – now that would matter. But it wasn't my code – and deciding to start fixing or writing tests for someone else's code is definitely not the kind of thing to endear you to others. I also knew that this wasn't the only squishy code out there. So I decided to try another approach.
In parts of this company (like many) you can occasionally run into a shoot-the-messenger mentality hiding not very far down. I had an idea on how to redirect our effort in a much more profitable way – but I didn't want to get shot as the bearer of bad news. So I asked our 2LM - “If I knew how we should be spending our time most effectively, would you want to know?”. To her credit, she replied “Of course!”.
What Needs Doing?
So I went off to carry out my idea. My idea was simple – I interviewed most of the people in the organization and set the emotional stage before asking them a few questions, wrote down the answers and put them into a short (5-page) report.
For the purposes of this posting, imagine March 23 was the cut date for our furst customer. My interviews with the project engineers went something like this:
Imagine it's the morning of March 23. You come in to work. Most of the offices are dark. It's very quiet. Those people who are here, just came to print copies of their resumes. We failed. Why did we fail? Then I asked – What should we do to prevent this? Followed by the important question – Can I use your name?
Interestingly enough every single person said yes to the last question. By far the most common answers to the first question were of the form “My software caused us to fail”. The most common answers to the second question were of the form “We need to test it unreasonably – beat the excrement out of it”.
This was an interesting set of results – indicative, in my opinion of a great organization. People understood the gravity of what they were trying to do, they took responsibility for their own bugs, and they knew what to do to fix it, and very much wanted to.
I wrote up the memo, and added a graph to it – since my management (at all levels) loved graphs. Now, it isn't a graph that any mathematician would love – it was more of an illustration – but this is OK I was showing it to management, not to mathematicians. The original graph is lost to antiquity – but I've drawn an approximation to it below.
This graph represents the universal truth that “some things are better than other things” - and also the common truth that some things are good enough, but some things are not. There are basically two lines on the graph separated by a small margin. This would be the result of applying uniform effort across the project. Note that few things rise up to cross the “good enough” line. My proposal then, was to apply our efforts only in the areas that were below the “good enough” line, and fill in only the areas below the line – indicated by light blue in the illustration. Seems perfectly obvious. In fact, it is perfectly obvious. What wasn't obvious was which things weren't good enough.
Since I was still skeptical that I wouldn't be a victim of “kill the messenger”, I went back to my 2LM and asked her if she still wanted to know what we should be doing – she said “Of course”. Since I was still concerned about shoot-the-messenger, I handed her my memo, and went home to hide where I would be hard to find. By the next day, I was dying of curiosity and went to see how my memo was taken. I looked and looked but could find no managers at all. I finally got up the nerve to ask my 2LM's assistant where everyone was – she said they were all off-site discussing the “Robertson memo”. I was completely floored. This was a very surprising result to me. To my shock and the everlasting credit of my 2LM, she reorganized the entire project around the recommendations I had collected in my memo.
If you take staff off the good enough things, and put them on the things that aren't good enough, then lots of normal rules on who does what get broken down. People are by definition, going to be working on areas that aren't “theirs”. And because it's being done over the whole project, it's by definition not a condemnation of anyone in particular, and is thereby socially acceptable.
More Testing – The Right Kind in the Right Places
Not all the suggested things were testing, but most were. The project had good unit tests. It had good system tests. Many of the recommendations for new tests turned out to be automated and merciless subsystem tests – which were subsequently nicknamed Bamm-Bamm tests. Since I had done the interviews and written the memo, I claimed first choice on which area to work in. I chose to work on the voice play/record testing. I wrote a set of unreasonable, merciless automated tests, which set up the hardware in cross-connected mode to allow for monitoring, grabbed the voice APIs at a low level and exercised them randomly. Play, skip ahead, skip back, speed up, slow down, record, play a touch tone, etc – all randomly, and without any kind of rhyme, rhythm, or restraint. This setup also could use more voice channels than were visible on the real system – and it didn't have to be connected to a switch – meaning it could run without tying up so much expensive test hardware, and it could produce more load than could ever possibly exist in the real system. The first bug it found reproduced itself reliably in 5 minutes – which took at least a week to reproduce itself in the system test environment.
Once the voice subsystem could run these tests for an hour without a failure, no one could ever find any more bugs. This from a subsystem where previously it had taken weeks to reproduce a problem – even once. Like the first bug, these subsequent bugs found reproduced themselves in a few minutes. This is a huge difference. You could reproduce a problem a few times, create a fix, and try it out all in a morning – instead of a few months.
This kind of result was certainly dramatic, but there were many others performing similar work on other weak subsystems. This all sounds wonderful, but what was the actual result? Our 2LM decided to delay our CI by a week, to let us finish our testing and fixing. Our CI site was already a heavy user of our voice mail systems – and we were going to replace their current system with the new one – which would not look any different to them. So, they already had a culture of significant use of voice mail, and they would hit the system hard on the first Monday after the cutover.
And Then A Miracle Occurs...
We installed the system and migrated their data over to it, and then everyone sat around and watched – waiting to respond to that first crash – which we assumed would come by 10:30 their local time. But that crash didn't come – and it didn't come – and it still didn't come. We were all pretty shocked. We assumed that they must have had a company holiday – so we pulled the traffic logs and compared them to their previous weeks of traffic – they were doing what they had always done – and it was just working. It was fully 6 weeks later before this heavy user of voice mail had their first complaint of any kind. And it was 6 months before our first crash in the field. Given how much new hardware and software there was, this was more than a little surprising. This reliability continued on and certainly appeared to be better than any product put out by this company (one known for reliable products), but more interesting confirmation of this came years later.
For a variety of internal political reasons, there were really only two major feature releases of this product – and after that it went mostly into maintenance mode – except for recording new languages and getting certified in new countries. However, a number of years later, some of the chips on the board weren't going to be available any more. Given the political issues surrounding the project, I assumed that would be the end of it – after all, the corporate hierarchy had tried to kill it numerous times, and it was a niche product in a small part of our portfolio. As it turns out, the product was incredibly profitable – and in spite of the efforts to minimize it and kill it, had become responsible for a very significant portion of the profit of our division – and politics or no, somehow no one was willing to leave all that money on the table. Of course, because of the political issues, all the development staff had fled to places that weren't going to kill their careers.
The Results Over Time
To keep this revenue stream going, the company had to gather up some of the original staff to create an updated version of the product. We had a new VP who didn't know about all the political issues of the past, and got together the new team in a meeting to learn about this project. He asked “Who all is working on the product now?” One person raised his hand, then someone said, “Yeah, but he's so good, he only works on it half time”. Of course, everyone laughed. But it was the truth – this 100-person software/hardware/firmware project which was out there in the field making tons of money had been maintained completely by one person half-time for the last 5 years. It simply worked - all the time – almost without fail. The only known bug was due to a hardware design issue relating to where some signal paths ran. About once a year one particular DSP would crash because of crosstalk in this set of signal paths. No one would let him redesign the board to fix it – so he put in a simple work around to catch it quickly and restart the DSP. This was really the only outstanding issue of any substance in the product. Of course, the redesign also performed more cost reductions as a result of 10 years of hardware evolution, so that it became even more profitable than it had been before.
When I saw the stunned look on that VP's face when he realized that despite the laughter this wasn't a larger in-joke – I really realized the magnitude of what we had accomplished. It was written all over his face that he really didn't believe that it was possible for a popular and profitable product of this size to work so well that it almost never needed fixing – for more than five years. It isn't exactly a common occurrence – “unheard of” probably isn't too strong a term.
Questions and Reflections
A series of interesting questions come out of this – Which things that we did made the difference? Why did they make such a difference? Why were they necessary? If this was a great organization, why didn't everyone already know all these things and act accordingly? How much of the answers to the “what to do about it” were influenced by my personal assumptions?
I don't have definitive answers to these questions – but I have thought about them a lot since this happened – since I was as stunned by the outcome as anyone. Although we all recognized how good the quality seemed to be at this first release, the political implications of it all for our own personal careers took first priority – and most of us forgot about it and got caught up in surviving and starting on our next things.
There are several things you can observe about this which seem interesting – first of all, this process helped identify what really needed doing – and reminded people of what they mostly already knew – which things worked, and which ones didn't.
Secondly, because the information came from the staff, it has inherent credibility – when someone says their own software isn't too good, it has more credibility than if someone is complaining about someone else's software. These complaints came from expert sources.
Thirdly, each of the areas has information from experts on what they thought would most likely result in success.
Fourthly, the resulting plan to take all these things seriously, resulted in a radical reallocation of staff – breaking down normal social barriers like “this is my code, stay out”, and replacing them with more of a community approach – where the areas with troubles got more help. This last point is worth elaborating some more. A normal project profile looks a bit like a normal distribution – with a small amount of staff at the beginning, a large amount at the middle, with the staff tapering off at the end. However, the truth is, that at the beginning and ends of those curves, that there is a good bit of underutilized staff – that is people who don't really have enough to do. I was personally a case in point – I had finished my code, but I was still on the project – part of the staffing curve – but not being utilized. Naturally, I wasn't about to tell anyone – I had put in incredible effort and hours over a period of years to get to this point. One of the things this technique does is locate that underutilized staff and put them to work where their help is most needed – in a way that causes minimal turf issues.
To be honest, I won't say there were no turf issues – but they were minimal and quickly overcome because the entire management structure supported these changes. The person I gave my testing tool (and bug reports to) wasn't completely happy that I'd found these issues – but he understood the benefits and after a bit, was OK with all this – especially since in the process I'd given him a tool to reproduce and fix problems very quickly – in the end making his life much easier.
Where Did The Staff (Money) Come From?
One of the more interesting realizations about this technique is that it cost very little additional money over what was already being spent for the project. How can that be, when clearly a lot more testing was getting done?
The answer is pretty simple, we redirected staff from those parts of the project that were above the line to places where things were below the line. Note that these people weren't writing code in those other areas, they were writing test code, which requires a good bit less knowledge of the particular subsystem under test. The staff already knew the project, how it worked, and had a basic understanding of most or all the subsystems – so they became productive quickly.
The fact that there was unused staff was an interesting thing – and I suspect is reasonably common towards the end of a project. Most of the people on my subteam were in a similar situation to mine – they had worked hard and gotten their code done right, and were now waiting for someone to bring them more bugs. I'm sure we weren't alone – people whose subsystems were above the imaginary “good enough” line were often in this situation. Below is a graph which I believe illustrates this situation reasonably well.
This represents the basic idea that staff needs rarely match staff availability. This particular graph represents the happy situation that eventually peak staff needs are met. The key realization is that in the latter phases of the project, there comes a situation where staff available to the project actually exceeds current needs. When individuals find themselves in this situation, they rarely call attention to it, but just try and keep busy and hope no one notices. In the case of our project, we chose to redeploy this staff to create new test mechanisms and solve other problems as identified by the informal survey I took.
Trying This Out For Yourself
If one were to try and reproduce this as an experiment, here are the things which seem to me to be most relevant to such an attempt:
-
Wait until the project is near enough completion, so that there is the possibility of having some underutilized staff that can be redirected, and a probability that the development staff has a good idea where the important problem areas are.
-
Perform interviews asking people what they think needs to happen. The person doing these interviews needs to understand development and be seen as an honest broker – either an outside development consultant or a respected developer with a reputation for honesty and respect of their peers. The emotional content of the method I used may have helped break people out of normal ways of thinking, but the exact formulation is probably not that important. Although having someone who can provide guidance in providing solutions to problems is not ideal interview technique, but it may be desirable in terms of getting good results.
-
Follow the recommendations that are gathered – trying to take them seriously. In my experience, reorganizing the project to move staff around will likely be necessary.
Needless to say, if you decide to repeat this experiment, I'd be interested in hearing about it.
Comments