news

October 29, 2007

Bad application design => Bad availability (more Rockies ticket debacle)

One final quick note on the Rockies ticket sales debacle - following up on my previous posting[1] on the subject.  This note discusses how to including the humans in your system design can improve both your perceived availability and your customer satisfaction while cutting your costs.

Pacolian did tweak their application to be a little more friendly to their infrastructure and perhaps fixed some networking issues, so that people did eventually get tickets to the Rockies/Red Sox games in Denver.  But, many, many people got told they were going to get tickets, but the system was still so slow and under such heavy load that it timed out many or even most users before they could pay for their tickets.  So, even the second try was highly unsatisfactory, and increased infrastructure costs.  Although you could argue that the system didn't crash, having it be so slow as to be unusable certainly creates the perception of unavailability in the minds of many users.

If you want to sell tickets to a hot event, there are many ways to do it.  For any system you come up with, you can expect the users who use it to try and game the system somehow to increase their odds.

What are the goals of selling tickets to events?

  • sell all the tickets
  • fairness
    • discourage professional ticket scalpers
    • limit number of tickets sold to any one party
    • make "gaming" the system harder
  • minimize infrastructure costs
  • minimize customer frustration

The way Pacolian did it, was to allow people to queue up in the web site and then sell the tickets first-come-first served beginning at a certain time.  Although this might sound reasonable because it's modeled after how tickets are sold at box offices in the real physical world, it's a bad idea in the Internet.  This encourages exactly the behavior that took their systems down, maximized customer frustration, and infrastructure costs, while still giving an edge to professional scalpers.  In their system, if you want tickets, obviously you open as many browser windows at a time as you can, hit the system as hard as you can at exactly the same time to increase your chances of getting in line first.  This creates a situation similar to what operating system people sometimes call thundering herd[2] [3] behavior.  The difference here is that the thundering herd loop included people and browsers.  It makes kind of a comical sight if you think about it...

Another method which I seem to recall hearing that other vendors use is a lottery system.  Such a system might work like this (details may need tweaking):

  1. Announce a registration period of 8-18 hours where people can register for which games they want to buy tickets for, and which price ranges they're willing to buy.
  2. Anyone registering during this time can register once, and give an email address where they can be reached, and which can be accessed from this computer.  Warn people not to do this on shared computers.
  3. Each computer gets one cookie and one email address for registering.  If you try and re-register the same browser or use the email address multiple times, you're rejected.
  4. At the end of the registration period, winners are randomly selected, and notified by email including a random cookie which is correlated to the cookie given to their browser earlier.  (you have to have both the cookie and the email to purchase a ticket).
  5. They have a specified period of time to purchase the tickets, and any given credit card number can't be used more than once per game.  Any tickets not paid for by that time get sold to other applicants.

This means that there is no advantage to registering early or late.  As long as you register, you have the same chance as anyone else.  The peak load on such a lottery system system is probably one or two orders of magnitude below that of the Pacolian thundering herd design.

The result of this is:

  • System availability goes up,
  • Customer satisfaction goes up
  • Infrastructure costs go down.
  • The possibility of gaming the system still exists, but is probably no worse than the original Pacolian system.

All in all, a big win.  To be fair, I didn't think of this idea myself, but before the Rockies/Pacolian debacle, I'd never put much thought into ticket sales either.  Because I'm not an expert in this area, there are no doubt many improvements that an expert would make to my proposal to make it harder to game.  However, from an availability perspective, this application design is much more robust than the original Pacolian system because it takes into account the motivations of the humans that are part of the computer system.

References

[1] http://techthoughts.typepad.com/managing_computers/2007/10/the-cost-of-un-.html
[2] http://catb.org/jargon/html/T/thundering-herd-problem.html
[3] http://en.wikipedia.org/wiki/Thundering_herd

October 22, 2007

The cost of un-availability - and the value of a bad example

Today, the good people of Major League Baseball suffered what looked like a denial-of-service attack which kept them from selling tickets to the (at least) the World Series games in Denver (at Coors Field).  This "attack" started at the same time as tickets sales began - 1000 MDT.

Amusingly enough, this apparent denial-of-service attack was probably caused by customers.  This year's World Series promises to be a good one, between the venerable Boston Red Sox[1] and the unbelievably hot and exceptional Colorado Rockies[2] (Go Rockies!), who are in the World Series for the first time ever, and who have played some absolutely amazing baseball in recent weeks.

As a result of this first-of-a-kind opportunity, many Coloradans stayed home from work, or took a "break" from work to order tickets all at once.  The server infrastructure couldn't stand up to the load, and no one got enough packets through to be able to order any tickets.

Since I was one of those trying to order tickets, and this was clearly a lack of availability in a critical time, I did a tiny bit of investigation.  It appears that they had about 15 servers in their mix.  The ticket sales are being managed by Paciolan[3].

In an event like this, it is vital that they have both a load balancing methodology and a load-shedding methodology.  It appears that they had both.  However, the symptoms suggest that their load-shedding methodology was insufficient to this incredible load.  Some of us Coloradans may be fickle Rockies' fans, but we're sure loyal when they're hot - especially when our other teams aren't going anywhere!    Unfortunately, the Rockies were too hot and and their fans too loyal for Pacolian's load shedding infrastructure.

Here's what I can see about their infrastructure from the outside:

  • They have external web servers which put you in a holding pattern, and try and get through to the inner sanctum of web servers once a minute.
  • If you get in to the inner web servers, you can then order tickets, presumably without a heavy overload, since the inner infrastructure limits the number of simultaneous users.

However, there is an Achilles' heel here - which both the loyal and the fickle Rockies' fans ran immediately into.  You have to have enough network bandwidth to allow people access to the outer infrastructure.  If you don't, various bad things can happen - your load balancers can crash, your routers can crash.  I'm not a networking expert, but if the offered load is an order of magnitude or two higher than the incoming infrastructure can support, most packets won't get through.  If most don't get through, then it doesn't matter how good the load shedding methods are, or how robust your servers are.  Customers can't buy your product.  OOPS!

Pacolian claimed that they were the victims of a real DDOS attack, and that they measured 8.5 million hits in an hour.  Quite honestly, that doesn't sound that high to me.  Personally, I had 4 browsers going at once.  8.5 million hits and hour is only 2361 hits/second, which is less than 142K hits/minute.  In my judgment, between people like me, school kids, scalpers, etc. 142K people isn't very many.  If they were all running 3 or 4 browsers, it would only take about 35K people - which is basically nothing.  Also, when you look at network bandwidth, ethernet links, no matter how fast they are, can only support a relatively small number of packets/second maximum because of minimum times required between packets.  IIRC, that number is something like 1000 packets/sec.    If they only had a single gigabit link to their infrastrucure, 142K people would be 2 orders of magnitude larger than that - which would put us in the right ball park (pun intended) for the behavior that was observed.

One can look at the fact that they almost certainly had only a single site to take this load as a single point of failure.  The networking infrastructure to that site failed.  Exactly how it failed, I can't say.  The fact that it failed is indisputable.

The good news for the Rockies is that because it's the World Series - eventually, somehow those tickets will get sold. But, in the mean time, they appear to have dumped Paciolan for this event[4].  This is unfortunate from my perspective, since I'm in the UK at the moment, and can't exactly run down to Coors Field to wait in line to buy tickets.  Oh well.  I guess it's not really all about me, eh?  :-D

But, this will take longer, and has already aggravated customers a great deal.  The Rockies will probably survive this error - after all that's not their specialty - they're baseball players.  But, it will take Paciolan a while to live this down.  They underestimated the load, maybe they bid too low for selling the tickets, maybe it was done with too little lead time, whatever.  In the end, this is their failure.  This will have cost them a good bit of reputation.  If they had succeeded, they would probably have the World Series business and maybe other sports for several years to come.  Under the circumstances, this will no doubt be a tremendous opportunity cost  - the cost of lost opportunities.

Although I can't quantify the size of their opportunity loss, it is a clear illustration of how it is that lack of availability translates directly into the bottom line of companies - even if it's future revenues.  To be fair to Paciolan,I believe that this is the first time anyone has attempted to sell  World Series tickets only by the web.  Hopefully, it won't be the last.

IBM has done some very high-profile sports web sites in the past, and I can tell you from the people that I've worked with who worked on those, that they required lots of money, incredible planning, always at least three geographic sites with separate networking infrastructure.  Fortunately, so far, IBM has not suffered any embarrassing failures of this magnitude.

Maybe you're saying "I don't sell World Series Tickets, so this doesn't apply to me".  Probably you don't sell World Series tickets.  But, you do probably do something vital for your company's future health.  But, if you take this catastrophe to heart, maybe you can avoid the same embarassing and expensive fate as Paciolan.

If anyone from Paciolan reads this article, I'm sure our readership would love to hear what actually went wrong, and how, in your opinion, it might have been avoided.  Of course, feel free to correct the things I guessed at as well!

Even more than usual, I commend my disclaimer page to you.  I don't speak for anyone but myself.  Not my employer, the Boston Red Sox, the outstanding Colorado Rockies, nor the unlucky/unfortunate Paciolan.

See also my follow-up posting [5] on this subject.

References

[1] http://boston.redsox.mlb.com/index.jsp?c_id=bos
[2] http://colorado.rockies.mlb.com/index.jsp?c_id=col
[3] http://www.paciolan.com/
[4] http://colorado.rockies.mlb.com/news/press_releases/press_release.jsp?ymd=20071022&content_id=2276226&vkey=pr_col&fext=.jsp&c_id=col
[5] http://techthoughts.typepad.com/managing_computers/2007/10/bad-application.html