Java News from Friday, July 27, 2007

Friday kicks off with an information-packed keynote from two of eBay's architects, Randy Shoup and Dan Pritchett. I can't type fast enough to keep up with them. I cheated and talked to Randy earlier in the Speaker's lounge. They have a few petabytes of data going back to the site's founding, split across a couple of dozen Oracle databases. They use a custom object-relational mapping layer to manage queries across independent databases. (Please open source that. I need it.)

Dan starts with the database side:

They're on version 3 of the architecture. eBay started with commodity Fry's hardware, Perl, and open source software, every item in a separate file. Limited to 50,000 files. Now they use 4-way Solaris 2900 boxes, Java, and Oracle. Power and cooling is the biggest current issue.

Version 2 was C++ in IIS. Database was moved to Oracle. Microsoft Index Server was introduced for search. Used Resonate for load balancing and spread horizontally. Still a single database though. Failover database introduced here. November 1999 they're down for a day and on CNN. This motivated splitting the databases.

December 2002 current DB architecture introduced, many different databases. Application is out of control though. It's a two-tiered architecture. 2 million+ lines of code in a single ISAPI DLL. Hundreds of developers are working on this. Integration nightmare. This motivates a move to J2EE and Java. Presentation logic was moved out of C++ into XSL! Introduced developer API.

Version 3 is Java.

The horizontal splitting of the databases is transparent to the application developers. They don't need to know how many physical servers they're talking to. However, 15,000 app servers at four DB connections per server. Connection management is a problem. Oracle can't handle 60,000 connections. They have to cheat to handle this.

Different rows have very different characteristics; e.g. a user with one feedback and a user with half a million feedbacks.

Amazon has the exact same rules. Brewer's CAP theorem: Pick 2 out of 3: consistency, availability, scalability (partition tolerance). They and Amazon discovered this empirically.

Now Randy starts talking about scaling the application tier (as opposed to the database tier).

They'd like to put more than 4K of data in a cookie, and they believe they can't do multipage flows without cookies. (They're wrong about that one, by the way.) They're in process of inventing server-side cookies to get around this. Now if they just threw out the cookies and stored the ID of the server side record in the URL, they'd be RESTful. They did have some scalability issues where the 4K cookies came back with each of 70 requests for 70 images on one page. They had to reorganize to avoid that. They do try to avoid session state, but they haven't achieved stateless nirvana yet (my opinion, not theirs).

The 4K limit in cookies is a surprise. I didn't know about that. URLs can hold more than 4K. I'm about to run out of battery power. I think I got most of the critical details down anyway.


I wanted to hear Granville Miller talk about Agile Architecture but that room is just too full. If the fire marshal shows up, we're in trouble. The conference has 5 class rooms running simultaneously and they're all about the same size. That has often meant the most popular session or two in any one time slot has people sitting in the aisles, and the others feel relatively empty.


First afternoon session Anthony Mattei talks about "When Every Click Counts: Building Mission Critical Enterprise Systems". "Fault tolerance is not a new concept." He cites fault tolerance architectures such as Enniac going back to 1946. Tandem et al. available in the 1980s.

SLA is an outward facing document to the customer about how your system will behave. However it also has internal value to the engineers. Distinguish between planned and unplanned downtime. Avoid single points of failure. Publish/subscribe has pretty much supplanted point-to-point messaging.