I was just reading an article linked from slashdot – an interview with one of the top guys at MySQL.
The thing that jumped out at me was what he was saying about how the sudden rise of multi-core CPUs and solid-state disk storage affects software design.
A database program allows you to store incredibly large amounts of data in a way that allows you to search it, manipulate it, and retrieve information from it, in theory as fast as lightning. If you think about it, all the complexity and the different tuning parameters in a database program comes from a set of facts about computers in general.
Hard disks, the only reasonable way to actually store the data, are incredibly slow when compared to any other subsystem in the computer. The speed disadvantage of hard drives is multiplied when the data is access randomly. There have been a few short-lived instances where common hard drive technology managed to eclipse common communication architectures and common network technologies, but there were always faster technologies just around the corner. When those systems went mainstream, they once again left hard drives in the dust.
RAM is lightning fast in comparison to everything on the system. The CPU contains a largish piece of memory called the level2 cache, which is immensely faster than standard RAM. It also contains a very small piece of memory called the level1 cache, which makes the level2 cache look like molasses.
The idea in computer and software design is to make each of these levels of storage found in a computer big enough that it can fully contain one or more levels of software and data abstraction. If you can keep all of the lowest level bits in the level 1 cache, processes at that level will not need to access the relatively glacial level2 cache. If you can keep all of the code that makes up the currently executing loop in your program inside the level2 cache, it will not need to slow down very often to access the main system RAM. If you can keep all your programs and the majority of the commonly accessed data cached in RAM, you will not need to board the three week luxury cruise that it takes to read a chunk of data off the hard drive. Buying enough RAM for perfection at all levels with a large database is prohibitively expensive for just about anyone.
Multiple core CPUs are now the norm, and the cache sizes on those CPUs is becoming insane. Level2 cache sizes in the 8-16MB range are becoming common. If that seems small when you think about computers having gigabytes of RAM, think about this: It’s only been recently that CPUs hit 1MB of L2 cache. I believe for Intel, the first mainstream processor to do so was the Pentium 4 Prescott core in 2004.
Solid state disk drives offer the greatest potential for seriously speeding up large systems like databases. At this time, the actual bulk transfer rate of flash-based disks is not a lot faster than most hard drives, and in some cases is actually slower. It’s in random disk access that they really shine. They have no moving parts – retrieving scattered data is just as fast as retrieving sequential data. The problem is that it currently costs about ten times as much for solid state storage as it does for hard drives, greatly increasing the cost of a computer. They are appearing in high end laptops because they offer awesome speed and low power draw.
Now we just have to wait for compiler technology and software writing practices to catch up with all these advancements, especially the 64-bit multi-core CPUs. Because it’s only recently become so common, there’s a boatload of legacy software that doesn’t utilize multiple CPUs efficiently, and 64-bit versions are rare.