Historically, scale up has been the model for Microsoft data warehouses. Running a large, multi-terabyte data warehouse meant buying a lot of hardware for a single server, and hoping that it would be enough, once the warehouse was fully loaded and under use. If the hardware wasn’t sized properly, you could be looking at big costs for purchasing a new server, with more capacity for memory, disk, and CPUs.
Over the past several months, though, there have been a number of announcements in the SQL Server space that change that. We now have the option of scaling our warehouses up or out. Project “Madison”, which is the integration of the massively parallel processing (MPP) technologies from the DATAllegro acquisition, promises to allow SQL Server 2008 to scale out to 100s of terabytes in the warehouse, by distributing processing among multiple commodity servers. Even though it’s not been officially released yet, I’ve seen several demos of the functionality, and it looks promising. The advantage of this approach is that as you need additional capacity, you simply add additional servers.
On the scale up front, last week Microsoft announced “SQL Server Fast Track Data Warehouse”, which is a set of reference architectures for symmetrical multi processing (SMP) data warehousing. These are single server configurations that are optimized for data warehousing workloads, and have been tested and validated. These take much of the guesswork out of sizing your data warehouse server. However, you still have to provide good estimates of query volume and size to use the reference architectures effectively.
So now the question becomes, should you target a scale up or scale out approach for your data warehouse? One of the deciding factors is going to be your data volume. The Fast Track reference architectures are currently targeted towards 4 to 32 terabyte warehouses. Given current hardware restrictions, that’s the practical limit for a single server. However, as the hardware continues to get better, that number is expected to go up. “Madison”, on the other hand, can scale well past 32 terabytes. So if your current data needs are greater than 32 terabytes, I’d be looking closely at “Madison”.
What if your current needs are less than 32 terabytes, but you expect to grow past that point over the next couple of years? Well, fortunately, the Fast Track reference architectures are designed to offer an easy transition to “Madison”, when your needs grow to that point. And if you expect your data volumes to stay below the 32 terabyte mark, then the Fast Track reference architectures certainly offer a greater degree of confidence that you are getting the appropriate configuration for your warehouse.
It’s always nice to have options, and improving the scaling abilities of SQL Server should certainly help Microsoft in the large data warehouse marketplace. However, the roadmap for how this might apply to the Analysis Services component of SQL Server hasn’t really been directly addressed yet. It would seem logical to offer the same sort of solutions in that space. It will be interesting to see which direction Microsoft takes on that.