The Stonebraker Uncertainty Principle

Recent Tweets

Main Navigation

Wednesday

Oct192011

Wednesday, October 19, 2011 at 04:04PM

Today at the XLDB conference at the Stanford Linear Accellerator Center, Mike Stonebraker gave a very nice talk concerning Shared disk vs. Shared Nothing architectures. I won't rehash the basic's here - Mike has been writing about this for a long time.

Several folks sitting around me commented on an inconsistency in his talk:

Hadoop is a shared file system with a computation dispatch mechanism (that can route on data location which is very nice for certain problems.) Without a distributed file system, there is no Hadoop, right? What is unique about HDFS is the cheapness of the file system vs. other alternatives. It's certainly not shared-nothing - NameNode is single point. The data awareness that is so nice with Hadoop goes away if the storage itself is virtualized.

He claimed that all of the major web shops (Facebook, Google, LinkedIn, Zynga, Ebay) had built around shared nothing architectures.

Is this a conflict? Or just a change in perception depending on how you are looking at the problem?

For these very simple databases, where the file system starts and the database begins is quite confusing. Oracle's Big Data Appliance virtualizes storage then puts HDFS on top of that. Or you use some other block storage (example - Hadoop can use S3 instead of HDFS.)

Right now none of the NoSQL players deal with data skewing, automatically rebalance and nary a query planner is in sight, so it doesn't matter much now. But a statistics based optimizer for rebalancing the file/compute balance across these systems is in order. And the result will look as much like a distributed file system as a database...

Aaron Rosenbaum | Comments Off |