Oracle Big Data Strategy

Tuesday

Oct042011

Tuesday, October 4, 2011 at 08:59AM

Oracle's Andy Mendelsohn, SVP of the Server group, laid out his vision for Big Data at Oracle yesterday. The use of Big Data in Oracle's vision is to enhance decision support. This contrasts greatly with some of the common uses of big data which support social networking or personalization.

First, some definitions...commonly used words don't necessarily mean the same to Oracle as many others have used them.

Big Data at Oracle means data that is outside the existing data managed by the warehouse. The data is kept outside the warehouse because it doesn't/can't adhere to the warehouses schema, is of low value without aggregation or analysis or is too plentiful to warrant inclusion in the warehouse.

NoSQL at Oracle specifically refers to Key/Value datastore approaches like BigTable, Dynamo, Cassandra, Voldemort, MongoDB, etc. Simple data can be stored K/V, complex data structures belong in relational structures.

The Oracle Big Data Life Cycle

Big Data fits into a BI context at Oracle - Acquire, Organize, Analyze, Decide. These roughly correspond to ETL, Aggregate, Analytics, Presentation/Dashboard. Nothing much in the existing stack has been displaced or changed. CEP is unchanged (probably the biggest shock to me), Dashboards are largely unchanged, analytics has one significant addition - support for R. Most of the new strategy is around data acquisition with a sharp focus on ETL.

Acquire

BerkeleyDB has found new life as Oracle NoSQL. A new distributed hashing approach has taken BerkeleyDB JE HA and made it scalable. A real server, rather than just a library. Oracle made a compelling case for it's scalability. Having a robust K/V store for rapid, agile web app development from Oracle will be a welcome addition to many shops. It's pretty bare bones and it seems like the Oracle team will have to walk a line to not step on the RDMS space (no joins! Simple data only!) But they will have their hands full integrating to Oracle's security model, enterprise management tools, etc.

All other big data acquisition has been delegated to the "HDFS". This doesn't seem like a real strategy, just an acknowledgement of HDFS as a cold storage strategy. Why not something ZFS? Just because of Hadoop M/R access? Seems like a place holder. But if it's not, there are several pretty decent implications. Having Oracle support for Hadoop on their OS's will be new. Plenty of people run it but Cloudera doesn't offer support (do others support Solaris + Oracle Linux?) However, there was lots of vagueness and wiggle room when Oracle exec's were asked what they mean by Apache Hadoop. Hadoop V.0.20.2 binary only? Without HDFS? They can't mean that. But Sqoop and Flume?

Organize

Oracle has announced their own Hadoop connector software. There are two pieces - a connector that links Oracle Data Integrator and Hadoop so ODI can author M/R jobs (through Hive or something else - unclear.) Also, there are tools, also accessed through ODI I assume, to read from HDFS. This area was pretty vague. The example used on stage was showing using Hadoop to do analysis of log files. The use-case was doing session aggregation across web logs. Can't ODI do this already? If using Hadoop, was it writing SQL code through hive, talking to Flume or doing a replacement for Flume. I asked some others in a briefing after the session and it was vague. That said, to the extent the semantics of the transform can be articulated using the relational calculus, there isn't any particular reason why ODI shouldn't be a great place to author M/R functions. I just don't know what benefit Hadoop would have over the existing parallelism built into ODI.

This is the most confusing to me. Oracle has DB based M/R technology. Haven't they been looking at Aster? If you want to do distributed M/R with ODI why not use a distributed SQL engine. Maybe there will be ODI link to do M/R over a Oracle NoSQL cluster in the future using the transforms one wrote in ODI that you previously ran over Hadoop....

Oracle Loader for Hadoop

Reading from an HDFS datastore into ODI is for reduced data...after the magic code has been written to apply schema to the schemaless, aggregate the low value data into high value data, it's brought in nice and tidy into the warehouse. Sqoop will be fine for some but for others, they'll like having support from Oracle. This seems like a winner even if it's less functional than Sqoop just for the ODI integration...

Analyze

In database release of R. This was understandable - writing R against the data directly instead of making XDF files and importing back in makes sense. Integrating with the scheduler in ODI and the dashboard was particulary cool. But in their use case, the data to be analyzed was in logs that were in HDFS. Why wouldn't you just do the whole thing in RHadoop? Why two step it (or was it 3 or 4 steps) with supposedly terabytes of data. The line between native M/R jobs, ODI authored M/R jobs and R scripting was unclear. Still, for basic t-tests, correlations, pretty candle charts, etc, the Oracle R piece demoed very well.

Decide

Nothing new was demoed. No mention of existing apps taking advantage of this new stuff...

Oh, Oracle would like you to buy their hw to run this on...Unclear why...

Implications

This is good news for the hadoop community but I think bad news for Cloudera. Not sure if it's good or bad for the K/V folks. It's great for Oracle to acknowledge that some data shouldn't be in the RDMS. But it'll be a drain on revenue from the enterprise. This seems only good for the rest of the Big Data ecosystem, big and small. Quest has always had parallel products to Oracle - why should loader be different. Same for Informatica - a statement of direction for Hadoop and ODI let's their customers move forward with PowerCenters more complete vision. All the text analytics folks, particulary those with a Hadoop strategy seem in great shape - all text analytics was really delegated to the magical Java programmers in Oracle's vision.

Aaron Rosenbaum | Comments Off |

View Printer Friendly Version

About Ambleside

Ambleside Logic is led by Aaron Rosenbaum. Father of 3, Programming since 7, DevOps since 11 (hacking RSTS), exIngres, exCTP, exCohera. Sold two companies to Oracle, one to HP. Research + Strategy for NoSQL/BigData ecosystem implementors, vendors and investors.