Recent Tweets
join our mailing list
* indicates required

About Ambleside

Ambleside Logic is led by Aaron Rosenbaum. Father of 3, Programming since 7, DevOps since 11 (hacking RSTS), exIngres, exCTP, exCohera. Sold two companies to Oracle, one to HP. Research + Strategy for NoSQL/BigData ecosystem implementors, vendors and investors.

« Is Zookeeper becoming the standard cluster manager? | Main | Too many management consoles in NoSQL/NewSQL DBMS industry! »
Friday
Mar022012

Two sides of big data: Better Answers vs. Cheaper Answers

One of the most frequently uttered thoughts at Strata 2012 "This is nothing new - folks have been doing this for years." Yet something new and creative is going on.  There is certainly a lot of fluff but I see three major legitimate trends going on that overlap – but in a fairly confusing manner. 

  • Lowering enterprise data warehouse operation costs (cheaper answers)
  • Mathematical insights into data (better answers)
  • Closed Loop Insight with machine data (better answers)

“How can I use Hadoop + NoSQL instead of Teradata?” aka "Saving $"

When speaking with vendors and users, it's not typically Oracle or SAP that folks are looking to cut costs on but Teradata.  Attacking new problems without buying new Teradata hw + licenses seems to be a large motivation driving many "Big Data" projects.  There is a lot of overlap with the ecommerce and advertising crowd because many of them also were big Teradata customers.  Now they need to grow with lower costs. 6 different vendors, including 2 larger than Teradata, brought up replacing Teradata as a key use case driving Hadoop adoption with cost, not Teradata capabilities being the leading reason. 

The use cases driving this replacement are mostly the same:

  • Sandboxing complex queries off the real-time system.  Typically queries that don’t leverage the indexes, take a long time and are not being used operationally. 
  •  Low budget departmental projects.  Two vendors gave me examples of the $50K ETL/EDW project.  The department can get access to linux servers from IT easily, but not 20TB of Teradata space. 
  • Targeted applications where storage costs/value are disproportianate to value. Teradata can’t/doesn’t charge less for 1TB of click stream data vs. 1TB of stock trade data but the values are quite a bit different.

This is about taking advantage of new cost structures. This story was told over and over again but I liked VMWare’s slide the best:

 

50X > IOPS, 800X > Bandwidth, 40X > Storage!!!

This is so huge that organizations that don’t take advantage of this risk being obsolete.  What would happen if an upstart bank could have 1/10th the IT costs of the established banks? Any information driven business had to be looking at this. This is the disrupter - Hadoop means one can run poor, brute force code and end up with useful results cheaply because of this disruption.  This disruption also means that hosting providers have cycles to spare.

Vendor activities?

Tier-one consulting firms looking to build those targeted applications and move the storage and traffic to the cloud – replace CapX with services. I think it’s really pure services right now – didn’t see any market leadership around centers of expertise from any particular firm – I think they are playing catchip (Accenture had lots of people at Strata – it was not an SI focused show and no Tier-1 SI had a booth.)

Adjacent Tier-one vendors offering to pick-off each others proprietary solutions with their new “Commodity” solutions (EMC/Greenplum HD and Oracle Big Data Appliance are two examples; IBM didn’t have a crisp HW story.  Cisco also has a reasonable story here.) But does non-whitebox really have a chance?

 

Exadata X2-2HC

Oracle BigData Appliance

Hyve Intel OCP-V2 (1/3 of 3 Rack unit)

Storage Capacity

504TB

648TB

144TB

IOPS

16K

21K

32K

CPU

262 Core

216 Core

576 Cores

Memory

768GB

864GB

3.5TB

Cost

~$2M after discounts plus maintenance

$450,000 plus maintenance

~$200,000

 

TeraData and Netezza compete in this same cost structure. I think maintenance costs are going to be the death of this model.  While the value of EMC or Oracle today is clear vs. whitebox, the maintenance models of the big players cannot be sustained - 12% maintenance ends up equalling replacement cost in year 3 for hw (doubling of power each year = 8X power for same money in 3 years.)

"Fire your domain experts, hire mathmeticians"

In a very interesting Panel discussion off the floor, the question was asked "Does Big Data Analytics replace domain experts?"  Jeremy Howard, who runs Kaggle - a crowd sourced analytics company - put forth the argument that math trumps domain experts.  He refered to a very interesting competition they ran for AllState. The goal of the challenge was to predict bodily injury liability, based solely on the characteristics of the insured vehicle.  

 

  • Allstate has been underwriting auto insurance for 80 years. 
  • Car model has been collected as a driver of rates for a very, very long time.
  • Allstate has hundreds of statisticians and mathmeticians on staff - they are a leading employer of actuaries.
  • 107 teams entered the competition.
  • Winner increased predictive performance 340% over Allstates existing model.     
  • The winner was also an actuary, but without any domain knowledge of auto insurance.

 

The work he, and other contestants did, was completely independent of the domain - he didn't need to know that the algorithm was aligning car models to injury claims - it might as well have been aligning demographics to restaurant sites.

Kaggle is currently running a contest with a $3M prize to the team that predicts how many days a patient will stay in the hospital in the next year.  None of the top 5 on the leader board have a health care background.

“How do I get people to buy my goods or click on my ads?” aka "Closed Loop Insight with Machine Data"

This is the first use case for Big Data and remains the primary use case.  Gaining insight and programmatically learning from the user behavior is what drew most non-vendors I talked with to the conference.  They may or may not have Big Data now, but they are universally envious of the learning that those with closed-loop environments enjoy vs. traditional business.  This has been talked to death and while interesting, doesn’t need further explanation.


 

PrintView Printer Friendly Version