Recent Tweets
join our mailing list
* indicates required

About Ambleside

Ambleside Logic is led by Aaron Rosenbaum. Father of 3, Programming since 7, DevOps since 11 (hacking RSTS), exIngres, exCTP, exCohera. Sold two companies to Oracle, one to HP. Research + Strategy for NoSQL/BigData ecosystem implementors, vendors and investors.

« What is a "Big Data" Application? | Main | Peaks, Valleys and wrong-turns - presenting time series data in analytics applications »
Friday
Jul292011

A taxonomy of Big Data

Every big data presentation I've seen starts with a discussion of how there are huge mountains of unanalyzed valuable data and how so much data produced is instructured.  All big data, however, is not created equal.

Log Data (structured but big)

System logs such as web logs, error logs, etc, are fairly structured data.  Most likely un-normalized, maybe some time sync errors, but as data goes, machine generated data is pretty structured.  But at large organizations, the data can be quite large and while traditional data management can deal with it, costs are cheaper with NoSQL sometimes. Splunk, Flume are some leaders here.

Big Graphs

Social graphs, highly recursive referential data isn't handled that efficienctly in RDMS but Hadoop doesn't do well either.  Daniel Abadi wrote a nice post about this last a couple of weeks ago.

Semi-structured Text

Email messages, twitter posts, document data stores and the such have some items that have more significance than others and can be recognized by specific fields (to, from, mentions, references) but also have completely unstructured pieces.  MarkLogic accels at dealing with this sort of data.  

Transactional Streams

Large amounts of similar transactional data - call records, stock trades, cash register data.  This data is the bread and butter of OLAP, MOLAP but may be cheaper to process using BigData approaches - sometimes cheaper than the maintenance costs on your old infrastructure.  There are also continuous stream query platforms for dealing with staggering large amounts of data or extremely fast query response - these tools however have not shown a lot of market traction vs. datawarehousing.


 

 

PrintView Printer Friendly Version