Sensor Data Collection and Processing Applying Web Scale To Sensor Data
Today s speaker Josh Patterson josh@cloudera.com / twitter: @jpatanooga Master s Thesis: self-organizing mesh networks Published in IAAI-09: TinyTermite: A Secure Routing Algorithm Conceived, built, and led Hadoop integration for the openpdc project at TVA (Smartgrid stuff) Led small team which designed classification techniques for time series and Map Reduce Open source work at http://openpdc.codeplex.com Now: Solutions Architect at Cloudera 2
NERC Sensor Data Collection openpdc PMU Data Collection circa 2009 120 Sensors 30 samples/second 4.3B Samples/day Housed in Hadoop
Major Themes From openpdc How much is coming in? Too much to make SAN storage cost effective! Planned for ½ Petabyte of Data storage Ok. So, then, where can this data live? Not at Amazon! Regulations, etc. Also: For fun, price ½ Petabyte of storage at amazon Enter Hadoop Linear Scaling Storage in both space and cost Also had that handy MapReduce thing included
Apache Hadoop Open Source Distributed Storage and Processing Engine Consolidates Mixed Data Move complex and relational data into a single repository MapReduce Hadoop Distributed File System (HDFS) Stores Inexpensively Keep raw data always available Use industry standard hardware Processes at the Source Eliminate ETL bottlenecks Mine data first, govern later
What Hadoop does Networks industry standard hardware nodes together to combine compute and storage into scalable distributed system Scales to petabytes without modification Manages fault tolerance and data replication automatically Processes semi-structured and unstructured data easily Supports MapReduce natively to analyze data in parallel
It s About More Than Just Collection Scenario 1 million sensors, collecting sample / 5 min 5 year retention policy Storage needs of 15 TB Reliability and Availability? Processing Single Machine: 15TB takes 2.2 DAYS to scan We d like to do a lot more than simple scans! Hadoop @ 20 nodes: Same task takes 11 Minutes Also can use Parallel Programming Model MapReduce
Unstructured Data Explosion (You) Complex, Unstructured Relational 2,500 exabytes of new information in 2012 with Internet as primary driver Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 zettabytes this year Source: IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009..
The Cloud The Legend Everything just works in the Cloud The Myth Cloud Computing is a New Technology The Reality Cloud Computing are just more advanced network based applications Not all cloud services are equal, caveat emptor
Scientific American on Cloud Computing Much of what makes cloud computing tick (internet, mobile computers, networked data storage, ) Has been available since the beginning of the dot-com era more than a decade ago. What is new, or at least more recent, is: The greater variety of content that can be delivered online to a wider variety of gadgets.
As it Turn Out The Cloud is just some place in North Virginia Business Insider Lessons Learned From AMZN Failure Amazon is not infallible, and the cloud is not magic. Amazon is not the only IaaS provider, and your application should be able to run on more than one. Cloud deployments must be automated and should take cloud server reliability characteristics into account Read more: http://www.businessinsider.com/learning- the-right-lessons-from-the-amazon-outage-2011-4#ixzz1l4gczcsu
Things to Think About Can I really afford to be locked into a proprietary cloud technology long term? Open Source is coming of age in the enterprise The market for data analysis is exploding Can I use my technology to process this data at scale - -- and process said data fast? Reliable Storage as a serious cost consideration What s a Terabyte cost on this platform? What s a Petabyte cost on this platform?
Hadoop Adoption
Take Aways Not All Data Can Go Into The Cloud Smartgrid data is sensitive, needs private cloud Caveat Emptor You can t just move everything to the cloud Not all cloud tech is of the same reliability Consider Speed at Scale as the killer app Cost at Scale, Cost of Lock-in
Questions? Cloudera s Distribution including Apache Hadoop (CDH): http://www.cloudera.com Resources http://www.slideshare.net/cloudera/hadoop-as-the-platform-for-thesmartgrid-at-tva http://www.tva.gov/news/releases/octdec09/data_collection_software.htm http://gigaom.com/cleantech/the-google-android-of-the-smart-grid-openpdc/ http://news.cnet.com/8301-13846_3-10393259-62.html http://gigaom.com/cleantech/how-to-use-open-source-hadoop-for-the-smartgrid/ http://openpdc.codeplex.com/ Timeseries blog article http://www.cloudera.com/blog/2011/03/simple-moving-average-secondarysort-and-mapreduce-part-1/