Big Data Pragmaticalities Experiences from Time Series Remote Sensing

Similar documents
7. Archiving and compressing 7.1 Introduction

Part 6b: The effect of scale on raster calculations mean local relief and slope

Triton file systems - an introduction. slide 1 of 28

Computer Basics 1/24/13. Computer Organization. Computer systems consist of hardware and software.

Operating System Principles: Memory Management Swapping, Paging, and Virtual Memory CS 111. Operating Systems Peter Reiher

Archive II. The archive. 26/May/15

XP: Backup Your Important Files for Safety

Lesson 9 Transcript: Backup and Recovery

Computer Basics 1/6/16. Computer Organization. Computer systems consist of hardware and software.

Introduction to Remote Sensing Wednesday, September 27, 2017

Memory Management: Virtual Memory and Paging CS 111. Operating Systems Peter Reiher

Developing MapReduce Programs

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Introduction to High Performance Parallel I/O

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

Open Data Standards for Administrative Data Processing

Excel Basics: Working with Spreadsheets

What to do with Scientific Data? Michael Stonebraker

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Introduction to Computing Systems - Scientific Computing's Perspective. Le Yan LSU

Computer Caches. Lab 1. Caching

Dealing with Large Datasets. or, So I have 40TB of data.. Jonathan Dursi, SciNet/CITA, University of Toronto

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Testing is a very big and important topic when it comes to software development. Testing has a number of aspects that need to be considered.

Exadata X3 in action: Measuring Smart Scan efficiency with AWR. Franck Pachot Senior Consultant

Actian Hybrid Data Conference 2017 London Actian Corporation

How to Rescue a Deleted File Using the Free Undelete 360 Program

CASE STUDY IT. Albumprinter Adopting Redgate DLM

Designing dashboards for performance. Reference deck

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

High Performance Data Efficient Interoperability for Scientific Data

Learning to Provide Modern Solutions

Memory Management. Kevin Webb Swarthmore College February 27, 2018

Utilities. September 8, 2015

Lecture 1: Overview

Practical Unix exercise MBV INFX410

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Physical Representation of Files

If you re using a Mac, follow these commands to prepare your computer to run these demos (and any other analysis you conduct with the Audio BNC

ICS Principles of Operating Systems

Outlook is easier to use than you might think; it also does a lot more than. Fundamental Features: How Did You Ever Do without Outlook?

CPU Pipelining Issues

CS15100 Lab 7: File compression

CS 101, Mock Computer Architecture

Taskbar: Working with Several Windows at Once

White Paper. How the Meltdown and Spectre bugs work and what you can do to prevent a performance plummet. Contents

What is a file system

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

Introduction to Unix: Fundamental Commands

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 3. Essential skills for bioinformatics: Unix/Linux

Chapter 6. File Systems

The Evolution of a Data Project

The Right Read Optimization is Actually Write Optimization. Leif Walsh

Essential Skills for Bioinformatics: Unix/Linux

Chapter 1: Introduction to Computer Science and Media Computation

I/O: State of the art and Future developments

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Ocean Color Data Formats and Conventions:

LeakDAS Version 4 The Complete Guide

How to approach a computational problem

Map Reduce. Yerevan.

Publications Database

Distributed Computation Models

Customizing DAZ Studio

Column Stores vs. Row Stores How Different Are They Really?

File Structures and Indexing

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

Chapter-3. Introduction to Unix: Fundamental Commands

Gadget in yt. christopher erick moody

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Parallel Programming Patterns Overview and Concepts

I/O Challenges: Todays I/O Challenges for Big Data Analysis. Henry Newman CEO/CTO Instrumental, Inc. April 30, 2013

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Matlab for FMRI Module 1: the basics Instructor: Luis Hernandez-Garcia

CS 1110 SPRING 2016: GETTING STARTED (Jan 27-28) First Name: Last Name: NetID:

WHITEPAPER. Disk Configuration Tips for Ingres by Chip nickolett, Ingres Corporation

Slide Set 1. for ENCM 339 Fall Steve Norman, PhD, PEng. Electrical & Computer Engineering Schulich School of Engineering University of Calgary

L7: Performance. Frans Kaashoek Spring 2013

Unix/Linux Operating System. Introduction to Computational Statistics STAT 598G, Fall 2011

[537] Fast File System. Tyler Harter

CS 318 Principles of Operating Systems

Google Drive: Access and organize your files

Welcome Back! Without further delay, let s get started! First Things First. If you haven t done it already, download Turbo Lister from ebay.

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

Background. Let s see what we prescribed.

8/16/12. Computer Organization. Architecture. Computer Organization. Computer Basics

If Statements, For Loops, Functions

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

CS510 Operating System Foundations. Jonathan Walpole

CS5460: Operating Systems Lecture 20: File System Reliability

Week - 01 Lecture - 04 Downloading and installing Python

Feature allows you to view, create, change, and delete IMS databases and application views (PSBs)

Automating Digital Downloads

Student Success Guide

Search Lesson Outline

10 Simple User Experience Best Practices

Splunk is a great tool for exploring your log data. It s very powerful, but

SciSpark 201. Searching for MCCs

ProMAX Cache-A Overview

Transcription:

Big Data Pragmaticalities Experiences from Time Series Remote Sensing Edward King Remote Sensing & Software Team Leader 3 September 2013 MARINE & ATMOSPHERIC RESEARCH

Overview Remote sensing (RS) and RS time series (type of processing & scale) Opportunities for parallelism Compute versus Data Scientific programming versus software engineering Some handy techniques Where next 2 Big Data Pragmaticalities

Automated data collection. 3 Big Data Pragmaticalities

Presto! Big Data(sets). 4 Big Data Pragmaticalities

More Detail Composites Remapped L2 (derived quantity) L1B (calibrated) L0 (raw sensor) Examples 1km imagery 3000 scenes/year x 500MB/scene x 10 years = 15TB 500m imagery x 4 = 60TB

Recap - Big Picture View These archives are large They are often only stored in raw format We usually need to do some significant amount of processing to extract the geophysical variable(s) of interest We often need to process the whole archive to achieve consistency in the data As scientists, unless you have a background in high performance computing and data intensive science, this is a daunting prospect. There are things that can make it easier 6 Big Data Pragmaticalities

Output types Scenes: User Composites: + + = best pixels User + + = etc 7 Big Data Pragmaticalities

Things to notice Some operations are done over and over again to data from different times. For example: processing Monday data and Tuesday data are independent This is an opportunity to do things in parallel (ie all at the same time) Operations on one place in the data are completely independent to operations in other places. For example: Processing data from WA doesn t depend on data from Tas. This is another opportunity to do things in parallel (ie all at the same time) 8 Big Data Pragmaticalities

12 th ARSPC - Fremantle Note: This general pattern is often referred to as a HADOOP or MAP- REDUCE system, and there are software frameworks that formalise it eg it lies behind Google search indexing. (Disclaimer: I ve never used one)

So what? Our previous example 10yrs x 3000 scenes/yr @ 10mins/scene = 5000hrs = 30weeks Give me 200 CPUs = 25hours But what about the data flux? 15TB/30 weeks = 3 GB/hour 15TB/25 hours = 600 GB/hour ~0.5GB Problem is transformed from compute bound to I/O bound 10 Big Data Pragmaticalities

Key tradeoff #1: Can you supply data fast enough to make the most of your computing? How much effort you put into this depends on How big is your data set How much computing you have available How many times you have to do it How soon you need your result Figuring out how to balance data organisation and supply against time spent computing is key to getting the best results. Unless you have an extraordinarily computationally intensive algorithm, you re (usually) better off focussing on steps to speed up data. 11 Big Data Pragmaticalities

Computing Clusters Workstation 2 CPUs (15 weeks) NCI (now obsolete) 20000 CPUs (20 mins) My first (& last) cluster (2002) 20 CPUs (1.5 weeks) 12 Big Data Pragmaticalities

Plumbing & Software Somehow we have to connect data to operations: Operations = atmosphere correction remap calibrate mycleveralgorithm Might be pre-existing packages Your own special code (Fortran, C, Python,. Matlab, IDL) Connect = provide the right data to the right operation and collect the results Usually you will use a scripting language since you need: To work with the operating system Run programs Analyse file names Maybe read log files to see if something went wrong Software for us is like glassware in a chem lab: a specialised setup for our experiments; you can get components off the shelf, but only you know how you want to connect them together. Bottom line you re going to be doing some programming of some sort. 13 Big Data Pragmaticalities

Scientific Programming versus Software Engineering (Key Tradeoff #2) Do you want to do this processing only once, or many times? Which parts of your workflow are repeated, which are one-off? Eg base processing many times, followed by one-off analysis experiments How does the cost of your time spent programming compare with the availability of computing and time spent running your workflow? Why spend a week making something twice as fast if it already runs in two days? (maybe because you need to do it many times?) Will you need to understand it later? 14 Big Data Pragmaticalities

Proprietary fly in the ointment (#1) If you use licenced software (IDL, Matlab etc.) you need licences for each CPU you want to run on. This may mean you can t use anything like as much computing as you otherwise could. These languages are good for prototyping and testing But, to really make the most of modern computing, you need to escape the licencing encumbrance = migrate to free software. PS: Windows is licenced software Example: we have complex IDL code that we run on a big data set at the NCI. We have only 4 licences. It runs in a week (6 days). If we had 50 licences -> 12hours. We can live with that since there would be weeks and weeks of coding and testing to port to Python. 15 Big Data Pragmaticalities

How to do it

Maximise performance by 1. Minimise the amount of programming you do Exploit existing tools (eg std. processing packages, operating system cmds) Write things you can re-use (data access, logging tools) Choose file names that make it easy to figure out what to do Use the file-system as your database. 2. Maximise your ability to use multiple CPUs Eliminate unnecessary differences (eg data formats, standards) Look for opportunities to parallelise Avoid licencing (eg proprietary data formats, libraries, languages) 3. Seek data movement efficiency everywhere Data layout Compression RAM disks 4. Minimise the number of times you have to run your workflow Log everything (so there is no uncertainty about whether you did what you think you did) 17 Big Data Pragmaticalities

RAM disks Tapes are slow Disks are less slow Memory is even less slow Cache is fast but small Most modern systems have multiple GB of RAM for each CPU, which you can assign to working memory and as virtual disk. TAPE DISK RAM CPU Cache If you have multiple processing steps, which need intermediate file storage use a RAM disk. Can get a factor of 10 improvement. 18 Big Data Pragmaticalities

Compression Data that is half the size takes half as long to move (but then you have to uncompress it but CPUs are faster than disks) Zip, gzip will usually get you a factor of 2-4 compression Bzip2 is often 10-15% better BUT it is much slower (factor of 5). Don t store random precision (3.14 compresses more than 3.1415926) Avoid recompressing (treat compressed archive as read-only, ie copyuncompress-use-delete, DO NOT move-uncompress-use-recompressmoveback) Remote Disk File.gz File RAM CPU (decompression) 19 Big Data Pragmaticalities

Data Layout Look at your data access patterns and organise your code/data to match Eg 1. if your analysis uses multiple files repeatedly, reorganise the data so you reduce the number of open & close operations Eg 2. Big files tend to end up as contiguous blocks on a disk, so try and localise access to data, not jumping around which will entail waiting for the disk. Access by row Access by column 20 Big Data Pragmaticalities

Data Formats (and metadata) This is still a religious subject, factors to consider: Avoid proprietary (may need licences or libraries for undocumented formats) versus open formats that are publicly documented Self-contained (keep header (metadata) and data together) Self-documenting formats have structure that can be decoded using only information already in the file Architectural independence will work on different computers Storage efficiency binary versus ascii Access efficiency and flexibility support for different layouts Interoperability openness and standard conformance = reuse Need some conventions around metadata for consistency Automated metadata harvest (for indexing/cataloguing) Longevity (& migration) Answer: use netcdf or HDF (or maybe FITS in astronomy) 21 Big Data Pragmaticalities

The file-system is my database Often in your multi-step processing of 1000s of files you will want to use a database to keep track of things DON T! Every time you do something, you have to update the DB It doesn t usually take long before inconsistencies arise (eg someone deletes a file by hand). Databases are a pain to work with by hand (SQL syntax, forgettable rules) Use the file-system (folders, filenames) to keep track. Egs: once file.nc has been processed, rename it to file.nc.done and just have your processing look for files *.nc. (rename it back to file.nc to run it again, use ls or dir to see where things are up to, and rm to get rid of things that didn t work). Create zero size files as breadcrumbs touch file.nc.fail.step2 ls *.FAIL.* to see how many failures there were and at what step Use directories to group data that need to be grouped for example all files for a particular composite. 22 Big Data Pragmaticalities

Filenames are really important Filenames are a good place to store metadata relevant to the processing workflow: They re easy to access without opening the file You can use file system tools to select data Use YYYYMMDD (or YYYYddd) for dates in filenames then they will automatically sort into time order (cf DDMMYY, DDmonYYYY) Make it easy to get metadata out of file names: Fixed width numerical fields (F1A.dat, F10B.dat, F100C.dat is harder to interpret by program than F001A.dat, F010B.dat, F100C.dat) Structured names but don t go overboard! D-20130812.G-1455.P-aqua.C-20130812172816.T-d000000n274862.S-n.pds Eg. ls *.G-1[234]* to choose files at a particular time of day 23 Big Data Pragmaticalities

Logging and Provenance Every time you do something (move data, feed it to a program, put it somewhere): write a time-stamped message to a log file. Write a function that automatically prepends a timestamp to a piece of text you give to it. Time-stamps are really useful for profiling identifying where the bottlenecks are, or figuring out if something has gone wrong. Huge log files are a tiny marginal overhead Make them easy to read by program (eg grep) Make your processing code report a version (number, or description), and its inputs, to the log file. Write the log file into the output data file as a final step. This lets you understand what you did months later (so you don t do it again) Keeps the relevant log file with the data (so you don t lose it, or mix it up) 24 Big Data Pragmaticalities

Final Thoughts Most of this is applicable to other data intensive parallel processing tasks Eg. spatio-temporal model output grids Advantages may vary depending on file size Data organisation has many subtleties a little work in understanding can offer great returns in performance Keep an eye on file format capabilities More CPUs is a double edged sword Data efficiency will only become more important Haven t really touched on spatial metadata (v. important for ease of end-use/analysis but tedious (=automatable)) Get your data into a self-documenting machine-readable open file format and you ll never have to reformat by hand again. These are things we now do out of habit because they work for us Perhaps they ll work for you? 25 Big Data Pragmaticalities

Thank you Marine & Atmospheric Research Edward King Team Leader: Remote Sensing & Software t +61 3 6232 5334 e edward.king@csiro.au w www.csiro.au/cmar MARINE & ATMOSPHERIC RESEARCH