Big Data Pragmaticalities Experiences from Time Series Remote Sensing Edward King Remote Sensing & Software Team Leader 3 September 2013 MARINE & ATMOSPHERIC RESEARCH
Overview Remote sensing (RS) and RS time series (type of processing & scale) Opportunities for parallelism Compute versus Data Scientific programming versus software engineering Some handy techniques Where next 2 Big Data Pragmaticalities
Automated data collection. 3 Big Data Pragmaticalities
Presto! Big Data(sets). 4 Big Data Pragmaticalities
More Detail Composites Remapped L2 (derived quantity) L1B (calibrated) L0 (raw sensor) Examples 1km imagery 3000 scenes/year x 500MB/scene x 10 years = 15TB 500m imagery x 4 = 60TB
Recap - Big Picture View These archives are large They are often only stored in raw format We usually need to do some significant amount of processing to extract the geophysical variable(s) of interest We often need to process the whole archive to achieve consistency in the data As scientists, unless you have a background in high performance computing and data intensive science, this is a daunting prospect. There are things that can make it easier 6 Big Data Pragmaticalities
Output types Scenes: User Composites: + + = best pixels User + + = etc 7 Big Data Pragmaticalities
Things to notice Some operations are done over and over again to data from different times. For example: processing Monday data and Tuesday data are independent This is an opportunity to do things in parallel (ie all at the same time) Operations on one place in the data are completely independent to operations in other places. For example: Processing data from WA doesn t depend on data from Tas. This is another opportunity to do things in parallel (ie all at the same time) 8 Big Data Pragmaticalities
12 th ARSPC - Fremantle Note: This general pattern is often referred to as a HADOOP or MAP- REDUCE system, and there are software frameworks that formalise it eg it lies behind Google search indexing. (Disclaimer: I ve never used one)
So what? Our previous example 10yrs x 3000 scenes/yr @ 10mins/scene = 5000hrs = 30weeks Give me 200 CPUs = 25hours But what about the data flux? 15TB/30 weeks = 3 GB/hour 15TB/25 hours = 600 GB/hour ~0.5GB Problem is transformed from compute bound to I/O bound 10 Big Data Pragmaticalities
Key tradeoff #1: Can you supply data fast enough to make the most of your computing? How much effort you put into this depends on How big is your data set How much computing you have available How many times you have to do it How soon you need your result Figuring out how to balance data organisation and supply against time spent computing is key to getting the best results. Unless you have an extraordinarily computationally intensive algorithm, you re (usually) better off focussing on steps to speed up data. 11 Big Data Pragmaticalities
Computing Clusters Workstation 2 CPUs (15 weeks) NCI (now obsolete) 20000 CPUs (20 mins) My first (& last) cluster (2002) 20 CPUs (1.5 weeks) 12 Big Data Pragmaticalities
Plumbing & Software Somehow we have to connect data to operations: Operations = atmosphere correction remap calibrate mycleveralgorithm Might be pre-existing packages Your own special code (Fortran, C, Python,. Matlab, IDL) Connect = provide the right data to the right operation and collect the results Usually you will use a scripting language since you need: To work with the operating system Run programs Analyse file names Maybe read log files to see if something went wrong Software for us is like glassware in a chem lab: a specialised setup for our experiments; you can get components off the shelf, but only you know how you want to connect them together. Bottom line you re going to be doing some programming of some sort. 13 Big Data Pragmaticalities
Scientific Programming versus Software Engineering (Key Tradeoff #2) Do you want to do this processing only once, or many times? Which parts of your workflow are repeated, which are one-off? Eg base processing many times, followed by one-off analysis experiments How does the cost of your time spent programming compare with the availability of computing and time spent running your workflow? Why spend a week making something twice as fast if it already runs in two days? (maybe because you need to do it many times?) Will you need to understand it later? 14 Big Data Pragmaticalities
Proprietary fly in the ointment (#1) If you use licenced software (IDL, Matlab etc.) you need licences for each CPU you want to run on. This may mean you can t use anything like as much computing as you otherwise could. These languages are good for prototyping and testing But, to really make the most of modern computing, you need to escape the licencing encumbrance = migrate to free software. PS: Windows is licenced software Example: we have complex IDL code that we run on a big data set at the NCI. We have only 4 licences. It runs in a week (6 days). If we had 50 licences -> 12hours. We can live with that since there would be weeks and weeks of coding and testing to port to Python. 15 Big Data Pragmaticalities
How to do it
Maximise performance by 1. Minimise the amount of programming you do Exploit existing tools (eg std. processing packages, operating system cmds) Write things you can re-use (data access, logging tools) Choose file names that make it easy to figure out what to do Use the file-system as your database. 2. Maximise your ability to use multiple CPUs Eliminate unnecessary differences (eg data formats, standards) Look for opportunities to parallelise Avoid licencing (eg proprietary data formats, libraries, languages) 3. Seek data movement efficiency everywhere Data layout Compression RAM disks 4. Minimise the number of times you have to run your workflow Log everything (so there is no uncertainty about whether you did what you think you did) 17 Big Data Pragmaticalities
RAM disks Tapes are slow Disks are less slow Memory is even less slow Cache is fast but small Most modern systems have multiple GB of RAM for each CPU, which you can assign to working memory and as virtual disk. TAPE DISK RAM CPU Cache If you have multiple processing steps, which need intermediate file storage use a RAM disk. Can get a factor of 10 improvement. 18 Big Data Pragmaticalities
Compression Data that is half the size takes half as long to move (but then you have to uncompress it but CPUs are faster than disks) Zip, gzip will usually get you a factor of 2-4 compression Bzip2 is often 10-15% better BUT it is much slower (factor of 5). Don t store random precision (3.14 compresses more than 3.1415926) Avoid recompressing (treat compressed archive as read-only, ie copyuncompress-use-delete, DO NOT move-uncompress-use-recompressmoveback) Remote Disk File.gz File RAM CPU (decompression) 19 Big Data Pragmaticalities
Data Layout Look at your data access patterns and organise your code/data to match Eg 1. if your analysis uses multiple files repeatedly, reorganise the data so you reduce the number of open & close operations Eg 2. Big files tend to end up as contiguous blocks on a disk, so try and localise access to data, not jumping around which will entail waiting for the disk. Access by row Access by column 20 Big Data Pragmaticalities
Data Formats (and metadata) This is still a religious subject, factors to consider: Avoid proprietary (may need licences or libraries for undocumented formats) versus open formats that are publicly documented Self-contained (keep header (metadata) and data together) Self-documenting formats have structure that can be decoded using only information already in the file Architectural independence will work on different computers Storage efficiency binary versus ascii Access efficiency and flexibility support for different layouts Interoperability openness and standard conformance = reuse Need some conventions around metadata for consistency Automated metadata harvest (for indexing/cataloguing) Longevity (& migration) Answer: use netcdf or HDF (or maybe FITS in astronomy) 21 Big Data Pragmaticalities
The file-system is my database Often in your multi-step processing of 1000s of files you will want to use a database to keep track of things DON T! Every time you do something, you have to update the DB It doesn t usually take long before inconsistencies arise (eg someone deletes a file by hand). Databases are a pain to work with by hand (SQL syntax, forgettable rules) Use the file-system (folders, filenames) to keep track. Egs: once file.nc has been processed, rename it to file.nc.done and just have your processing look for files *.nc. (rename it back to file.nc to run it again, use ls or dir to see where things are up to, and rm to get rid of things that didn t work). Create zero size files as breadcrumbs touch file.nc.fail.step2 ls *.FAIL.* to see how many failures there were and at what step Use directories to group data that need to be grouped for example all files for a particular composite. 22 Big Data Pragmaticalities
Filenames are really important Filenames are a good place to store metadata relevant to the processing workflow: They re easy to access without opening the file You can use file system tools to select data Use YYYYMMDD (or YYYYddd) for dates in filenames then they will automatically sort into time order (cf DDMMYY, DDmonYYYY) Make it easy to get metadata out of file names: Fixed width numerical fields (F1A.dat, F10B.dat, F100C.dat is harder to interpret by program than F001A.dat, F010B.dat, F100C.dat) Structured names but don t go overboard! D-20130812.G-1455.P-aqua.C-20130812172816.T-d000000n274862.S-n.pds Eg. ls *.G-1[234]* to choose files at a particular time of day 23 Big Data Pragmaticalities
Logging and Provenance Every time you do something (move data, feed it to a program, put it somewhere): write a time-stamped message to a log file. Write a function that automatically prepends a timestamp to a piece of text you give to it. Time-stamps are really useful for profiling identifying where the bottlenecks are, or figuring out if something has gone wrong. Huge log files are a tiny marginal overhead Make them easy to read by program (eg grep) Make your processing code report a version (number, or description), and its inputs, to the log file. Write the log file into the output data file as a final step. This lets you understand what you did months later (so you don t do it again) Keeps the relevant log file with the data (so you don t lose it, or mix it up) 24 Big Data Pragmaticalities
Final Thoughts Most of this is applicable to other data intensive parallel processing tasks Eg. spatio-temporal model output grids Advantages may vary depending on file size Data organisation has many subtleties a little work in understanding can offer great returns in performance Keep an eye on file format capabilities More CPUs is a double edged sword Data efficiency will only become more important Haven t really touched on spatial metadata (v. important for ease of end-use/analysis but tedious (=automatable)) Get your data into a self-documenting machine-readable open file format and you ll never have to reformat by hand again. These are things we now do out of habit because they work for us Perhaps they ll work for you? 25 Big Data Pragmaticalities
Thank you Marine & Atmospheric Research Edward King Team Leader: Remote Sensing & Software t +61 3 6232 5334 e edward.king@csiro.au w www.csiro.au/cmar MARINE & ATMOSPHERIC RESEARCH