Big measurements and data analysis, challenges and some ideas

Size: px

Start display at page:

Download "Big measurements and data analysis, challenges and some ideas"

Candice Dorsey
5 years ago
Views:

1 Big measurements and data analysis, challenges and some ideas University of Vaasa March 9 th 2017

2 What is big data? Big data is... Data sets that are too large and complex to be worked with commonly available tools. [Snijders et al., 2012]

3 Is big data useful? 1 Soccer World Championships 2 Brexit 3 USA Precident election 4 Climate change research 5 WindSoMe project (Wind turbine noise)

4 Social network analysis

5 Characteristics of big data

6 Data driven science

7 WindSoMe research project HMS Ambient conditions Disturbance response Noise control Objective feedback Subjective feedback

8 Driver for Big Data: Machine Learning

9 What does the data know The data can be used for prediction a variable y by learning a function f : y 1 x 11 x 12 x x 1n y 2. = f x 21 x 22 x x 2n (1). y m x m1 x m2 x m3... x mn, where x is the data matrix with n variables and m samples, and y is a result column vector, which m values. If y is discrete, the function f is a classifier, and when y is continuous f is a regressor. If y is known, the problem is said to be a supervised learning case, otherwise it is unsupervised. Sometimes it is usefull to explore the X by itself.

10 Algorithms 1 PCA: Can remove the correlation from the variables and reduce the dimensionality to make further processing more controllable 2 PLS: Multivariate regression method 3 Neural networks: For multivariate classification and regression. 4 SVC: Multivariate classification and regression tool 5 NBC. Really fast to train and predict, supports training with partial data, which is very good, when the whole data set cannot fit in the memory at the same time. NBC in scikit-learn 6 Decision trees (white box): Ensemble classifiers with boosting or bagging (Adaboost, random forest) See Scikit-learn manual page. Can be used for feature selection as well. 7 L1 regularization to simplify the models: Lasso, LARS

11 Tools for machine learning 1 Python + Scientific Python + Scikit.learn + Pandas Real object oriented, versatile programming language, which is easy to learn Good support for numerical computation A lot of well documented libraries for machine learning Can work with Hadoop MapReduce 2 R All new algorithms are found from libraries Best in statistics Can work with Hadoop 3 MATLAB / Octave Well suited for numerical computation MATLAB can work with Hadoop, Spark The interface to all is programming language, with support for interactive operation. Hadoop integration should be studied further.

12 Tools of Big data

13 Challenges of big data 1 Storage: 2 Data transfer 3 CPU-capacity: Amount of stored data increases much faster than CPU-capacity, which still follows the Moore s law. [Kambatla et al., 2014] 4 Preprocessing, transformation and feature extraction: How to make processing faster by smartly calculating and selecting features. 5 Energy consumption: The energy saving technologies develop much slower than the data processing needs. In the future, energy saving may become an important issue in algorith development.

14 Solutions 1 Speed: Simple algoritms, dimensionality reduction, feature selection, sample selection 2 Transfer: Distributed processing near data: Hadoop 3 CPU capcity: computer farms, CSC, cloud computing, supercomputer 4 Data parallelism; If same operation is applied to many data, the task can be simply distributed in the computer cluster. Still the data storage or transfer may be the problem.

15 Map Reduce A common method for distributing calculation to the network in Big data systems: 1 The central node is ordered to compute a task 2 The central node maps the task to the client nodes (Map) 3 The client nodes may in turn distribute the task hierarchically further 4 The results (key-value pairs) are returned to the parent, which combines them togetner (Reduce) Map Reduce suits best for textual data and is not so much applied for data intensive science applications because complex and high dimensional data sets?? [Buck et al., 2011] Hadoop Uses MapReduce to enable distributed computing of large data sets in their exising locations and lot of other tools, such as resource manager YARN and distribute storage system, HDFS.

16 Storage Hard disks, DAS/NAS: The chapest (WD Blue WD40EZRZ 4 TB hard disck, 165 e[42 e/ TB]). Reliability and volume requirements can be fulfilled with RAID and logical volume solutions. Solid state memory: The fastest 10x cost, 10x speed. Cloud storage: Flexible, slower access, can become more expensive. CSC workspace quota 5120 GB in workspace, no backups! Distributed file system: HDFS (Still needs physical devices) Ideal storage system? 1) Tolerancy towards network parititoning, 2) Consistent, 3) High availability. According to Brewer s CAP-theorem, you can have only two. [Brewer, 2000]

17 NoSQL storage Centralized databases cannot handle really big data sets -> use distributed, like nosql. Source: Apache hadoop

18 CPU capacity: Other solutions Solution cores price unit price HPE Proliant tower, 1x Xeon E e 250 e/core Rack server computer, 2x Xeon E e 187 e/core Amazon on demand m4.large e/h 1.4 e/ coreday Amazon on demand m4.4xlarge e/h 1.4 e/coreday Amazon on demand m4.16xlarge e/h 1.4 e/ coreday Amazon hadoop-cluster 10 > 0.15 e/h 0.36 e/coreday If you can utilize more than about 150 coredays for each core of your server in it s lifetime, it will be cheaper than leasing from cloud. Of course server needs service, energy, and is not as flexible than on-demand cloud. Better is that the organization purchases CPU power for shared used than a project for private use.

19 CSC CSC: CPU cores + GPU cluster

20 Cloud based big data systems [Hashem et al., 2015]

21 Cases

22 Case: Climate change research MERRA Analytic Services: Meeting the Big Data challenges of climate science through cloud-enabled Climate Analytics-as-a-Service [Schnase et al., 2017]. NASA s climate change repositories alone are projected to grow to 350 petabytes by 2013 Data sets themselves cannot be moved: instead, analytical operations need to migrate to where the data reside The ability to respond quickly to customer demands for new and often unanticipated uses for climate data requires greater agility in building and deploying applications Data stored in Hadoop HDFS distributed file system The data is originally stored as netcdf format, and converted to HDFS Interface to other users is provided through RESTfull web services

23 WindSoMe research project HMS Ambient conditions Disturbance response Noise control Objective feedback Subjective feedback

24 Purpose of the project Measure sound emission and immission, weather conditions and collect dweller feedback over whole year. Develop suitable noise metrics to evaluate both the level and the characteristics of the sound. Assess dweller attitudes and opinions by questionnaires. Emission, immission and weather measurement are used to calculate sound propagation in all weather conditions Real time noise feedback is used for finding out why noise is annoying: Level, AM, tonality, or other reasons. Survey is made to study background reasons, and general opinions Tonality, amplitude modulation and other characteristics of the sound will be studied. The measurement, identification and prevention of mechanical wind turbine noise will be studied. Disseminate research results to the public.

25 Sound data measurement infrastructure

26 Data storage infrastucture The data is stored in HDF5 files in a filesystem, including the metadata related to the data acquisition: Sampling rate, sampling time, sampling location, etc. HDF5 can be easily used in Python, R, and MATLAB/Octave. It supports transparent compression of data. The metadata and calculated features are stored in an SQL database. SQLite files are used for exchanging structured data Some of the data and calculated features are accessible from WSAA web system, which provides also RESTfull API

27 Installation of measurement stations

28 WindSoMe machine learning case Prediction: Mean shift clustering Feature extraction: PCA Preprocessing: Scaling+FFT Transfer Acquisition FFT provides features for recognizing sound sources Replace PCA with Incremental PCA Large window data analysis needed to capture slow processes -> additional features, like AM-modulation index, wind shield noise index Can feature selection eliminate the need of full FFT and get rid of PCA?

29 Sound pressure signal (x) Sound pressure / Pa Time / s

30 Potential features Time/s Time/s Amplitude / Pa Amplitude / db Angle Angle Delay / s Delay / s

31 Spectrogram (X)

32 Automated noise analysis (Clustering)

33 Result: Automated noise analysis 70 Original average Fixed average 60 Sound pressure level / dba :00:00 03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:00 Time

34 Conclusions Digitalization provides big data sets Possibilities for Data driven research Distributed storage and computation are needed Fast algorithms and carefull data selection helps Big data requires serious programming In WindSoMe case, analysis of big measurement data could tell which features of noise makes people annoyed, and in which weather conditions they apper. Furthermore it can provide automatization to environmental sound measurements and analysis, making it affordable.

35 References I Brewer, E. A. (2000). Towards robust distributed systems. In PODC, volume 7. Buck, J. B., Watkins, N., LeFevre, J., Ioannidou, K., Maltzahn, C., Polyzotis, N., and Brandt, S. (2011). SciHadoop: Array-based query processing in Hadoop. In 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., and Ullah Khan, S. (2015). The rise of big data on cloud computing: Review and open research issues. Information Systems, 47: Kambatla, K., Kollias, G., Kumar, V., and Grama, A. (2014). Trends in big data analytics. Journal of Parallel and Distributed Computing, 74(7): Schnase, J. L., Duffy, D. Q., Tamkin, G. S., Nadeau, D., Thompson, J. H., Grieg, C. M., McInerney, M. A., and Webster, W. P. (2017). MERRA Analytic Services: Meeting the Big Data challenges of climate science through cloud-enabled Climate Analytics-as-a-Service. Computers, Environment and Urban Systems, 61, Part B: Snijders, C., Matzat, U., and Reips, U.-D. (2012). "Big Data" : Big Gaps of Knowledge in the Field of Internet Science. International Journal of Internet Science, 7(1):1 5.

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

National Aeronautics and Space Administration Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures 13 November 2016 Carrie Spear (carrie.e.spear@nasa.gov) HPC Architect/Contractor