Comparing R and Python for PCA PyData Boston 2013

Size: px
Start display at page:

Download "Comparing R and Python for PCA PyData Boston 2013"

Transcription

1 Vipin Sachdeva Senior Engineer, IBM Research Comparing R and Python for PCA PyData Boston 2013

2 Comparison of R and Python for Principal Component Analysis R and Python are popular choices for data analysis. How do they compare in terms of programmer productivity and performance? Use a common task for both R and Python Principal Component Analysis (PCA) PCA is a very commonly used technique for dimension reduction. Dataframes is an essential part of languages supporting data analysis R provides data frame with numerous statistical packages. Python has included numpy (arrays) and Pandas (dataframe) for data handling which we use. Both language have rich development environments Rstudio for R ipython for Python. Both languages have many features that helps in data analysis. In this talk we compare those features with some code examples to solve our problem. This talk is not about as much about principal component analysis as about programming and performance of Python and R Let s get started

3 PCA Short Introduction PCA is a standard tool in modern data analysis. Simple method to extract information from confusing datasets Reduce a complex dataset to a lower dimension PCA projects the data along the direction where data varies the most. Directions are determined by the direction of the eigenvectors coresponding to largest eigenvalues

4 PCA Mathematical approaches Find eigenvalues of standardized covariance matrix. Choose eigenvalues with sum exceeding a threshold. Reduction in dimension from N to K: Create data with subset of eigenvalues (whose sum exceeds that threshold). K Σ i i=1 N Σ i i=1 > Threshold (e.g., 0.9 or 0.95) ˆx x = K Σ b i u i or ˆx = K Σ b i u i + x i=1 i=1

5 PCA using Singular Value Decomposition (SVD) More generalized approach for performing PCA. Decompose X=UDV T D*D is eigenvalues of covariance matrix. Reconstruction of data by zeroing out regions as shown below Choose q (as before) apple

6 PCA: What data to use? How about PCA on current 500 S&P stocks data for a period of time? Download symbols from S&P 500 website and create a vector. Use this vector to download symbols data from 1970 to 2012 in a dataframe (if possible). R and Python have various packages for financial data download quantmod (R) pandas.io.data.datareader (Python) Need a package that provides a single dataframe as output from a single call. Dates MMM ABT NA NA NA NA

7 Data Download R only I am a C/C++/Fortran HPC programmer, and I do use for loops in R and Python. for loops are slow in R Can any package return data for S&P stocks as a single dataframe? Use fimport package of R to download daily data. stocksdata<-yahooseries(symbols_nospaces,from=" ",to=" ) #symbols_nospaces is S&P stock symbols Extract columns with closing dates. Write to a csv file for repeated runs (takes a long time to download) Read the file in R/Python to get the data read.table in R created a R dataframe Pandas read_table created a Pandas dataframe. Many symbols have NA s for dates where data is not available. Work with a subset of data How about 200 stocks for quarter of a century ( )? #Snippet of code to get closing data colname<-paste(symbols_nospaces[i],".close",sep="") print(colname) stockdata_df[,i+1]<-get(colname) colnames(stockdata_df)[i+1]<-symbols_nospaces[i]

8 Combined Data Preparation R and Python Read the file in R/Python to get the data read.table in R created a R dataframe Pandas read_table created a Pandas dataframe. Many symbols have NA s for dates where data is not available. Work with a subset of data How about 200 stocks for quarter of a century ( )? Find first occurrence of 1988 in dataframe s Dates column. str.contains( 1970 ) in Python agrep(,fixed=true) in R. Extract stock columns which do not have NA on the first trading day of 1988!is.na in R/ not math.isnan in Python Get 200 stocks which satisfy above requirement Result: combined data for 200 stocks from in R/Python dataframes. Drop rows with NA for any stock. na.omit()/drop.na() 6162 entries for 200 stocks in total. Both R and Python are remarkably similar for this step.

9 Code in R and Python for data preparation Python code def extractdata(filename,lowyear, numstocks): stocksreturns=pd.read_table('stocksdata_dataframe.txt', sep='\s+') yrpattern="%d-*" % (lowyear) x=stocksreturns['"dates"'].str.contains(str(lowyear)) for i in range(size(x)): if(x[i]==true): break startindex=i colnames=stocksreturns.columns stocksreturns_short=dataframe(stocksreturns.ix[startind ex:size(x)]['"dates"']) stockindex=0 for i in range(1,501): if stockindex<numstocks: if not math.isnan(stocksreturns[colnames[i]] [startindex]): stocksreturns_short[colnames[i]]=stocksreturns[colname s[i]][startindex:len(stocksreturns.index)] stockindex=stockindex+1 9 return stocksreturns_short R code extractdata<-function(filename,yearrange, numstocks) { stocksreturns<-read.table("./ stocksdata_dataframe.txt", header=t) yrpattern<-paste(yearrange[1],"-*",sep=" ) stockdates<-stocksreturns$dates sindex<-agrep(yrpattern,stockdates,fixed=true)[1] eindex<-length(stockdates) colnames<-colnames(stocksreturns[i]) stocksreturns_short<data.frame(stockdates[startindex:endindex])]) colnames(stocksreturns_short)[1]<-"dates" j<-2 stockindex<-0 for(i in 2:501) { if(stockindex<numstocks) { if(!is.na(stocksreturns[,i][k])) { stocksreturns_short[,j]<-stocksreturns[,i] [sindex:eindex] colnames(stocksreturns_short)[j]<-colnames[i][i] j<-j+1 stockindex<-stockindex+1}}}}

10 R/Python packages for PCA Size of data is only about 23 MB in txt format. No memory-bound issues running on my laptop. R has many choices (one too many) for PCA: prcomp/princomp/pca/dudi.pca/acp prcomp scales and centers data (very convenient) prcomp(stocksreturns_short,scale=true,center=true,retx=true) Reconstruct data with predict function. prcomp uses svd beneath the covers Python seems to have several choices for PCA as well. matplotlib.mca.pca MDP (module for data processing) PCA numpy.eig/scipy.eig etc Both packages seem to have adequate support for PCA in multiple ways. Our approach: Use SVD in both R/Python: Do same operations and compare runtimes. svd in numpy returns transpose(v), while R returns V Both R and Python return d as a vector; trivial to make a diagonal matrix for reconstruction of data. Things start in Python from 0; in R from 1 J covariance matrix/eigenvalues/eigenvectors approach.

11 PCA on combined data using SVD PCA is a SVD operation: X is stocks data (6162x200) D*D is eigenvalues. (p=200) Reconstruction of data by zeroing out regions as shown below apple Choose q (explained ahead)

12 PCA on combined data Approach Perform a SVD of stock returns data. Find number of eigenvalues q comprising 50%,75%,90% and 100% of sum of all the eigenvalues Eigenvalues=d*d from SVD Zero out remaining eigenvectors/eigenvalues In Python, use copy.copy for copying eigenvalues/vectors from SVD (assignment is done using references) Reconstruct data with matrix-multiply operations. X_reconstructed=U*D*t(V) Measure the std_dev(data_reconstructed-original_data)

13 Code in Python for PCA on combined data colnames=stocksreturns_short.columns diff_data_combined=np.zeros(shape=(stocksreturns_short.shape[0]-1,stocksreturns_short.shape[1])) for i in range(0,numstocks): diff_data_combined[:,i]=diff(stocksreturns_short[:,i+1]) [u_original,d_original,v_original]=np.linalg.svd(diff_data_combined,full_matrices=false) d_diag_original=diag(d_original) eigvals_combined = d_original*d_original totalsumeigvals=sum(eigvals_combined) for percent in eigvalspercent: sumeigvals=0 for i in range(0,200): sumeigvals=sumeigvals+eigvals_combined[i] if sumeigvals>=(percent*totalsumeigvals): neigvals=i+1 break u=copy.copy(u_original) d_diag=copy.copy(d_diag_original) v=copy.copy(v_original) nvals=shape(diff_data_combined)[1] u[:][neigvals:nvals]=0 d_diag[neigvals:nvals][neigvals:nvals]=0 v[neigvals:nvals][:]=0 dproduct=np.dot(u,np.dot(d_diag,v)) 13

14 Code in R for PCA on combined data data.combined<-diff_stockyr svd.combined<-svd(data.combined) #find SVD eigvalues.combined<-svd.combined$d * svd.combined$d totalsum<-sum(eigvalues.combined) proportionrange<-c(0.5,0.75,0.90,1) for(proportion in proportionrange){ sum<-0 neigvalues<-0 for(i in 1:numstocks) { sum<-sum+eigvalues.combined[i] neigvalues<-neigvalues+1 if((sum/totalsum)>=proportion) { cat(sprintf("number of eigenvalues for combined data for proportion %f = %d\n",proportion,neigvalues)) break; }} nvals<-dim(data.combined)[2] u<-svd.combined$u d<-diag(svd.combined$d) v<-svd.combined$v #Copy SVD matrices u[,(neigvalues):nvals]<-0 d[(neigvalues):nvals,(neigvalues):nvals]<-0 v[,(neigvalues):nvals]<-0 stock.data<-u %*% d %*% t(v) #Do a matrix multiply to get data 14

15 PCA on combined data results 138 stocks out of 200 account for 90% of the sum of all the eigenvalues Reconstruct data with 138 stocks has negligible error (10^-5)

16 Yearly PCA Instead of doing a PCA on combined data from , how about yearly PCA? PCA on yearly data Separate the combined dataframe into yearly dataframes (1 for each year). Number of observations vary for each year. Calculate number of eigenvalues accounting for 50%, 75%,90%,100% of sum of all eigenvalues (same operation as PCA on combined data) Do a reconstruction for each proportion/each year. (step operation as PCA on combined data) 25 separate PCA s 100 reconstructions in total. Dataframe 1....Dataframe 25 Dates MMM ABT NA Dates MMM ABT NA NA NA NA NA NA NA

17 Code in Python for extracting yearly dataframes Python code colnames=stocksreturns_short.columns x=stocksreturns_short['"dates"'].str.contains(str(year s)) stocksreturns_yr=stocksreturns_short.ix[x] shape0,shape1=np.shape(stocksreturns_yr) diff_data_yr=np.zeros((shape0-1,shape1)) for i in range(0,numstocks): data_yr[:,i]=(stocksreturns_yr[:][colnames[i+1]]) diff_data_yr[:,i]=np.diff(data_yr[:,i]) R code yrpattern<-paste(years,"-*",sep="") yrindices<agrep(yrpattern,stocksreturns_short $Dates,fixed=TRUE) val_stockyr<data.frame(stocksreturns_short $Dates[yrindices[1]:tail(yrindices,n=1)]) colnames(val_stockyr)[1]<-"dates" for(i in 2:(numstocks+1)) { } val_stockyr[,i]<stocksreturns_short[,i] [yrindices[1]:tail(yrindices,n=1)] colnames(val_stockyr)[i]<colnames(stocksreturns_short)[i] diff_stockyr<data.frame(matrix(na,nrow=(dim(log_st ockyr)[1]-1),ncol=numstocks)) for(j in 1:200) diff_stockyr[,j]<-diff(log_stockyr[,j]) 17

18 Analysis of yearly PCA data Number of eigenvalues with 50% of total sum drops to 1 in 2008 Stock movement is highly correlated due to macro-economic trends.

19 PCA on yearly data Being a C/C++/Fortran HPC programmer, I use for loops in R/Python Not efficient for R (for loop is an object; assignment in R is a copy operation) Python s assignment is done with references so it works better with for loops, and lesser overhead of functions. Development Environment: ipython-2.7 with pandas and numpy installed through ports package Rstudio 0.97 with R binary downloaded for Mac No attempt to optimize the build for either R and Python. Total code for R takes above 20 seconds versus about 11.9 seconds for Python on my Macbook Pro. Timings may change with less reliance on for loops in the code.

20 Parallelizing yearly PCA Can we use parallelism in R and Python productively? Both R and Python provide several ways for parallelization Multiple cores Distributed parallelism using MPI or sockets Use coarse-grained parallelism to speed up our computations. Look into how both packages allow use of multicores on modern day processors Very easy to apply coarse-grained parallelism to yearly PCA Divide years amongst threads/processes. For R use domc/foreach package that works on the multiple cores. Python threads does not work well due to global interpreter lock (GIL). Use ipython ipcluster parallelization framework. Further evaluation using MPI on distributed clusters needed.

21 Parallelizing yearly PCA in R foreach depends on a backend for execution We register DoMC (multiple cores) as backend for the yearly PCA in this case MPI can also be used as a backend for distributed clusters. Snow package another option (higher level for distributed clusters). Not just limited to for loops: Use mclapply for multi-core lapply etc. #Sequential code for(years in 1988:2012) { for(proportion in c(0.5,0.75,0.90,1){ Code to extract yearly data, do PCA and then reconstruct }} #Parallel code registerdomc(4) #Register multicore as backend with 4 cores foreach(years in 1988:2012) %dopar%{ for(proportion in c(0.5,0.75,0.90,1){ Code to extract yearly data, do PCA and then reconstruct }}

22 Timing Results Threads Intel Core i7 Macbook Pro (4 cores, 8 hyper-threading threads)

23 Parallelizing yearly PCA in Python Using ipython s Direct Interface Start backend of ipython % ipcluster-2.7 start n 4 (4 is the number of processes) Rewrite pcadata function so that it can be used with the map API of Python pcadata(stocksreturns_short,year) pcadata extracts data for year from stocksreturns_short, performs a SVD and then reconstructions with eigenvalues percentages as before. Processes (unlike threads in R) makes us reimport all the modules inside the function. Higher memory footprint More heavyweight compared to threads. Create a list for each process s function arguments. Parallelize across years as in R Each process computes a subset of SVD s and reuses a single SVD for 4 reconstructions. #code for map_async x=[] for i in range(0,25): x.append(stocksreturns_short) starttime=datetime.now() map_sync(pcadata,x,range(1988,2013)) print(datetime.now()-starttime)

24 Timing results in Python Threads Starting ipcluster=8 leads to processes hanging. ipython with multiple processes led to some memory issues. Scalability of Python shows similar trend as R

25 Summary Both R and Python offer good choices for PCA R has many packages for tasks such as downloading financial data, PCA etc. Python has a good support as well. R offers a cohesive framework Installing packages is pain-free Parallelization in R is very simple. R seems to be slower as assignment operator requires copy operations which is a lot of overhead (and my use of for loops). Python is more forgiving of usage of for loop, and seems to require lesser statements to do the same work. Pandas/Numpy adds dataframe capabilities to Python s native string handling capabilities to provide a strong platform for data analysis.

26 Future Work Profiling of code at statement level etc. How does R/Python work for memory-bound/compute-bound problems? Work with Distributed matrices (disnumpy for Python,r-pbd for R) Use MPI as backend for parallelization on a cluster Make interpreted code faster for both R/Python through compilers(cmpfun for R, Cython for Python)

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Raquel Urtasun & Rich Zemel University of Toronto Nov 4, 2015 Urtasun & Zemel (UofT) CSC 411: 14-PCA & Autoencoders Nov 4, 2015 1 / 18

More information

R on BioHPC. Rstudio, Parallel R and BioconductoR. Updated for

R on BioHPC. Rstudio, Parallel R and BioconductoR. Updated for R on BioHPC Rstudio, Parallel R and BioconductoR 1 Updated for 2015-07-15 2 Today we ll be looking at Why R? The dominant statistics environment in academia Large number of packages to do a lot of different

More information

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22 Feature selection Javier Béjar cbea LSI - FIB Term 2011/2012 Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/2012 1 / 22 Outline 1 Dimensionality reduction 2 Projections 3 Attribute selection

More information

Singular Value Decomposition, and Application to Recommender Systems

Singular Value Decomposition, and Application to Recommender Systems Singular Value Decomposition, and Application to Recommender Systems CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Recommendation

More information

PCOMP http://127.0.0.1:55825/help/topic/com.rsi.idl.doc.core/pcomp... IDL API Reference Guides > IDL Reference Guide > Part I: IDL Command Reference > Routines: P PCOMP Syntax Return Value Arguments Keywords

More information

LECTURE 7: STUDENT REQUESTED TOPICS

LECTURE 7: STUDENT REQUESTED TOPICS 1 LECTURE 7: STUDENT REQUESTED TOPICS Introduction to Scientific Python, CME 193 Feb. 20, 2014 Please download today s exercises from: web.stanford.edu/~ermartin/teaching/cme193-winter15 Eileen Martin

More information

Distributed Data Structures, Parallel Computing and IPython

Distributed Data Structures, Parallel Computing and IPython Distributed Data Structures, Parallel Computing and IPython Brian Granger, Cal Poly San Luis Obispo Fernando Perez, UC Berkeley Funded in part by NASA Motivation Compiled Languages C/C++/Fortran are FAST

More information

ARTIFICIAL INTELLIGENCE AND PYTHON

ARTIFICIAL INTELLIGENCE AND PYTHON ARTIFICIAL INTELLIGENCE AND PYTHON DAY 1 STANLEY LIANG, LASSONDE SCHOOL OF ENGINEERING, YORK UNIVERSITY WHAT IS PYTHON An interpreted high-level programming language for general-purpose programming. Python

More information

Dimension Reduction CS534

Dimension Reduction CS534 Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

More information

Science Cookbook. Practical Data. open source community experience distilled. Benjamin Bengfort. science projects in R and Python.

Science Cookbook. Practical Data. open source community experience distilled. Benjamin Bengfort. science projects in R and Python. Practical Data Science Cookbook 89 hands-on recipes to help you complete real-world data science projects in R and Python Tony Ojeda Sean Patrick Murphy Benjamin Bengfort Abhijit Dasgupta PUBLISHING open

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

Automatic Singular Spectrum Analysis for Time-Series Decomposition

Automatic Singular Spectrum Analysis for Time-Series Decomposition Automatic Singular Spectrum Analysis for Time-Series Decomposition A.M. Álvarez-Meza and C.D. Acosta-Medina and G. Castellanos-Domínguez Universidad Nacional de Colombia, Signal Processing and Recognition

More information

Parallel Architecture & Programing Models for Face Recognition

Parallel Architecture & Programing Models for Face Recognition Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature

More information

Collaborative Filtering for Netflix

Collaborative Filtering for Netflix Collaborative Filtering for Netflix Michael Percy Dec 10, 2009 Abstract The Netflix movie-recommendation problem was investigated and the incremental Singular Value Decomposition (SVD) algorithm was implemented

More information

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 1 / 18

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab Recommender System What is it? How to build it? Challenges R package: recommenderlab 1 What is a recommender system Wiki definition: A recommender system or a recommendation system (sometimes replacing

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

HW Assignment 3 (Due by 9:00am on Mar 6)

HW Assignment 3 (Due by 9:00am on Mar 6) HW Assignment 3 (Due by 9:00am on Mar 6) 1 Theory (150 points) 1. [Tied Weights, 50 points] Write down the gradient computation for a (non-linear) auto-encoder with tied weights i.e., W (2) = (W (1) )

More information

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org

More information

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center

More information

Modelling and Visualization of High Dimensional Data. Sample Examination Paper

Modelling and Visualization of High Dimensional Data. Sample Examination Paper Duration not specified UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE Modelling and Visualization of High Dimensional Data Sample Examination Paper Examination date not specified Time: Examination

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Work 2. Case-based reasoning exercise

Work 2. Case-based reasoning exercise Work 2. Case-based reasoning exercise Marc Albert Garcia Gonzalo, Miquel Perelló Nieto November 19, 2012 1 Introduction In this exercise we have implemented a case-based reasoning system, specifically

More information

Analysis and Latent Semantic Indexing

Analysis and Latent Semantic Indexing 18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding

More information

RDAV and Nautilus

RDAV and Nautilus http://rdav.nics.tennessee.edu/ RDAV and Nautilus Parallel Processing with R Amy F. Szczepa!ski Remote Data Analysis and Visualization Center University of Tennessee, Knoxville aszczepa@utk.edu Any opinions,

More information

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture 14 Python Exercise on knn and PCA Hello everyone,

More information

Spatial Distributions of Precipitation Events from Regional Climate Models

Spatial Distributions of Precipitation Events from Regional Climate Models Spatial Distributions of Precipitation Events from Regional Climate Models N. Lenssen September 2, 2010 1 Scientific Reason The Institute of Mathematics Applied to Geosciences (IMAGe) and the National

More information

ACHIEVEMENTS FROM TRAINING

ACHIEVEMENTS FROM TRAINING LEARN WELL TECHNOCRAFT DATA SCIENCE/ MACHINE LEARNING SYLLABUS 8TH YEAR OF ACCOMPLISHMENTS AUTHORIZED GLOBAL CERTIFICATION CENTER FOR MICROSOFT, ORACLE, IBM, AWS AND MANY MORE. 8411002339/7709292162 WWW.DW-LEARNWELL.COM

More information

Introduction to Programming

Introduction to Programming Introduction to Programming G. Bakalli March 8, 2017 G. Bakalli Introduction to Programming March 8, 2017 1 / 33 Outline 1 Programming in Finance 2 Types of Languages Interpreters Compilers 3 Programming

More information

Intel Distribution For Python*

Intel Distribution For Python* Intel Distribution For Python* Intel Distribution for Python* 2017 Advancing Python performance closer to native speeds Easy, out-of-the-box access to high performance Python High performance with multiple

More information

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies. CSE 547: Machine Learning for Big Data Spring 2019 Problem Set 2 Please read the homework submission policies. 1 Principal Component Analysis and Reconstruction (25 points) Let s do PCA and reconstruct

More information

Motion Interpretation and Synthesis by ICA

Motion Interpretation and Synthesis by ICA Motion Interpretation and Synthesis by ICA Renqiang Min Department of Computer Science, University of Toronto, 1 King s College Road, Toronto, ON M5S3G4, Canada Abstract. It is known that high-dimensional

More information

Biology Project 1

Biology Project 1 Biology 6317 Project 1 Data and illustrations courtesy of Professor Tony Frankino, Department of Biology/Biochemistry 1. Background The data set www.math.uh.edu/~charles/wing_xy.dat has measurements related

More information

UNIVERSITY OF OSLO. Faculty of Mathematics and Natural Sciences

UNIVERSITY OF OSLO. Faculty of Mathematics and Natural Sciences UNIVERSITY OF OSLO Faculty of Mathematics and Natural Sciences Exam: INF 4300 / INF 9305 Digital image analysis Date: Thursday December 21, 2017 Exam hours: 09.00-13.00 (4 hours) Number of pages: 8 pages

More information

Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory

Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory Rei Odaira, Jose G. Castanos and Hisanobu Tomari IBM Research and University of Tokyo April 8, 2014 Rei Odaira, Jose G.

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Using Existing Numerical Libraries on Spark

Using Existing Numerical Libraries on Spark Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm

More information

Parallel Computing with R. Le Yan LSU

Parallel Computing with R. Le Yan LSU Parallel Computing with R Le Yan HPC @ LSU 3/22/2017 HPC training series Spring 2017 Outline Parallel computing primers Parallel computing with R Implicit parallelism Explicit parallelism R with GPU 3/22/2017

More information

Package PCADSC. April 19, 2017

Package PCADSC. April 19, 2017 Type Package Package PCADSC April 19, 2017 Title Tools for Principal Component Analysis-Based Data Structure Comparisons Version 0.8.0 A suite of non-parametric, visual tools for assessing differences

More information

GPU Based Face Recognition System for Authentication

GPU Based Face Recognition System for Authentication GPU Based Face Recognition System for Authentication Bhumika Agrawal, Chelsi Gupta, Meghna Mandloi, Divya Dwivedi, Jayesh Surana Information Technology, SVITS Gram Baroli, Sanwer road, Indore, MP, India

More information

Python for Data Analysis

Python for Data Analysis Python for Data Analysis Wes McKinney O'REILLY 8 Beijing Cambridge Farnham Kb'ln Sebastopol Tokyo Table of Contents Preface xi 1. Preliminaries " 1 What Is This Book About? 1 Why Python for Data Analysis?

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Fabio G. Cozman - fgcozman@usp.br November 16, 2018 What can we do? We just have a dataset with features (no labels, no response). We want to understand the data... no easy to define

More information

R and parallel libraries. Introduction to R for data analytics Bologna, 26/06/2017

R and parallel libraries. Introduction to R for data analytics Bologna, 26/06/2017 R and parallel libraries Introduction to R for data analytics Bologna, 26/06/2017 Outline Overview What is R R Console Input and Evaluation Data types R Objects and Attributes Vectors and Lists Matrices

More information

Parallel Computing with R. Le Yan LSU

Parallel Computing with R. Le Yan LSU Parallel Computing with Le Yan HPC @ LSU 11/1/2017 HPC training series Fall 2017 Parallel Computing: Why? Getting results faster unning in parallel may speed up the time to reach solution Dealing with

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

GEMINI GEneric Multimedia INdexIng

GEMINI GEneric Multimedia INdexIng GEMINI GEneric Multimedia INdexIng GEneric Multimedia INdexIng distance measure Sub-pattern Match quick and dirty test Lower bounding lemma 1-D Time Sequences Color histograms Color auto-correlogram Shapes

More information

Scaled Machine Learning at Matroid

Scaled Machine Learning at Matroid Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling

More information

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT Python for Data Analysis Prof.Sushila Aghav-Palwe Assistant Professor MIT Four steps to apply data analytics: 1. Define your Objective What are you trying to achieve? What could the result look like? 2.

More information

Python Certification Training

Python Certification Training Introduction To Python Python Certification Training Goal : Give brief idea of what Python is and touch on basics. Define Python Know why Python is popular Setup Python environment Discuss flow control

More information

Accelerated Machine Learning Algorithms in Python

Accelerated Machine Learning Algorithms in Python Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals

More information

2 Calculation of the within-class covariance matrix

2 Calculation of the within-class covariance matrix 1 Topic Parallel programming in R. Using the «parallel» and «doparallel» packages. Personal computers become more and more efficient. They are mostly equipped with multi-core processors. At the same time,

More information

WIDE use of k Nearest Neighbours (k-nn) search is

WIDE use of k Nearest Neighbours (k-nn) search is This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/TC.27.27483,

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION 6 NEURAL NETWORK BASED PATH PLANNING ALGORITHM 61 INTRODUCTION In previous chapters path planning algorithms such as trigonometry based path planning algorithm and direction based path planning algorithm

More information

parallel Parallel R ANF R Vincent Miele CNRS 07/10/2015

parallel Parallel R ANF R Vincent Miele CNRS 07/10/2015 Parallel R ANF R Vincent Miele CNRS 07/10/2015 Thinking Plan Thinking Context Principles Traditional paradigms and languages Parallel R - the foundations embarrassingly computations in R the snow heritage

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Data preprocessing Functional Programming and Intelligent Algorithms

Data preprocessing Functional Programming and Intelligent Algorithms Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Høgskolen i Ålesund 20th March 2017 1 Why data preprocessing? Real-world data tend to be dirty incomplete: lacking attribute

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation

Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation Lori Cillo, Attebury Honors Program Dr. Rajan Alex, Mentor West Texas A&M University Canyon, Texas 1 ABSTRACT. This work is

More information

Introducion to R and parallel libraries. Giorgio Pedrazzi, CINECA Matteo Sartori, CINECA School of Data Analytics and Visualisation Milan, 09/06/2015

Introducion to R and parallel libraries. Giorgio Pedrazzi, CINECA Matteo Sartori, CINECA School of Data Analytics and Visualisation Milan, 09/06/2015 Introducion to R and parallel libraries Giorgio Pedrazzi, CINECA Matteo Sartori, CINECA School of Data Analytics and Visualisation Milan, 09/06/2015 Overview What is R R Console Input and Evaluation Data

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Some possible directions for the R engine

Some possible directions for the R engine Some possible directions for the R engine Luke Tierney Department of Statistics & Actuarial Science University of Iowa July 22, 2010 Luke Tierney (U. of Iowa) Directions for the R engine July 22, 2010

More information

Applied Neuroscience. Columbia Science Honors Program Fall Machine Learning and Neural Networks

Applied Neuroscience. Columbia Science Honors Program Fall Machine Learning and Neural Networks Applied Neuroscience Columbia Science Honors Program Fall 2016 Machine Learning and Neural Networks Machine Learning and Neural Networks Objective: Introduction to Machine Learning Agenda: 1. JavaScript

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Statistical Methods and Optimization in Data Mining

Statistical Methods and Optimization in Data Mining Statistical Methods and Optimization in Data Mining Eloísa Macedo 1, Adelaide Freitas 2 1 University of Aveiro, Aveiro, Portugal; macedo@ua.pt 2 University of Aveiro, Aveiro, Portugal; adelaide@ua.pt The

More information

An Approximate Singular Value Decomposition of Large Matrices in Julia

An Approximate Singular Value Decomposition of Large Matrices in Julia An Approximate Singular Value Decomposition of Large Matrices in Julia Alexander J. Turner 1, 1 Harvard University, School of Engineering and Applied Sciences, Cambridge, MA, USA. In this project, I implement

More information

SSE Vectorization of the EM Algorithm. for Mixture of Gaussians Density Estimation

SSE Vectorization of the EM Algorithm. for Mixture of Gaussians Density Estimation Tyler Karrels ECE 734 Spring 2010 1. Introduction SSE Vectorization of the EM Algorithm for Mixture of Gaussians Density Estimation The Expectation-Maximization (EM) algorithm is a popular tool for determining

More information

A Faster Parallel Algorithm for Analyzing Drug-Drug Interaction from MEDLINE Database

A Faster Parallel Algorithm for Analyzing Drug-Drug Interaction from MEDLINE Database A Faster Parallel Algorithm for Analyzing Drug-Drug Interaction from MEDLINE Database Sulav Malla, Kartik Anil Reddy, Song Yang Department of Computer Science and Engineering University of South Florida

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

Lecture Topic Projects

Lecture Topic Projects Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis techniques for non-spatial data Project #1 out 4 Data

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Principal Component Analysis for Distributed Data

Principal Component Analysis for Distributed Data Principal Component Analysis for Distributed Data David Woodruff IBM Almaden Based on works with Ken Clarkson, Ravi Kannan, and Santosh Vempala Outline 1. What is low rank approximation? 2. How do we solve

More information

Exploring Parallelism in. Joseph Pantoga Jon Simington

Exploring Parallelism in. Joseph Pantoga Jon Simington Exploring Parallelism in Joseph Pantoga Jon Simington Why bring parallelism to Python? - We love Python (and you should, too!) - Interacts very well with C / C++ via python.h and CPython - Rapid development

More information

Solving Large-Scale Energy System Models

Solving Large-Scale Energy System Models Solving Large-Scale Energy System Models Frederik Fiand Operations Research Analyst GAMS Software GmbH GAMS Development Corp. GAMS Software GmbH www.gams.com Agenda 1. GAMS System Overview 2. BEAM-ME Background

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems

LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems Xiangrui Meng Joint with Michael A. Saunders and Michael W. Mahoney Stanford University June 19, 2012 Meng, Saunders, Mahoney

More information

CSE 258 Lecture 5. Web Mining and Recommender Systems. Dimensionality Reduction

CSE 258 Lecture 5. Web Mining and Recommender Systems. Dimensionality Reduction CSE 258 Lecture 5 Web Mining and Recommender Systems Dimensionality Reduction This week How can we build low dimensional representations of high dimensional data? e.g. how might we (compactly!) represent

More information

High Performance Computing with Python

High Performance Computing with Python High Performance Computing with Python Pawel Pomorski SHARCNET University of Waterloo ppomorsk@sharcnet.ca April 29,2015 Outline Speeding up Python code with NumPy Speeding up Python code with Cython Using

More information

Getting the most out of your CPUs Parallel computing strategies in R

Getting the most out of your CPUs Parallel computing strategies in R Getting the most out of your CPUs Parallel computing strategies in R Stefan Theussl Department of Statistics and Mathematics Wirtschaftsuniversität Wien July 2, 2008 Outline Introduction Parallel Computing

More information

CISC 322 Software Architecture

CISC 322 Software Architecture CISC 322 Software Architecture Lecture 04: Non Functional Requirements (NFR) Quality Attributes Emad Shihab Adapted from Ahmed E. Hassan and Ian Gorton Last Class - Recap Lot of ambiguity within stakeholders

More information

Programming Exercise 7: K-means Clustering and Principal Component Analysis

Programming Exercise 7: K-means Clustering and Principal Component Analysis Programming Exercise 7: K-means Clustering and Principal Component Analysis Machine Learning May 13, 2012 Introduction In this exercise, you will implement the K-means clustering algorithm and apply it

More information

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

GPU Technology Conference 2015 Silicon Valley

GPU Technology Conference 2015 Silicon Valley GPU Technology Conference 2015 Silicon Valley Big Data in Real Time: An Approach to Predictive Analytics for Alpha Generation and Risk Management Yigal Jhirad and Blay Tarnoff March 19, 2015 Table of Contents

More information

Data Analytics and Machine Learning: From Node to Cluster

Data Analytics and Machine Learning: From Node to Cluster Data Analytics and Machine Learning: From Node to Cluster Presented by Viswanath Puttagunta Ganesh Raju Understanding use cases to optimize on ARM Ecosystem Date BKK16-404B March 10th, 2016 Event Linaro

More information

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance

More information

CSCI-580 Advanced High Performance Computing

CSCI-580 Advanced High Performance Computing CSCI-580 Advanced High Performance Computing Performance Hacking: Matrix Multiplication Bo Wu Colorado School of Mines Most content of the slides is from: Saman Amarasinghe (MIT) Square-Matrix Multiplication!2

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Clustering and Dimensionality Reduction

Clustering and Dimensionality Reduction Clustering and Dimensionality Reduction Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at: Data Mining Automatically extracting meaning from

More information

DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10

DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10 COMP717, Data Mining with R, Test Two, Tuesday the 28 th of May, 2013, 8h30-11h30 1 DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015 Parallel Programming Presentation to Linux Users of Victoria, Inc. November 4th, 2015 http://levlafayette.com 1.0 What Is Parallel Programming? 1.1 Historically, software has been written for serial computation

More information

Getting Started with doparallel and foreach

Getting Started with doparallel and foreach Steve Weston and Rich Calaway doc@revolutionanalytics.com September 19, 2017 1 Introduction The doparallel package is a parallel backend for the foreach package. It provides a mechanism needed to execute

More information

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS 8.1 Introduction The recognition systems developed so far were for simple characters comprising of consonants and vowels. But there is one

More information