Comparing R and Python for PCA PyData Boston 2013

Size: px

Start display at page:

Download "Comparing R and Python for PCA PyData Boston 2013"

Kelley Henderson
5 years ago
Views:

1 Vipin Sachdeva Senior Engineer, IBM Research Comparing R and Python for PCA PyData Boston 2013

2 Comparison of R and Python for Principal Component Analysis R and Python are popular choices for data analysis. How do they compare in terms of programmer productivity and performance? Use a common task for both R and Python Principal Component Analysis (PCA) PCA is a very commonly used technique for dimension reduction. Dataframes is an essential part of languages supporting data analysis R provides data frame with numerous statistical packages. Python has included numpy (arrays) and Pandas (dataframe) for data handling which we use. Both language have rich development environments Rstudio for R ipython for Python. Both languages have many features that helps in data analysis. In this talk we compare those features with some code examples to solve our problem. This talk is not about as much about principal component analysis as about programming and performance of Python and R Let s get started

3 PCA Short Introduction PCA is a standard tool in modern data analysis. Simple method to extract information from confusing datasets Reduce a complex dataset to a lower dimension PCA projects the data along the direction where data varies the most. Directions are determined by the direction of the eigenvectors coresponding to largest eigenvalues

4 PCA Mathematical approaches Find eigenvalues of standardized covariance matrix. Choose eigenvalues with sum exceeding a threshold. Reduction in dimension from N to K: Create data with subset of eigenvalues (whose sum exceeds that threshold). K Σ i i=1 N Σ i i=1 > Threshold (e.g., 0.9 or 0.95) ˆx x = K Σ b i u i or ˆx = K Σ b i u i + x i=1 i=1

5 PCA using Singular Value Decomposition (SVD) More generalized approach for performing PCA. Decompose X=UDV T D*D is eigenvalues of covariance matrix. Reconstruction of data by zeroing out regions as shown below Choose q (as before) apple

6 PCA: What data to use? How about PCA on current 500 S&P stocks data for a period of time? Download symbols from S&P 500 website and create a vector. Use this vector to download symbols data from 1970 to 2012 in a dataframe (if possible). R and Python have various packages for financial data download quantmod (R) pandas.io.data.datareader (Python) Need a package that provides a single dataframe as output from a single call. Dates MMM ABT NA NA NA NA

7 Data Download R only I am a C/C++/Fortran HPC programmer, and I do use for loops in R and Python. for loops are slow in R Can any package return data for S&P stocks as a single dataframe? Use fimport package of R to download daily data. stocksdata<-yahooseries(symbols_nospaces,from=" ",to=" ) #symbols_nospaces is S&P stock symbols Extract columns with closing dates. Write to a csv file for repeated runs (takes a long time to download) Read the file in R/Python to get the data read.table in R created a R dataframe Pandas read_table created a Pandas dataframe. Many symbols have NA s for dates where data is not available. Work with a subset of data How about 200 stocks for quarter of a century ( )? #Snippet of code to get closing data colname<-paste(symbols_nospaces[i],".close",sep="") print(colname) stockdata_df[,i+1]<-get(colname) colnames(stockdata_df)[i+1]<-symbols_nospaces[i]

8 Combined Data Preparation R and Python Read the file in R/Python to get the data read.table in R created a R dataframe Pandas read_table created a Pandas dataframe. Many symbols have NA s for dates where data is not available. Work with a subset of data How about 200 stocks for quarter of a century ( )? Find first occurrence of 1988 in dataframe s Dates column. str.contains( 1970 ) in Python agrep(,fixed=true) in R. Extract stock columns which do not have NA on the first trading day of 1988!is.na in R/ not math.isnan in Python Get 200 stocks which satisfy above requirement Result: combined data for 200 stocks from in R/Python dataframes. Drop rows with NA for any stock. na.omit()/drop.na() 6162 entries for 200 stocks in total. Both R and Python are remarkably similar for this step.

9 Code in R and Python for data preparation Python code def extractdata(filename,lowyear, numstocks): stocksreturns=pd.read_table('stocksdata_dataframe.txt', sep='\s+') yrpattern="%d-*" % (lowyear) x=stocksreturns['"dates"'].str.contains(str(lowyear)) for i in range(size(x)): if(x[i]==true): break startindex=i colnames=stocksreturns.columns stocksreturns_short=dataframe(stocksreturns.ix[startind ex:size(x)]['"dates"']) stockindex=0 for i in range(1,501): if stockindex<numstocks: if not math.isnan(stocksreturns[colnames[i]] [startindex]): stocksreturns_short[colnames[i]]=stocksreturns[colname s[i]][startindex:len(stocksreturns.index)] stockindex=stockindex+1 9 return stocksreturns_short R code extractdata<-function(filename,yearrange, numstocks) { stocksreturns<-read.table("./ stocksdata_dataframe.txt", header=t) yrpattern<-paste(yearrange[1],"-*",sep=" ) stockdates<-stocksreturns$dates sindex<-agrep(yrpattern,stockdates,fixed=true)[1] eindex<-length(stockdates) colnames<-colnames(stocksreturns[i]) stocksreturns_short<data.frame(stockdates[startindex:endindex])]) colnames(stocksreturns_short)[1]<-"dates" j<-2 stockindex<-0 for(i in 2:501) { if(stockindex<numstocks) { if(!is.na(stocksreturns[,i][k])) { stocksreturns_short[,j]<-stocksreturns[,i] [sindex:eindex] colnames(stocksreturns_short)[j]<-colnames[i][i] j<-j+1 stockindex<-stockindex+1}}}}

10 R/Python packages for PCA Size of data is only about 23 MB in txt format. No memory-bound issues running on my laptop. R has many choices (one too many) for PCA: prcomp/princomp/pca/dudi.pca/acp prcomp scales and centers data (very convenient) prcomp(stocksreturns_short,scale=true,center=true,retx=true) Reconstruct data with predict function. prcomp uses svd beneath the covers Python seems to have several choices for PCA as well. matplotlib.mca.pca MDP (module for data processing) PCA numpy.eig/scipy.eig etc Both packages seem to have adequate support for PCA in multiple ways. Our approach: Use SVD in both R/Python: Do same operations and compare runtimes. svd in numpy returns transpose(v), while R returns V Both R and Python return d as a vector; trivial to make a diagonal matrix for reconstruction of data. Things start in Python from 0; in R from 1 J covariance matrix/eigenvalues/eigenvectors approach.

11 PCA on combined data using SVD PCA is a SVD operation: X is stocks data (6162x200) D*D is eigenvalues. (p=200) Reconstruction of data by zeroing out regions as shown below apple Choose q (explained ahead)

12 PCA on combined data Approach Perform a SVD of stock returns data. Find number of eigenvalues q comprising 50%,75%,90% and 100% of sum of all the eigenvalues Eigenvalues=d*d from SVD Zero out remaining eigenvectors/eigenvalues In Python, use copy.copy for copying eigenvalues/vectors from SVD (assignment is done using references) Reconstruct data with matrix-multiply operations. X_reconstructed=U*D*t(V) Measure the std_dev(data_reconstructed-original_data)

13 Code in Python for PCA on combined data colnames=stocksreturns_short.columns diff_data_combined=np.zeros(shape=(stocksreturns_short.shape[0]-1,stocksreturns_short.shape[1])) for i in range(0,numstocks): diff_data_combined[:,i]=diff(stocksreturns_short[:,i+1]) [u_original,d_original,v_original]=np.linalg.svd(diff_data_combined,full_matrices=false) d_diag_original=diag(d_original) eigvals_combined = d_original*d_original totalsumeigvals=sum(eigvals_combined) for percent in eigvalspercent: sumeigvals=0 for i in range(0,200): sumeigvals=sumeigvals+eigvals_combined[i] if sumeigvals>=(percent*totalsumeigvals): neigvals=i+1 break u=copy.copy(u_original) d_diag=copy.copy(d_diag_original) v=copy.copy(v_original) nvals=shape(diff_data_combined)[1] u[:][neigvals:nvals]=0 d_diag[neigvals:nvals][neigvals:nvals]=0 v[neigvals:nvals][:]=0 dproduct=np.dot(u,np.dot(d_diag,v)) 13

14 Code in R for PCA on combined data data.combined<-diff_stockyr svd.combined<-svd(data.combined) #find SVD eigvalues.combined<-svd.combined$d * svd.combined$d totalsum<-sum(eigvalues.combined) proportionrange<-c(0.5,0.75,0.90,1) for(proportion in proportionrange){ sum<-0 neigvalues<-0 for(i in 1:numstocks) { sum<-sum+eigvalues.combined[i] neigvalues<-neigvalues+1 if((sum/totalsum)>=proportion) { cat(sprintf("number of eigenvalues for combined data for proportion %f = %d\n",proportion,neigvalues)) break; }} nvals<-dim(data.combined)[2] u<-svd.combined$u d<-diag(svd.combined$d) v<-svd.combined$v #Copy SVD matrices u[,(neigvalues):nvals]<-0 d[(neigvalues):nvals,(neigvalues):nvals]<-0 v[,(neigvalues):nvals]<-0 stock.data<-u %*% d %*% t(v) #Do a matrix multiply to get data 14

15 PCA on combined data results 138 stocks out of 200 account for 90% of the sum of all the eigenvalues Reconstruct data with 138 stocks has negligible error (10^-5)

16 Yearly PCA Instead of doing a PCA on combined data from , how about yearly PCA? PCA on yearly data Separate the combined dataframe into yearly dataframes (1 for each year). Number of observations vary for each year. Calculate number of eigenvalues accounting for 50%, 75%,90%,100% of sum of all eigenvalues (same operation as PCA on combined data) Do a reconstruction for each proportion/each year. (step operation as PCA on combined data) 25 separate PCA s 100 reconstructions in total. Dataframe 1....Dataframe 25 Dates MMM ABT NA Dates MMM ABT NA NA NA NA NA NA NA

17 Code in Python for extracting yearly dataframes Python code colnames=stocksreturns_short.columns x=stocksreturns_short['"dates"'].str.contains(str(year s)) stocksreturns_yr=stocksreturns_short.ix[x] shape0,shape1=np.shape(stocksreturns_yr) diff_data_yr=np.zeros((shape0-1,shape1)) for i in range(0,numstocks): data_yr[:,i]=(stocksreturns_yr[:][colnames[i+1]]) diff_data_yr[:,i]=np.diff(data_yr[:,i]) R code yrpattern<-paste(years,"-*",sep="") yrindices<agrep(yrpattern,stocksreturns_short $Dates,fixed=TRUE) val_stockyr<data.frame(stocksreturns_short $Dates[yrindices[1]:tail(yrindices,n=1)]) colnames(val_stockyr)[1]<-"dates" for(i in 2:(numstocks+1)) { } val_stockyr[,i]<stocksreturns_short[,i] [yrindices[1]:tail(yrindices,n=1)] colnames(val_stockyr)[i]<colnames(stocksreturns_short)[i] diff_stockyr<data.frame(matrix(na,nrow=(dim(log_st ockyr)[1]-1),ncol=numstocks)) for(j in 1:200) diff_stockyr[,j]<-diff(log_stockyr[,j]) 17

18 Analysis of yearly PCA data Number of eigenvalues with 50% of total sum drops to 1 in 2008 Stock movement is highly correlated due to macro-economic trends.

19 PCA on yearly data Being a C/C++/Fortran HPC programmer, I use for loops in R/Python Not efficient for R (for loop is an object; assignment in R is a copy operation) Python s assignment is done with references so it works better with for loops, and lesser overhead of functions. Development Environment: ipython-2.7 with pandas and numpy installed through ports package Rstudio 0.97 with R binary downloaded for Mac No attempt to optimize the build for either R and Python. Total code for R takes above 20 seconds versus about 11.9 seconds for Python on my Macbook Pro. Timings may change with less reliance on for loops in the code.

20 Parallelizing yearly PCA Can we use parallelism in R and Python productively? Both R and Python provide several ways for parallelization Multiple cores Distributed parallelism using MPI or sockets Use coarse-grained parallelism to speed up our computations. Look into how both packages allow use of multicores on modern day processors Very easy to apply coarse-grained parallelism to yearly PCA Divide years amongst threads/processes. For R use domc/foreach package that works on the multiple cores. Python threads does not work well due to global interpreter lock (GIL). Use ipython ipcluster parallelization framework. Further evaluation using MPI on distributed clusters needed.

21 Parallelizing yearly PCA in R foreach depends on a backend for execution We register DoMC (multiple cores) as backend for the yearly PCA in this case MPI can also be used as a backend for distributed clusters. Snow package another option (higher level for distributed clusters). Not just limited to for loops: Use mclapply for multi-core lapply etc. #Sequential code for(years in 1988:2012) { for(proportion in c(0.5,0.75,0.90,1){ Code to extract yearly data, do PCA and then reconstruct }} #Parallel code registerdomc(4) #Register multicore as backend with 4 cores foreach(years in 1988:2012) %dopar%{ for(proportion in c(0.5,0.75,0.90,1){ Code to extract yearly data, do PCA and then reconstruct }}

22 Timing Results Threads Intel Core i7 Macbook Pro (4 cores, 8 hyper-threading threads)

23 Parallelizing yearly PCA in Python Using ipython s Direct Interface Start backend of ipython % ipcluster-2.7 start n 4 (4 is the number of processes) Rewrite pcadata function so that it can be used with the map API of Python pcadata(stocksreturns_short,year) pcadata extracts data for year from stocksreturns_short, performs a SVD and then reconstructions with eigenvalues percentages as before. Processes (unlike threads in R) makes us reimport all the modules inside the function. Higher memory footprint More heavyweight compared to threads. Create a list for each process s function arguments. Parallelize across years as in R Each process computes a subset of SVD s and reuses a single SVD for 4 reconstructions. #code for map_async x=[] for i in range(0,25): x.append(stocksreturns_short) starttime=datetime.now() map_sync(pcadata,x,range(1988,2013)) print(datetime.now()-starttime)

24 Timing results in Python Threads Starting ipcluster=8 leads to processes hanging. ipython with multiple processes led to some memory issues. Scalability of Python shows similar trend as R

25 Summary Both R and Python offer good choices for PCA R has many packages for tasks such as downloading financial data, PCA etc. Python has a good support as well. R offers a cohesive framework Installing packages is pain-free Parallelization in R is very simple. R seems to be slower as assignment operator requires copy operations which is a lot of overhead (and my use of for loops). Python is more forgiving of usage of for loop, and seems to require lesser statements to do the same work. Pandas/Numpy adds dataframe capabilities to Python s native string handling capabilities to provide a strong platform for data analysis.

26 Future Work Profiling of code at statement level etc. How does R/Python work for memory-bound/compute-bound problems? Work with Distributed matrices (disnumpy for Python,r-pbd for R) Use MPI as backend for parallelization on a cluster Make interpreted code faster for both R/Python through compilers(cmpfun for R, Cython for Python)

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Raquel Urtasun & Rich Zemel University of Toronto Nov 4, 2015 Urtasun & Zemel (UofT) CSC 411: 14-PCA & Autoencoders Nov 4, 2015 1 / 18