Comparing R and Python for PCA PyData Boston 2013
|
|
- Kelley Henderson
- 5 years ago
- Views:
Transcription
1 Vipin Sachdeva Senior Engineer, IBM Research Comparing R and Python for PCA PyData Boston 2013
2 Comparison of R and Python for Principal Component Analysis R and Python are popular choices for data analysis. How do they compare in terms of programmer productivity and performance? Use a common task for both R and Python Principal Component Analysis (PCA) PCA is a very commonly used technique for dimension reduction. Dataframes is an essential part of languages supporting data analysis R provides data frame with numerous statistical packages. Python has included numpy (arrays) and Pandas (dataframe) for data handling which we use. Both language have rich development environments Rstudio for R ipython for Python. Both languages have many features that helps in data analysis. In this talk we compare those features with some code examples to solve our problem. This talk is not about as much about principal component analysis as about programming and performance of Python and R Let s get started
3 PCA Short Introduction PCA is a standard tool in modern data analysis. Simple method to extract information from confusing datasets Reduce a complex dataset to a lower dimension PCA projects the data along the direction where data varies the most. Directions are determined by the direction of the eigenvectors coresponding to largest eigenvalues
4 PCA Mathematical approaches Find eigenvalues of standardized covariance matrix. Choose eigenvalues with sum exceeding a threshold. Reduction in dimension from N to K: Create data with subset of eigenvalues (whose sum exceeds that threshold). K Σ i i=1 N Σ i i=1 > Threshold (e.g., 0.9 or 0.95) ˆx x = K Σ b i u i or ˆx = K Σ b i u i + x i=1 i=1
5 PCA using Singular Value Decomposition (SVD) More generalized approach for performing PCA. Decompose X=UDV T D*D is eigenvalues of covariance matrix. Reconstruction of data by zeroing out regions as shown below Choose q (as before) apple
6 PCA: What data to use? How about PCA on current 500 S&P stocks data for a period of time? Download symbols from S&P 500 website and create a vector. Use this vector to download symbols data from 1970 to 2012 in a dataframe (if possible). R and Python have various packages for financial data download quantmod (R) pandas.io.data.datareader (Python) Need a package that provides a single dataframe as output from a single call. Dates MMM ABT NA NA NA NA
7 Data Download R only I am a C/C++/Fortran HPC programmer, and I do use for loops in R and Python. for loops are slow in R Can any package return data for S&P stocks as a single dataframe? Use fimport package of R to download daily data. stocksdata<-yahooseries(symbols_nospaces,from=" ",to=" ) #symbols_nospaces is S&P stock symbols Extract columns with closing dates. Write to a csv file for repeated runs (takes a long time to download) Read the file in R/Python to get the data read.table in R created a R dataframe Pandas read_table created a Pandas dataframe. Many symbols have NA s for dates where data is not available. Work with a subset of data How about 200 stocks for quarter of a century ( )? #Snippet of code to get closing data colname<-paste(symbols_nospaces[i],".close",sep="") print(colname) stockdata_df[,i+1]<-get(colname) colnames(stockdata_df)[i+1]<-symbols_nospaces[i]
8 Combined Data Preparation R and Python Read the file in R/Python to get the data read.table in R created a R dataframe Pandas read_table created a Pandas dataframe. Many symbols have NA s for dates where data is not available. Work with a subset of data How about 200 stocks for quarter of a century ( )? Find first occurrence of 1988 in dataframe s Dates column. str.contains( 1970 ) in Python agrep(,fixed=true) in R. Extract stock columns which do not have NA on the first trading day of 1988!is.na in R/ not math.isnan in Python Get 200 stocks which satisfy above requirement Result: combined data for 200 stocks from in R/Python dataframes. Drop rows with NA for any stock. na.omit()/drop.na() 6162 entries for 200 stocks in total. Both R and Python are remarkably similar for this step.
9 Code in R and Python for data preparation Python code def extractdata(filename,lowyear, numstocks): stocksreturns=pd.read_table('stocksdata_dataframe.txt', sep='\s+') yrpattern="%d-*" % (lowyear) x=stocksreturns['"dates"'].str.contains(str(lowyear)) for i in range(size(x)): if(x[i]==true): break startindex=i colnames=stocksreturns.columns stocksreturns_short=dataframe(stocksreturns.ix[startind ex:size(x)]['"dates"']) stockindex=0 for i in range(1,501): if stockindex<numstocks: if not math.isnan(stocksreturns[colnames[i]] [startindex]): stocksreturns_short[colnames[i]]=stocksreturns[colname s[i]][startindex:len(stocksreturns.index)] stockindex=stockindex+1 9 return stocksreturns_short R code extractdata<-function(filename,yearrange, numstocks) { stocksreturns<-read.table("./ stocksdata_dataframe.txt", header=t) yrpattern<-paste(yearrange[1],"-*",sep=" ) stockdates<-stocksreturns$dates sindex<-agrep(yrpattern,stockdates,fixed=true)[1] eindex<-length(stockdates) colnames<-colnames(stocksreturns[i]) stocksreturns_short<data.frame(stockdates[startindex:endindex])]) colnames(stocksreturns_short)[1]<-"dates" j<-2 stockindex<-0 for(i in 2:501) { if(stockindex<numstocks) { if(!is.na(stocksreturns[,i][k])) { stocksreturns_short[,j]<-stocksreturns[,i] [sindex:eindex] colnames(stocksreturns_short)[j]<-colnames[i][i] j<-j+1 stockindex<-stockindex+1}}}}
10 R/Python packages for PCA Size of data is only about 23 MB in txt format. No memory-bound issues running on my laptop. R has many choices (one too many) for PCA: prcomp/princomp/pca/dudi.pca/acp prcomp scales and centers data (very convenient) prcomp(stocksreturns_short,scale=true,center=true,retx=true) Reconstruct data with predict function. prcomp uses svd beneath the covers Python seems to have several choices for PCA as well. matplotlib.mca.pca MDP (module for data processing) PCA numpy.eig/scipy.eig etc Both packages seem to have adequate support for PCA in multiple ways. Our approach: Use SVD in both R/Python: Do same operations and compare runtimes. svd in numpy returns transpose(v), while R returns V Both R and Python return d as a vector; trivial to make a diagonal matrix for reconstruction of data. Things start in Python from 0; in R from 1 J covariance matrix/eigenvalues/eigenvectors approach.
11 PCA on combined data using SVD PCA is a SVD operation: X is stocks data (6162x200) D*D is eigenvalues. (p=200) Reconstruction of data by zeroing out regions as shown below apple Choose q (explained ahead)
12 PCA on combined data Approach Perform a SVD of stock returns data. Find number of eigenvalues q comprising 50%,75%,90% and 100% of sum of all the eigenvalues Eigenvalues=d*d from SVD Zero out remaining eigenvectors/eigenvalues In Python, use copy.copy for copying eigenvalues/vectors from SVD (assignment is done using references) Reconstruct data with matrix-multiply operations. X_reconstructed=U*D*t(V) Measure the std_dev(data_reconstructed-original_data)
13 Code in Python for PCA on combined data colnames=stocksreturns_short.columns diff_data_combined=np.zeros(shape=(stocksreturns_short.shape[0]-1,stocksreturns_short.shape[1])) for i in range(0,numstocks): diff_data_combined[:,i]=diff(stocksreturns_short[:,i+1]) [u_original,d_original,v_original]=np.linalg.svd(diff_data_combined,full_matrices=false) d_diag_original=diag(d_original) eigvals_combined = d_original*d_original totalsumeigvals=sum(eigvals_combined) for percent in eigvalspercent: sumeigvals=0 for i in range(0,200): sumeigvals=sumeigvals+eigvals_combined[i] if sumeigvals>=(percent*totalsumeigvals): neigvals=i+1 break u=copy.copy(u_original) d_diag=copy.copy(d_diag_original) v=copy.copy(v_original) nvals=shape(diff_data_combined)[1] u[:][neigvals:nvals]=0 d_diag[neigvals:nvals][neigvals:nvals]=0 v[neigvals:nvals][:]=0 dproduct=np.dot(u,np.dot(d_diag,v)) 13
14 Code in R for PCA on combined data data.combined<-diff_stockyr svd.combined<-svd(data.combined) #find SVD eigvalues.combined<-svd.combined$d * svd.combined$d totalsum<-sum(eigvalues.combined) proportionrange<-c(0.5,0.75,0.90,1) for(proportion in proportionrange){ sum<-0 neigvalues<-0 for(i in 1:numstocks) { sum<-sum+eigvalues.combined[i] neigvalues<-neigvalues+1 if((sum/totalsum)>=proportion) { cat(sprintf("number of eigenvalues for combined data for proportion %f = %d\n",proportion,neigvalues)) break; }} nvals<-dim(data.combined)[2] u<-svd.combined$u d<-diag(svd.combined$d) v<-svd.combined$v #Copy SVD matrices u[,(neigvalues):nvals]<-0 d[(neigvalues):nvals,(neigvalues):nvals]<-0 v[,(neigvalues):nvals]<-0 stock.data<-u %*% d %*% t(v) #Do a matrix multiply to get data 14
15 PCA on combined data results 138 stocks out of 200 account for 90% of the sum of all the eigenvalues Reconstruct data with 138 stocks has negligible error (10^-5)
16 Yearly PCA Instead of doing a PCA on combined data from , how about yearly PCA? PCA on yearly data Separate the combined dataframe into yearly dataframes (1 for each year). Number of observations vary for each year. Calculate number of eigenvalues accounting for 50%, 75%,90%,100% of sum of all eigenvalues (same operation as PCA on combined data) Do a reconstruction for each proportion/each year. (step operation as PCA on combined data) 25 separate PCA s 100 reconstructions in total. Dataframe 1....Dataframe 25 Dates MMM ABT NA Dates MMM ABT NA NA NA NA NA NA NA
17 Code in Python for extracting yearly dataframes Python code colnames=stocksreturns_short.columns x=stocksreturns_short['"dates"'].str.contains(str(year s)) stocksreturns_yr=stocksreturns_short.ix[x] shape0,shape1=np.shape(stocksreturns_yr) diff_data_yr=np.zeros((shape0-1,shape1)) for i in range(0,numstocks): data_yr[:,i]=(stocksreturns_yr[:][colnames[i+1]]) diff_data_yr[:,i]=np.diff(data_yr[:,i]) R code yrpattern<-paste(years,"-*",sep="") yrindices<agrep(yrpattern,stocksreturns_short $Dates,fixed=TRUE) val_stockyr<data.frame(stocksreturns_short $Dates[yrindices[1]:tail(yrindices,n=1)]) colnames(val_stockyr)[1]<-"dates" for(i in 2:(numstocks+1)) { } val_stockyr[,i]<stocksreturns_short[,i] [yrindices[1]:tail(yrindices,n=1)] colnames(val_stockyr)[i]<colnames(stocksreturns_short)[i] diff_stockyr<data.frame(matrix(na,nrow=(dim(log_st ockyr)[1]-1),ncol=numstocks)) for(j in 1:200) diff_stockyr[,j]<-diff(log_stockyr[,j]) 17
18 Analysis of yearly PCA data Number of eigenvalues with 50% of total sum drops to 1 in 2008 Stock movement is highly correlated due to macro-economic trends.
19 PCA on yearly data Being a C/C++/Fortran HPC programmer, I use for loops in R/Python Not efficient for R (for loop is an object; assignment in R is a copy operation) Python s assignment is done with references so it works better with for loops, and lesser overhead of functions. Development Environment: ipython-2.7 with pandas and numpy installed through ports package Rstudio 0.97 with R binary downloaded for Mac No attempt to optimize the build for either R and Python. Total code for R takes above 20 seconds versus about 11.9 seconds for Python on my Macbook Pro. Timings may change with less reliance on for loops in the code.
20 Parallelizing yearly PCA Can we use parallelism in R and Python productively? Both R and Python provide several ways for parallelization Multiple cores Distributed parallelism using MPI or sockets Use coarse-grained parallelism to speed up our computations. Look into how both packages allow use of multicores on modern day processors Very easy to apply coarse-grained parallelism to yearly PCA Divide years amongst threads/processes. For R use domc/foreach package that works on the multiple cores. Python threads does not work well due to global interpreter lock (GIL). Use ipython ipcluster parallelization framework. Further evaluation using MPI on distributed clusters needed.
21 Parallelizing yearly PCA in R foreach depends on a backend for execution We register DoMC (multiple cores) as backend for the yearly PCA in this case MPI can also be used as a backend for distributed clusters. Snow package another option (higher level for distributed clusters). Not just limited to for loops: Use mclapply for multi-core lapply etc. #Sequential code for(years in 1988:2012) { for(proportion in c(0.5,0.75,0.90,1){ Code to extract yearly data, do PCA and then reconstruct }} #Parallel code registerdomc(4) #Register multicore as backend with 4 cores foreach(years in 1988:2012) %dopar%{ for(proportion in c(0.5,0.75,0.90,1){ Code to extract yearly data, do PCA and then reconstruct }}
22 Timing Results Threads Intel Core i7 Macbook Pro (4 cores, 8 hyper-threading threads)
23 Parallelizing yearly PCA in Python Using ipython s Direct Interface Start backend of ipython % ipcluster-2.7 start n 4 (4 is the number of processes) Rewrite pcadata function so that it can be used with the map API of Python pcadata(stocksreturns_short,year) pcadata extracts data for year from stocksreturns_short, performs a SVD and then reconstructions with eigenvalues percentages as before. Processes (unlike threads in R) makes us reimport all the modules inside the function. Higher memory footprint More heavyweight compared to threads. Create a list for each process s function arguments. Parallelize across years as in R Each process computes a subset of SVD s and reuses a single SVD for 4 reconstructions. #code for map_async x=[] for i in range(0,25): x.append(stocksreturns_short) starttime=datetime.now() map_sync(pcadata,x,range(1988,2013)) print(datetime.now()-starttime)
24 Timing results in Python Threads Starting ipcluster=8 leads to processes hanging. ipython with multiple processes led to some memory issues. Scalability of Python shows similar trend as R
25 Summary Both R and Python offer good choices for PCA R has many packages for tasks such as downloading financial data, PCA etc. Python has a good support as well. R offers a cohesive framework Installing packages is pain-free Parallelization in R is very simple. R seems to be slower as assignment operator requires copy operations which is a lot of overhead (and my use of for loops). Python is more forgiving of usage of for loop, and seems to require lesser statements to do the same work. Pandas/Numpy adds dataframe capabilities to Python s native string handling capabilities to provide a strong platform for data analysis.
26 Future Work Profiling of code at statement level etc. How does R/Python work for memory-bound/compute-bound problems? Work with Distributed matrices (disnumpy for Python,r-pbd for R) Use MPI as backend for parallelization on a cluster Make interpreted code faster for both R/Python through compilers(cmpfun for R, Cython for Python)
CSC 411: Lecture 14: Principal Components Analysis & Autoencoders
CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Raquel Urtasun & Rich Zemel University of Toronto Nov 4, 2015 Urtasun & Zemel (UofT) CSC 411: 14-PCA & Autoencoders Nov 4, 2015 1 / 18
More informationR on BioHPC. Rstudio, Parallel R and BioconductoR. Updated for
R on BioHPC Rstudio, Parallel R and BioconductoR 1 Updated for 2015-07-15 2 Today we ll be looking at Why R? The dominant statistics environment in academia Large number of packages to do a lot of different
More informationFeature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22
Feature selection Javier Béjar cbea LSI - FIB Term 2011/2012 Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/2012 1 / 22 Outline 1 Dimensionality reduction 2 Projections 3 Attribute selection
More informationSingular Value Decomposition, and Application to Recommender Systems
Singular Value Decomposition, and Application to Recommender Systems CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Recommendation
More informationPCOMP http://127.0.0.1:55825/help/topic/com.rsi.idl.doc.core/pcomp... IDL API Reference Guides > IDL Reference Guide > Part I: IDL Command Reference > Routines: P PCOMP Syntax Return Value Arguments Keywords
More informationLECTURE 7: STUDENT REQUESTED TOPICS
1 LECTURE 7: STUDENT REQUESTED TOPICS Introduction to Scientific Python, CME 193 Feb. 20, 2014 Please download today s exercises from: web.stanford.edu/~ermartin/teaching/cme193-winter15 Eileen Martin
More informationDistributed Data Structures, Parallel Computing and IPython
Distributed Data Structures, Parallel Computing and IPython Brian Granger, Cal Poly San Luis Obispo Fernando Perez, UC Berkeley Funded in part by NASA Motivation Compiled Languages C/C++/Fortran are FAST
More informationARTIFICIAL INTELLIGENCE AND PYTHON
ARTIFICIAL INTELLIGENCE AND PYTHON DAY 1 STANLEY LIANG, LASSONDE SCHOOL OF ENGINEERING, YORK UNIVERSITY WHAT IS PYTHON An interpreted high-level programming language for general-purpose programming. Python
More informationDimension Reduction CS534
Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of
More informationScience Cookbook. Practical Data. open source community experience distilled. Benjamin Bengfort. science projects in R and Python.
Practical Data Science Cookbook 89 hands-on recipes to help you complete real-world data science projects in R and Python Tony Ojeda Sean Patrick Murphy Benjamin Bengfort Abhijit Dasgupta PUBLISHING open
More informationLinear Methods for Regression and Shrinkage Methods
Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors
More informationGeneral Instructions. Questions
CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These
More informationAutomatic Singular Spectrum Analysis for Time-Series Decomposition
Automatic Singular Spectrum Analysis for Time-Series Decomposition A.M. Álvarez-Meza and C.D. Acosta-Medina and G. Castellanos-Domínguez Universidad Nacional de Colombia, Signal Processing and Recognition
More informationParallel Architecture & Programing Models for Face Recognition
Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature
More informationCollaborative Filtering for Netflix
Collaborative Filtering for Netflix Michael Percy Dec 10, 2009 Abstract The Netflix movie-recommendation problem was investigated and the incremental Singular Value Decomposition (SVD) algorithm was implemented
More informationCSC 411: Lecture 14: Principal Components Analysis & Autoencoders
CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 1 / 18
More informationUnsupervised learning in Vision
Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual
More informationClustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin
Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014
More informationRecommender System. What is it? How to build it? Challenges. R package: recommenderlab
Recommender System What is it? How to build it? Challenges R package: recommenderlab 1 What is a recommender system Wiki definition: A recommender system or a recommendation system (sometimes replacing
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationHW Assignment 3 (Due by 9:00am on Mar 6)
HW Assignment 3 (Due by 9:00am on Mar 6) 1 Theory (150 points) 1. [Tied Weights, 50 points] Write down the gradient computation for a (non-linear) auto-encoder with tied weights i.e., W (2) = (W (1) )
More informationLinear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre
Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org
More informationA Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)
A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center
More informationModelling and Visualization of High Dimensional Data. Sample Examination Paper
Duration not specified UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE Modelling and Visualization of High Dimensional Data Sample Examination Paper Examination date not specified Time: Examination
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationWork 2. Case-based reasoning exercise
Work 2. Case-based reasoning exercise Marc Albert Garcia Gonzalo, Miquel Perelló Nieto November 19, 2012 1 Introduction In this exercise we have implemented a case-based reasoning system, specifically
More informationAnalysis and Latent Semantic Indexing
18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding
More informationRDAV and Nautilus
http://rdav.nics.tennessee.edu/ RDAV and Nautilus Parallel Processing with R Amy F. Szczepa!ski Remote Data Analysis and Visualization Center University of Tennessee, Knoxville aszczepa@utk.edu Any opinions,
More informationIntroduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture 14 Python Exercise on knn and PCA Hello everyone,
More informationSpatial Distributions of Precipitation Events from Regional Climate Models
Spatial Distributions of Precipitation Events from Regional Climate Models N. Lenssen September 2, 2010 1 Scientific Reason The Institute of Mathematics Applied to Geosciences (IMAGe) and the National
More informationACHIEVEMENTS FROM TRAINING
LEARN WELL TECHNOCRAFT DATA SCIENCE/ MACHINE LEARNING SYLLABUS 8TH YEAR OF ACCOMPLISHMENTS AUTHORIZED GLOBAL CERTIFICATION CENTER FOR MICROSOFT, ORACLE, IBM, AWS AND MANY MORE. 8411002339/7709292162 WWW.DW-LEARNWELL.COM
More informationIntroduction to Programming
Introduction to Programming G. Bakalli March 8, 2017 G. Bakalli Introduction to Programming March 8, 2017 1 / 33 Outline 1 Programming in Finance 2 Types of Languages Interpreters Compilers 3 Programming
More informationIntel Distribution For Python*
Intel Distribution For Python* Intel Distribution for Python* 2017 Advancing Python performance closer to native speeds Easy, out-of-the-box access to high performance Python High performance with multiple
More informationCSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.
CSE 547: Machine Learning for Big Data Spring 2019 Problem Set 2 Please read the homework submission policies. 1 Principal Component Analysis and Reconstruction (25 points) Let s do PCA and reconstruct
More informationMotion Interpretation and Synthesis by ICA
Motion Interpretation and Synthesis by ICA Renqiang Min Department of Computer Science, University of Toronto, 1 King s College Road, Toronto, ON M5S3G4, Canada Abstract. It is known that high-dimensional
More informationBiology Project 1
Biology 6317 Project 1 Data and illustrations courtesy of Professor Tony Frankino, Department of Biology/Biochemistry 1. Background The data set www.math.uh.edu/~charles/wing_xy.dat has measurements related
More informationUNIVERSITY OF OSLO. Faculty of Mathematics and Natural Sciences
UNIVERSITY OF OSLO Faculty of Mathematics and Natural Sciences Exam: INF 4300 / INF 9305 Digital image analysis Date: Thursday December 21, 2017 Exam hours: 09.00-13.00 (4 hours) Number of pages: 8 pages
More informationEliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory
Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory Rei Odaira, Jose G. Castanos and Hisanobu Tomari IBM Research and University of Tokyo April 8, 2014 Rei Odaira, Jose G.
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationUsing Existing Numerical Libraries on Spark
Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm
More informationParallel Computing with R. Le Yan LSU
Parallel Computing with R Le Yan HPC @ LSU 3/22/2017 HPC training series Spring 2017 Outline Parallel computing primers Parallel computing with R Implicit parallelism Explicit parallelism R with GPU 3/22/2017
More informationPackage PCADSC. April 19, 2017
Type Package Package PCADSC April 19, 2017 Title Tools for Principal Component Analysis-Based Data Structure Comparisons Version 0.8.0 A suite of non-parametric, visual tools for assessing differences
More informationGPU Based Face Recognition System for Authentication
GPU Based Face Recognition System for Authentication Bhumika Agrawal, Chelsi Gupta, Meghna Mandloi, Divya Dwivedi, Jayesh Surana Information Technology, SVITS Gram Baroli, Sanwer road, Indore, MP, India
More informationPython for Data Analysis
Python for Data Analysis Wes McKinney O'REILLY 8 Beijing Cambridge Farnham Kb'ln Sebastopol Tokyo Table of Contents Preface xi 1. Preliminaries " 1 What Is This Book About? 1 Why Python for Data Analysis?
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationUnsupervised Learning
Unsupervised Learning Fabio G. Cozman - fgcozman@usp.br November 16, 2018 What can we do? We just have a dataset with features (no labels, no response). We want to understand the data... no easy to define
More informationR and parallel libraries. Introduction to R for data analytics Bologna, 26/06/2017
R and parallel libraries Introduction to R for data analytics Bologna, 26/06/2017 Outline Overview What is R R Console Input and Evaluation Data types R Objects and Attributes Vectors and Lists Matrices
More informationParallel Computing with R. Le Yan LSU
Parallel Computing with Le Yan HPC @ LSU 11/1/2017 HPC training series Fall 2017 Parallel Computing: Why? Getting results faster unning in parallel may speed up the time to reach solution Dealing with
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationGEMINI GEneric Multimedia INdexIng
GEMINI GEneric Multimedia INdexIng GEneric Multimedia INdexIng distance measure Sub-pattern Match quick and dirty test Lower bounding lemma 1-D Time Sequences Color histograms Color auto-correlogram Shapes
More informationScaled Machine Learning at Matroid
Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling
More informationPython for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT
Python for Data Analysis Prof.Sushila Aghav-Palwe Assistant Professor MIT Four steps to apply data analytics: 1. Define your Objective What are you trying to achieve? What could the result look like? 2.
More informationPython Certification Training
Introduction To Python Python Certification Training Goal : Give brief idea of what Python is and touch on basics. Define Python Know why Python is popular Setup Python environment Discuss flow control
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More information2 Calculation of the within-class covariance matrix
1 Topic Parallel programming in R. Using the «parallel» and «doparallel» packages. Personal computers become more and more efficient. They are mostly equipped with multi-core processors. At the same time,
More informationWIDE use of k Nearest Neighbours (k-nn) search is
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/TC.27.27483,
More informationDesign of Parallel Algorithms. Course Introduction
+ Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:
More information6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION
6 NEURAL NETWORK BASED PATH PLANNING ALGORITHM 61 INTRODUCTION In previous chapters path planning algorithms such as trigonometry based path planning algorithm and direction based path planning algorithm
More informationparallel Parallel R ANF R Vincent Miele CNRS 07/10/2015
Parallel R ANF R Vincent Miele CNRS 07/10/2015 Thinking Plan Thinking Context Principles Traditional paradigms and languages Parallel R - the foundations embarrassingly computations in R the snow heritage
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationData preprocessing Functional Programming and Intelligent Algorithms
Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Høgskolen i Ålesund 20th March 2017 1 Why data preprocessing? Real-world data tend to be dirty incomplete: lacking attribute
More informationFeature Selection Using Modified-MCA Based Scoring Metric for Classification
2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification
More informationRecognizing Handwritten Digits Using the LLE Algorithm with Back Propagation
Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation Lori Cillo, Attebury Honors Program Dr. Rajan Alex, Mentor West Texas A&M University Canyon, Texas 1 ABSTRACT. This work is
More informationIntroducion to R and parallel libraries. Giorgio Pedrazzi, CINECA Matteo Sartori, CINECA School of Data Analytics and Visualisation Milan, 09/06/2015
Introducion to R and parallel libraries Giorgio Pedrazzi, CINECA Matteo Sartori, CINECA School of Data Analytics and Visualisation Milan, 09/06/2015 Overview What is R R Console Input and Evaluation Data
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationComputer Caches. Lab 1. Caching
Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationSome possible directions for the R engine
Some possible directions for the R engine Luke Tierney Department of Statistics & Actuarial Science University of Iowa July 22, 2010 Luke Tierney (U. of Iowa) Directions for the R engine July 22, 2010
More informationApplied Neuroscience. Columbia Science Honors Program Fall Machine Learning and Neural Networks
Applied Neuroscience Columbia Science Honors Program Fall 2016 Machine Learning and Neural Networks Machine Learning and Neural Networks Objective: Introduction to Machine Learning Agenda: 1. JavaScript
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationStatistical Methods and Optimization in Data Mining
Statistical Methods and Optimization in Data Mining Eloísa Macedo 1, Adelaide Freitas 2 1 University of Aveiro, Aveiro, Portugal; macedo@ua.pt 2 University of Aveiro, Aveiro, Portugal; adelaide@ua.pt The
More informationAn Approximate Singular Value Decomposition of Large Matrices in Julia
An Approximate Singular Value Decomposition of Large Matrices in Julia Alexander J. Turner 1, 1 Harvard University, School of Engineering and Applied Sciences, Cambridge, MA, USA. In this project, I implement
More informationSSE Vectorization of the EM Algorithm. for Mixture of Gaussians Density Estimation
Tyler Karrels ECE 734 Spring 2010 1. Introduction SSE Vectorization of the EM Algorithm for Mixture of Gaussians Density Estimation The Expectation-Maximization (EM) algorithm is a popular tool for determining
More informationA Faster Parallel Algorithm for Analyzing Drug-Drug Interaction from MEDLINE Database
A Faster Parallel Algorithm for Analyzing Drug-Drug Interaction from MEDLINE Database Sulav Malla, Kartik Anil Reddy, Song Yang Department of Computer Science and Engineering University of South Florida
More informationVisual Representations for Machine Learning
Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering
More informationLecture Topic Projects
Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis techniques for non-spatial data Project #1 out 4 Data
More informationOptimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*
Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating
More informationPrincipal Component Analysis for Distributed Data
Principal Component Analysis for Distributed Data David Woodruff IBM Almaden Based on works with Ken Clarkson, Ravi Kannan, and Santosh Vempala Outline 1. What is low rank approximation? 2. How do we solve
More informationExploring Parallelism in. Joseph Pantoga Jon Simington
Exploring Parallelism in Joseph Pantoga Jon Simington Why bring parallelism to Python? - We love Python (and you should, too!) - Interacts very well with C / C++ via python.h and CPython - Rapid development
More informationSolving Large-Scale Energy System Models
Solving Large-Scale Energy System Models Frederik Fiand Operations Research Analyst GAMS Software GmbH GAMS Development Corp. GAMS Software GmbH www.gams.com Agenda 1. GAMS System Overview 2. BEAM-ME Background
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationLSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems
LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems Xiangrui Meng Joint with Michael A. Saunders and Michael W. Mahoney Stanford University June 19, 2012 Meng, Saunders, Mahoney
More informationCSE 258 Lecture 5. Web Mining and Recommender Systems. Dimensionality Reduction
CSE 258 Lecture 5 Web Mining and Recommender Systems Dimensionality Reduction This week How can we build low dimensional representations of high dimensional data? e.g. how might we (compactly!) represent
More informationHigh Performance Computing with Python
High Performance Computing with Python Pawel Pomorski SHARCNET University of Waterloo ppomorsk@sharcnet.ca April 29,2015 Outline Speeding up Python code with NumPy Speeding up Python code with Cython Using
More informationGetting the most out of your CPUs Parallel computing strategies in R
Getting the most out of your CPUs Parallel computing strategies in R Stefan Theussl Department of Statistics and Mathematics Wirtschaftsuniversität Wien July 2, 2008 Outline Introduction Parallel Computing
More informationCISC 322 Software Architecture
CISC 322 Software Architecture Lecture 04: Non Functional Requirements (NFR) Quality Attributes Emad Shihab Adapted from Ahmed E. Hassan and Ian Gorton Last Class - Recap Lot of ambiguity within stakeholders
More informationProgramming Exercise 7: K-means Clustering and Principal Component Analysis
Programming Exercise 7: K-means Clustering and Principal Component Analysis Machine Learning May 13, 2012 Introduction In this exercise, you will implement the K-means clustering algorithm and apply it
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationGPU Technology Conference 2015 Silicon Valley
GPU Technology Conference 2015 Silicon Valley Big Data in Real Time: An Approach to Predictive Analytics for Alpha Generation and Risk Management Yigal Jhirad and Blay Tarnoff March 19, 2015 Table of Contents
More informationData Analytics and Machine Learning: From Node to Cluster
Data Analytics and Machine Learning: From Node to Cluster Presented by Viswanath Puttagunta Ganesh Raju Understanding use cases to optimize on ARM Ecosystem Date BKK16-404B March 10th, 2016 Event Linaro
More informationA Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois
A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance
More informationCSCI-580 Advanced High Performance Computing
CSCI-580 Advanced High Performance Computing Performance Hacking: Matrix Multiplication Bo Wu Colorado School of Mines Most content of the slides is from: Saman Amarasinghe (MIT) Square-Matrix Multiplication!2
More information2. Data Preprocessing
2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459
More informationClustering and Dimensionality Reduction
Clustering and Dimensionality Reduction Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at: Data Mining Automatically extracting meaning from
More informationDATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10
COMP717, Data Mining with R, Test Two, Tuesday the 28 th of May, 2013, 8h30-11h30 1 DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationParallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015
Parallel Programming Presentation to Linux Users of Victoria, Inc. November 4th, 2015 http://levlafayette.com 1.0 What Is Parallel Programming? 1.1 Historically, software has been written for serial computation
More informationGetting Started with doparallel and foreach
Steve Weston and Rich Calaway doc@revolutionanalytics.com September 19, 2017 1 Introduction The doparallel package is a parallel backend for the foreach package. It provides a mechanism needed to execute
More informationCHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS
CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS 8.1 Introduction The recognition systems developed so far were for simple characters comprising of consonants and vowels. But there is one
More information