Data Analytics and Dynamic Languages. Lee E. Edlefsen, Ph.D. VP of Engineering

Similar documents
MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by

JAVA An overview for C++ programmers

The name of our class will be Yo. Type that in where it says Class Name. Don t hit the OK button yet.

DCS/100: Procedural Programming

Pointers. A pointer is simply a reference to a variable/object. Compilers automatically generate code to store/retrieve variables from memory

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

High Performance Computing and Programming, Lecture 3

Computer Vision. Matlab

18-642: Code Style for Compilers

How To Test Your Code A CS 1371 Homework Guide

How to program with Matlab (PART 1/3)

MatLab Project # 1 Due IN TUTORIAL Wednesday October 30

Fractional. Design of Experiments. Overview. Scenario

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Rules and syntax for inheritance. The boring stuff

Matching Logic A New Program Verification Approach

SML: Specialized Matrix Language White Paper

Virtual machines (e.g., VMware)

(Refer Slide Time 01:41 min)

Guide - The limitations in screen layout using the Item Placement Tool

What run-time services could help scientific programming?

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

Introduction to Concurrency (Processes, Threads, Interrupts, etc.)

Programming Style and Optimisations - An Overview

Operator overloading

Handout: Handy Computer Tools

Week - 01 Lecture - 04 Downloading and installing Python

It was a dark and stormy night. Seriously. There was a rain storm in Wisconsin, and the line noise dialing into the Unix machines was bad enough to

Common Lisp. In Practical Usage

How to Break Software by James Whittaker

How Rust is Tilde s Competitive Advantage

Slide Set 1. for ENCM 339 Fall Steve Norman, PhD, PEng. Electrical & Computer Engineering Schulich School of Engineering University of Calgary

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

Math 250A (Fall 2009) - Lab I: Estimate Integrals Numerically with Matlab. Due Date: Monday, September 21, INSTRUCTIONS

CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis

Advances in Memory Management and Symbol Lookup in pqr

Advances in Programming Languages: Exam preparation and exam questions

Sending an to Your Chapter Through Neon

Heap Arrays and Linked Lists. Steven R. Bagley

Chapter 6 Control Flow. June 9, 2015

Matrex Table of Contents

Risk Management. I.T. Mock Interview Correction

Business and Scientific Applications of the Java Programming Language

An OCaml-based automated theorem-proving textbook

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Writing MATLAB Programs

Centrality Book. cohesion.

TOOLS AND TECHNIQUES FOR TEST-DRIVEN LEARNING IN CS1

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

PIXEL SPREADSHEET USER MANUAL April 2011

Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring Topic Notes: C and Unix Overview

LECTURE 11. Memory Hierarchy

Reflections on Dynamic Languages and Parallelism. David Padua University of Illinois at Urbana-Champaign

Heap Arrays. Steven R. Bagley

8. The C++ language, 1. Programming and Algorithms II Degree in Bioinformatics Fall 2017

Matlab for FMRI Module 1: the basics Instructor: Luis Hernandez-Garcia

18-642: Code Style for Compilers

Computer Science 322 Operating Systems Mount Holyoke College Spring Topic Notes: C and Unix Overview

Advanced Compiler Design. CSE 231 Instructor: Sorin Lerner

BENCHMARKING OCTAVE, R AND PYTHON PLATFORMS FOR CODE PROTOTYPING IN DATA ANALYTICS AND MACHINE LEARNING

Concepts of Programming Languages

Copyright 2013 Thomas W. Doeppner. IX 1

Forth Meets Smalltalk. A Presentation to SVFIG October 23, 2010 by Douglas B. Hoffman

CS252 Advanced Programming Language Principles. Prof. Tom Austin San José State University Fall 2013

Math 2250 Lab #3: Landing on Target

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Scripted Components: Problem. Scripted Components. Problems with Components. Single-Language Assumption. Dr. James A. Bednar

Scripted Components Dr. James A. Bednar

CS153: Compilers Lecture 1: Introduction

Getting To Know Matlab

C introduction: part 1

Computational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Principles of Programming Languages. Lecture Outline

Handout: Handy Computer Tools

Effective Threat Modeling using TAM

1 Motivation for Improving Matrix Multiplication

Seminar report Java Submitted in partial fulfillment of the requirement for the award of degree Of CSE

Exercise 4: Loops, Arrays and Files

Chamberlin and Boyce - SEQUEL: A Structured English Query Language

GNU Octave JIT Compilation - Report

Search Engine Optimization (SEO) using HTML Meta-Tags

Subscripts and sizes should be signed

Overview. Why Pointers?

CS 390 Chapter 2 Homework Solutions

An Introduction to Balder An OpenMP Run-time Library for Clusters of SMPs

Non 8-bit byte support in Clang and LLVM Ed Jones, Simon Cook


MATLAB. Devon Cormack and James Staley

************ THIS PROGRAM IS NOT ELIGIBLE FOR LATE SUBMISSION. ALL SUBMISSIONS MUST BE RECEIVED BY THE DUE DATE/TIME INDICATED ABOVE HERE

2.7 Numerical Linear Algebra Software

Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm

Alan J. Perlis - Epigrams on Programming

Understanding simhwrt Output

These are notes for the third lecture; if statements and loops.

(Refer Slide Time: 1:27)

Late binding Ch 15.3

int $0x32 // call interrupt number 50

Visual Basic Primer A. A. Cousins

Assignment 3 Functions, Graphics, and Decomposition

Transcription:

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1

Overview This is my perspective on the use of dynamic languages (interpreters) for data analytics (statistics) I am a long-time user and commercial developer of dynamic languages for data analysis, but I am also a hard-core C++ programmer 2

Outline A bit about my experience Good/bad things about dynamic languages for data analysis, with special focus on R Why statistics/data analytics requires its own language The importance of parallel and distributed computing for data analysis What I think would be the ideal language for statistics Revolution R Enterprise 3

Dynamic languages I have used and/or developed APL tried using to teach econometrics with it in early 1980 s MATLAB tried using a very early version to teach econometrics in the early 80 s Gauss designed, developed, and commercialized this matrix-oriented statistical system in mid-80 s Revolution R Enterprise 4

Dynamic languages I have used and/or developed -- 2 Axum designed and developed this commercial technical graphics package, and wrote a C-like interpreter for doing data transformations S-PLUS was VP of Development for S- PLUS (at MathSoft) in late 90 s. S-PLUS is the commercial version of S. Revolution R Enterprise 5

Dynamic languages I have used and/or developed -- 3 ExaStat designed and developed this C++ based data analysis system that can be both interpreted (using CINT) and compiled Threading is built in for most array and matrix computations Includes a framework for automatically parallelizing and distributing a broad class of statistical algorithms R -- currently VP of Engineering at Revolution Analytics, which focuses on R Revolution R Enterprise 6

Features that make dynamic languages so nice for data analysis Command line for quick experimentation Ability to work directly with arrays, sets of arrays, and matrices Environment for inspection of objects Availability of functions for doing common operations Natural syntax that corresponds to subject matter don t have to be a hard core programmer to get things done Revolution R Enterprise 7

More nice features Reduced need to write loops Don t have to worry about specifying data types No compile/link/load time No worries about memory allocation, pointers, headers, linking errors, loading problems Revolution R Enterprise 8

Problems with most dynamic languages for data analysis Speed, especially compared with the best compiled code Loops tend to be especially slow Problems scaling with data size Often requires translation into compiled code before use in production environment Insufficient range of data types Lack of support for parallel and distributed computing Revolution R Enterprise 9

Especially nice features of R Common language used by practitioners around the world Much of new statistics over the past 10-15 years has been implemented in R Easy ability to download and install new packages; thousands are available Excellent interface to compiled languages Functions available for doing almost anything data-related Revolution R Enterprise 10

Particular problems with R Slow, especially for loops over data (about 100,000 times slower than C++ for simple loops) Memory hog; multiple copies of read-only data can be made during a simple analysis Not enough data types Encourages bad coding practices especially use of globals Not thread-safe Revolution R Enterprise 11

Why statistics needs its own language(s) A natural and familiar syntax is important It is necessary to have data set objects with column and row names, data descriptions, mixed data types Easy, fast I/O of these objects Missing value handling is crucial and must be built-in to almost all functionality Proper handling of categorical data is also critical Revolution R Enterprise 12

The importance of parallel and distributed computing in data analysis Our ability to collect and store data is rapidly and greatly outpacing our ability to analyze that data To analyze all this data we must use multiple cores and multiple hard drives This means we need software that distributes computations across cores and computers and puts the results back together as if the work had been done on one core Revolution R Enterprise 13

My view of the ideal statistical language Based on the R syntax, but fixing the main problems Implemented using LLVM or something similar Extended range of data types Allow passing objects by reference Perhaps allow type specifications to enable increased speed of loops over data Have built-in support for parallel and distributed computations Revolution R Enterprise 14