pandas: Rich Data Analysis Tools for Quant Finance

Similar documents
Python for Data Analysis

Data Science with Python Course Catalog

Certified Data Science with Python Professional VS-1442

Python Quant Platform

Pandas and Friends. Austin Godber Mail: Source:

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

Command Line and Python Introduction. Jennifer Helsby, Eric Potash Computation for Public Policy Lecture 2: January 7, 2016

Python With Data Science

About Intellipaat. About the Course. Why Take This Course?

Pandas UDF Scalable Analysis with Python and PySpark. Li Jin, Two Sigma Investments

HANDS ON DATA MINING. By Amit Somech. Workshop in Data-science, March 2016

Problem Based Learning 2018

David J. Pine. Introduction to Python for Science & Engineering

Data Science Bootcamp Curriculum. NYC Data Science Academy

Monthly SEO Report. Example Client 16 November 2012 Scott Lawson. Date. Prepared by

Scientific Computing with Python. Quick Introduction

This report is based on sampled data. Jun 1 Jul 6 Aug 10 Sep 14 Oct 19 Nov 23 Dec 28 Feb 1 Mar 8 Apr 12 May 17 Ju

SQL Server 2017: Data Science with Python or R?

Python for Quant Finance

DATA STRUCTURE AND ALGORITHM USING PYTHON

Python Scripting for Computational Science

Matplotlib Python Plotting

Introduction to Programming

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Python Scripting for Computational Science

Introducing Python Pandas

ARTIFICIAL INTELLIGENCE AND PYTHON

ICT PROFESSIONAL MICROSOFT OFFICE SCHEDULE MIDRAND

DSC 201: Data Analysis & Visualization

Python & Spark PTT18/19

Python for Scientists and Engineers

SQL Server Machine Learning Marek Chmel & Vladimir Muzny

Scientific computing platforms at PGI / JCNS

Webinar Series. Introduction To Python For Data Analysis March 19, With Interactive Brokers

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. 20 November 2017 (12:10 GMT) Beginner.

Python for Data Analysis

The Python interpreter

Introduction to Scientific Computing with Python, part two.

Programming for Data Science Syllabus

Python Certification Training

What you learned so far. Loops & Arrays efficiency for statements while statements. Assignment Plan. Required Reading. Objective 2/3/2018

COURSE LISTING. Courses Listed. with HANA Programming. 13 February 2018 (04:51 GMT) HA100 - SAP HANA

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

Instructor : Dr. Sunnie Chung. Independent Study Spring Pentaho. 1 P a g e

Open Data Standards for Administrative Data Processing

Ch.1 Introduction. Why Machine Learning (ML)?

Kenora Public Library. Computer Training. Introduction to Excel

Programming for Engineers in Python

Scientific Computing using Python

tutorial : modeling synaptic plasticity

DATA FORMATS FOR DATA SCIENCE Remastered

SOFTWARE BOOTCAMP. January 10-12, Financial Services Center, Hanlon Lab 4 th Floor, Babbio Center PROGRAM

EVM & Project Controls Applied to Manufacturing

Why I Use Python for Academic Research

Lecture 10: Boolean Expressions

SciPy. scipy [ and links on course web page] scipy arrays

Webgurukul Programming Language Course

Introduction to Computer Vision Laboratories

DSC 201: Data Analysis & Visualization

Integrating MATLAB Analytics into Business-Critical Applications Marta Wilczkowiak Senior Applications Engineer MathWorks

Sand Pit Utilization

Intermediate/Advanced Python. Michael Weinstein (Day 2)

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

LECTURE 19. Numerical and Scientific Packages

Python Training. Complete Practical & Real-time Trainings. A Unit of SequelGate Innovative Technologies Pvt. Ltd.

SCIENCE. An Introduction to Python Brief History Why Python Where to use

Python: Its Past, Present, and Future in Meteorology

Definition: Data Type A data type is a collection of values and the definition of one or more operations on those values.

MO101: Python for Engineering Vladimir Paun ENSTA ParisTech

Asks for clarification of whether a GOP must communicate to a TOP that a generator is in manual mode (no AVR) during start up or shut down.

Introduction to Python. Didzis Gosko

Effective Programming Practices for Economists. 10. Some scientific tools for Python

Quant Investment Analytics on Google Sheets (Cloud Analytics) VBA to Google Apps Script Cloud based financial engineering solutions

LECTURE 22. Numerical and Scientific Packages

An Introduction To Matplotlib School Of Geosciences

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. Last updated on: 30 Nov 2018.

Data Analytics Job Guarantee Program

Introduction to ufit

Google Data Studio. Toronto, Ontario May 31, 2017

OREKIT IN PYTHON ACCESS THE PYTHON SCIENTIFIC ECOSYSTEM. Petrus Hyvönen

INFORMATION TECHNOLOGY SPREADSHEETS. Part 1

DSC 201: Data Analysis & Visualization

Data Analytics Training Program using

AIMMS Function Reference - Date Time Related Identifiers

New Concept for Article 36 Networking and Management of the List

Python, SageMath/Cloud, R and Open-Source

Case study: accessing financial data

DATE OF BIRTH SORTING (DBSORT)

1 What s New in Instant Archive Viewer for OCS

MSRS Roadmap. As of January 15, PJM 2019

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer

CSC Advanced Scientific Computing, Fall Numpy

DSC 201: Data Analysis & Visualization

Decision Making Information from Your Mobile Device with Today's Rockwell Software

Data Wrangling with Python and Pandas

CS Programming I: Arrays

PYTABLES & Family. Analyzing and Sharing HDF5 Data with Python. Francesc Altet. HDF Workshop November 30, December 2, Cárabos Coop. V.

Connecting ArcGIS with R and Conda. Shaun Walbridge

Ch.1 Introduction. Why Machine Learning (ML)? manual designing of rules requires knowing how humans do it.

Guillimin HPC Users Meeting December 14, 2017

Transcription:

pandas: Rich Data Analysis Tools for Quant Finance Wes McKinney April 24, 2012, QWAFAFEW Boston

about me MIT 07 AQR Capital: 2007-2010 Global Macro and Credit Research WES MCKINNEY pandas: 2008 - Present wes (at) lambdafoundry.com Twitter: @wesmckinn 2

Upcoming book Python language essentials Core scientific libraries pandas Visualization Case studies Look out for Rough Cuts version on oreilly.com 3

Lambda Foundry http://lambdafoundry.com Incorporated January 2012 Mission: Better solutions to data-driven business problems RapidQuant: Financial analytics libraries and research environment Open Source Development and Support Training and Consulting 4

RapidQuant Platform Integrated, interactive research environment and workbench Analytics libraries, standard data algorithms, and transforms Frameworks for backtesting, risk modeling, portfolio management Vendor data integration Optimized data interfaces Caching and data management Testing tools Distributed computing 5

Outline Why Python for Quants? Scientific Python Foundation Pandas essentials Time series Group-wise data manipulation Plotting and Visualization More examples 6

Research vs. Production Research needs Production needs Rapid iteration, exploration Interactive Data Analysis Statistical modeling tools Backtesting framework Rich Visualization Reporting, Excel integration High perf computations Model and data controls, versioning Rigorous testing, robustness Large-scale process management Modularity, extensibility Productive system dev tools Strong interoperability High perf computations 7

Python: one-language solution to the two-domain problem?

Python Easy to learn, but richly featured Concise, but highly readable Python gets out of my way Multi-paradigm: object-oriented, functional, procedural Easy integration with C / C++ / Fortran Mature scientific libraries and large, active community 9

Python for Systems Strong library support Excellent maintainability Debugging, profiling, static code analysis tools Numerous testing frameworks Deployment, packaging tools GUI toolkits, web development, network applications One of Google s main languages 10

Core financial stack NumPy: multidimensional arrays, linear algebra pandas: data manipulation toolkit IPython: rich interactive environment SciPy: like MATLAB toolboxes statsmodels: statistics and econometrics Visualization: matplotlib 11

NumPy: Numerical Python Fast array processing library implemented in C ndarray: multidimensional array object Linear algebra operations Random number generation Efficient binary IO Other stuff: FFT, f2py, masked arrays,... 12

IPython Rich, interactive shell environment Terminal version + Rich Qt-based Shell with inline plotting Web Notebook Format Tab completion, introspection %run command Debugging and profiling tools 13

pandas Powerful data handling tool built on NumPy Started building in 2008 at AQR Capital, open sourced Has become widely used in quant finance Mature, well-tested codebase Powerful time series capabilities Widely used in production in the quant finance industry Upcoming 0.8.0 release: major time series improvements http://pandas.pydata.org 14

pandas Series, DataFrame, Panel GroupBy, Pivoting Indexing, Data alignment Time Series IO Merging / Joining Summary Stats Plotting Regression Sparse Indexing 15

Series 1D labeled array for cross-sections, time series Array of data, any type Array of labels, the index index A B values 5 6 - Orderedness not required C 12 Index can be integers, time, assets, or any other identifiers D E -5 6.7 16

DataFrame Table of Series objects columns foo bar baz qux Columns can be different types index Shared row index A 0 x 2.7 True Dict-like column insertion/deletion Select/slice data using row and / or column labels B C D 4 8-12 y z w 6 10 NA True False False E 16 a 18 False 17

pandas input / output Read from / write to a variety of formats CSV, clipboard, fixed-width, generalized table Excel 2003, 2007 DataFrame save and load via Pickling Managed storage solutions - - HDF5: pandas.io.pytables SQL: pandas.io.sql Web based API's like Yahoo! Finance, Fama-French, FRED 18

Data alignment Arithmetic auto-aligns data on label (ticker, timestamp,...) DataFrame aligns on row and column labels A 0 A NA B 1 B 1 B 2 C 2 + C 2 = C 4 D 3 D 3 D 6 E 4 E NA 19

Indexing and selection Select rows/columns by position or label Slice chunks of objects without copying Insert and delete DataFrame columns Hierarchical indexing: multiple levels of keys on a single axis Many time series conveniences 20

Time series Time series representations Fixed frequency and irregular data handling Date arithmetic Time zone handling Resampling: high to low, low to high Interpolating missing values Moving window functions 21

Date and time types Timestamp: specific moment in time Period: span of time - e.g. 2010, June 2007, 1997Q3 Interval: defined by 2 timestamps Timedelta or Duration: a length of time - e.g. 3 days; 30 minutes; 2 hours 22

Period arithmetic Period('Jun-2011', 'M') In [7]: p = Period('2011') Start End In [8]: p.asfreq('m', 'start') Out[8]: Period('Jan-2011', 'M') Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec In [9]: p.asfreq('m', 'end') Out[9]: Period('Dec-2011', 'M') Period('2011', 'A-DEC') 23

Fixed frequency time series Irregular by default, but can have a frequency Used for: shifting, frequency conversion, date arithmetic Upcoming changes in 0.8.0 24

Visualization matplotlib: general purpose plotting IPython integrates with matplotlib Plot windows or inline plotting Many convenience functions functions in pandas Complex plots may take effort 25

Group By Split Apply Combine A 0 B 5 C 10 A A A 0 5 10 sum A 5 B 5 sum A 15 Key B 10 B 10 B 30 C 15 B 15 C 45 A B C 10 15 20 C C C 10 15 20 sum 26

pandas roadmap Pandas for Big Data Integration with JavaScript visualization, e.g. d3 More integration with statsmodels (econometrics) and scikit-learn (machine learning) ggplot2-like plotting interface Better text file processing capabilities 27

pandas vs. R More time series features, higher performance than zoo, xts, fts, its, etc. DataFrame merge performance 5-30x faster Better performance than plyr / reshape2 for reshaping and groupby operations Symmetric treatment of row- and column-oriented operations No ggplot2 equivalent; weak area for Python, have plans to work on this summer 28