Analytics Research Internship at Hewlett Packard Labs

Similar documents
Accelerated Machine Learning Algorithms in Python

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Clustering algorithms and autoencoders for anomaly detection

Separating Speech From Noise Challenge

Jarek Szlichta

Prediction of Dialysis Length. Adrian Loy, Antje Schubotz 2 February 2017

Data Science Bootcamp Curriculum. NYC Data Science Academy

Practice Questions for Midterm

Data Science Training

CS 253: Intro to Systems Programming 1/21

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Divide & Recombine with Tessera: Analyzing Larger and More Complex Data. tessera.io

EECS 482 Introduction to Operating Systems

SOFTWARE CONFIGURATION MANAGEMENT

Experimenting with bags (tables and query answers with duplicate rows):

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Vertica s Design: Basics, Successes, and Failures

All lecture slides will be available at CSC2515_Winter15.html

6.034 Quiz 2, Spring 2005

Chapter 4. Adaptive Self-tuning : A Neural Network approach. 4.1 Introduction

The Seven Steps to Implement DataOps

CSE 446 Bias-Variance & Naïve Bayes

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Standards for Test Automation

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Science Course Content

Chakra Chennubhotla and David Koes

Slice Intelligence!

Linear Regression Optimization

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Naïve Bayes Classification

Data Science. Data Analyst. Data Scientist. Data Architect

OpenACC Course. Office Hour #2 Q&A

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

5th Vienna Deep Learning Meetup

The Mathematics Behind Neural Networks

Machine Learning: Think Big and Parallel

Machine Learning 13. week

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

CSI Lab 02. Tuesday, January 21st

Big Data and Large Scale Machine Learning

Surfing the SAS cache

Announcements: projects

Application of Spectral Clustering Algorithm

White Paper. How the Meltdown and Spectre bugs work and what you can do to prevent a performance plummet. Contents

Easy CSR In-Text Table Automation, Oh My

Report: Privacy-Preserving Classification on Deep Neural Network

Spring Modern Computer Science in a Unix Like Environment CIS c

It is been used to calculate the score which denotes the chances that given word is equal to some other word.

Dynamic Control Hazard Avoidance

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

Hire The People You Want To Employ

Lecture #11: The Perceptron

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Exploring Performance Tradeoffs in a Sudoku SAT Solver CS242 Project Report

Distributed Systems CS6421

Towards the world s fastest k-means algorithm

Community edition(open-source) Enterprise edition

Background. $VENDOR wasn t sure either, but they were pretty sure it wasn t their code.

Hello World! Computer Programming for Kids and Other Beginners. Chapter 1. by Warren Sande and Carter Sande. Copyright 2009 Manning Publications

Stanford University. A Distributed Solver for Kernalized SVM

Dr. SubraMANI Paramasivam. Think & Work like a Data Scientist with SQL 2016 & R

Prototyping Data Intensive Apps: TrendingTopics.org

Spam Detection ECE 539 Fall 2013 Ethan Grefe. For Public Use

Using Virtual Platforms To Improve Software Verification and Validation Efficiency

Getting Started. Excerpted from Hello World! Computer Programming for Kids and Other Beginners

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

COSC 2P95. Introduction. Week 1. Brock University. Brock University (Week 1) Introduction 1 / 18

Nexus Builder Developing a Graphical User Interface to create NeXus files

Netezza The Analytics Appliance

Security analytics: From data to action Visual and analytical approaches to detecting modern adversaries

ComSpOC: 3D Buildings:

Predicting Popular Xbox games based on Search Queries of Users

Practical Natural Language Processing with Senior Architect West Monroe Partners

Lecture 3 Notes Topic: Benchmarks

Open Packet Processing Acceleration Nuzzo, Craig,

ECE 468 Computer Architecture and Organization Lecture 1

Datacenter Care HEWLETT PACKARD ENTERPRISE. Key drivers of an exceptional NPS score

LECTURE 12: LINEAR MODEL SELECTION PT. 3. October 23, 2017 SDS 293: Machine Learning

A Brief Look at Optimization

Depending on the computer you find yourself in front of, here s what you ll need to do to open SPSS.

Stay Calm and Carry On. Charles Profitt

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Inverting the Pyramid

KNIME for the life sciences Cambridge Meetup

HP Storage Software Solutions

WHITE PAPER AX WAIT, DID WE JUST BUILD A WIRELESS SWITCH?

Introduction to GPU hardware and to CUDA

Students in the 12-course program will have the following degree requirements:

This is an oral history interview conducted on. October 30, 2003, with IBM researcher Chieko Asakawa and IBM

What's Currently Happening with Continuous Delivery on the z/os stack?

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Horizont HPE Synergy. Matt Foley, EMEA Hybrid IT Presales. October Copyright 2015 Hewlett Packard Enterprise Development LP

Math Computer Price Project

Machine Learning on Encrypted Data

Transcription:

Analytics Research Internship at Hewlett Packard Labs Stefanie Deo Mentor: Mehran Kafai September 12, 2016

First, another opportunity that came my way but didn t pan out: Data Science Internship at Intuit! I joined local Meetup groups for women who code/women in data science:! Through Pyladies SF, I heard about an intro to data science class at Intuit (Oct 1 Nov 12)! I applied for the class and was accepted! An Intuit recruiter got in touch with me after my final project presentation! I got a phone interview, but wasn t selected (and that s okay!) 2

How I got an Research Internship at HPE s Analytics Lab! Old-fashioned way: I made a list of all the big tech companies and searched for jobs directly on each company s website! I started looking really early; I sent out my first resume in October! I checked around every week, and then found this internship on HPE s job postings page (I applied for it on January 14)! In March, I got a request from HPE for a phone interview! By that time, I had applied for over 50 internships (I had to keep a spreadsheet to organize my search!)! One interview question was about prior statistical work (for that, I talked about my Math 261A project)! Hardest question was about programming, but I was able to answer it on the second try because of what I learned in CS21A (Python Programming at Foothill College) and Math 167 (SAS Programming at SJSU) 3

A little bit about the company! HPE became a separate company after Hewlett Packard split into two companies last Nov 2015! As the name states, they focus on enterprise products (servers, storage, networking, consulting, and software for big data analysis and security)! Hewlett Packard Labs (in Palo Alto) is their R&D division which comes up with new products! HP Inc. makes printers, laptops, and other consumer goods 4

What I worked on: A machine learning classification tool 5

Predictions from classification models can help us solve all sorts of problems AD TARGETING: Which ad is a user most likely to click on? SPEECH RECOGNITION: Did the person say Call Bobby or Call Barbie? MEDICAL DIAGNOSIS: Does a patient have Disease X? 6

With streaming data, however, stale data usually isn t ideal. Taking a closer look at a search engine trying to predict ad clicks: Browsing history Streaming data from user Use data to update classification model Use new model to predict the best ads to display but retraining a model with the most recent data can be challenging when we have large datasets that are rapidly evolving 7

Using the most recent data requires the shortest possible training time This is a tough problem, due to a historical trade-off between the two classification methods: Linear classification Kernel classification Fast, but only suits data that is already linearly separable Can handle non-linearly separable data, but is computationally expensive (and therefore slow) 8

Labs solution is CROlinear, a tool that is the best of both worlds CRO feature map fast kernel transformation developed at Hewlett Packard Labs able to take data that is not linearly separable, and quickly make it linearly separable in a higher dimension + Accelerated Linear = classification incremental multicore faster optimization algorithms CROlinear fast can handle data that is not linearly separable 9

CROlinear Experiments (using logistic regression as the solver) 10

Datasets Dataset: a9a (Adult census data) Training Dataset: 32,561 instances Number of classes: 2 Dataset: real-sim (Real vs Simulated document classification) Training Dataset: 72,309 instances Number of classes: 2 11

Datasets Dataset: MNIST (handwritten digits) Dataset: TIMIT (American English speech corpus) Training set: 1M instances Training set: 1,385,426 instances Testing set: 100K instances Testing set: 506,113 instances Number of classes: 10 Number of Classes: 39 12

Incremental training produces an updated model much faster than training from scratch 13

No surprise: training on more cores requires less processing time Dataset: MNIST1M 14

Optimization algorithms are just different ways of solving the same classification problem TRON iterations L-BFGS iterations Time when TRON converged Time when L-BFGS converged 15

Some algorithms are faster at solving certain problems (there is no one-size-fits-all when it comes to algorithms and datasets) TRON iterations L-BFGS iterations Time when TRON converged Time when L-BFGS converged 16

CROlinear is more accurate than a linear classifier, and much faster than a kernel classifier MNIST 1M (single core) Linear classifier only Kernel classifier only* CROlinear Training Time (mins) 8.79 754 31.45 Prediction Time (µs/ instance) 126 1,200 121 Accuracy 86.4% 96.3% 99.2% *process auto killed 17

What I accomplished on this internship! I read scientific papers (lots of them)! I studied logistic regression in greater depth, since we only covered a little bit of it at the end of Math 261A! I studied the inner workings of Liblinear, which is an open source software library for largescale linear classification (Liblinear was the basis for the linear classifier of CROlinear)! I learned Linux (necessary, since it s hard to compile Liblinear on Windows)! I learned how to access a remote Linux server for running large jobs! I wrote Python scripts to run experiments with Liblinear! I learned a bit of C/C++ in order to make modifications to Liblinear, including: - Modifying the Python wrapper for a version of Liblinear that includes the two extensions for incremental training and multicore training - Incorporating the L-BFGS algorithm into Liblinear and creating a command-line option to use it! I learned Git in order to share code with my team on HPE s internal Github repository 18

For more information on the CRO feature map, see the paper by Kave Eshghi and Mehran Kafai: http://alumni.cs.ucr.edu/~mkafai/papers/paper_icde16.pdf 19

Thank you 20