Implementing SVM in an RDBMS: Improved Scalability and Usability. Joseph Yarmus, Boriana Milenova, Marcos M. Campos Data Mining Technologies Oracle

Similar documents
SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

High Speed ETL on Low Budget

OLAP Introduction and Overview

Oracle Database: Introduction to SQL/PLSQL Accelerated

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

New Features in Oracle Data Miner 4.2. The new features in Oracle Data Miner 4.2 include: The new Oracle Data Mining features include:

ENTERPRISE MINER: 1 DATA EXPLORATION AND VISUALISATION

Fault Detection using Advanced Analytics at CERN's Large Hadron Collider: Too Hot or Too Cold BIWA Summit 2016

Introduction to SQL/PLSQL Accelerated Ed 2

Oracle Database 11g: Performance Tuning DBA Release 2

ORACLE TRAINING CURRICULUM. Relational Databases and Relational Database Management Systems

After completing this course, participants will be able to:

SAS Enterprise Miner 7.1

Learn Well Technocraft

Configuring the Oracle Network Environment. Copyright 2009, Oracle. All rights reserved.

to-end Solution Using OWB and JDeveloper to Analyze Your Data Warehouse

EE 8591 Homework 4 (10 pts) Fall 2018 SOLUTIONS Topic: SVM classification and regression GRADING: Problems 1,2,4 3pts each, Problem 3 1 point.

Oracle Hyperion Profitability and Cost Management

Types of Data Mining

Course 40045A: Microsoft SQL Server for Oracle DBAs

Oracle Big Data Science

Developing Applications with Business Intelligence Beans and Oracle9i JDeveloper: Our Experience. IOUG 2003 Paper 406

Oracle Database 11g: Performance Tuning DBA Release 2

Oracle Big Data Science IOUG Collaborate 16

Oracle Database: SQL and PL/SQL Fundamentals NEW

Eine für Alle - Oracle DB für Big Data, In-memory und Exadata Dr.-Ing. Holger Friedrich

MICROSOFT BUSINESS INTELLIGENCE

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Introducing Oracle R Enterprise 1.4 -

Oracle 1Z0-515 Exam Questions & Answers

Oracle Database 11g: SQL and PL/SQL Fundamentals

Principles of Machine Learning

Oracle 1Z Oracle Database 11g: New Features for Administrators.

Oracle 9i release 1. Administration. Database Outsourcing Experts

AO3 - Version: 2. Oracle Database 11g SQL

Oracle Database 11g : Performance Tuning DBA Release2

Table of Contents 1 Introduction A Declarative Approach to Entity Resolution... 17

Architettura Database Oracle

Course Description. Audience. Prerequisites. At Course Completion. : Course 40074A : Microsoft SQL Server 2014 for Oracle DBAs

ORACLE TRAINING. ORACLE Training Course syllabus ORACLE SQL ORACLE PLSQL. Oracle SQL Training Syllabus

Learning Objectives : This chapter provides an introduction to performance tuning scenarios and its tools.

Oracle9i for e-business: Business Intelligence. An Oracle Technical White Paper June 2001

SQL+PL/SQL. Introduction to SQL

Oracle Database: SQL and PL/SQL Fundamentals NEW

ITExamDownload. Provide the latest exam dumps for you. Download the free reference for study

Data Preprocessing. Slides by: Shree Jaswal

SAS Factory Miner 14.2: User s Guide

Oracle Database: SQL and PL/SQL Fundamentals Ed 2

CHAPTER. Oracle Database 11g Architecture Options

Oracle Database 11g: SQL Tuning Workshop. Student Guide

"Charting the Course... Oracle18c SQL (5 Day) Course Summary

Oracle 1Z0-054 Exam Questions and Answers (PDF) Oracle 1Z0-054 Exam Questions 1Z0-054 BrainDumps

Enterprise Miner Software: Changes and Enhancements, Release 4.1

Introducing SAS Model Manager 15.1 for SAS Viya

Prediction of Dialysis Length. Adrian Loy, Antje Schubotz 2 February 2017

Oracle Big Data Connectors

Enterprise Miner Tutorial Notes 2 1

Understanding the latent value in all content

Oracle Database: Introduction to SQL

CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP)

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Chapter 7. Introduction to Structured Query Language (SQL) Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel

CASE STUDY Application Migration and optimization on AWS

My name is Brian Pottle. I will be your guide for the next 45 minutes of interactive lectures and review on this lesson.

Oracle Database 12c Performance Management and Tuning

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

April Copyright 2013 Cloudera Inc. All rights reserved.

Netezza The Analytics Appliance

CONCEPTUAL DESIGN FOR SOFTWARE PRODUCTS: SERVICE REQUEST PORTAL. Tyler Munger Subhas Desa

Oracle9i Data Mining. Data Sheet August 2002

Oracle Database 12c: Performance Management and Tuning

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more.

Technical Sheet NITRODB Time-Series Database

ETL TESTING TRAINING

Gain Insight and Improve Performance with Data Mining

PART I Core Ideas and Elements of PL/SQL Performance Tuning

Leng Leng Tan Vice President Server Manageability and Diagnosability Oracle Corporation. Arvind Gidwani Kothandapani Subramaniyam

Was ist dran an einer spezialisierten Data Warehousing platform?

IBM SPSS Statistics and open source: A powerful combination. Let s go

In-Memory Data Management

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Informatica Power Center 10.1 Developer Training

Oracle s Machine Learning and Advanced Analytics

Call: JSP Spring Hibernate Webservice Course Content:35-40hours Course Outline

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

App Engine: Datastore Introduction

An Interactive GUI Front-End for a Credit Scoring Modeling System by Jeffrey Morrison, Futian Shi, and Timothy Lee

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

Oracle 1Z0-071 Exam Questions and Answers (PDF) Oracle 1Z0-071 Exam Questions 1Z0-071 BrainDumps

Data Mining Overview. CHAPTER 1 Introduction to SAS Enterprise Miner Software

Database Development (H4)

Oracle Data Integrator 12c: Integration and Administration

The Consequences of Poor Data Quality on Model Accuracy

Independent consultant. Oracle ACE Director. Member of OakTable Network. Available for consulting In-house workshops. Performance Troubleshooting

Course Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course:

Oracle Database 11g: SQL Fundamentals I

EZY Intellect Pte. Ltd., #1 Changi North Street 1, Singapore

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

Transcription:

Implementing SVM in an RDBMS: Improved Scalability and Usability Joseph Yarmus, Boriana Milenova, Marcos M. Campos Data Mining Technologies Oracle

Overview Oracle RDBMS resources leveraged by data mining Oracle Data Mining (ODM) infrastructure Oracle SVM scalability solutions Oracle SVM ease of use solutions

Data Mining in RDBMS Integration of data storage and analytic processing No data movement Data integrity and security Leverage of decades of RDBMS technology Query scalability Transparent distributed processing Resource management benefits

Database Infrastructure Model build implemented as a row source Processing node that takes in flows of rows (records) and produces a flow of rows Model is a first class (native) database object Model score implemented as a SQL operator Extensibility framework user-defined data types, operators, index types, optimizer functions, aggregate functions, pipelined parallel table functions

Oracle Database Architecture SGA Client Oracle [pga] Shared Pool Buffer Cache Large Pool P000 [pga] P001 [pga] Pxxx [pga] Database data files, temp space

SVM in the Database Oracle Data Mining (ODM) Commercial SVM implementation in the database supporting classification, regression, and one-class Product targets application developers and data mining practitioners Focuses on ease of use and efficiency Challenges Good scalability large quantities of data, low memory requirements, fast response time Good out-of-the-box accuracy

SVM and the ODM Infrastructure Data sources: views, tables, multi-relational data (e.g., star schema) Complex data types: sparse data, XML, text, images Data preparation Model build/test Model deployment Tools for measuring accuracy Model description

SVM Data Preparation Support Automatic data preparation Outlier treatment Missing value treatment Scaling Categorical to numeric attribute recoding

SVM Scalability Issues Build scalability Scoring scalability Large model sizes (non-linear kernels) make online scoring infeasible and batch scoring impractical

Implemented Scalability Techniques Popular build scalability techniques Decomposition Working set selection Kernel caching Shrinking Sparse data encoding Specialized linear model representation However, these established techniques were not sufficient for our needs

Additional Implemented Scalability Techniques Improved working set selection Stratified sampling mechanisms Small model generation

Working Set Selection Smooth (non-oscillating) iterative process overlap across working sets retain non-bounded support vectors Computationally efficient choice of violators add large violators without sorting

Who to Retain? /* Examine previous working set */ if (non-bounded sv < 50% of WS size) { retain all non-bounded sv; add other randomly selected up to 50% of WS size; } else randomly select non-bounded sv;

Who to Add? create violator list; if (violators < 50% of WS size) add all violators else { /* Scan I - pick large violators */ while (added violators < 50% of WS size) if (violation > avg_violation) add to WS; /* Scan II - pick other violators */ while (added violators < 50% of WS size) add randomly selected violators to WS; }

Stratified Sampling Classification and regression Prevent rare target values (classification) or ranges (regression) from being excluded Single scan of the data Target cardinality and distribution unknown at outset Dynamically adjust the sampling rates based on the observed target statistics

Small Model Generation Linear kernel models Represented as coefficients Non-linear kernel models (active learning) 1. Construct a small initial model 2. Select additional influential training records 3. Retrain on the augmented training sample 4. Exit when the maximum allowed model size is reached

Build Scalability Results

Scoring Scalability Results

SVM Scoring as a SQL Operator Easy integration DML statements (select, insert, delete, update), subqueries, functional indexes, triggers Parallelism Small memory footprint Model cached in shared memory Pipelined operation SELECT id, PREDICTION(svm_model_1 USING *) FROM user_data WHERE PREDICTION_PROBABILITY(svm_model_2, 'target_val USING *) > 0.5

Functional Indexes CREATE INDEX anninc_idx on customers (PREDICTION (svmr_model USING *)) ETL missing value imputation (data enrichment) Order-by elimination SELECT * from customers ORDER BY PREDICTION (svmr_model USING *); Index-driven filtering SELECT cust_name FROM customers WHERE PREDICTION (svmr_model USING *) > 150000;

Ease of Use Issues Complex methodology for novice users Data preparation Parameter (model) selection High computational cost for finding appropriate parameters Accuracy Fast build

On-the-Fly SVM Parameter Estimation Data-driven Low computational cost Ensure good generalization

Classification Accuracy Default Params Astroparticle 0.67 Default Params + Scaling Grid Search + Xval Oracle 0.96 0.97 0.97 Bioinformatics 0.57 0.79 0.85 0.84 Vehicle 0.02 0.12 0.88 0.71 Hsu, C. et al, A Practical Guide to Support Vector Classification

Regression Accuracy Grid search RMSE Boston housing 6.26 Computer activity Pumadyn 0.33 0.02 Oracle RMSE 6.57 0.35 0.02

Classification Capacity Estimate Goal: Allocate sufficient capacity to separate typical examples 1. Pick m random examples per class 2. Compute f i assuming α = C 2m f = i = Cy j j K( x j, xi ) 1 3. Exclude noise (incorrect sign) 4. Scale C, f i = ±1 (non bounded sv) C ( i) = 2m j = 1 sign( f ) / y K( x, x 5. Order ascending 6. Select 90 th percentile i j j i )

Classification Standard Deviation Estimate Goal: Estimate distance between classes 1. Pick random pairs from opposite classes 2. Measure distances 3. Order descending 4. Select 90 th percentile

Regression Epsilon Estimate Goal: estimate noise by fitting preliminary models 1. Pick small training and heldaside sets 2. Train SVM model with 3. Compute residuals on heldaside data 4. Update 5. Retrain new ε = + old ( ε µ )/ 2 r ε = 0.01* µ y

Conclusions Oracle s SVM implementation Handles large quantities of data Has a small memory footprint Has fast response time Allows database users with little data mining expertise to achieve reasonable out-of-the-box results Corroborated by independent evaluations by the University of Rhode Island and the University of Genoa

Final Note SVM is available in Oracle 10g database Implementation details described here refer to Oracle 10g Release 2 JAVA (J2EE) and PL/SQL APIs Oracle Data Miner GUI Oracle s SVM has been integrated by ISVs SPSS (Clementine) InforSense KDE Oracle Edition

Q U E S T I O N S A N S W E R S