Fuzzy Matching in Fraud Analytics. Grant Brodie, President, Arbutus Software

Size: px
Start display at page:

Download "Fuzzy Matching in Fraud Analytics. Grant Brodie, President, Arbutus Software"

Transcription

1 Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software

2 Outline What Is Fuzzy? Causes Effective Implementation Application to Specific Products Demonstration Q&A 2

3 Why Is Fuzzy Important? Big data Too many transactions User-entered data (web sites) E-Commerce Less manual oversight 3

4 What Is Fuzzy? Subset of duplicates testing Find specific keywords in text (FCPA, PCard) Close, but not the same Two reasonable definitions Proximity Looks similar 4

5 Proximity Sorts close together Characters Albert vs. Albertson Numbers 123, vs. 123, Dates Jan 19, 2014 vs. Jan 20,

6 Looks Similar Characters Microsoft vs. Wicrosoft Numbers 127, vs. 12, Dates Jan 13, 2014 vs. Jan 31,

7 Traditional Approach to Close Pronunciation based Soundex NYSIIS Designed for names Many false positives Not useful for numbers or dates 7

8 Fuzzy Today Based on physical string matching Levenshtein (ACL) Damerau-Levenshtein (Arbutus) N-Gram Jaro-Winkler And many more Differences expressed as a distance or percentage 8

9 Quick Lesson: Damerau-Levenshtein Min. # changes to make one string into another Insert, delete, replace, transpose 123 Main Street vs. 123 Main St = vs = 1 (Levenshtein: 2) Rob vs. Robert = 3 Gary vs. Mary = 1 Gary vs. gary = 1 9

10 Problems with String Matching Very literal Doesn t apply any context John Smith vs. John Smith (1) Smith John vs. Smith, John (1) John Smith vs. john smith (2) México vs. Mexico (1) John Smith vs. john smith same as John Hmitz (2) 10

11 What Do You Use? Whatever your tool offers Almost impossible to implement manually VERY compute intensive 11

12 Causes Accidental errors Carelessness/mistyping Transpositions Blurry source Punctuation Extra blanks 1 vs. I, 0 vs. O (particularly with OCR) 12

13 Errors vs. Fraud All of the causes were likely errors Fraud uses intentional errors to mask activity Obscure duplicates Obscure relationships Trick through similarity Disparate systems make comparison even harder 13

14 Practical Issues Generally hard to target fuzzy tests Forced to use broad tests Most findings will be errors Even so, the finding is still valuable Need a process to address errors found 14

15 Our System Catches Duplicates Exact matches only Strict application (i.e. company, vendor, invoice) May only warn Not all duplicates are payments Most only test document numbers 15

16 Types of Duplicates Names Personal Corporate Addresses Document numbers (e.g., invoice) Contact information Phone numbers s 16

17 Issues Very compute intensive (wait times) Exponential relationship 1000x data = 1,000,000x more work False positives Ease of use 17

18 False Positives Easily the most challenging aspect Any time spent on a false positive is wasted Can easily outnumber the true positives by 10, 100, 1000 to 1 If too many, can remove any cost effectiveness How does this happen? Only one way to get an exact match Virtually unlimited ways to get close 18

19 False Positive Examples Matching to with a single difference: Missing (1245): 5, Transposition (12435): 4 Incorrect (12745): min 45 (175 if alpha, 1,000+ if any char) Extra (123345): min 60 (200+ if alpha, 1,000+ if any char) Hundreds/thousands of ways that differ by just 1 Not just errors, all close values Exponentially more with a distance of 2 Bad actor tries to rely on his needle in a haystack 19

20 How to Address the Issues Data preparation Utilize context Use tight specifications Choose software that meets needs Rank your results 20

21 Choose Your Software Has the capabilities you need Can process your data volumes Easy to implement Easy to automate ACL, Arbutus, IDEA, fraud-specific, non-audit tools 21

22 Data Preparation Remove immaterial differences first (i.e., normalization) Text manipulation Upper case Punctuation Extra blanks Foreign characters (México vs. Mexico, Québec vs. Quebec) 22

23 Data Preparation (Cont.) (Remove immaterial differences first, normalization) Eliminate noise words Different by type of data Address: Suite, Unit Corporate name: Company, Co, Inc Personal name: Mr, Ms, Dr, Prof 23

24 Data Preparation (Cont.) (Remove immaterial differences first, normalization) Common misspellings/typos Common vocabulary (chair vs. silla) Different by data type Avenue: Av, Ave, Aven, Avenu First vs. 1 st West vs. W Richard, Rick, Dick, Ricky, Rich 24

25 Data Preparation (Cont.) (Remove immaterial differences first, normalization) Word order 123 W Main St. vs. 123 Main St. W 25

26 Data Preparation: Result Well implemented data prep. minimizes the need for fuzzy Consider the two addresses: # Main Street West 1234 W MAIN ST, Suite 200 Levenshtein distance is 20 Applying data prep can make both strings identical W ST MAIN

27 Text Manipulation: ACL Create a computed field Upper case: Upper(field) (FUZZYDUP ignores case, but data prep is simpler) Punctuation: Include(field, ABCDEFGHIJKLMNOPQRSTUVWXYZ ), but Extra blanks: (replace 2 with 1) Replace(Replace(field,, ),, ) Foreign characters: Replace(Replace(field É, E ), Á, A ) Replace(Replace(Replace(Replace(Include(Upper(field), ABCDEFGHIJKLMNOPQRSTUVWXYZ ),, ),, ),, ), É, E ) In practice, many more replace calls May break up into multiple fields for clarity 27

28 Text Manipulation: Arbutus Create a computed field Upper case: Upper(field) Punctuation: Include(field, 0~9A~Z ), but Extra blanks: Compact(field) Foreign characters: Replace(field, É, E, Á, A, ) Replace(Compact(Include(Upper(field), 0~9A~Z )), É, E ) May break up into multiple fields for clarity Only for unusual situations (use Normalize function) 28

29 Eliminate Noise Words: ACL Use whole words Omit(field+, INCORPORATED,INC,LIMITED,LTD, F), but Omit(field, INC ): CINCH INDUSTRIES becomes CH INDUSTRIES Problem is, many noise words to eliminate two solutions: Long list Alltrim(Omit(field+, INCORPORATED,INC,LIMITED,LTD,CORPORATION, CORP, )) Sequential omits of a variable in a group v_field=omit(field v_field=omit(v_field 29

30 Common Vocabulary: ACL Similar to noise words, only Replace instead of Omit Use whole words Replace(field+, ROAD, RD ) Otherwise, BROADWAY becomes BRDWAY Don t omit, as Peachtree Lane is not the same as Peachtree Court Problem is, MANY vocabulary words to potentially normalize USPS 400 street terms, 500+ male names, 700+ female names Nested functions (with Replace instead of Omit) Sequential replaces of a variable in a group 30

31 Word Order: ACL No practical way to address this 31

32 Noise Words and Common Vocabulary: Arbutus If you choose, ACL syntax all works Instead: Use Normalize() or SortNormalize() Automatically implements ALL of the data prep described (Upper case, punctuation, blanks, foreign, noise, vocabulary) Normalize(address, addr.txt ) Norm( Suite Main Street West, addr.txt ) = MAIN ST W SortNormalize has the same syntax, but = W ST MAIN Normalize can use a separate vocabulary file (addr.txt) Replaces or omits any word, on a whole word basis User configurable and selectable, by data type 32

33 Noise Words and Common Vocabulary: Arbutus Substitution file (addr.txt, for example) FIRST 1ST SEVENTH 7TH AV AVENU AVENUE AVN AVE AVE AVE AVE PARKWAY PKWY PARKWY PKWAY PKY PKWY PKWY PKWY SUITE UNIT 33

34 False Positive Reduction: Utilize Context Data elements always have a context Names or address: location (e.g., city, state, ZIP, country, etc.) Documents: vendor, employee, etc. Reference the similarities to minimize the ambiguity Same state, city, similar address 123 Main St., Springfield, IL/MA Same vendor, date, amount, similar invoice number 34

35 Utilize Context: Application ACL FUZZYDUP: Only supports one key field Concatenate fields into a single expression/computed field State+City+Address Other data types require conversion: vendor+date(dt)+str(amount, 16)+invno Arbutus DUPLICATES: Supports multiple key fields Specify each key separately Last key can be fuzzy 35

36 False Positive Reduction: Use Tight Specs Levenshtein distance 1, or 2 max Looser specifications = more false positives Avoid Soundex and similar approaches There is no substitute for good data prep 36

37 False Positives: Rank Your Results Order based on exposure Size of item Degree of inherent risk (cash) Order based on degree of similarity Distance (1 vs. 2) Number of matching same elements 37

38 Execution: ACL Separate menu item Analyze/fuzzy duplicates Choose your (concatenated) key Choose diff. threshold (1 or 2) Select other fields to use in investigation Select the output table name Be patient 38

39 Execution: Arbutus Included with duplicates testing Analyze/duplicates Choose your key fields (any type) Choose either near or similar processing Choose max. difference (0, 1, or 2) Select other fields to use in investigation Select output location and name 39

40 Similar Processing: Arbutus Specifically designed to work with document IDs Uses Damerau-Levenshtein, but auto. pre-processes Removes all blanks and punctuation, upper cases Matches similar characters: O=0, I=1, 5=S, etc. Works on all data types 127, vs. 12, (diff. 1) I vs (diff 0) Particularly useful with OCR 40

41 Similar Processing: ACL Not explicitly supported Pre-process the data to create a computed field Upper case Include only numbers and letters (no blanks, punctuation) Convert numbers and dates to strings (date or string) Use the FUZZYDUP command as in the past 41

42 Manual Duplicates Testing: ACL Data prep is still important LevDist(string1, string2 <, case sensitive>) Case sensitive by default Filter: LevDist(name1, name2, F) < 3 IsFuzzyDup(string1, string2, distance <, diff%> ) Automatically case insensitive Filter: IsFuzzyDup(name1, name2, 2) Can also be used as a join test 42

43 Manual Duplicates Testing: Arbutus All case sensitive, by default (assumes normalized inputs) Difference(string1, string2 <, case sensitive>) Filter: difference(name1, name2, F) < 3 Near(field1, field2, difference) Filter: near(name1, name2, 2) Applies to all data types Char: Damerau-Levenshtein; numbers and dates: proximity (4799 vs 4803) Similar(field, field2, difference) Applies to all data types, always uses Damerau-Levenshtein Char: prepared data; numbers and dates: 123,456 vs. 12,456 43

44 Find Specific Keywords in Text: ACL Very common for purchase card reviews, FCPA Use the Find function: Filter: IF Find( Exotic, desc) Multiple words: IF Find( Exotic, desc) OR Find( IPad, desc) Not case sensitive, not whole word Create a Logical computed field (say Exception ): T IF Find( Exotic, desc) T IF Find( IPad, desc) F Filter: IF Exception 44

45 Find Specific Keywords in Text: Arbutus Find function works the same as ACL Use the ListFind function instead: Filter: IF ListFind( exceptions.txt, desc) Simple text file Easily maintained in Notepad Unlimited entries Supports an external reference file or an internal array Like Find function, not case sensitive, not whole word 45

46 Continuous Monitoring Mostly errors Test vs. control Ownership of the process May relate to frequency Detective vs. Preventative Entire presentation detective Opportunity to run against documents before committing Preventative almost certainly a control 46

47 Fuzzy Testing in action Demonstration 47

REAL-TIME SOLUTIONS TO REAL-TIME PROBLEMS FUZZY MATCHING IN FRAUD ANALYTICS

REAL-TIME SOLUTIONS TO REAL-TIME PROBLEMS FUZZY MATCHING IN FRAUD ANALYTICS REAL-TIME SOLUTIONS TO REAL-TIME PROBLEMS FUZZY MATCHING IN FRAUD ANALYTICS Technology that allows you to look for inexact but close matches (fuzzy matching) has been around for many years. As sophistication

More information

Overview of Record Linkage Techniques

Overview of Record Linkage Techniques Overview of Record Linkage Techniques Record linkage or data matching refers to the process used to identify records which relate to the same entity (e.g. patient, customer, household) in one or more data

More information

Data.com Record Matching in Salesforce

Data.com Record Matching in Salesforce Data.com Record Matching in Salesforce Salesforce, Winter 16 @salesforcedocs Last updated: October 1, 2015 Copyright 2000 2015, inc. All rights reserved. Salesforce is a registered trademark of, inc.,

More information

Procurement Card Purchasing

Procurement Card Purchasing Procurement Card Purchasing Some vendors listed in Mercury as either Punch out vendors or hosted vendors can be paid for using your Procurement Card when placing your order through Mercury Commerce system.

More information

Accounting Information Systems, 2e (Kay/Ovlia) Chapter 2 Accounting Databases. Objective 1

Accounting Information Systems, 2e (Kay/Ovlia) Chapter 2 Accounting Databases. Objective 1 Accounting Information Systems, 2e (Kay/Ovlia) Chapter 2 Accounting Databases Objective 1 1) One of the disadvantages of a relational database is that we can enter data once into the database, and then

More information

Relational Database Management Systems for Epidemiologists: SQL Part I

Relational Database Management Systems for Epidemiologists: SQL Part I Relational Database Management Systems for Epidemiologists: SQL Part I Outline SQL Basics Retrieving Data from a Table Operators and Functions What is SQL? SQL is the standard programming language to create,

More information

Duplicate Constituents and Merge Tasks Guide

Duplicate Constituents and Merge Tasks Guide Duplicate Constituents and Merge Tasks Guide 06/12/2017 Altru 4.96 Duplicate Constituents and Merge Tasks US 2017 Blackbaud, Inc. This publication, or any part thereof, may not be reproduced or transmitted

More information

Searching Guide. September 16, Version 9.3

Searching Guide. September 16, Version 9.3 Searching Guide September 16, 2016 - Version 9.3 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

More information

Introduction to SQL. IT 5101 Introduction to Database Systems. J.G. Zheng Fall 2011

Introduction to SQL. IT 5101 Introduction to Database Systems. J.G. Zheng Fall 2011 Introduction to SQL IT 5101 Introduction to Database Systems J.G. Zheng Fall 2011 Overview Using Structured Query Language (SQL) to get the data you want from relational databases Learning basic syntax

More information

DB2 SQL Class Outline

DB2 SQL Class Outline DB2 SQL Class Outline The Basics of SQL Introduction Finding Your Current Schema Setting Your Default SCHEMA SELECT * (All Columns) in a Table SELECT Specific Columns in a Table Commas in the Front or

More information

Tips and Tricks for Data Quality Management

Tips and Tricks for Data Quality Management Tips and Tricks for Data Quality Management Thomas A. Dye III CCP Informatica Chris Phillips Senior Product Manager, Data Quality Informatica 1 Biography Thomas A. Dye III, CCP Senior Consultant with Informatica

More information

100 + REASONS TO LIKE ARBUTUS

100 + REASONS TO LIKE ARBUTUS 100 + REASONS TO LIKE ARBUTUS Arbutus Audit Analytics offers a broad range of capabilities unmatched by the alternatives. The list below is an example of company president Grant Brodie s focus on making

More information

DATA MANAGEMENT USE CASES

DATA MANAGEMENT USE CASES DATA MANAGEMENT USE CASES Data Management is a general term for a variety of tasks ISD is frequently asked to assist with. DATA MANAGEMENT WHAT S POSSIBLE? Consulting on implementing new workflows or improving

More information

Manage Duplicate Records in Salesforce PREVIEW

Manage Duplicate Records in Salesforce PREVIEW Manage Duplicate Records in Salesforce Salesforce, Winter 18 PREVIEW Note: This release is in preview. Features described in this document don t become generally available until the latest general availability

More information

Access Basics: When and How

Access Basics: When and How Access Basics: When and How Hal Jankowski CACUBO Winter Workshop Kansas City, MO April 2014 Learning outcome disclaimer Access is a complex tool that requires significant hands on time to become familiar.

More information

The Corticon Rule Modeling Methodology. Applied to. FEMA Disaster Assistance Fraud Detection. A Case Study

The Corticon Rule Modeling Methodology. Applied to. FEMA Disaster Assistance Fraud Detection. A Case Study The Corticon Rule Modeling Methodology Applied to FEMA Disaster Assistance Fraud Detection A Case Study By Mike Parish Contents Table Of Figures... 4 The Business Problem... 6 Identify the Business Decision(s)

More information

Settings Options User Manual

Settings Options User Manual Settings Options User Manual Settings Options User Manual 04/05/2016 User Reference Manual Copyright 2016 by Celerant Technology Corp. All rights reserved worldwide. This manual, as well as the software

More information

Vendor Inquiry and Reports Munis Version 11.2

Vendor Inquiry and Reports Munis Version 11.2 Objective This document gives you step by step instructions for using the Vendor Inquiry/Reports program to query the vendor master table for information regarding a specific vendor(s) and how to produce

More information

StatTrak Address Manager Business Edition User Manual

StatTrak Address Manager Business Edition User Manual StatTrak Address Manager Business Edition User Manual Overview... 2 Frequently Asked Questions... 5 Toolbar... 7 Address Listings... 9 Update Main & Details... 11 Update Individual... 12 Back Up Database...

More information

GDPR Thread or Opportunity? Jan Sál 10th October 2017

GDPR Thread or Opportunity? Jan Sál 10th October 2017 GDPR Thread or Opportunity? Jan Sál 10th October 2017 Contents 1 2 3 4 5 Status of GDPR implementation process Rights: of access, to erase and to restrict processing Case study of implementation Data mining

More information

Address Cleansing in Michigan Lessons Learned. Michigan Care Improvement Registry

Address Cleansing in Michigan Lessons Learned. Michigan Care Improvement Registry Address Cleansing in Michigan Lessons Learned Michigan Care Improvement Registry Address Cleansing: what is it Break the address into parts (ie. primary number, street name, city name, state name, ZIP

More information

Data Warehousing. Jens Teubner, TU Dortmund Summer Jens Teubner Data Warehousing Summer

Data Warehousing. Jens Teubner, TU Dortmund Summer Jens Teubner Data Warehousing Summer Jens Teubner Data Warehousing Summer 2018 1 Data Warehousing Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2018 Jens Teubner Data Warehousing Summer 2018 160 Part VI ETL Process ETL Overview

More information

Microsoft Access XP (2002) - Advanced Queries

Microsoft Access XP (2002) - Advanced Queries Microsoft Access XP (2002) - Advanced Queries Group/Summary Operations Change Join Properties Not Equal Query Parameter Queries Working with Text IIF Queries Expression Builder Backing up Tables Action

More information

Donor Management with GiftWorks

Donor Management with GiftWorks Donor Management with GiftWorks The Big Picture With GiftWorks, you can store a large amount of information about each of your donors. In addition to basics like names, addresses, and phone numbers, you

More information

About this exam review

About this exam review Final Exam Review About this exam review I ve prepared an outline of the material covered in class May not be totally complete! Exam may ask about things that were covered in class but not in this review

More information

GiftWorks Import Guide Page 2

GiftWorks Import Guide Page 2 Import Guide Introduction... 2 GiftWorks Import Services... 3 Import Sources... 4 Preparing for Import... 9 Importing and Matching to Existing Donors... 11 Handling Receipting of Imported Donations...

More information

Vendor Maint/Reports Menu Vendor Inquiry/Reports MUNIS Version 7

Vendor Maint/Reports Menu Vendor Inquiry/Reports MUNIS Version 7 Module: Topic: Accounts Payable Vendor Maint/Reports Menu Vendor Inquiry/Reports MUNIS Version 7 Objective This document gives you step by step instructions for using the Vendor Inquiry/Reports program

More information

1 Writing Basic SQL SELECT Statements 2 Restricting and Sorting Data

1 Writing Basic SQL SELECT Statements 2 Restricting and Sorting Data 1 Writing Basic SQL SELECT Statements Objectives 1-2 Capabilities of SQL SELECT Statements 1-3 Basic SELECT Statement 1-4 Selecting All Columns 1-5 Selecting Specific Columns 1-6 Writing SQL Statements

More information

Access Intermediate

Access Intermediate Access 2013 - Intermediate 103-134 Advanced Queries Quick Links Overview Pages AC124 AC125 Selecting Fields Pages AC125 AC128 AC129 AC131 AC238 Sorting Results Pages AC131 AC136 Specifying Criteria Pages

More information

Attacking Return-to- Sender Mail from All Directions

Attacking Return-to- Sender Mail from All Directions Attacking Return-to- Sender Mail from All Directions JEFF STANGLE DIRECTOR OF SOLUTIONS POSTAL CONSULTING PITNEY BOWES MANAGEMENT SERVICES ADAM COLLINSON ENGAGEMENT MANAGER POSTAL CONSULTING PITNEY BOWES

More information

Donor Management with GiftWorks. The Big Picture... 2 A Closer Look... 2 Scenarios... 4 Best Practices Conclusion... 21

Donor Management with GiftWorks. The Big Picture... 2 A Closer Look... 2 Scenarios... 4 Best Practices Conclusion... 21 Donor Management with GiftWorks The Big Picture... 2 A Closer Look... 2 Scenarios... 4 Best Practices... 20 Conclusion... 21 The Big Picture With GiftWorks, you can store a large amount of information

More information

Format of Session 1. Forensic Accounting then and now 2. Overview of Data Analytics 3. Fraud Analytics Basics 4. Advanced Fraud Analytics 5. Data Visualization 6. Wrap-up Question are welcome and encouraged!

More information

EXTRACTING DATA FOR MAILING LISTS OR REPORTS

EXTRACTING DATA FOR MAILING LISTS OR REPORTS EXTRACTING DATA FOR MAILING LISTS OR REPORTS The data stored in your files provide a valuable source of information. There are many reports in Lakeshore but sometimes you may need something unique or you

More information

CMPT 354: Database System I. Lecture 3. SQL Basics

CMPT 354: Database System I. Lecture 3. SQL Basics CMPT 354: Database System I Lecture 3. SQL Basics 1 Announcements! About Piazza 97 enrolled (as of today) Posts are anonymous to classmates You should have started doing A1 Please come to office hours

More information

The Matching Engine. The Science of Maximising Legitimate Matches, Minimising False Matches and Taking Control of the Matching Process

The Matching Engine. The Science of Maximising Legitimate Matches, Minimising False Matches and Taking Control of the Matching Process The Matching Engine The Science of Maximising Legitimate Matches, Minimising False Matches and Taking Control of the Matching Process CLEANER DATA. BETTER DECISIONS. The Challenge of Contact Data Matching

More information

First Data Global Gateway SM Virtual Terminal User Manual

First Data Global Gateway SM Virtual Terminal User Manual First Data Global Gateway SM Virtual Terminal User Manual Version 1.0 2015 First Data Corporation. All Rights Reserved. All trademarks, service marks, and trade names referenced in this material are the

More information

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Keith Kranker

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Keith Kranker dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Presentation at the 2018 Stata Conference Columbus, Ohio July 20, 2018 Keith Kranker Abstract Stata users

More information

Guide to Importing Data

Guide to Importing Data Guide to Importing Data CONTENTS Data Import Introduction... 3 Who should use the Gold-Vision Import Client?... 3 Prepare your data... 3 Downloading and installing the import client... 7 Step One Getting

More information

Techniques for Large Scale Data Linking in SAS. By Damien John Melksham

Techniques for Large Scale Data Linking in SAS. By Damien John Melksham Techniques for Large Scale Data Linking in SAS By Damien John Melksham What is Data Linking? Called everything imaginable: Data linking, record linkage, mergepurge, entity resolution, deduplication, fuzzy

More information

Access Intermediate

Access Intermediate Access 2010 - Intermediate 103-134 Advanced Queries Quick Links Overview Pages AC116 AC117 Selecting Fields Pages AC118 AC119 AC122 Sorting Results Pages AC125 AC126 Specifying Criteria Pages AC132 AC134

More information

Tutorial 5 Advanced Queries and Enhancing Table Design

Tutorial 5 Advanced Queries and Enhancing Table Design Tutorial 5 Advanced Queries and Enhancing Table Design (Sessions 1 and 3 only) The Clinic Database Clinic.accdb file for Tutorials 5-8 object names include tags no spaces in field names to promote upsizing

More information

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,... Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

More information

Automatic training example selection for scalable unsupervised record linkage

Automatic training example selection for scalable unsupervised record linkage Automatic training example selection for scalable unsupervised record linkage Peter Christen Department of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au

More information

Matching Rules: Too Loose, Too Tight, or Just Right?

Matching Rules: Too Loose, Too Tight, or Just Right? Paper 1674-2014 Matching Rules: Too Loose, Too Tight, or Just Right? Richard Cadieux, Towers Watson, Arlington, VA & Daniel R. Bretheim, Towers Watson, Arlington, VA ABSTRACT This paper describes a technique

More information

WHY EFFECTIVE WEB WRITING MATTERS Web users read differently on the web. They rarely read entire pages, word for word.

WHY EFFECTIVE WEB WRITING MATTERS Web users read differently on the web. They rarely read entire pages, word for word. Web Writing 101 WHY EFFECTIVE WEB WRITING MATTERS Web users read differently on the web. They rarely read entire pages, word for word. Instead, users: Scan pages Pick out key words and phrases Read in

More information

Lecture 3 SQL. Shuigeng Zhou. September 23, 2008 School of Computer Science Fudan University

Lecture 3 SQL. Shuigeng Zhou. September 23, 2008 School of Computer Science Fudan University Lecture 3 SQL Shuigeng Zhou September 23, 2008 School of Computer Science Fudan University Outline Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries Derived Relations Views

More information

DATA HYGIENE AND MERGE PURGE

DATA HYGIENE AND MERGE PURGE DATA HYGIENE AND MERGE PURGE DMAW DM101 By Lori Barao, MMI Direct WHAT YOU LL LEARN TODAY DATA HYGIENE AND MERGE/PURGE You ll leave with an understanding the Merge/Purge process and the tools available

More information

Outlier Detection With SQL And R. Kevin Feasel, Engineering Manager, ChannelAdvisor Moderated By: Satya Jayanty

Outlier Detection With SQL And R. Kevin Feasel, Engineering Manager, ChannelAdvisor Moderated By: Satya Jayanty Outlier Detection With SQL And R Kevin Feasel, Engineering Manager, ChannelAdvisor Moderated By: Satya Jayanty Technical Assistance If you require assistance during the session, type your inquiry into

More information

Converting a Lowercase Letter Character to Uppercase (Or Vice Versa)

Converting a Lowercase Letter Character to Uppercase (Or Vice Versa) Looping Forward Through the Characters of a C String A lot of C string algorithms require looping forward through all of the characters of the string. We can use a for loop to do that. The first character

More information

Segregating Data Within Databases for Performance Prepared by Bill Hulsizer

Segregating Data Within Databases for Performance Prepared by Bill Hulsizer Segregating Data Within Databases for Performance Prepared by Bill Hulsizer When designing databases, segregating data within tables is usually important and sometimes very important. The higher the volume

More information

Copyright 2009 Labyrinth Learning Not for Sale or Classroom Use LESSON 1. Designing a Relational Database

Copyright 2009 Labyrinth Learning Not for Sale or Classroom Use LESSON 1. Designing a Relational Database LESSON 1 By now, you should have a good understanding of the basic features of a database. As you move forward in your study of Access, it is important to get a better idea of what makes Access a relational

More information

Exam code: Exam name: Database Fundamentals. Version 16.0

Exam code: Exam name: Database Fundamentals. Version 16.0 98-364 Number: 98-364 Passing Score: 800 Time Limit: 120 min File Version: 16.0 Exam code: 98-364 Exam name: Database Fundamentals Version 16.0 98-364 QUESTION 1 You have a table that contains the following

More information

jellyfish Documentation

jellyfish Documentation jellyfish Documentation Release 0.5.6 James Turk December 01, 2016 Contents 1 Overview 1 1.1 Phonetic Encoding............................................ 1 1.1.1 American Soundex.......................................

More information

Excel Tips for Compensation Practitioners Weeks Text Formulae

Excel Tips for Compensation Practitioners Weeks Text Formulae Excel Tips for Compensation Practitioners Weeks 70-73 Text Formulae Week 70 Using Left, Mid and Right Formulae When analysing compensation data, you generally extract data from the payroll, the HR system,

More information

Munis Self Service Vendor Self Service

Munis Self Service Vendor Self Service Munis Self Service Vendor Self Service User Guide Version 10.5 For more information, visit www.tylertech.com. TABLE OF CONTENTS Vendor Self Service Overview... 3 Vendor Self Service Users... 3 Vendor Registration...

More information

Slicing and Dicing Data in CF and SQL: Part 1

Slicing and Dicing Data in CF and SQL: Part 1 Slicing and Dicing Data in CF and SQL: Part 1 Charlie Arehart Founder/CTO Systemanage carehart@systemanage.com SysteManage: Agenda Slicing and Dicing Data in Many Ways Handling Distinct Column Values Manipulating

More information

Full file at

Full file at David Kroenke's Database Processing: Fundamentals, Design and Implementation (10 th Edition) CHAPTER TWO INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) True-False Questions 1. SQL stands for Standard

More information

CASS Certification Procedures

CASS Certification Procedures CASS Certification Procedures How Your Member Addresses are CASS-Certified INTRODUCTION The United States Postal Service, in cooperation with the mailing industry, has developed a process of evaluating

More information

The Entity-Relationship Model (ER Model) - Part 2

The Entity-Relationship Model (ER Model) - Part 2 Lecture 4 The Entity-Relationship Model (ER Model) - Part 2 By Michael Hahsler Based on slides for CS145 Introduction to Databases (Stanford) Lecture 4 > Section 2 What you will learn about in this section

More information

Fuzzy Matching: Where Is It Appropriate and How Is It Done? SAS Can Help.

Fuzzy Matching: Where Is It Appropriate and How Is It Done? SAS Can Help. Paper 7760-2016 Fuzzy Matching: Where Is It Appropriate and How Is It Done? SAS Can Help. ABSTRACT Stephen Sloan, Accenture Dan Hoicowitz, Accenture Federal Services When attempting to match names and

More information

ShelbyNext Financials: Accounts Payable Best Practices (Course #N210)

ShelbyNext Financials: Accounts Payable Best Practices (Course #N210) ShelbyNext Financials: Accounts Payable Best Practices (Course #N210) Presented by: Carmen Dea, Shelby Consultant 2017 Shelby Systems, Inc. Other brand and product names are trademarks or registered trademarks

More information

Aster Data Basics Class Outline

Aster Data Basics Class Outline Aster Data Basics Class Outline CoffingDW education has been customized for every customer for the past 20 years. Our classes can be taught either on site or remotely via the internet. Education Contact:

More information

Opera Customer Information System Guide. Version 2.0 January 2011

Opera Customer Information System Guide. Version 2.0 January 2011 Opera Customer Information System Guide Version 2.0 January 2011 Contents Opera Customer Information System: What is it?... 2 Using the Lookup Feature... 2 Information on the Profile Screen... 4 Central

More information

Introduction to SQL Server 2005/2008 and Transact SQL

Introduction to SQL Server 2005/2008 and Transact SQL Introduction to SQL Server 2005/2008 and Transact SQL Week 2 TRANSACT SQL CRUD Create, Read, Update, and Delete Steve Stedman - Instructor Steve@SteveStedman.com Homework Review Review of homework from

More information

Greenplum SQL Class Outline

Greenplum SQL Class Outline Greenplum SQL Class Outline The Basics of Greenplum SQL Introduction SELECT * (All Columns) in a Table Fully Qualifying a Database, Schema and Table SELECT Specific Columns in a Table Commas in the Front

More information

Reviewed 07/13/10. if the. address and. at (865) 974-

Reviewed 07/13/10. if the. address and. at (865) 974- POLICY: Addresses Effective: 3/09/04 Revised 07/13/10 Reviewed 07/13/10 Objective: To standardize the process of maintaining different types of addresses for an entity. Quick Address Software (QAS) is

More information

REDCAP DATA DICTIONARY CLASS. November 9, 2017

REDCAP DATA DICTIONARY CLASS. November 9, 2017 REDCAP DATA DICTIONARY CLASS November 9, 2017 LEARNING OBJECTIVES Learn how to leverage the data dictionary Data dictionary basics Column descriptions Best practices Interplay with longitudinal features

More information

Introduction to blocking techniques and traditional record linkage

Introduction to blocking techniques and traditional record linkage Introduction to blocking techniques and traditional record linkage Brenda Betancourt Duke University Department of Statistical Science bb222@stat.duke.edu May 2, 2018 1 / 32 Blocking: Motivation Naively

More information

Microsoft Access 2007 Module 1

Microsoft Access 2007 Module 1 Microsoft Access 007 Module http://citt.hccfl.edu Microsoft Access 007: Module August 007 007 Hillsborough Community College - CITT Faculty Professional Development Hillsborough Community College - CITT

More information

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,... Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

More information

Best Practices. Contents. Meridian Technologies 5210 Belfort Rd, Suite 400 Jacksonville, FL Meridiantechnologies.net

Best Practices. Contents. Meridian Technologies 5210 Belfort Rd, Suite 400 Jacksonville, FL Meridiantechnologies.net Meridian Technologies 5210 Belfort Rd, Suite 400 Jacksonville, FL 32257 Meridiantechnologies.net Contents Overview... 2 A Word on Data Profiling... 2 Extract... 2 De- Identification... 3 PHI... 3 Subsets...

More information

Linux Systems Security. Security Design NETS Fall 2016

Linux Systems Security. Security Design NETS Fall 2016 Linux Systems Security Security Design NETS1028 - Fall 2016 Designing a Security Approach Physical access Boot control Service availability and control User access Change control Data protection and backup

More information

CHOIR APPLICATION FORM

CHOIR APPLICATION FORM P a g e 1 CHOIR APPLICATION FORM BARCELONA: July 17-27, 2017 Please have one representative register interest for the entire choir with this form. Name of Choir/Ensemble: Title (please circle): Mr. Ms.

More information

The Power Of An Integrated Search Strategy

The Power Of An Integrated Search Strategy The Power Of An Integrated Search Strategy Chad Hallert Director of Digital Strategy Noble Studios A Quick Introduction About Me About Noble Studios 15 Years in Digital Marketing 2015 Direct Marketing

More information

Proceedings of the Eighth International Conference on Information Quality (ICIQ-03)

Proceedings of the Eighth International Conference on Information Quality (ICIQ-03) Record for a Large Master Client Index at the New York City Health Department Andrew Borthwick ChoiceMaker Technologies andrew.borthwick@choicemaker.com Executive Summary/Abstract: The New York City Department

More information

Oracle Syllabus Course code-r10605 SQL

Oracle Syllabus Course code-r10605 SQL Oracle Syllabus Course code-r10605 SQL Writing Basic SQL SELECT Statements Basic SELECT Statement Selecting All Columns Selecting Specific Columns Writing SQL Statements Column Heading Defaults Arithmetic

More information

Lesson 2. Data Manipulation Language

Lesson 2. Data Manipulation Language Lesson 2 Data Manipulation Language IN THIS LESSON YOU WILL LEARN To add data to the database. To remove data. To update existing data. To retrieve the information from the database that fulfil the stablished

More information

AAL 217: DATA STRUCTURES

AAL 217: DATA STRUCTURES Chapter # 4: Hashing AAL 217: DATA STRUCTURES The implementation of hash tables is frequently called hashing. Hashing is a technique used for performing insertions, deletions, and finds in constant average

More information

Silberschatz, Korth and Sudarshan See for conditions on re-use

Silberschatz, Korth and Sudarshan See   for conditions on re-use Chapter 3: SQL Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 3: SQL Data Definition Basic Query Structure Set Operations Aggregate Functions Null Values Nested

More information

Table of Contents. PDF created with FinePrint pdffactory Pro trial version

Table of Contents. PDF created with FinePrint pdffactory Pro trial version Table of Contents Course Description The SQL Course covers relational database principles and Oracle concepts, writing basic SQL statements, restricting and sorting data, and using single-row functions.

More information

Database 2: Slicing and Dicing Data in CF and SQL

Database 2: Slicing and Dicing Data in CF and SQL Database 2: Slicing and Dicing Data in CF and SQL Charlie Arehart Founder/CTO Systemanage carehart@systemanage.com SysteManage: Agenda Slicing and Dicing Data in Many Ways Handling Distinct Column Values

More information

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #3: SQL and Rela2onal Algebra- - - Part 1

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #3: SQL and Rela2onal Algebra- - - Part 1 CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #3: SQL and Rela2onal Algebra- - - Part 1 Reminder: Rela0onal Algebra Rela2onal algebra is a nota2on for specifying queries

More information

01 Transaction Pro Importer version 6.0

01 Transaction Pro Importer version 6.0 01 Transaction Pro Importer version 6.0 PLEASE READ: This help file gives an introduction to the basics of using the product. For more detailed instructions including frequently asked questions (FAQ's)

More information

CA Mahesh Bhatki Mumbai, 28/December/2013. Agenda

CA Mahesh Bhatki Mumbai, 28/December/2013. Agenda Data Analytics and Use of CAATTs Seminar On Investigation and Forensic Accounting & Audit CA Mahesh Bhatki Mumbai, 28/December/2013 Agenda Overview of CAATTs (Tools and Techniques) Some Useful Techniques

More information

PEGASUS DISTRIBUTOR S GUIDE

PEGASUS DISTRIBUTOR S GUIDE PEGASUS DISTRIBUTOR S GUIDE GPS /GPRS SOLUTION FOR YOUR FLEET Web Based Tracking System Tel: +44 (0)1509 808168 E- Mail: info@naxertech.com. www.naxertech.co.uk www.naxertech.com Revision History Note:

More information

Quality Control of Clinical Data Listings with Proc Compare

Quality Control of Clinical Data Listings with Proc Compare ABSTRACT Quality Control of Clinical Data Listings with Proc Compare Robert Bikwemu, Pharmapace, Inc., San Diego, CA Nicole Wallstedt, Pharmapace, Inc., San Diego, CA Checking clinical data listings with

More information

Tags, Categories and Keywords

Tags, Categories and Keywords Tags, Categories and Keywords Document Management Tip Sheet As more and more content gets added to your repository, it will become harder to find what you need. Documents may become buried in multi-level

More information

Chapter 3: Introduction to SQL

Chapter 3: Introduction to SQL Chapter 3: Introduction to SQL Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 3: Introduction to SQL Overview of the SQL Query Language Data Definition Basic Query

More information

CASS Cycle L ( ) Certification: Frequently Asked Questions

CASS Cycle L ( ) Certification: Frequently Asked Questions CASS Cycle L (2007-2008) Certification: Frequently Asked Questions Q. What is CASS Cycle L? A. CASS Cycle L is the next regularly scheduled update of address-matching software. The USPS requires address-matching

More information

Formulas, LookUp Tables and PivotTables Prepared for Aero Controlex

Formulas, LookUp Tables and PivotTables Prepared for Aero Controlex Basic Topics: Formulas, LookUp Tables and PivotTables Prepared for Aero Controlex Review ribbon terminology such as tabs, groups and commands Navigate a worksheet, workbook, and multiple workbooks Prepare

More information

3/3/2008. Announcements. A Table with a View (continued) Fields (Attributes) and Primary Keys. Video. Keys Primary & Foreign Primary/Foreign Key

3/3/2008. Announcements. A Table with a View (continued) Fields (Attributes) and Primary Keys. Video. Keys Primary & Foreign Primary/Foreign Key Announcements Quiz will cover chapter 16 in Fluency Nothing in QuickStart Read Chapter 17 for Wednesday Project 3 3A due Friday before 11pm 3B due Monday, March 17 before 11pm A Table with a View (continued)

More information

Searching Guide. November 17, Version 9.5

Searching Guide. November 17, Version 9.5 Searching Guide November 17, 2017 - Version 9.5 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

More information

Why Use OSU Printing & Mailing Services?

Why Use OSU Printing & Mailing Services? Why Use OSU Printing & Mailing Services? Printing & Mailing Numbers Bulk Mail Pieces Reducing Costs For Our Clients Without proper mail preparation, you could be paying a significant amount more in production

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

COVER LETTER UNIT 1 LESSON 3

COVER LETTER UNIT 1 LESSON 3 1 COVER LETTER Naviance Family Connection http://connection.naviance.com/cascadehs http://connection.naviance.com/everetths http://connection.naviance.com/henrymjhs http://connection.naviance.com/sequoiahs

More information

Walt Brainerd s Fortran 90 programming tips

Walt Brainerd s Fortran 90 programming tips Walt Brainerd s Fortran 90 programming tips I WORKETA - March, 2004 Summary by Margarete Domingues (www.cleanscape.net/products/fortranlint/fortran-programming tips.html) Fortran tips WORKETA - 2004 p.1/??

More information

Chapter 3: Introduction to SQL. Chapter 3: Introduction to SQL

Chapter 3: Introduction to SQL. Chapter 3: Introduction to SQL Chapter 3: Introduction to SQL Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 3: Introduction to SQL Overview of The SQL Query Language Data Definition Basic Query

More information

Implementation of Lexical Analysis

Implementation of Lexical Analysis Written ssignments W assigned today Implementation of Lexical nalysis Lecture 4 Due in one week y 5pm Turn in In class In box outside 4 Gates Electronically Prof. iken CS 43 Lecture 4 Prof. iken CS 43

More information

Implementation of Lexical Analysis

Implementation of Lexical Analysis Written ssignments W assigned today Implementation of Lexical nalysis Lecture 4 Due in one week :59pm Electronic hand-in Prof. iken CS 43 Lecture 4 Prof. iken CS 43 Lecture 4 2 Tips on uilding Large Systems

More information

Importacular for The Raiser s Edge

Importacular for The Raiser s Edge Importacular for The Raiser s Edge development@zeidman.info www.zeidman.info UK: 020 3637 0080 US: (646) 570 1131 Table of Contents Overview... 4 Installation for Self-Hosted Users (on premise)... 4 Hosted

More information