The Wildcard Side of SAS

Similar documents
The Perks of PRX David L. Cassell, Design Pathways, Corvallis, OR

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

CHAOTIC DATA EXAMPLES. Paper CC-033

CSC207 Week 9. Larry Zhang

Control Structures. CIS 118 Intro to LINUX

Regular Expressions Explained

A Brief Tutorial on PERL Regular Expressions

PRXChange: Accept No Substitutions Kenneth W. Borowiak, PPD, Inc.

Perl Regular Expression The Power to Know the PERL in Your Data

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

REPORT DOCUMENTATION PAGE

Using Lex or Flex. Prof. James L. Frankel Harvard University

The Power of Perl Regular Expressions: Processing Dataset-XML documents back to SAS Data Sets Joseph Hinson, inventiv Health, Princeton, NJ, USA

Who This Book Is For What This Book Covers How This Book Is Structured What You Need to Use This Book. Source Code

Advanced Handle Definition

Contract No Ref#11075 WIRELESS SYSTEM DESIGN. As of March 7, :20 pm

PharmaSUG Poster PO12

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

Standardization of Lists of Names and Addresses Using SAS Character and Perl Regular Expression (PRX) Functions

This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl.

Regular Expressions 1

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Non-deterministic Finite Automata (NFA)

Lecture 18 Regular Expressions

CMSC 132: Object-Oriented Programming II

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

SharkFest 16. Advanced Wireshark Display Filters: How to Zoom in on the 10 Packets You Actually Need Download files from tinyurl.

Regular Expressions & Automata

1 Lexical Considerations

Shredding Your Data with the New DS2 RegEx Packages

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Python-2. None. Special constant that is a null value

Paolo Santinelli Sistemi e Reti. Regular expressions. Regular expressions aim to facilitate the solution of text manipulation problems

Ministry of Transportation

Understanding Regular Expressions, Special Characters, and Patterns

Regular Expressions. Perl PCRE POSIX.NET Python Java

1 CS580W-01 Quiz 1 Solution

Getting to grips with Unix and the Linux family

Regexs with DFA and Parse Trees. CS230 Tutorial 11

Indian Institute of Technology Kharagpur. PERL Part III. Prof. Indranil Sen Gupta Dept. of Computer Science & Engg. I.I.T.

Voice Translation Rules

Object-Oriented Software Engineering CS288

Basic Geospatial Analysis Techniques: This presentation introduces you to basic geospatial analysis techniques, such as spatial and aspatial

USING SAS HASH OBJECTS TO CUT DOWN PROCESSING TIME Girish Narayandas, Optum, Eden Prairie, MN

Regular Expressions. Upsorn Praphamontripong. CS 1111 Introduction to Programming Spring [Ref:

Ministry of Transportation

Functional Languages. Hwansoo Han

Chapter 2: Using Data

Regular Expressions. Regular Expression Syntax in Python. Achtung!

Introduction. Getting Started with the Macro Facility CHAPTER 1

Regular Expressions in programming. CSE 307 Principles of Programming Languages Stony Brook University

Lexical Analysis (ASU Ch 3, Fig 3.1)

A Side of Hash for You To Dig Into

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

Introduction to Regular Expressions Version 1.3. Tom Sgouros

Beginning Perl for Bioinformatics. Steven Nevers Bioinformatics Research Group Brigham Young University

Hidden in plain sight: my top ten underpublicized enhancements in SAS Versions 9.2 and 9.3

Lecture 11: Regular Expressions. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Taming a Spreadsheet Importation Monster

Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

More Scripting and Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Regular Expressions. Regular expressions match input within a line Regular expressions are very different than shell meta-characters.

CSCI 4152/6509 Natural Language Processing. Perl Tutorial CSCI 4152/6509. CSCI 4152/6509, Perl Tutorial 1

Lab CTCAE the Perl Way Marina Dolgin, Teva Pharmaceutical Industries Ltd., Netanya, Israel

Lexical Considerations

Introduction to Programming Using Java (98-388)

PharmaSUG Paper CC22

SUPPLEMENTAL INFORMATION PACKAGE

psed [-an] script [file...] psed [-an] [-e script] [-f script-file] [file...]

Structure of Programming Languages Lecture 3

Regex Guide. Complete Revolution In programming For Text Detection

The e switch allows Perl to execute Perl statements at the command line instead of from a script.

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

11/6/17. Functional programming. FP Foundations, Scheme (2) LISP Data Types. LISP Data Types. LISP Data Types. Scheme. LISP: John McCarthy 1958 MIT

Filtering Service

Pattern Matching. An Introduction to File Globs and Regular Expressions

Copy That! Using SAS to Create Directories and Duplicate Files

Essentials for Scientific Computing: Stream editing with sed and awk

You deserve ARRAYs; How to be more efficient using SAS!

Functional Programming. Pure Functional Languages

How to avoid wide format PDF printing problems

Introduction to Perl. c Sanjiv K. Bhatia. Department of Mathematics & Computer Science University of Missouri St. Louis St.

ABSTRACT: INTRODUCTION: WEB CRAWLER OVERVIEW: METHOD 1: WEB CRAWLER IN SAS DATA STEP CODE. Paper CC-17

Language Reference Manual

Opera Customer Information System Guide. Version 2.0 January 2011

Paper TT17 An Animated Guide : Speed Merges with Key Merging and the _IORC_ Variable Russ Lavery Contractor for Numeric resources, Inc.

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

Macro Quoting: Which Function Should We Use? Pengfei Guo, MSD R&D (China) Co., Ltd., Shanghai, China

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Regular expressions. LING78100: Methods in Computational Linguistics I

SAS Macro Dynamics: from Simple Basics to Powerful Invocations Rick Andrews, Office of Research, Development, and Information, Baltimore, MD

Data Grids in Business Rules, Decisions, Batch Scoring, and Real-Time Scoring

Statistics, Data Analysis & Econometrics

CERTIFICATE OF REGISTRATION

CSE 3302 Programming Languages Lecture 2: Syntax

Perl. Interview Questions and Answers

DATA STRUCTURE AND ALGORITHM USING PYTHON

The Warhol Language Reference Manual

ABSTRACT. Paper CC-031

Transcription:

The Wildcard Side of SAS A 10 Minute SAS 9 Tech Tip presented by: Winfried Jakob, Senior Analyst, Data Quality Canadian Institute for Health Information OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 1 of 31

Alternate Title: Pattern Matching in SAS 9 using Perl Regular Expressions OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 2 of 31

Agenda Definition of Pattern Matching and Regular Expressions The Mechanics of Pattern Matching in SAS 9 by Example Regular Expression Basics The Regex Coach - A Learning Tool A More Advanced Example How to Get Started Final Words of Wisdom... OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 3 of 31

Definition 1: Pattern Matching Pattern matching enables you to search for and extract multiple matching patterns from a character string in one step, as well as to make several substitutions in a string in one step. ¹ OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 4 of 31

Definition 2a: Regular Expressions Regular expressions are a pattern language which provides fast tools for parsing large amounts of text. Regular expressions are composed of characters and special characters that are called metacharacters. ¹ OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 5 of 31

Definition 2b: Regular Expressions Perl's 'regular expressions' give you a simple language to search for patterns, extract patterns from strings, and even change text that matches your patterns. This is invaluable for all manner of text manipulation, including validation, text replacement, and string testing. With the advent of SAS 9, the power of Perl's regular expressions is now available in the DATA step. ² OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 6 of 31

The Mechanics of Pattern Matching Example 1.1 String Validation Logical Steps: 1. Define a search pattern/regular expression 2. Compile the regular expression 3. Use the regular expression to find matches OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 7 of 31

Example 1.1 String Validation Listing of Example 1 and 2 Data Obs Text 1 Ms. Winifred Jacob 2 Winfred Jakob 3 Wyn Fried Jakob 4 Winfired Jakob 5 Winfried Jakob 6 Winfried Jakob 7 Winfried Jakob 8 Winfried Jakob 9 Winfried Jakob 10 WinfriedJakob OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 8 of 31

Example 1.1 String Validation data _null_; set Example1; * Read Example data *; Pattern = "/winfried jakob/i"; * Define PRX *; PRX = prxparse(pattern); * Compile PRX *; Match_Result = prxmatch(prx,text); * Search for PRX *; if Match_Result eq 0 then Message="No Match: "; if Match_Result gt 0 then Message="MATCH in: "; if _N_ eq 1 then do; putlog "The PRX pattern is: " Pattern /; end; putlog "Obs " _n_ 2.0 " " Message " " Text $char20. /; run; OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 9 of 31

Example 1.1 String Validation The PRX pattern is: /winfried jakob/i Obs 1 No Match: Ms. Winifred Jacob Obs 2 No Match: Winfred Jakob Obs 3 No Match: Wyn Fried Jakob Obs 4 No Match: Winfired Jakob Obs 5 MATCH in: Winfried Jakob Obs 6 MATCH in: Winfried Jakob Obs 7 No Match: Winfried Jakob Obs 8 No Match: Winfried Jakob Obs 9 No Match: Winfried Jakob Obs 10 No Match: WinfriedJakob OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 10 of 31

Example 1.2 String Validation data _null_;... same code as before... Pattern = "/winfried\s*jakob/i"; * Define PRX *; PRX = prxparse(pattern); * Compile PRX *; Match_Result = prxmatch(prx,text); * Search for PRX *;... same code as before... run; OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 11 of 31

Example 1.2 String Validation The PRX pattern is: /winfried\s*jakob/i Obs 1 No Match: Ms. Winifred Jacob Obs 2 No Match: Winfred Jakob Obs 3 No Match: Wyn Fried Jakob Obs 4 No Match: Winfired Jakob Obs 5 MATCH in: Winfried Jakob Obs 6 MATCH in: Winfried Jakob Obs 7 MATCH in: Winfried Jakob Obs 8 MATCH in: Winfried Jakob Obs 9 MATCH in: Winfried Jakob Obs 10 MATCH in: WinfriedJakob OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 12 of 31

Example 2.1 String Replacement data _null_;... same code as before... Pattern = "s/\s*winfried\s*jakob/winfried Jakob/i"; PRX = prxparse(pattern); Match_Result = prxmatch(prx,text); call prxchange(prx,-1,text);... same code as before... run; OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 13 of 31

Example 2.1 String Replacement PRX: s/\s*winfried\s*jakob/winfried Jakob/i Obs 1 No Match: Ms. Winifred Jacob Obs 2 No Match: Winfred Jakob Obs 3 No Match: Wyn Fried Jakob Obs 4 No Match: Winfired Jakob Obs 5 REPLACED: Winfried Jakob Obs 6 REPLACED: Winfried Jakob Obs 7 REPLACED: Winfried Jakob Obs 8 REPLACED: Winfried Jakob Obs 9 REPLACED: Winfried Jakob Obs 10 REPLACED: Winfried Jakob OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 14 of 31

Example 3.1 String Extraction Listing of Example 3 Data Obs Text 1 153 First Street 2 4000 Kanata Avenue North 3 6789 64th Ave 4 4 Moritz Road 5 99 Borderline St. 6 7493 Wilkes Place 7 70 Sherring Court 8 5 Goldridge Crescent 9 7020 Yonge Street 10 1 Buenaventura Pl. East OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 15 of 31

Example 3.1 String Extraction data _null_;... same code as before... Pattern1 = "/\bavenue \bave \bcourt \bcrt "; Pattern2 = "\bdrive \bdr \broad \brd \bstreet \bst/i"; PRX = prxparse(pattern1!! Pattern2); call prxsubstr(prx, Text, position, length); if position ne 0 then do; match = substr(text, position, length); putlog "Obs " _n_ 2.0 " " match:$quote. "found in " Text:$QUOTE. /; end; else if position eq 0 then do; putlog "Obs " _n_ 2.0 " NO MATCH found in " Text:$QUOTE. /; end; run; OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 16 of 31

Example 3.1 String Extraction PRX: /\bavenue \bave \bcourt \bcrt \bdrive \bdr \broad \brd \bstreet \bst/i Obs 1 "Street" found in "153 First Street" Obs 2 "Avenue" found in "4000 Kanata Avenue North" Obs 3 "Ave" found in "6789 64th Ave" Obs 4 "Road" found in "4 Moritz Road" Obs 5 "St" found in "99 Borderline St." Obs 6 NO MATCH found in "7493 Wilkes Place" Obs 7 "Court" found in "70 Sherring Court" Obs 8 NO MATCH found in "5 Goldridge Crescent" Obs 9 "Street" found in "7020 Yonge Street" Obs 10 NO MATCH found in "1 Buenaventura Pl. East" OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 17 of 31

Regular Expression Basics Concatenation: The simplest form of a regular expression is simply a string of characters Examples: /Regular Expressions/ /Yonge Street/ /500 Dollars/ /xyz/ /1000/ OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 18 of 31

Regular Expression Basics Metacharacters: Some characters have a special meaning: { } [ ] ( ) ^ $. * +? \ To find a string that contains any of those, you must prefix them with a backslash (\) Example: To find the string: What is it? you must specify: / What is it\? / OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 19 of 31

Regular Expression Basics More Metacharacters:. the period matches any one character \w a 'word'-like character, matches a-z, A-Z, 0-9, _ \d a 'digit' character, matches numbers 0 to 9 \s a 'space'-like character, matches any whitespace character, including the space and the tab \b a word boundary, the position between a 'word' character and a 'non-word' character Examples: /\d\d:\d\d:\d\d/ /\bave/ matches the hh:mm:ss time format does NOT match Buenaventura OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 20 of 31

Regular Expression Basics Iterators: * matches 0 or more occurrences of preceding pattern + matches 1 or more occurrences of preceding pattern? matches 0 or 1 occurrences of preceding pattern {k} matches k occurrences of preceding pattern {n,m} matches at least n and at most m occurrences of preceding pattern Example: /a{1,4}/ matches a, aa, aaa, aaaa OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 21 of 31

Regular Expression Basics Alternation and Grouping: A vertical bar separates alternatives Example: /Avenue Ave Boulevard Blvd/ Parentheses are used to define the scope and precedence of the operators Example: /H(ä ae?)ndel/ matches Handel, Händel, and Haendel OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 22 of 31

Regular Expression Basics Character Classes: You can express single-character alternatives by using a character class. Character classes are denoted by square brackets, with the set of characters to be possibly matched inside. Example: /[abc]ar/ matches aar, bar, and car OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 23 of 31

The Regex Coach ³ A great learning tool Free download Free use Highly recommended OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 24 of 31

A More Advanced Example: Problem Check if a 6 character Postal Code follows these 3 rules: Postal Code must be in the format ANANAN. The first character must not be D, F, I, O, Q, U or W. Postal Code can also be a right-aligned mini postal code in the format: ' AA' (4 leading spaces) Postal Code can be 999999 for 'Unknown'. OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 25 of 31

A More Advanced Example: Solution Construct 3 patterns and combine them as alternatives using the vertical bar: Pattern1 = "[ABCEGHJ-NPRSTVXYZ]\d[A-Z]\d[A-Z]\d"; Pattern2 = " [A-Z][A-Z]"; Pattern3 = "999999"; PRX = prxparse("/"!! Pattern1!! " "!! Pattern2!! " "!! Pattern3!! "/"); OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 26 of 31

A More Advanced Example: Result Obs 1 MATCH in: K2K2T1 Obs 2 No Match: KKK2T1 Obs 3 No Match: K2K211 Obs 4 No Match: K+K2T1 Obs 5 No Match: FL Obs 6 MATCH in: FL Obs 7 No Match: F Obs 8 No Match: 99999 Obs 9 No Match: 999 99 Obs 10 MATCH in: 999999 OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 27 of 31

How to Get Started Download this slide show from the OASUS website Get SUGI 29 Article from Internet: David L. Cassell, Design Pathways, Corvallis, OR: The Perks of PRX... ² Download and install The Regex Coach ³ Work through the article and try out your own ideas Learn by doing and expect surprises! 4 OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 28 of 31

Final Words of Wisdom... Regular Expressions are extremely powerful, but they are not the best solution for every problem. Learn enough to know when they are appropriate, and when they will simply cause more problems than they solve. Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems. Jamie Zawinski, in comp.lang.emacs OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 29 of 31

References 1) Technical Support article: Pattern Matching Using SAS Regular Expressions (RX) and Perl Regular Expressions (PRX) http://support.sas.com/91doc/getdoc/lrdict.hlp/a002288677.htm 2) Cassell, David L. (2004), The Perks of PRX..., SUGI Proceedings, 2003 http://www2.sas.com/proceedings/sugi29/129-29.pdf 3) The Regex Coach is available at: http://www.weitz.de/regex-coach/ 4) Ottawa Area SAS Users Society website: http://www.oasus.ca Acknowledgements SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. The Perl language was designed by Larry Wall, and is available for public use under both the GNU GPL and the Perl Copyleft.. Other brand and product names are registered trademarks or trademarks of their respective companies. OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 30 of 31

About the presenter Winfried Jakob Senior Analyst, Data Quality Canadian Institute for Health Information 495 Richmond Road, Suite 600 Ottawa, Ontario K2A 4H6 Canada Phone: +1 (613) 694-6993 Fax: +1 (613) 241-8120 Email: wjakob@cihi.ca http://www.cihi.ca OASUS Spring Meeting - 2006-05-03 - The Wildcard Side of SAS - Page 31 of 31