Reading in Data Directly from Microsoft Word Questionnaire Forms

Similar documents
Using SAS to Control the Post Processing of Microsoft Documents Nat Wooding, J. Sargeant Reynolds Community College, Richmond, VA

Let SAS Help You Easily Find and Access Your Folders and Files

PROBLEM FORMULATION, PROPOSED METHOD AND DETAILED DESCRIPTION

A Macro that can Search and Replace String in your SAS Programs

Using DDE with Microsoft Excel and SAS to Collect Data from Hundreds of Users

When Powerful SAS Meets PowerShell TM

SAS 101. Based on Learning SAS by Example: A Programmer s Guide Chapter 21, 22, & 23. By Tasha Chapman, Oregon Health Authority

Using a Fillable PDF together with SAS for Questionnaire Data Donald Evans, US Department of the Treasury

Using Dynamic Data Exchange

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

Using an ICPSR set-up file to create a SAS dataset

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

A Macro to Create Program Inventory for Analysis Data Reviewer s Guide Xianhua (Allen) Zeng, PAREXEL International, Shanghai, China

Macro Method to use Google Maps and SAS to Geocode a Location by Name or Address

A Tool to Compare Different Data Transfers Jun Wang, FMD K&L, Inc., Nanjing, China

TLF Management Tools: SAS programs to help in managing large number of TLFs. Eduard Joseph Siquioco, PPD, Manila, Philippines

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

How a Code-Checking Algorithm Can Prevent Errors

Run your reports through that last loop to standardize the presentation attributes

SAS Drug Development Program Portability

A Macro that Creates U.S Census Tracts Keyhole Markup Language Files for Google Map Use

Top Coding Tips. Neil Merchant Technical Specialist - SAS

Matt Downs and Heidi Christ-Schmidt Statistics Collaborative, Inc., Washington, D.C.

Building Sequential Programs for a Routine Task with Five SAS Techniques

One SAS To Rule Them All

A Mass Symphony: Directing the Program Logs, Lists, and Outputs

ODS/RTF Pagination Revisit

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

Write SAS Code to Generate Another SAS Program A Dynamic Way to Get Your Data into SAS

Google Apps for Education: The Basics

PharmaSUG China Paper 059

%MISSING: A SAS Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System

SAS Online Training: Course contents: Agenda:

SAS Studio: A New Way to Program in SAS

The Dataset Diet How to transform short and fat into long and thin

CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

Implementing external file processing with no record delimiter via a metadata-driven approach

Use SAS/AF, SCL and MACRO to Build User-friendly Applications on UNIX

CDISC Variable Mapping and Control Terminology Implementation Made Easy

A Macro to Manage Table Templates Mark Mihalyo, Community Care Behavioral Health Organization, Pittsburgh, PA

HAVE YOU EVER WISHED THAT YOU DO NOT NEED TO TYPE OR CHANGE REPORT NUMBERS AND TITLES IN YOUR SAS PROGRAMS?

Program Validation: Logging the Log

Automated Checking Of Multiple Files Kathyayini Tappeta, Percept Pharma Services, Bridgewater, NJ

Submitting SAS Code On The Side

Electricity Forecasting Full Circle

Post-Processing.LST files to get what you want

The Output Bundle: A Solution for a Fully Documented Program Run

ET01. LIBNAME libref <engine-name> <physical-file-name> <libname-options>; <SAS Code> LIBNAME libref CLEAR;

Base and Advance SAS

PDF Multi-Level Bookmarks via SAS

Taming a Spreadsheet Importation Monster

SESUG Paper AD A SAS macro replacement for Dynamic Data Exchange (DDE) for use with SAS grid

Automating Comparison of Multiple Datasets Sandeep Kottam, Remx IT, King of Prussia, PA

Exporting Variable Labels as Column Headers in Excel using SAS Chaitanya Chowdagam, MaxisIT Inc., Metuchen, NJ

Useful Tips When Deploying SAS Code in a Production Environment

The Path To Treatment Pathways Tracee Vinson-Sorrentino, IMS Health, Plymouth Meeting, PA

Macros from Beginning to Mend A Simple and Practical Approach to the SAS Macro Facility

Sending SAS Data Sets and Output to Microsoft Excel

Bryan K. Beverly, UTA/DigitalNet

LST in Comparison Sanket Kale, Parexel International Inc., Durham, NC Sajin Johnny, Parexel International Inc., Durham, NC

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

Your Own SAS Macros Are as Powerful as You Are Ingenious

Best Practice for Creation and Maintenance of a SAS Infrastructure

Combining TLFs into a Single File Deliverable William Coar, Axio Research, Seattle, WA

Copy That! Using SAS to Create Directories and Duplicate Files

Step through Your DATA Step: Introducing the DATA Step Debugger in SAS Enterprise Guide

Data Manipulation with SQL Mara Werner, HHS/OIG, Chicago, IL

TLFs: Replaying Rather than Appending William Coar, Axio Research, Seattle, WA

EXAMPLE 3: MATCHING DATA FROM RESPONDENTS AT 2 OR MORE WAVES (LONG FORMAT)

Making the most of SAS Jobs in LSAF

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

The Ins and Outs of %IF

Guidelines for Coding of SAS Programs Thomas J. Winn, Jr. Texas State Auditor s Office

DSCI 325: Handout 2 Getting Data into SAS Spring 2017

PC and Windows Installation 32 and 64 bit Operating Systems

A Visual Step-by-step Approach to Converting an RTF File to an Excel File

SAS2VBA2SAS: Automated solution to string truncation in PROC IMPORT Amarnath Vijayarangan, Genpact, India

How to Implement the One-Time Methodology Mark Tabladillo, Ph.D., Atlanta, GA

Using SAS 9.4M5 and the Varchar Data Type to Manage Text Strings Exceeding 32kb

Chapter 2: Getting Data Into SAS

Untangling and Reformatting NT PerfMon Data to Load a UNIX SAS Database With a Software-Intelligent Data-Adaptive Application

ODS DOCUMENT, a practical example. Ruurd Bennink, OCS Consulting B.V., s-hertogenbosch, the Netherlands

The Proc Transpose Cookbook

Prove QC Quality Create SAS Datasets from RTF Files Honghua Chen, OCKHAM, Cary, NC

An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step Mike Zdeb, FSL, University at Albany School of Public Health, Rensselaer, NY

Why SAS Programmers Should Learn Python Too

The Impossible An Organized Statistical Programmer Brian Spruell and Kevin Mcgowan, SRA Inc., Durham, NC

Amie Bissonett, inventiv Health Clinical, Minneapolis, MN

WHAT ARE SASHELP VIEWS?

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

Automate Secure Transfers with SAS and PSFTP

ABC Macro and Performance Chart with Benchmarks Annotation

Ads Software User Manual Template Word 2007

INTRODUCTION THE FILENAME STATEMENT CAPTURING THE PROGRAM CODE

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

A Macro to Keep Titles and Footnotes in One Place

Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide

Correcting for natural time lag bias in non-participants in pre-post intervention evaluation studies

Importing CSV Data to All Character Variables Arthur L. Carpenter California Occidental Consultants, Anchorage, AK

Transcription:

Paper 1401-2014 Reading in Data Directly from Microsoft Word Questionnaire Forms Sijian Zhang, VA Pittsburgh Healthcare System ABSTRACT If someone comes to you with hundreds of questionnaire forms in Microsoft (MS) Word file format, and asks you to subtract the data from the forms into a SAS dataset, you may have several ways to handle this. However, as a SAS programmer, the shortcut is to write a SAS program to read in the data directly from Word files into a SAS dataset. This paper will show how it can be done with simple SAS programming skills, such as using FILENAME with the PIPE option, DDE, function call EXECUTE( ), and so on. INTRODUCTION MS Word file is one of the most popular things in the digital world. One study at our center (Center for Health Equity Research and Promotion, VA Pittsburgh Healthcare System) set the blank questionnaire template form in Word file and used it to collect data from each study subject. The task requested is to get the answers on questionnaire form into a SAS dataset, see Figure 1. MS Word File Figure 1: From raw data on MS Word questionnaire form to data in a SAS dataset SAS Dataset Specifically, for the form above, the checkbox information needs to be collected as numeric values into a SAS dataset. However, the data in Word file is not organized in a table, and there are no recognizable bookmarks or labels in the file. First, searching for related publications online, I could not find any that fits this scenario. Then, I had a close look at the file and poked around, it turned out that I 1

got it done with several simple SAS programming skills. It takes three steps to move the raw information from MS Word files to a SAS dataset: 1. Copy all the Word files into a folder, 2. Convert Word files to RTF files, and 3) Subtract data from RTF files into a SAS dataset. The reason to convert Word files to RTF files is that SAS cannot read in and recognize the source code of Word files. If you feel that the code behind RTF file is difficult to read, then you will agree that the code behind Word file is impossible to make sense to a SAS programmer. Figure 2 shows the Notepad views of the screenshots of both Word and RTF files with only one sentence, Hello, World!. You cannot even find the sentence using Search function in Notepad view of the Word file source code. However, RTF source code can be read into SAS as character values, just like reading in text data file. Once the data file is read into SAS, it is like a package received, the next step is to unwrap the package and take out what we ordered. Notepad view of a MS Word file Notepad view of a RTF file Figure 2: Notepad views of a MS Word and a RTF file Well, the text file is much simpler than RTF file and easier for SAS to process. Why not save the MS word file as a text file directly? Actually, this was what I tried first. But it does not work because all the formats and objects (including checkbox) are lost in the conversion, which will serve as indicators to identify the values to read. The coding environment for this paper: Win XP 2002 Service Pack 3, Microsoft Office Word 2007, and SAS 9.3. 2

PROGRAMS 1. Get the list of all Word files in the folder ***************** * Get file list * *****************; libname output "C:\Studies\IRB Survey\Word2SAS\SAS"; %let wordpath=c:\studies\irb Survey\Word2SAS\Word\; * &wordpath contains all Word questionnaire forms *; %let rtfpath=c:\studies\irb Survey\Word2SAS\rtf\; %let filetype=.docx; * Or.doc *; filename files pipe "dir ""&wordpath*&filetype"""; ❶ data filelist; infile files lrecl=300 truncover; input line $200.; retain fileid; ❷ if not index(line,"&filetype") then delete; else fileid+1; filename = strip(substr(line,39,199)); wordpathname="&wordpath" left(filename); rtfpathname="&rtfpath" left(tranwrd(filename, "&filetype", ".rtf")); keep fileid filename wordpathname rtfpathname; Statements above FILENAME specify the location and input file extension information for rest of the program. The file extension before Word 2007 is.doc. The PIPE ❶ option in FILENAME statement, with file extension specified in command DIR, will collect file attribute information of all those files and sub-directory information in the folder specified, and save them in the default variable LINE. The contents of each line just looks like what we see in Window Explorer, including file name, size, type, etc. Then we can use function SUBSTR( ) to subtract the piece that we need. ❷ Four new variables (1. fileid used to identify individual output SAS dataset, 2. filename filename (including extension), 3. wordpathname pathnames of input Word files, and 4. rtfpathname - pathname of converted RFT files) are created from variable LINE in dataset filelist for the purpose of data manipulation later, see Figure 3. Figure 3: SAS dataset filelist 3

2. Convert Word files to RTF files ***************************** * Convert Word to RTF files * *****************************; options noxwait noxsync; x call "C:\Program Files\Microsoft Office\Office12\winword.exe"; wait=sleep(3); ❶ filename wordlink dde 'WinWord System'; %macro word2rtf(inpathname,outpathname); file wordlink; put '[FileOpen.Name = "'"&inpathname"'"]'; put '[FileSaveAs "'"&outpathname"'",6]'; put '[FileClose]'; %mend word2rtf; ❷ set filelist; call execute('%word2rtf(' wordpathname ', ' rtfpathname ')'); ❸ file wordlink; put '[FileExit]'; filename wordlink clear; DDE (dynamic data exchange) connection between SAS and MS Word is established at the beginning of this section, which is used to send WordBasic statements from SAS to Word to implement actions in Word. After starting Word, the program pauses for 3 seconds ❶ for Word to open before setting up the connection in FILENAME statement. Otherwise, the following DATA steps may have running errors. WordBasic statement FileSaveAs with option 6 ❷ specifies RTF as the file format to be saved. Then, function Call Execute ( ) ❸ runs macro %word2rtf through all the Word files in the dataset filelist. After this step, the RTF folder will be filled with converted RTF files. 4

3. Subtract data from RTF files ******************************* * Read in data from RTF files * *******************************; %macro rtf2sas(rtffile,filename,surveynum); filename inrtf "&rtffile"; data questions; infile inrtf lrecl=5000 truncover; input rawtxt $ 1-500; retain Criterion Keep; if index(rawtxt, " 1: ") then do; Criterion=1; Keep=1; end; ❶a if index(rawtxt, " 2: ") then Criterion=2; if index(rawtxt, " 3: ") then Criterion=3; if index(rawtxt, " 4: ") then Criterion=4; if index(rawtxt, " 5: ") then Criterion=5; if index(rawtxt, " 6: ") then Criterion=6; if index(rawtxt, " 7: ") then Criterion=7; if index(rawtxt, " 8: ") then Criterion=8; if Keep=1; if index(rawtxt, " 1: ") & lag(criterion)=8 then stop; drop Keep; ❶b data q_and_a; set questions; by Criterion; if first.criterion then checkbox=0; if index(rawtxt,"0000140000000") then do; ❷ checkbox+1; if index(rawtxt,"00010000000000000000000000") then checkboxvalue=1; ❸ else checkboxvalue=0; output; end; data selected; set q_and_a; if checkboxvalue=1; 5

proc transpose data=selected out=record(drop=_name_) prefix=criterion; id Criterion; var checkbox; data survey_&surveynum; length Survey $ 50; set record; Survey="&fileName"; %mend rtf2sas; set filelist; call execute('%rtf2sas(' rtfpathname ',' filename ',' fileid ')'); ❹ **************************** * Set all surveys together * ****************************; data output.survey_all; set survey_:; ❺ From ❶a to ❶b, the code is to assign values to variable Criterion according to the unique criterion identification numbers. And in the meantime, the rawtxt values before and after question-and-answer segment is dropped. The variable Keep is used to get rid of all RTF contents before the first question; and statement of ❶b is used to get rid of all RTF contents after the last question by using next 1: (start of next section that is not a part of data needed). For very checkbox, there is a value portion {\*\datafield 650000001400000007436865636b353500010000000000000000000000}. In this value, 0000140000000 ❷ is used to identify any checkbox; 00010000000000000000000000 ❸ indicates the box is checked, and 00000000000000000000000000 indicates the box is blank. Then, the observations with checkboxvalue = 1 are kept, and transposed with the Criterion underscore criterion number as variable names to change the data layout from vertical to horizontal style. To execute the macro, function Call Execute ( ) ❹ runs %rtf2sas through all questionnaire files in RTF folder. At last, all individual datasets are set together ❺ to generate the final output SAS dataset shown in Figure 1. COMMENTS The way to collect study data described above is a little unconventional. However, if you have no control on the data collection, and what you are requested to do is to get the data from Word files into 6

a SAS dataset, using SAS program to read in data directly from the raw Word files can save a lot of intermediary efforts, such as building up a database, entering data, and then importing data into SAS system. In terms of efficiency, this SAS programming approach is a shortcut. It can also avoid some human errors during data entry. The programming in this paper is not complicated. The key part is to find out the unique structure characters in RTF file to identify the data to be subtracted. In other similar situations, the specifics can be quite different. The programming always needs to be tailed to the situation at hand, however, the basic idea for importing data without table structure should be similar: first, convert raw files to SAS readable if needed; then read it into SAS, and subtract data by the file structure specifications. REFERENCES 1. Let SAS Tell Microsoft Word to Collate, http://www2.sas.com/proceedings/sugi29/034-29.pdf CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Sijian Zhang, MD, MS, MBA Research Health Science Specialist Biostatistics and Informatics Core Center for Health Equity Research and Promotion VA Pittsburgh Healthcare System 7180 Highland Drive (151C-H) Pittsburgh, PA 15206 sijian.zhang@va.gov SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 7