Using a Fillable PDF together with SAS for Questionnaire Data Donald Evans, US Department of the Treasury Introduction The objective of this paper is to demonstrate how to use a fillable PDF to collect data and use SAS to import and summarize the answers from the questionnaires. When a respondent uses the fillable PDF to answer a question, the PDF is storing the information in XML language in the background of the program. This information can be read and converted into a database format with the built-in SAS XML Engine. An intermediate step is required whereby additional information is added to the xml to identify the start and end of a table and the start and end of a record. This technique is especially effective with questionnaires containing more than 256 fields the column restriction for Excel 2003. Create a Document The first step in this process is to create a document that will be used as a foundation for collecting all of the information that we are interested in obtaining. This can be accomplished two ways: create a form directly in Acrobat LifeCyle Designer; or in an editor like Word. If the latter is chosen, then create a rough-version of the form with the overall structure without controls (like radio buttons, text fields, etc.), create a PDF from the document, and then overlay the fillable components of the form directly in Acrobat. Exhibit 1: Example of a Form Started in Word Page 1
Exhibit 2: Example of Form with Overlayed Controls Controls The appropriate controls (like radio buttons or check-boxes) need to be added to capture user input. To add these in Acrobat go to Tools Forms and select the control of choice. The naming convention that is used for each of the controls will impact other steps in the process. Acrobat will automatically assign a default name to a control double-click on the control to see the name under the General tab. Choosing a name that is representative of the question that it related to will make it easier to understand where the data is coming from. For example, Q1Gender is easier to track than RadioButton2. Make sure to follow the naming convention for SAS variable names: don t start with a number; names can not be more than 32 characters, and do not include special characters (other than _ ). Control names with spaces will cause additional information to be added to the xml. This extra information will have to be removed for SAS to read the xml. Exhibit 3: Properties of a Control Name Exhibit 4: xml Generated from Variables with and without a Space in the Name Control Name xml output Check Box4 <CheckBox4 xfdf:original="check Box4">$75-$100k</CheckBox4> CheckBox4 <CheckBox4>$75-$100k</CheckBox4> Page 2
Each control can operate together or independently. If, for example, two radio buttons have been added to capture the sex of the individual as either male or female, these buttons can be tied together so that answering one restricts the other from being answered. Simply giving each of these controls the same name will achieve this. In order to s has a name. Extracting Responses from Completed PDFs Once completed questionnaires have been received, save to a central location (a single folder). The responses can be extracted en masse through Acrobat. Go to Forms Manage Form Data Merge Data Files into Spreadsheet (Export Data can be used for a single form). Select the files to extract the data and click Export. You will be prompted to save the file and designate a location make sure to change the file type to xml. You will be prompted to view the file or close the dialog box. Viewing the file will open your html browser. Exhibit 5: Extracting Response Data Stored in Multiple Files Exhibit 6: Choose the Files that Contain Response Data Page 3
Exhibit 7: Viewing the xml Output File will Open Your html Browser You might be tempted at this point to choose to use the csv file-type, however it is very difficult to view the large data files (too wide) I could only get notepad to fully display the data and it wrapped the lines. The exhibit below shows what the file looks like in WordPad and Notepad. WordPad did not fully display the file stopped scrolling right. Notepad displays the data, but it s easy to get lost quickly. Exhibit 8: Reading Large Files in WordPad and Notepad Adjusting the xml Structure for SAS Some minor revisions need to be made to the xml to put it into the structure that SAS needs. Any text editor can be used, but I prefer Notepad. The xml needs to tell SAS that it is reading a table and where a record starts and ends and when a table ends. A table in this case is asking SAS to generate a SAS dataset. The example below shows the basic structure of the xml. If a field is left blank, it only includes the end-field marker. The word RECORD can be changed to the name of choice this will be the name of the SAS dataset. All other information that Acrobat automatically includes needs to be removed like <DATA>,</FIELD>, and comments outside of the record information. Page 4
Exhibit 9: Basic xml Structure that Enables SAS to Read an xml Data File <?xml version="1.0" encoding="utf-8"?> <TABLE> <RECORD> <field1>data from field stored here</field1> <field2>data from field stored here</field2> </field3> </RECORD> <RECORD> <field1>data from field stored here</field1> <field2>data from field stored here</field2> </field3> </RECORD> </TABLE> Exhibit 10: Data File with Acceptable xml Structure Page 5
Exhibit 11: Data File with Acceptable xml Structure Raw xml with Acceptable Structure Markers <?xml version="1.0" encoding="utf-8"?> <TABLE> <DEMO> <CheckBox4 >$25-$50K</CheckBox4 ><RadioButton14 >Yes</RadioButton14 ><RadioButton2 >Female</RadioButton2 ><Text12 >Great questionnaire!!!</text12 > </DEMO> <DEMO> <CheckBox4 >$100-$125k</CheckBox4 ><RadioButton14 >No</RadioButton14 ><RadioButton2 >Male</RadioButton2 ><Text12 >Sample questionnaire is too short</text12 > </DEMO> </TABLE> Pretty (Easy-To-Read) View of xml <?xml version="1.0" encoding="utf-8"?> <TABLE> <DEMO> <CheckBox4 >$25-$50K</CheckBox4> <RadioButton14>Yes</RadioButton14> <RadioButton2>Female</RadioButton2> <Text12>Great questionnaire!!!</text12> </DEMO> <DEMO> <CheckBox4>$100-$125k</CheckBox4> <RadioButton14>No</RadioButton14> <RadioButton2>Male</RadioButton2> <Text12>Sample questionnaire is too short</text12> </DEMO> </TABLE> Exhibit 12: SAS Error if the xml is Not in the Appropriate Structure Working in SAS Now that all of the prep-work is done, it s finally time to tell SAS what to do. Two libname statements and a simple data step are all that is needed to read the data. The key to the code is telling SAS that it will be reading some xml and it will be from the trans xml file. The variable names are read directly from the xml labels and the data placed in the correct columns. Exhibit 13: SAS Code to Read an xml File libname trans xml 'D:\don work files\sas Master Files\report.xml'; libname myfiles 'D:\don work files\sas Master Files'; data myfiles.demo; set trans.demo; run; Page 6
trans xml myfiles datastep Exhibit 14: What s Going On The points to the raw data file where the new datafile will be saved invokes the SAS xml engine Exhibit 15: Success! Data File Read Successfully SAS Errors The descriptions that SAS uses for the errors in reading xml file are not especially helpful. Here s what I would check: a. Are any of your variables more than 32 characters? b. Did you add the record markers correctly? c. Did you delete the default markers that Acrobat adds? d. Did any of the respondents include a & in the text that was not converted to & in the xml? Page 7
Fine-Tuning the SAS Dataset The dataset that SAS created from the xml will be in reverse order and backwards. This is because SAS is creating the variable names directly from the data as it reads it. In order to flip the variables, use a format statement followed by the set statement in a new datastep. Exhibit 16: SAS Dataset Created from Reading the xml File Exhibit 17: SAS Code to Flip the Order of the Variables data demo; format checkbox4 radiobutton14 radiobutton2 text12 ; set myfiles.demo; run; Shortcut: Use a proc contents with a varnum option together with a ods statement to output an excel file. In excel sort the file in descending order and copy the listed variables into your datastep. This assumes that all of the controls were added in order from the start of the form to the end. Handling Numeric Variables If a textbox is being used to capture a number, this variable will be stored as a numeric unless the first record that is read is blank (empty). If this is the case, then it will be created as a character and will need to be converted to a numeric variable. Create a new variable based on the original variable by multiplying it by one. Drop the original variable and in a new datastep, copy the temporary variable back to the original name and drop the temporary variable. Exhibit 18: Datasteps to Convert a Character to Numeric data demo; set demo; text12_temp=text12*1; drop text12; run; data demo; set demo; text12=text12_temp; drop text12_temp; run; Page 8
Trademark SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. Acknowledgement To Paul Showalter, for helping me work out some issues that I was having with my sample PDF. Contact Information Your comments and questions are valued and encouraged. Please contact the author at: donevans@hotmail.com 202/283-9521 Page 9