Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, 00142 Roma, Italy e-mail: pimassol@istat.it 1. Introduction Questions can be usually asked following specific patterns in the questionnaires used to collect survey data. Depending on the complexity of the questionnaire, several different questionnaire routes can emerge. As a matter of fact, questions can be asked in accordance with the answers given to another questions. It is important to establish which responses are missing and which are not applicable ones. Questionnaire routes are also called skip patterns (Fagan and Greenberg, 1988). As a result of an inadequate mathematical model, only a partial set of these patterns is often extracted by looking at the questionnaire. This practice can lead to an ineffective data checking and to the creation of more software than necessary causing errors as a result of also an harder software maintenance. However, it is quite common to collect information in relation to the questionnaire such as: number of variables, type of variables, range of each quantitative variable, items of each qualitative variable and so on. This information is called metadata; it is commonly used to manage data processing in its entirety. Moreover, there are several examples reported in literature about the use of graph theory (Diestel, 2005) that one can address for modelling the diverse steps of the survey data processing. In order to describe a survey questionnaire completely, the suitable mathematical structure, widely adopted, is the acyclic directed graph. This paper proposes a possible use of metadata to obtain the graph related to a typical survey questionnaire. Skip patterns are necessary to assess the feasibility for each record in the questionnaire dataset collecting the responses to all the interviews. A good modelling must recognize the questions which should have been answered but are missing from the not applicable ones. The approach proposed here requires less matrix algebra than others. This work proposes the construction of a feasibility check algorithm to be adopted in any survey. Basic definitions and terminology of graph theory are reported in Paragraph 2. In Paragraph 3, the metadata needed to deal with varying classic questionnaire structures are introduced. Paragraph 4 describes how the directed graph is used to check the questionnaire data record feasibility. A brief discussion of the results and further developments are discussed in the conclusion of the paper. 2. Basic definitions A graph is a pair of sets G = (V, E) so that E is a subset of V x V and V E = Ø The elements of the finite set V are called vertices v i (with i = 1, 2, 3,, n where n is the cardinality of V) and the elements e k = (v i, v j ) of E are not ordered pairs of V called edges (with k = 1, 2, 3,, m where m is the cardinality of E). A graph in which its edges are not ordered pairs of vertices is called undirected because each edge has not a direction. A vertex v i is incident with an edge e j if v i e j meaning the vertex v i belongs to the edge e j. Two vertices v i and v j are adjacent if and only if there exists an edge joining them. Two vertices incident with an edge are called its ends. In a directed graph the edges are ordered pairs of vertices (label j of the vertex v j is greater than the label I of the vertex v i ). A path from v i to v k (k n) is a sequence of vertices and edges in which all vertices are distinct. If the vertices are not distinct then the path is a chain. The number of edges in a path or in a chain is called length.

If v i and v j are vertices of a graph and if there exists a path with initial vertex v i and terminal vertex v j then there is a path from v i to v j. If there exists a path from v i to v j then v i preceeds v j and v j succeeds v i. If there is an edge incident to v i and to v j then v i is an immediate predecessor to v j and v j is an immediate successor to v i. A graph in which there are only paths is called acyclic because contains no cycles. A path that is not contained in another paths is called maximal. It is always possible to find (or eventually to create) a unique vertex from which one or more edges start but none ends (source) and a unique vertex in which one or more edges end but none starts (sink) such that each maximal of the graph has the source as its initial vertex and the sink as its terminal vertex. A very important matrix needed to deal with a graph G = (V, E) with V = {v 1, v 2,, v n } vertices and E = {e 1, e 2,, e m } edges is the incidence matrix (of a directed graph) B = {b ij } (n x m) defined as follows: + 1 if v i e j and e j is outcoming from v i b ij = if v i e j and e j is incoming in v i (1) 0 otherwise In the case of an undirected graph becomes b ij =1 if v i e j regardless of the direction of the edge e j. As well known in literature, the product BB T equals the sum of two matrices A + D. Matrix A={a ij } is a (n x n) matrix called adjacency matrix with 0/1 elements where a ij =1 if v i and v j are adjacent. This matrix is symmetric for an undirected graph while for a directed graph becomes triangular and it is usually called source-target matrix. Matrix D is also a (n x n) diagonal matrix with element d ii that are equal to the degree 1 of vertex v i (I = 1, 2,, n). 3. Questionnaire metadata All the interviews conducted using the same questionnaire are usually collected in the same dataset. Each interview comprises a data record of the dataset. The number of variables n v in each data record depends on the questionnaire s responses which can be stored in one or more than one variable. There are questions for which a single response is appropriate and others with a multiple choice response. Classic variable types include the numeric type and the character one. These two variable types are, respectively, used to store a response concerning with one s personal income and a string showing the profession of an individual. The domain and the range of variables are used respectively for qualitative and quantitative variables. As a broad generalization, all the information concerning the order of variables, variable types and all the leaps from one question to another, has been stored in an ASCII file with comma separated values format. This file is called routing metadata file. The description of this file is given below in Table 1. In view of the fact that there is a marked analogy between vertices and variables as well as edges and items, this file must contain all the structural information so that one is able to construct the graph which underlies the questionnaire. 1 The degree of a vertex is the total number of edges incident with it.

Table 1 Routing file record layout Column Description 1 Variable name character 2 Number of items - integer numbe<r [0, + ) 3 Minimum variable value integer number (-, + ) 4 Maximum variable value integer number (-, + ) 5 Group s label - integer number (-, + ) 6 Number of the leaps - integer number [-1, 98] 7 First variable item that realizes the leap - integer number (-, + ) 8 Variable to visit because of the first leap character 9 Second variable item that realizes the leap integer number (-, + ) 10 Variable to visit because of the second leap - character 11 12. Each variable is represented by a vertex. Each variable s item corresponds to an edge. Variables are sorted in the routing file according to the order of the questions in the questionnaire. Variables types considered in this paper are: numeric type for all those questions whose responses are coded using a progressive integer number and character type for all those questions which accept an open response (a word or a phrase). In order to distinguish a character variable from a numeric one, for the former the number of items n items equals 0 with a minimum variable s value v min equal to a null value and a maximum value v max equal to 1 (variable filled). As mentioned above, a numeric variable can be quantitative or qualitative. For quantitative variables, the number of items must be equal to 1 and the minimum variable s value and its maximum value give the variable s range. For qualitative variables, the number of items is greater than 0 and the minimum variable s value and its maximum give the variable s domain; taking into account that consecutive variable s items are equidistant 2 with a distance equalling (v max - v min ) / (n items 1). Moreover, the interviewer commonly checks additional information regarding the interviewee or is asked to recall a previously given answer. Even though this practice should be avoided because it is often a cause of errors, this information determines which skip pattern has to be used to continue the interview and complicates or, at worst, makes impossible to derive an acyclic directed graph from the questionnaire. Therefore, in order to ensure the construction of the graph, it could be necessary to insert auxiliary variables in the routing file as well as in the questionnaire dataset. These variables are not directly related to any question and are created in accordance with specific information contained in another ASCII file called auxiliary file. A description of this auxiliary metadata file is given in Table 2. In this paper, variables that are directly related to the questionnaire are called base variables. By using this technique, it is always possible to traverse correctly the graph starting from its source (initial variable) in order to arrive at its sink (terminal variable). The construction of this variable can involve base variables of the same dataset, base variables that belong to another dataset or both. An auxiliary variable is always qualitative with a numeric code and its items are valued in accordance with the base variables even if the base variables are quantitative. Each base variable is related to a block made by one or more rows. Each row in a block refers to an item of the base variable. Blocks are read in accordance with the order of the block labels (third column of the auxiliary file). There is no theoretical limitation to the number of base variable s blocks that 2 A qualitative variable whose items are not equidistant can be treated with the construction of an auxiliary variable to be inserted in the routing file as an immediate successor of that variable.

one s can write in an auxiliary file. As shown in Table 2, a flag is used to decide whether the base variable must be dropped from the questionnaire dataset or not. Table 2 Auxiliary file record layout Column Description 1 Auxiliary variable s name character 2 Auxiliary variable s item integer number (-, + ) 3 Label of the base variable s block integer number [1, + ) 4 Dataset name where is the first base variable character 5 First base variable name character 6 Key variable to merge with the dataset of the interviews - character 7 Item of the first base variable - integer number [-, + ) 8 Minimum value of the first base variable - real number (-, + ) 9 Maximum value of the first base variable - real number (-, + ) 10 Flag that indicates if the base variables are to be dropped (0/1) Each row of the auxiliary file regarding the same auxiliary variable in the same base variable s block is a statement in logical OR with the others belonging to the same block. Different base variable blocks concerning with the same auxiliary variable constitute statements in logical AND relation. More than one base variable can be computed to evaluate the related auxiliary variable s item. Certain features of the questionnaire s structures are the multiple choice responses and also responses that always involve two different and mutually exclusive variables (for example, a question in which the interviewee could declare the amount of a certain expense: it caters for not having that expense at all). In order to model these two classic questions, the concept of group (of variables) has been introduced. A multiple choice response is related to a group in logical OR relation. Mutually exclusive responses correspond to a question asking, for example, the amount of a certain expense but the subject being interviewed could not have that expense at all. These cases are modelled by means of a group of two variables in logical XOR relation. From the graphs theory point of view, these two groups of variables are considered both as they were a unique vertex so that an edge incident to this vertex can be incident to one or more variable of the group if this is a OR group or can be incident to one of the two variables of the XOR group. The fifth column of the routing metadata file is used to depict these structures namely, if the variable is single (single response) the integer value written in the column is equal to 0 while it is a positive integer number for OR groups and a negative integer number for XOR groups. Each group is labelled with the same integer number. The remaining columns are required to draw the questionnaire routes completely. The sixth column indicates how many leaps ahead (n routes ) can be done starting from the variable k written in the first column at the same row. Obviously, number of the leaps must be a positive integer less or equal to the number of items of the variable k. A value equal to 0 means that there are no leaps and the variable k+1 immediately below is adjacent, for every items, to variable k. A value equal to -1 means that variable k and k+1 are not adjacent (there is not an edge joining them). A value equal to 98 is a special code to indicate that there is a leap to the variable h > k regardless of the item s value of variable k. Next columns in the routing file are an alternate sequence of item of the variable k and name of the variable h > k whose value should be present if the variable k assumes the item value indicated in the previous column. Length of this sequence depends on the number of the leaps indicated by n routes.

Figure 1 Acyclic graph underlying a questionnaire In Figure 1 an example of directed acyclic graph with a source (v 1 ) and a sink (v 9 ) is shown. The graph refers to a questionnaire with 3 single response questions, 1 multiple choice response question and 1 mutually exclusive responses question. Nine variables are necessary to store the responses as it can be seen from the routing file in Table 3. Table 3 Routing file example Name n items v min v max l group n routes item 1 leap 1 item 2 v 1 2 e 1 e 2 0 1 e 1 v6 v 2 1 e 3 e 3 0 0 v 3 1 e 4 e 4-1 1 e 4 v 9 v 4 1 e 5 e 5-1 0 v 5 2 e 6 e 7 0 1 e 6 V 9 v 6 1 e 8 e 8 1 0 v 7 1 e 9 e 9 1 0 v 8 1 e 10 e 10 1 0 v 9.. Vertices set V of the graph has 9 elements while edges set E has 10 elements. Variable v 1 corresponds to the source vertex while v 9 corresponds to the sink vertex of the graph. The transposed incidence matrix of the graph in Figure 1 is reported below (Equation 2) and can be directly derived from the metadata in Table 3.

1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 T 0 0 0 1 0 0 0 0 B = (2) 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 Because of the variables groups presence, there are more than two vertices incident with the same edge. This is a very common matrix structure when one deals with a typical survey questionnaire. 4. Questionnaire data record feasibility check An algorithm organized in three steps is proposed, in order to check the feasibility of each record in the questionnaire dataset X(n r x n v ) with n r records and n v variables. There are base and eventually auxiliary variables in the dataset. In the first step, a copy X * (n r x n items ) of the questionnaire dataset must be transformed such that each variable in a data record is related to as many dummy variables as the number of its items. In the second step, the transposed incidence matrix B T (n items x n v ) of the graph underlying the questionnaire must be evaluated reading the routing metadata file. In the third step, the matrix B T is used to check out missing data or missing not allowed for each record in X *. The pseudo code about the third step is shown in Figure 2. Figure 2 Feasibility check for each record in X find all the variables (vertices) not null in X and creates a vector (1 x n v ) with 0/1 elements (0 = variable empty, 1 = variable filled) --> datapath find all the items (edges) not null in X* --> items_list for each item i in items_list find in the i-th row of the matrix B T vertex with value = +1 initial_vertex find in the i-th row of the matrix B T vertex with value = -1 terminal_vertex for k from initial_vertex to terminal_vertex match the k-th element of datapath with the absolute value of the k-th element of the i-th row of matrix B T test if (test > 0) then "value must be missing" if (test < 0) then "missing value not allowed" end for end for end for

Because of variable groups presence, the algorithm reported in Figure 2 should be changed. As a consequence, the complexity of the algorithm would increase as well as the computational effort. Therefore, in order to make the computational effort lighter, matrices dimensions must be kept the smallest as possible. Thus, each group of vertices is coarsened to a new vertex as if it were related to a single variable. This task is accomplished during the first step of the algorithm proposed by inserting an appropriate auxiliary variable in the questionnaire dataset that substitutes the related group of variables. This new variable is created by using the information written in the auxiliary file. Variables groups substitution causes the questionnaire dataset and its related incidence matrix reduction (a lower number of variables). Figure 3 Graph after vertices coarsening As it is shown in Figure 3, two auxiliary variables (v aux1, v aux2 ) substitute the variable groups (v 3, v 4 ) and (v 5, v 6, v 7 ) so that the coarsened graph reduces to 6 variables. In this case, the number of items of the coarsened graph remains the same as in the original one. Table 4 Auxiliary file for XOR and OR variables groups Name item aux block basedset basename key item base (v base ) min (v base ) max Drop v aux1 1 1 V 3 e 4 1 v aux1 2 1 V 4 (e 5 ) min (e 5 ) max 1 v aux2 1 V 6 e 8 1 v aux2 1 V 7 e 9 1 v aux2 1 V 8 e 10 1.. This file is needed to create the statements reported in Figure 4 and Figure 5.

Figure 4 Statement for a XOR group v aux1 = <empty value> if v 3 = e 4 then v aux1 = 1 else if (e 4 ) min v 4 (e 4 ) max then v aux1 = 2 Figure 5 Statement for a OR group create an array with the names of the variables of the group L={V 7, V 8, V 9 } create an array with the number of items n items ={1, 1, 1} create an array with the minimum values e min ={e 7, e 8, e 9 } S = 0 missing_values = 0 for each element k in L if L(k) is not equal to <empty value> then if n items (k) = 2 then S = S + 2 (k-1) * (L(k) - e min (k)) / 2 else S = S + 2 (k-1) else missing_values = missing_values + 1 end for if missing_values = <length of L> then S = <empty value> v aux2 = S Group of base variables in logical OR relation is substituted by a numeric quantitative variable (n items field of the auxiliary variable must be equal to 1) whose value S indicates those base variables filled or left empty (n items field of the base variable is equal to 1) as well as those variables that accept a positive or a negative response (n items field of the base variable is equal to 2). In order to complete the algorithm for the construction of auxiliary variables, the case of a generic auxiliary variable (n routes equal to 0) must be treated. An example of an auxiliary variable related to three base variables is reported in Tables 5-7. Table 5 Auxiliary file block for creating a generic auxiliary variable (block 1) Name item aux block basedset basename key item base (v base ) min (v base ) max Drop v aux 1 1 v base1 x 1 0 v aux 2 1 v base1 x 2 0

Table 6 Auxiliary file block for creating a generic auxiliary variable (block 2) Name item aux block basedset basename key item base (v base ) min (v base ) max Drop v aux 2 dset 2 v base2 v key2 y 1 0 v aux 2 dset 2 v base2 v key2 y 2 0 Table 6 Auxiliary file block for creating a generic auxiliary variable (block 3) Name item aux block basedset basename key item base (v base ) min (v base ) max Drop v aux 1 3 dset 3 v base3 v key3 u min u max 0 v aux 2 3 dset 3 v base3 v key3 w min w max 0 These three blocks generate the statement reported in Figure 6 which is applied to the questionnaire dataset that has to be checked. Figure 6 Statement for creating a generic auxiliary variable for each record in the (temporary) dataset where to insert the auxiliary variable if it is the first occurrence of the key variable then v aux = <empty value> if (v aux is not equal to μ) and (v base1 = x 1 ) and (v base2 = y 1 ) and (u min v base3 u max ) then v aux = 1 if (v aux is not equal to μ) and (v base1 = x 1 ) and (v base2 = y 2 ) and (u min v base3 u max ) then v aux = 1 if (v aux is not equal to μ) and (v base1 = x 2 ) and (v base2 = y 1 ) and (w min v base3 w max ) then v aux = 2 if (v aux is not equal to μ) and (v base1 = x 2 ) and (v base2 = y 2 ) and (w min v base3 w max ) then v aux = 2 if it is the last occurrence of the key variable then save the record with the inserted v aux end for For base variables belonging to other datasets (base datasets), these latter are firstly merged one by one with the questionnaire dataset using the related key variables. The number of records in the resulting temporary dataset can be larger than the number of records of the original questionnaire dataset. This depends on the type of base datasets (for example, in a social survey an household level questionnaire dataset merged with an individual level base dataset) as well as the variable used to link the base dataset to the questionnaire dataset (key variable). Because the auxiliary variable is related to the questionnaire dataset, the number of records after its insertion must remain the same as before. Furthermore, it is used the value μ which equals the lowest auxiliary variable s item in order to solve the problem of constructing an auxiliary variable even in all those cases which require that the interviewer checks whether the interviewee has answered at least one yes to a question related to a group of variables where each variable has two items (yes/no) and none of the variables is in logical relation one to each other.

5. Conclusions The feasibility check has been implemented using SAS System V9.1 on Windows XP platform. The software package developed here is generalized and the following inputs are required: a questionnaire dataset, a routing metadata file and an auxiliary metadata file (if necessary). Related outputs are: a dataset containing all the errors encountered and a report file. This latter includes a detailed description of the errors encountered and also different statistics such as the number of errors per record and the number of errors per variable in order to get an exhaustive analysis about the records feasibility of the questionnaire dataset. For testing purposes, this software has been used to assess the feasibility of the data records concerning with the household register, the household questionnaire and the personal questionnaire 3 administered in the European Survey on Income and Living Conditions (EU-SILC). As an example, in Table 7 are reported some results of the check executed on the household questionnaire. Table 7 Summary of the feasibility check Input dataset HOUSEHOLD_QUESTIONNAIRE Number of records 20982 Number of base variables 356 Number of character variables 18 Number of auxiliary variables 9 Number of variables in routing file 365 Number of XOR groups 10 Number of OR groups 22 Graph s incidence matrix dimensions 265 x 598 Overall duration of the feasibility check 02 h 07 m 23.38 s % of feasible records 60.2 Because this approach is general, the related software developed could be applied to any survey that uses a questionnaire as method of data collection. The execution time of the feasibility check should be the lowest as possible to improve efficiency. Leaving apart matters of machine performances, execution time is strictly related to the matrices dimensions of the graph underlying the questionnaire and to matters code optimization. For these reasons, software implementation is still in progress being continually updated. The considerable advantages of modelling a survey questionnaire using graph theory rely in a very compact and effective tool to be used in data production; this tool helps to manage every changes in the questionnaire form simply using the related metadata files. In addition, diverse methods are available after the graph construction, one can adopt such as, for example, graph partitioning to divide the questionnaire in different sections in order to treat them separately, minimum paths searching and a complete skip patterns listing in order to measure questionnaire complexity. 3 For more details see: http://www.istat.it/strumenti/rispondenti/indagini/famiglia_societa/eusilc/2007/silc_reg_2007.pdf, http://www.istat.it/strumenti/rispondenti/indagini/famiglia_societa/eusilc/2007/silc_fam_2007.pdf, http://www.istat.it/strumenti/rispondenti/indagini/famiglia_societa/eusilc/2007/silc_ind_2007.pdf.

References Diestel, R. (2005), Graph Theory, Springer-Verlag HeidenBerg, New York. Fagan, J. and Greenberg, B. Y. (1988), Using Graph Theory to Analyze Skip Patterns in Questionnaires, Bureau of the Census Statistical Research Division Report series SRD research report number: Census/SRD/RR-88106, Washington, D.C. 20233: Statistical Research Division Bureau of the Census.