Document: PM-2003-05/EN Original: English "Transport Statistics" WORKING GROUP ON PASSENGER MOBILITY STATISTICS Luxembourg, 24-25 April 2003 Jean Monnet Building, Room M5 Beginning 0:00 am Database and Data Retrieval Data Analysis Software ELMIS Item 5.5 of the agenda - -
Design and Application of a Travel Survey for European Long-distance Trips Based on an International Network of Expertise - 5 th Framework Programme Competitive and Sustainable Growth - European Commission DG TREN Database and Data Retrieval - Data Analysis Software ELMIS Meeting of the Passenger Mobility Working Group Luxembourg, 24-25 April 2003 - University of Maribor SLOVENIA - Doc PM/2003/05 2
INTRODUCTION The database brings together a consistent set of long-distance passenger travel information collected throughout the European Union and Switzerland. Information coded from paper questionnaires and CATI systems form the basis of this database. Apart from coded information, the database also contains other valuable information such as weighting and projection factors. The results are available through the European Long-distance Mobility Information System (ELMIS), which is a web based application supported by the NESSTAR statistical engine. ELMIS delivers information about the survey concepts as well as collected data and tools for data analysis. 2 DATABASE, CODING AND DATA RETRIEVAL 2. DATABASE The database provides information about long-distance mobility in the European Union. The contents describe long-distance mobility at several levels of detail. The database holds information at household and person levels. Travel behaviour information such as journeys and trips is linked to both levels. In addition, travel information itself is structured into two levels of detail journey level and trip level. The journeys provide general information about travel behaviour whilst trips give a deeper insight into the journeys of particular interest to the project. In addition to journeys and trips, the database also contains information about Commuting Journeys. In order to avoid data redundancy and to preserve all the relationships, the database is structured as a set of normalised relational tables. In the Annex, one finds Figure which presents an entity-relationship diagram of the database structure. The survey design allows for different survey methods to be applied. This decision affects the data content. In case of telephone surveys, most of the time travel information is gathered on just one person from the household. In postal surveys households are asked to provide information for all members of the household. The second specialty of the survey is a two-phase data collection process. As mentioned above journeys, trips and excursions are collected. Information on journeys is collected during the first phase. Later, journeys are divided into trips and excursions. This 3
information was collected during the second phase. However, not all the journeys have information about trips and excursions. In the main survey this information was only collected for journeys of particular interest with regard to a defined selection rule. In the database, users can also find journeys outside the agreed reporting period in cases where no journeys within the reporting period were discovered. Along with this information, the database contains information from exploration and non-response surveys. The database can be accessed and downloaded from the ELMIS web site at: http://cgi.fg.uni-mb.si/elmis. It can also be obtained on CD ROM free of charge from: University of Maribor Faculty of Civil Engineering Construction IT Centre Smetanova 7 SI 2000 Maribor Email: elmis@uni-mb.si The database includes data from the 5 European Union Member States and Switzerland. It is possible to acquire either subsets of the database by country or the complete database. The database is available as a set of comma separated ASCII files. These files are exports stemming from the relational tables of the MS SQL database system. Each relational table is extracted as a separate file. All files are put together, compressed and made available as a self-extracting executable file. 2.2 DATABASE BUILDING PROCESS The database is a result of work carried out in the project by many partners from various European countries. The process of database building started concurrently with the data collection process. The data collection process began with the coding of the survey questionnaires. Survey organisations were responsible for this task. The coding process required digitising of information provided by respondents and also geocoding of places used in the questionnaires. 4
On a regular basis and according to a predefined schedule, site administrators sent their coded data to the data centre at the University of Maribor. At the data centre, all partial databases were integrated into one database and the database was checked for consistency and errors. Error reports were generated for each survey organisation and sent back for correction. When all the errors were corrected, several updates were performed on the database. Calculated values such as journey distance, number of journeys per household and person, NUTS codes, etc. were derived from existing information. The complete database was sent for data analysis in which two main tasks were carried out. The first task was the implementation of weighting and projection; and secondly, the derivation of main long-distance mobility indicators. 2.3 CODING Although survey organisations were free to use any other software tool for the coding, the project provided software tools to support coding, geocoding, error checking and the transfer of information to a central location for all survey organisations free of charge. In cases where survey organisations preferred to use their own coding software, the project defined a procedure for data preparation. In such cases, survey organisations prepared comma delimited ASCII files for sending to the data centre. The European Coding Book defined file structure, contents, code lists and other required characteristics of the resulting data files. A particular problem for organisations that had not used software was the geocoding. For this purpose the database of place names was extracted to a text file where all places were listed. If mapping support was needed, a survey organisation had to provide its own. Some survey organisations, when using their own software for coding, took advantage of the Geocode It application to support the geocoding process. Data integration started quite early and in parallel with the data collection and coding procedures. Survey organisations received a schedule for sending the data, which contained four sending intervals. Each interval consisted of two stages. In the first stage, survey organisations sent the data coded so far. At the data centre the error-checking procedure was performed and the results in the form of error lists and suggestions about 5
possible causes were sent back to the survey organisation. In the second stage, the survey organisation tried to correct all the errors that had been discovered and after three weeks they sent a new, corrected dataset back to the data centre. According to the schedule all questionnaires were to be coded after four such iteration cycles. The data sending schedule was tailored to the needs of a particular country and survey organisation, because survey work did not start at the same time in all countries. 2.4 DATA INTEGRATION AND PROCESSING After the data reached the data centre, the data were processed and checked for plausibility. Feedback was provided to survey organisations. Figure 2 in the Annex presents an overview of data processing. The process of coding and error correction was followed by data preparation. The aim of this task was to prepare the database for weighting and projection, data analysis and finalisation of the database and the creation of exports. Data preparation included calculation of derived values. In addition to the raw data, new variables were defined and calculated to support processes that follow data collection and coding such as weighting and statistical analyses. Data preparation also included data imputation. Trip information was generated for each one-day journey. For such journeys trip information was not coded but variables from the journey record were copied to trip records. For each one-day journey two trip records were created the outward and return trip. 2.5 GEO-CODING The project has implemented a mobility study; therefore on the one hand the results give answers to questions such as why, when and how someone travels. On the other hand, the information collected reveals another important dimension about mobility: the spatial distribution of passenger activities. For every geocoding activity applied after the actual journey, a reference list of geocoded places was needed. Since the objective is to present the data at a regional level, using the NUTS administrative classification, it was decided to use the city (town, village) as the smallest unit for collecting geographical information. This place name geocoding 6
was also recommended by other projects, e.g. MEST and TEST, for several reasons: (a) people are not keen on giving out exact address information, (b) they easily forget more detailed information or (c) they do not know the exact address. Two databases were selected for compilation of the database of places. The main source for place data was GEOnet Names Server (GNS). The GNS is a US system and provides access to the National Imagery and Mapping Agency's (NIMA) database of foreign geographic feature names. The database is the official repository for foreign place name decisions approved by the US Board on Geographic Names (US BGN). Since the GNS only includes places outside the US, the Geographic Names Information System (GNIS), developed by the USGS in co-operation with the US Board on Geographic Names (BGN) was used to cover places inside the US Survey organisations were responsible for coding and they were also responsible for geocoding places. This proved to be a good solution since the majority of places that were to be geocoded were in the area familiar to the coders. In addition, the coders speak the same language as the respondents, which is important with respect to different spellings of places. Most survey respondents use the spelling the coder understands. To support geo-coding, the project provided software tools to the survey organisations. This support was integrated into the Collect It coding application. In this way, geo-coding was an integral part of the whole coding process. Mapping functionality also supported the application. Survey organisations who decided to use their own software for coding, could use a standalone geocoding tool provided by or rely on self-implementation of geocoding. For this purpose, the project provided a list of all place names. One important result of geocoding was distance calculation. During error checking, journeys were tested to see if the calculated distance from the journey origin to the journey destination was greater than 00 km. In the end, all short journeys were excluded from the MEST and TEST are projects founded by the EC in 4 th framework programme dealing with methods and technologies for long distance mobility studies. 7
final database. 8
In addition to distance calculation, the second purpose for collecting geographic information was the preparation of origin destination matrices (O-D matrices). The project delivers O-D matrices at the regional level, consistent with the NUTS classification. For the purpose of matrix building, it was necessary to create aggregations of geocoded places. Therefore, besides existing information, each place received a code for the region in which it is located. The project has the objective of analysing the data at the NUTS level. Despite this, more detailed NUTS 3 codes have been assigned to the places, since the maps obtained allow for that level of detail. For matrix building, the NUTS classification was used; however, users can utilise the more detailed codes for their own purposes. 3 EUROPEAN LONG-DISTANCE MOBILITY INFORMATION SYSTEM 3. OVERVIEW The European Long-distance Mobility Information System (ELMIS) is a result of the project and generally known as Deliverable 9 (D9). The aim of ELMIS is to make the data collected in the survey available to the general public. Since intends to deliver the results to a broad audience using contemporary media, the decision was made to develop an Internet-based data retrieval and analysis system. The system is available at web address http://cgi.fg.uni-mb.si/elmis. ELMIS combines all efforts of the project and allows the outside community to explore the outcomes. In this sense, ELMIS should be considered as a medium that gives access to an expertise on long-distance mobility and surveys while allowing to tap a source of information and experience which is the result of a great amount of work carried out in the data collection process and its processing. ELMIS is a web site supported by applications that deliver the survey results in a highly interactive way. To allow ELMIS users to browse through the database and perform statistical analyses, the system integrates the NESSTAR server application, which contains a statistical engine. When using ELMIS, the user interacts with the NESSTAR statistical engine through the NESSTAR Light Client (NCL). For the purposes of ELMIS, the client has been adapted slightly from the standard client normally used. More information about NESSTAR can be found at www.nesstar.com. 9
The second application behind the system is the O-D matrices browser. This application is based on an O-D matrices viewer developed by the Peter Davidson Consultancy in England. The original matrix viewer is a Windows application and has been upgraded with a web interface to meet the needs of ELMIS. In order for the user to gain a quick overview of basic results, a number of main indicators constructed from the survey data have been prepared and can be easily accessed online. For an extended analysis of the results, ELMIS provides the original database with all necessary descriptions. This database can be downloaded from the ELMIS site (web address will be announced). Acting as a supplement to the results, further information about the survey is given via Internet links relevant to the project (e.g. location of deliverables) and through additional blocks of background information. ELMIS also includes fundamental project information about dissemination events, involved partners and other related activities. 3.2 TABULATIONS AND CHARTS The core functionality of ELMIS is found in the NESSTAR Light Client. This is a set of web pages, which form a user interface for statistical analysis and the construction of charts. The client allows users to search for, locate, browse, analyse and download a wide variety of statistical and background information from NESSTAR server within a web browser. There is no need for a specialised software to be able to view the data stored on the NESSTAR server. The server also gives access to metadata (descriptions of data tables and variables) and contains basic information about the survey itself with links to other parts of ELMIS. The metadata is structured according to the Data Documentation Initiative (DDI). The DDI is an internationally recognised standard for the creation, presentation and preservation of metadata. More information about the DDI can be found at http://www.icpsr.umich.edu/ddi/index.html 0
During data preparation, weighting variables have been appropriately marked, allowing the NESSTAR system to recognise them. The user can select the variables with weights he or she wants to use for the preparation of a desired set of statistics. The data can be exported in several formats compatible with common statistical analysis tools. Through the NESSTAR Light Client, ELMIS offers the following formats for exports: - SPSS system file - SPSS portable file - NSDstat - Statistica - Stata - Data Interchange Format (DIF) (suitable for use in Excel) - Dbase 3 - SAS 3.3 O-D MATRICES Since the project is concerned with long-distance mobility, one of the key project reports is about O-D matrices. For the purpose of the project, it was decided that for all 5 EU countries, the O-D matrices should be constructed following the regional differentiation of NUTS. If a destination lies outside the EU, a defined system of superimposed zones applies. Not all, but most of these zones correspond to a country. During the development of the matrix, the modal split has been taken into account as a third feature. The user is able to explore journeys undertaken by plain, train, car or by some other mode of transport. Naturally, matrices showing all modes of transport are also available. Through ELMIS, the user can view the O-D matrices in tabular or graphical form. In both cases, origins by regions and destinations can be selected from respective lists offered through the matrix page. In addition to viewing places, modes of transport can also be chosen.
ANNEX Figure : Entity-relationship diagram of the database structure HH_State Households..* Persons 0.. Commuting * * Journeys..* Participants * * Trips Excursions 2
Figure 2: Data integration flowchart Start take one country data from FTP server Assign geographic coordinates to geo-coded places list of places with coordinates coded files are in ASCII format Yes Calculate derived variables No Convert ASCII to Paradox No Check for errors, warnings, geoerrors and item non-response Error reports by received data packet is conversion A No A succsesful Yes all data packets prepared Comment errors and advise future coding Comments, customised error reports Correct or exclude data packet with errors Yes Join partial databases Prepare exports for weighting Data exports by country End No is join successful Import data to MSSQL database lists of interviewed persons Import or generate»interviwed person«flag household group size files Import or calculate numer of HH participants for journeys 3