Mapping and Terminology English Speaking CDISC User Group Meeting on 13-Mar-08
Statement of the Problem GSK has a large drug portfolio, therefore there are many drug project teams GSK has standards 8,200 variables for core data and validated rating scales with 860 codelists 12,000 other variables (therapeutic standards and study specific variables) with another 1,500 codelists For SDTM submissions in the near term, all data will need to be mapped Later submissions will be a combination of mapped studies and SDTM designed studies
What do we mean by Standards A combination of definition and implementation Definition: the data items and associated codelists - the variable catalogue The associations between data items i.e. what set of items constitutes a single piece of information e.g. severe headache for subject 1 on 13-Mar-08 with action=drug stopped ) - the data records The data checks we want the summaries/displays we want to produce Implementation: The paper CRFs/eCRFs The dataset specifications The programming to create checks, datasets, displays
Impact of CDISC Implementation Let s pick a point in time implementation doesn t mean changing what we want to collect or what we want to report but it does mean changing our datasets our software, our processes and possibly our paper/e-crfs, and we may need to change some of our codelists (more on these later)
Our mapping Our expectation is that GSK will always have to do some mapping We expect our experiences gained from mapping our standards to help us design studies with SDTM in mind This will be true for most companies with in-house standards
One Hand Grenade We have 2300 codelists CDISC has developed around 50 in two years Even if this rate is accelerated by orders of magnitude, all companies will have to determine how to handle this new terminology as it comes on-stream over the years (particularly mandatory terminology)
General Rules for Starting Mapping Find a place for all our variables If we don t capture data, don t generate it Map from GSK standard to SDTM, not the other direction Don t map anything purely derived for analysis purposes If no logical place for a variable, assume SUPPQUAL
Our Mapping Experiences When our variables have a simple match to SDTM variables (which might include a transposition) easy, just apply algorithms all about how to do these mappings efficiently and with no errors can be done by anyone When our variables don t have a simple match, particularly when a choice is involved hard, risk of multiple and inconsistent approaches dependent on guidelines to eliminate errors and inconsistencies need the right people to interrogate the data
Our Approach Excel based (everyone has access, familiar software) Spreadsheet automatically pulls in our standards metadata and is used in dropdowns to avoid typos Minimise the amount of manual effort Make it as easy as possible Staged approach a structured template guiding the mapper through the process Don t over-complicate the process to fit every eventuality Team effort rather than purely individual Committee to address issues that crop up
Tabs in the Spreadsheet GSK variable catalogue The SDTM Domain Our mapping sheet SUPPQUAL variables An easy way of looking at our GSK codelists Value level metadata Tabs have been placed in the order in which they are to be completed
Key Feature of the Mapping Sheet SDTM Variable Name SDTM Variab le Include in SDTM+ dataset Source Variable Name Defau lt Value Req'd SDTM+ Var Desc Data Type Source Type Source Variable Description STUDYID Y Unique identifier for the study Char Y Req STUDYID N Unique identifier for the study Text Y SourceVariable STUDYID Unique identifier for the study Y DOMAIN Y Domain abbreviation Char Y Req DOMAIN N Domain abbreviation Text Y DefaultValue DS Y USUBJID Y Unique subject identifier Char Y Req DOMAIN N Unique subject identifier Text Y SourceVariable USUBJID Unique subject ID N USUBJID N Unique identifier for the study Text Y SourceVariable STUDYID Unique identifier for the study N USUBJID N Subject identifier Fixed Y SourceVariable SUBJID Subject ID N DSSEQ Y Sequence number Num Req DSSEQ N Start value for generated sequence Fixed SourceVariable Y number DSDTC Y Date/time of assessment Char Exp DSDTC N Datetime of assessment Datetime SourceVariable N DSDTC N Actual date of assessment Date SourceVariable N DSDTC N Actual time of assessment Time SourceVariable N DSDTC N Actual date of assessment (char Date SourceVariable N dup) DSDUR Y Duration Char DSDUR N Duration in years Float SourceVariable N DSDUR N Duration in months Fixed SourceVariable N DSDUR N Duration in weeks Fixed SourceVariable N DSDUR N Duration in days Fixed SourceVariable N DSDUR N Duration in hours Fixed SourceVariable N DSDUR N Duration in minutes Fixed SourceVariable N DSDUR N Duration in seconds Fixed SourceVariable N DSDUR N Duration Fixed SourceVariable N DSDUR N Duration units Text SourceVariable N DSDUR N Duration code Text SourceVariable N DSDUR N Duration decode Text SourceVariable N A home for all the variables that map easily even if there are multiple variables that go to make just one SDTM variable no need for mappers to add rows!
Key Feature of the Value Level Metadata Sheet Include in SDTM+ dataset Source Variable Name SDTM Value TESTCD SDTM+ Description Data Type Source Type Variable Description Y SYSBP Systolic blood pressure Y N SYSBP Original numeric result Float Y SourceVariable SYSBP Systolic blood pressure (mmhg) N SYSBP Original units decode Text N SYSBP Original units code Text N SYSBP Original units (intuitive code or free text) Text Y DefaultValue MMHG N SYSBP Original result score code Fixed N SYSBP Original result code Text N SYSBP Original result (decode) Text N SYSBP Original result specify Text N SYSBP Original result text Text N SYSBP Original result date Date N SYSBP Original result date (char dup) Text N SYSBP Original result time Time N SYSBP Original result datetime Datetime N SYSBP Assessor code Text N SYSBP Assessor decode Text N SYSBP Location used for the measurement decode Text N SYSBP Location used for the measurement code Text N SYSBP Method of test or examination code Text N SYSBP Method of test or examination decode Text N SYSBP Severity decode Text N SYSBP Toxicity grade decode Text N SYSBP Subject position code Text Y SourceVariable VSPOSCD Subject position code N SYSBP Subject position decode Text Y SourceVariable VSPOS Subject position N SYSBP Baseline flag Text This sheet handles the mapping of topic variables (parameters, questions etc) and results when the source dataset is horizontal/non-normalised also need to handle provision of value level metadata when the source dataset is normalised Defaul t Value
SDTM Variable Name Key Feature of the SUPPQUAL Sheet Include in SDTM Variable SDTM+ Var Desc SDTM+ Data Type dataset Source Type Source Variable Name Source Variable Description Default Value STUDYID Y Unique identifier for the study Char Y Req STUDYID N Unique identifier for the study Text Y SourceVariable STUDYID Unique identifier for the study Y RDOMAIN Y Related domain abbreviation Char Y Exp RDOMAIN N Related domain abbreviation Text Y DefaultValue VS Y USUBJID Y Unique subject identifier Char Y Req RDOMAIN N Unique subject identifier Text Y SourceVariable USUBJID Unique subject ID N USUBJID N Unique identifier for the study Text Y SourceVariable STUDYID Unique identifier for the study N USUBJID N Subject identifier Fixed Y SourceVariable SUBJID Subject ID N IDVAR Y Identifying variable Char Y Exp IDVAR N Identifying variable Text Y ValueLevel Y IDVARVAL Y Identifying variable value Char Y Exp IDVARVAL N Identifying variable value Text Y ValueLevel Y QNAM Y Qualifier variable name Char Y Req QNAM N Qualifier variable name Text Y ValueLevel Y QLABEL Y Qualifier variable label Char Y Req QLABEL N Qualifier variable label Text Y ValueLevel Y QVAL Y Data value Char Y Req QVAL N Numeric data value Float ValueLevel N QVAL N Text data value Text ValueLevel N QVAL N Date data value Date ValueLevel N QVAL N Date data value (char dup) Text ValueLevel N QVAL N Time data value Time ValueLevel N QVAL N Datetime data value Datetime ValueLevel N QVAL N Data value code Text ValueLevel N QVAL N Data value score code Fixed ValueLevel N QVAL N Data value decode Text ValueLevel N QVAL N Data value specify Text ValueLevel N QORIG Y Origin Char Req QORIG N Origin Text ValueLevel Y QEVAL Y Evaluator Char Exp QEVAL N Evaluator text Text ValueLevel N QEVAL N Evaluator code Text ValueLevel N QEVAL N Evaluator decode Text ValueLevel N QEVAL N Evaluator specify Text ValueLevel N This sheet contains all the SUPPQUAL variables and is populated according to the information entered on the SUPPQUAL Metadata sheet Req'd
The easy interface to enter SUPPQUAL information Include in SDTM+ Source Type Source Variable Name SDTM Qualifier QNAM SDTM+ Description Data Type Variable Description Default Value Y VSQUAL Reading qualifier Y N VSQUAL Identifying variable Text Y DefaultValue VSSEQ N VSQUAL Identifying variable value Text Y SDTMVariable VSSEQ Sequence number N VSQUAL Numeric data value Float N VSQUAL Text data value Text N VSQUAL Date data value Date N VSQUAL Date data value (char dup) Text N VSQUAL Time data value Time N VSQUAL Datetime data value Datetime N VSQUAL Data value code Text Y SourceVariable VSQUALCD Reading qualifier code N VSQUAL Data value score code Fixed N VSQUAL Data value decode Text Y SourceVariable VSQUAL Reading qualifier N VSQUAL Data value specify Text N VSQUAL Evaluator text Text N VSQUAL Evaluator code Text N VSQUAL Evaluator decode Text N VSQUAL Evaluator specify Text N VSQUAL Origin Text Y DefaultValue CRF All the information needed for each SUPPQUAL variable is entered through this sheet (no limitation on the number of SUPPQUAL records)
Learnings Knowledge, understanding the context of the data is critical for successful mapping at the dataset level e.g. vitals at the study level (not always obvious what is in a variable) It is possible to define algorithms to automate the physical mapping Multiple options for mapping a variable (wiggle-room) Not everything goes into SUPPQUAL (SUPPQUAL should be a last resort!) Pre-processing is sometimes needed to make things mappable Some of our codelists contain multiple sets of information, in other cases we have multiple codelists covering a single set of information. These take extra effort. We ve created principles, adapted the template to reflect our learnings mapping the 50 or so core standards On rare occasions, it isn t possible to use the template e.g. our genetics sample information (all the data needs to be pre-processed) It takes time to agree on a home for tricky variables