XML in the bipharmaceutical sector XML holds out the opportunity to integrate data across both the enterprise and the network of biopharmaceutical alliances - with little technological dislocation and at a fraction of the cost of other integration solutions. Dr Lara Marks and Emmett Power, Silico Research Limited Make no mistake, XML is coming and it will change the way that biopharmaceutical companies and their partners and suppliers integrate their information systems. At Silico Research, we recently conducted an analysis of the deployment of XML technology in the biopharmaceutical sector. As a result of the research, we concluded that by 2004 XML would be one of the most important technologies used to achieve data integration across pharmaceutical, biotechnology and genomic research. Biopharmaceutical, genomic and technology companies are advised to address the issue of XML and formulate an organisation-wide XML strategy without delay. XML technologies go to the heart of data integration. Over the next five years the pharmaceutical industry will shift from a wet, bench-driven, research model to one driven by information and computers. As that shift takes place, data integration will become a major source of competitive advantage or disadvantage. The Human Genome Project and new emerging technologies - such as bioinformatics, cheminformatics, pharmacogenomics, simulation and modelling - have created a massive growth in the amount of data that pharmaceutical companies have to confront in order to bring drugs to market. Data integration and analysis promise deeper insights into the biology and chemistry underlying drug actions and target diseases, and at the same time, radically shortening development times. This has led biopharmaceutical companies to invest heavily in emerging big-ticket technologies designed to integrate data across the enterprise; these include bioinformatic platform technologies, data warehouses and enterprise information portals. Such efforts have met with mixed results. At times, efforts to integrate have been de-railed by overoptimistic expectations of the technology and poor implementation, combined with dramatically shifting information creation and usage patterns. There are reasons to be optimistic that XML, the latest integration technology, could avoid many of the problems that have beset earlier integration technologies. What is XML? The Extensible Markup Language, XML for short, is a mark-up language - much like HTML which is widely used to present data on the Internet and other networks. Unlike HTML, however, which was designed to describe the format of data, XML has been designed to describe the structure and content of data and to facilitate the transfer of that data between applications and over networks including the Internet. XML achieves this by allowing data to be stored within an XML document in a structured, tagged format; this enables the data to be interpreted and modified by any XML-compatible application. XML is a derivative of the standard generalised mark-up language (SGML), which is the international standard for defining the structure and content of electronic documents. HTML is an application of SGML, but with a very limited scope. XML is extensible. This represents one of its great strengths in that, unlike HTML, the tags used to structure the data can be extended according to the needs of the user. In HTML the tags are fixed Over the next five years the pharmaceutical industry will shift from a wet, benchdriven, research model to one driven by information and computers Innovations in Pharmaceutical Technology 43
XML is extensible. This represents one of its great strengths in that, unlike HTML, the tags used to structure the data can be extended according to the needs of the user, Table 1. Key biochemical and pharmaceutical DTDs. and invariable, whereas in XML they can be extended to suit the needs of the organisation using the format. This is achieved in XML by writing a specific Document Type Declaration (DTD) that maps the structure of the data. The DTD is shared with other users of the document enabling them to read, map and process the data contained in it. Using DTDs, developers can create self-defining text tags to identify a piece of data for use in other applications. So, if for example the deploying company decides that it needs a tag in the document representing Sequence, that tag can be added to the DTD and shared with all applications using the DTD. Taking this principle to its logical conclusion, developers and users are writing their own customised DTDs to suit the needs of the specific data they are working with. Several specialised DTDs have been written for the pharmaceutical, biotechnology and genomic sectors. One example of a specialist DTD is the widely used Bioinformatic Sequence Markup Language (BSML), developed by Visual Genomics (now part of LabBooks) for comparing genetic data from multiple sources and platforms. Using BSML, 44 Innovations in Pharmaceutical Technology
scientists can compare genetic data, run applications across that data and display it in a special BSML browser written for the purpose. Another strength of XML is that it opens the door to data representation through the ability to use the language to model data. This can be done because XML can be used to define the structure of elements and data, and the inter-relationship of data and data elements. Developers can therefore create 2D representations of chemical and biological structures using data from a number of sources, or represent genomic data in a graphical format. This creates powerful opportunities for developers in the pharmaceutical, biotechnology and genomic sectors where complex data is increasingly interpreted visually. The data representation function of XML is fully captured in specialist browsers like the BIOML (Biological Markup Language) browser. XML also facilitates data integration. Most database APIs (application programming interfaces) are defined in a particular programming language, so that results are returned in that language s native data types. While Standard Query Language is a widely accepted method for specifying the database question, there is not yet a language-independent and database-neutral specification for the response. XML affords an opportunity to facilitate this. XML provides an open source solution for data migration between programming languages. This has proved highly attractive in a sector like pharmaceuticals, characterised by a large number of heterogeneous applications using a number of programming languages to derive complex data in a number of formats. XML core technologies Four programming technologies are central to XML: Document type declarations (DTDs) A DTD is a file containing structured data that defines the tags used in an XML document. DTDs define the data schema and structure of the document. For example, the Biological Markup Language (BIOML) DTD uses tags like organism The highest deployment of XML is in the discovery stages of the drug development process Innovations in Pharmaceutical Technology 45
As XML becomes more widely adopted, we expect to see the number of DTDs explode from 50 today to at least 500 label, chromosome label, gene label and DNA label to define and structure the biological data. This allows any XML-enabled application to use the BIOML DTD to access and manipulate the data. An XML document can refer to a DTD in an external document on for example a website, or contain the DTD itself. In order for two applications to share data through XML, they must be using the same tags and labels through the adoption of an agreed DTD or a parser. XML parsers The system receiving XML data needs a parser that can break an XML document down into its data elements. The parser includes a facility that maps elements from the DTD to data structures in the receiving system. The parser extracts the actual data out of the textual representation and creates either events or new data structures from them. Parsers also check whether documents conform to the XML standard and have a correct structure. This is essential for the automatic processing of XML documents. A validating parser checks a document not only for conformance to the general XML rules, but also enforces a certain DTD, checking whether all necessary elements are present and if their order is as specified in the DTD. Namespaces This refers to a collection of names, identified by a Uniform Resource Identifier (URI) reference, and is designed to avoid name collisions, and promote industry standard DTDs. Namespaces are a proposed standard for defining the location of DTDs, so that remote applications can exchange data and remote users can read XML pages. XSL Stylesheets are documents that, when processed with an XSLT transformation engine, can turn one form of mark-up (XML or HTML) into another. This is useful for piping XML documents from one version of a DTD to another. A number of biological stylesheets are hosted by bioxml.org. XML in the biopharmaceutical sector XML deployment is widespread in the pharmaceutical, biotechnology and genomic sectors. Of the executives surveyed by Silico Research, 75% said that they are currently deploying XML as part of their R&D infrastructure or product range. Virtually all those who are not deploying XML today expect to be doing so by 2003. But most deployments are trial deployments. Typically, pharmaceutical and biotechnology companies are trialing the technology in a few sites and across a few applications before implementing full-scale deployment. This is accounted for by two factors. The first is the newness of the technology. XML is in the early stages of development and this is acting as an inhibitor to its full-scale adoption. Both users and vendors are standing back to watch how the technology develops. The second factor is the rapid pace of development in XML. This is making companies adopt a wait-and-see attitude before making a commitment to the technology. The highest deployment of XML is in the discovery stages of the drug development process. There are three reasons for this. First, discovery and development have high integration needs; compared with other parts of the research pipeline, early stage research processes such as discovery have a strong need to integrate and manipulate complex data sets. Second, early stage teams in drug discovery processes typically have better IT skills than later stage teams, for example in clinical trials. This makes it easier for early stage teams to experiment with new technologies. Finally, early stage discovery and development teams are culturally more open to experimenting with new technologies than those involved in later stage functions. Pharmaceutical applications of XML Today, most companies rely upon internally generated DTDs designed to achieve a particular objective or to link specific data sources. Of the publicly available DTDs, three show significant usage: Bioinformatic Sequence Markup Language (BSML) used for the annotation of biopolymer sequence information, BIOpolymer Markup Language (BIOML) for genetic sequences, and Genome Annotation Markup Elements (GAME) for annotating biosequence features. Open issues and constraints A number of open issues and constraints need to be considered with respect to XML-based technologies. The newness of XML makes it difficult to assess its usefulness in the long term. As XML becomes more widely adopted, we expect to see the number of DTDs explode from 50 today to at least 500. This will create DTD name clashes and make it difficult to determine which DTD to adopt and when to update. Within the enterprise, XML is - and will continue to be - simply one of many data formats. Many pharmaceutical companies require the complex splitting and merging of data to and from multiple sources, and the combination of dependent data from relational databases. It is not clear that XML will help in this. XML builds on text-based documents - but text-based documents are poor utilisers of bandwidth. As the throughput of document processing increases, many companies will run into network bandwidth utilisation problems with XML. 46 Innovations in Pharmaceutical Technology
Conclusion XML holds out the opportunity to integrate data across the enterprise and across the network of partnerships and alliances that biopharmaceutical companies are building. Moreover, it holds out the possibility of doing so with little technological dislocation and at a fraction of the cost of other integration solutions. The key to realising these opportunities lies in planning and execution - as always. Dr Lara Marks is a Visiting Senior Research Associate at the CGHPSS Unit at Cambridge University and an honorary Senior Lecturer at the London School of Hygiene & Tropical Medicine. She is undertaking research into the impact of genomics on the pharmaceutical industry. Dr Marks is the author of a number of books and papers on pharmaceutical and healthcare issues, including a recently published and widely reviewed book for Yale University Press on the discovery and development of the oral contraceptive. Emmett Power is a leading European analyst of new information technologies focusing on bioinformatics, data warehousing and other infrastructure and analytical technologies. He advises leading pharmaceutical and technology companies, and is the author of a number of reports covering pharmaceutical information technologies. Note: Silico Research Limited is a pharmaceutical technology-based analysis organisation located in London and Cambridge in the UK. The company can be contacted via its website: www.silico- Research.com. Innovations in Pharmaceutical Technology 47