BioNumerics THE UNIVERSAL PLATFORM FOR DATABASING AND ANALYSIS OF ALL BIOLOGICAL DATA.

Size: px

Start display at page:

Download "BioNumerics THE UNIVERSAL PLATFORM FOR DATABASING AND ANALYSIS OF ALL BIOLOGICAL DATA."

Anna Smith
5 years ago
Views:

1 BioNumerics THE UNIVERSAL PLATFORM FOR DATABASING AND ANALYSIS OF ALL BIOLOGICAL DATA

Organisms, samples Animals, plants, microbial strains or communities, fungi, tissue, samples, etc. 1-D Fingerprints Electrophoresis gels, densitometric records, HPLC, spectrophotometry, etc.

2 Organisms, samples Animals, plants, microbial strains or communities, fungi, tissue, samples, etc. 1-D Fingerprints Electrophoresis gels, densitometric records, HPLC, spectrophotometry, etc. Character sets Sequences Phenotypic test panels, antibiotic resistance profiles, microarrays, etc. DNA, RNA, and protein sequences 2-D gels Two-dimensional protein gels Input BioNumerics BioNumerics Database Import, conversion, image acquisition, normalization of gels, assembly of sequences, etc. Different experiments and descriptive information linked to database entries Identification Clustering Dimensioning Statistical tools Export Database sharing Quick similarity search Construction of libraries Neural networks Dendrograms from individual experiments and composite data sets Phylogenetic tree construction Principal components analysis Discriminant analysis Self-organising maps Cluster and group significance Group validation techniques Multivariate analysis Congruence of techniques Professionel printing of reports and analyses Export as enhanced metafiles or bitmaps Customized reporting using scripts Peer-to-peer exchange Client-Server projects i

INTRODUCTION The advent of high-throughput sequencers, microarrays, MALDI, and numerous other fast and automated molecular typing techniques has made it possible to effortlessly produce millions of

As easy it is to generate massive amounts of data, as difficult and challenging it has become to manage the data and extract meaningful signal out of it.

3 INTRODUCTION The advent of high-throughput sequencers, microarrays, MALDI, and numerous other fast and automated molecular typing techniques has made it possible to effortlessly produce millions of data points for each single sample under study. As easy it is to generate massive amounts of data, as difficult and challenging it has become to manage the data and extract meaningful signal out of it. Even more challenging is the consensus analysis of data from different experimental techniques in order to obtain more conclusive answers. The BioNumerics software platform addresses these challenges by its four fundamental achievements: 1 Import and, where appropriate, automated batch-processing of any kind of biodata, from 1-D and 2-D electrophoresis gels or spectrometry profiles to sequences, microarrays, or phenotype characters. 2 A relational multi-user database environment for lab-wide storage and retrieval of experimental and descriptive information. 3 Powerful querying, data mining and exploration, analysis, comparison, and visualization. 4 Integrated networking, data exchange, and Internet-connectivity in a peer-to-peer or client-server environment. BioNumerics is the most complete and powerful solution for databasing and comparative analysis of biodata. The software has gained worldwide recognition by daily use in many research sites, including universities, hospitals and public health centers, food, drug and pharmaceutical industries, and a wide range of federal and private laboratories involved in typing, quality control, screening, testing, breeding, etc. BioNumerics 3

CONCEPTS The uniqueness of BioNumerics consists in the combination of a rich databasing platform with analysis tools for all existing biological data types.

sequences, curves, and 2-D gels. The biological entities of the database, i.e. the database entries, can be any biological sample under study, including bacterial or viral strains, animals, plants,

The concept of the database allows various experiments of different nature to be entered for the same sample or strain (entry).

types, (2) character-based data, called character types, (3) DNA, RNA, and protein sequences, called sequence types, (4) 2-D electrophoresis gels, called 2-D gel types, (5) curve readings that

4 CONCEPTS The uniqueness of BioNumerics consists in the combination of a rich databasing platform with analysis tools for all existing biological data types. The software combines a powerful, multi-user database environment specifically designed for biodata with the most advanced tools for the analysis of patterns and fingerprints, character arrays, sequences, curves, and 2-D gels. The biological entities of the database, i.e. the database entries, can be any biological sample under study, including bacterial or viral strains, animals, plants, fungi, tissue, or any other organic samples for which experimental data can be obtained. The concept of the database allows various experiments of different nature to be entered for the same sample or strain (entry). As a result, multiple experimental data can be explored and compared among the entries studied, and groupings or identifications can be obtained for any combination of database entries and experiments available. The experimental data can be subdivided into six classes, which include all possible experiment types employed to express relationships in biology: (1) 1-D densitometric patterns, called fingerprint types, (2) character-based data, called character types, (3) DNA, RNA, and protein sequences, called sequence types, (4) 2-D electrophoresis gels, called 2-D gel types, (5) curve readings that express an evolution (trend) of one parameter in function of another (e.g. kinetic readings), called trend data types, and (6) matrix types, including similarity and distance matrices. Within each of these generic experimental classes, the user can create custom experiment types with particular settings. The six experiment classes are the basis of the modular subdivision of the BioNumerics software. Fingerprint types Any densitometric record seen as a profile of peaks or bands can be considered as a fingerprint type. Examples are electrophoresis patterns, gas chromatography or HPLC profiles, spectrophotometric curves, MALDI, SELDI, etc. Through its easy and powerful script language, BioNumerics can import and process virtually any type of fingerprint data from any manufacturer. Electrophoresis is an important component in studying relationships in biology; therefore, comprehensive tools for preprocessing electrophoresis fingerprints, both from slab gels and capillary sequencers, are incorporated into BioNumerics. These tools include reading graphical and densitometric file formats from image files and automated sequencers, lane finding, normalization (alignment of patterns), band finding and quantification, band matching, etc. The quality and completeness of electrophoresis fingerprint analysis in BioNumerics is illustrated by the fact that the famous GelCompar II software, with all its functions and possibilities, is entirely contained in BioNumerics Fingerprint types application. In addition to the range of advanced functions for gel analysis, BioNumerics provides comprehensive tools for automated preprocessing and analysis of other fingerprint data such as MALDI, dhplc, and chromatogram files from automated sequencers. The software also offers a number of specialist plugins for electrophoresis-based applications such as VNTR or MLVA, HDA or CSCE, spa-typing, AFLP-based breeding, etc. Character types Using the character types, it is possible to define any array of named characters, binary or continuous, with fixed or undefined length. The size of a character type in BioNumerics can range from one single character (e.g. a morphological feature) to microarray experiments of many thousands of gene expression values. Further adjustable features include the range of the characters, the number of digits, the color scale, similarity coefficients used for comparison, etc. Examples of character types include fatty acid profiles, metabolic assimila-

Clear icons in menus and buttons aid quick

Individual buttons can be added or removed.

useful information fields that can be shown

can be docked conveniently or removed if not

5 Clear icons in menus and buttons aid quick access to frequently used tools. Button bars can be docked or removed as desired. Individual buttons can be added or removed. All database components are described by useful information fields that can be shown or hidden, queried, sorted, etc. The user can define custom fields. The BioNumerics main window Database panels can be docked conveniently or removed if not used. Panels can also be tabbed with other panels to optimize display usage. Creating Levels in a database allows for a richer database structure and increased flexibility in data organisation. BioNumerics 5

tion or enzyme activity test panels such as API, Biolog, Vitek, antibiotics resistance profiles, morphological and biochemical features, microarrays and gene chips, etc.

BioNumerics powerful script language and ODBC functions allow for direct import of data from external databases or from textformatted files or Excel spreadsheets.

Each sequence type in BioNumerics can be stored with its own reference sequence, and with specific alignment and clustering settings.

6 tion or enzyme activity test panels such as API, Biolog, Vitek, antibiotics resistance profiles, morphological and biochemical features, microarrays and gene chips, etc. Character sets can also be the result of processed data from other sources, for example, copy numbers in electrophoresis-based MLVA or allele numbers in sequence-based MLST. BioNumerics powerful script language and ODBC functions allow for direct import of data from external databases or from textformatted files or Excel spreadsheets. Sequence types mouse-click on a sequence stored in the BioNumerics database. Each sequence type in BioNumerics can be stored with its own reference sequence, and with specific alignment and clustering settings. BioNumerics offers probably the finest and most comprehensive sequence alignment and clustering tools that currently exist for PCs. It combines clustering of thousands of nucleotide or protein sequences of almost unlimited length with multiple alignment and display of homology matrices. The versatile user interface allows sequences in multiple alignments to be displayed as raw chromatogram files as well as translated protein sequences, and direct editing is possible in any visualization. Multiple alignments associated with dendrograms can be edited manually in drag-and-drop mode, and a multistep undo/redo function makes editing even more convenient. In addition to well-established alignment algorithms described in the literature, the software contains extremely fast and reliable algorithms elaborated at Applied Maths. BioNumerics sequence alignment application is an invaluable tool for SNP and mutation analysis. SNPs or mutations are screened for through up to many thousands of aligned sequences and the software statistically calculates the probability of each SNP based upon the quality of the base assignments and the curves in the chromatogram files. Various filters allow for screening SNPs with specific thresholds or other features such as the type of mutation they induce. By looking at sequencer chromatograms directly, the user has excellent control over the probability of each potential SNP locus. Complete alignments, including all SNP and subsequence search listings, can be saved and re-edited at any time. Additional sequences can be added to existing projects. Within the sequence types, the user can enter sequences of nucleic acids (DNA and RNA) and amino acids. BioNumerics recognizes widely used sequence file formats such as EMBL, GenBank, and Fasta, with the possibility to import user-selected header tags as information fields. In addition, BioNumerics powerful sequence assembler tool allows direct import of raw chromatogram files from automated sequencers. The assembler has both an excellent alignment engine and a smart, user-friendly interface. The program is fully scriptable, allowing for automated batchprocessing in high-throughput sequencing projects such as for typing and surveillance. Complete gene assembly projects with aligned chromatograms can be saved into projects and popped up with a single The software offers a wide range of phylogenetic clustering techniques and various tools for the estimation of the significance and reliability of clusters, which are discussed below under the Analysis and comparison functions. 2-D gel types The 2-D gel analysis module in BioNumerics is a fully featured application for complete analysis and databasing of 2-D gels. Applied Maths experience in image analysis has been fully exploited to achieve more reliable automatic gel alignments than ever obtained so far. A project-based interface allows

curves, etc. BioNumerics offers a large number of curve fit models, ranging from linear, logarithmic and Gaussian functions to more complex models such as Logistic growth and Gompertz.

These parameters can be calculated dynamically as character values and used for clustering, identification and statistics purposes.

software. These matrices can be linked to the database entries in BioNumerics and they are used in conjunction with other information to obtain classifications and identifications.

Composite data sets for the automated batch processing of multiple gels, including experiments with repeats or multiplexed gels such as DIGE.

BioNumerics allows protein spots from 2-D gels to be identified and stored in the database.

7 curves, etc. BioNumerics offers a large number of curve fit models, ranging from linear, logarithmic and Gaussian functions to more complex models such as Logistic growth and Gompertz. In addition, a number of curve-derived parameters can be calculated to compare curve data in a sensible way. These parameters can be calculated dynamically as character values and used for clustering, identification and statistics purposes. Matrix types With matrix types, it is possible to import externally generated similarity or distance matrices, providing similarity between entries revealed directly by the technique, or by other software. These matrices can be linked to the database entries in BioNumerics and they are used in conjunction with other information to obtain classifications and identifications. A typical example of a native matrix type is a table of DNA homology values. Composite data sets for the automated batch processing of multiple gels, including experiments with repeats or multiplexed gels such as DIGE. In addition, interactive overlay images with gels shown in different colors allow the user to manually correct normalizations and detect unique and common spots at a glance. BioNumerics allows protein spots from 2-D gels to be identified and stored in the database. As such, 2-D gel information can be analyzed in an unparalleled way, using all the available querying, clustering, identification, and ordination techniques available in BioNumerics. Each single database entry, a strain, organism, or sample, can have several experiments of different type linked to it. For example, a bacterial strain in the database could be characterized by a PFGE pattern, a 16S rdna sequence, an antibiotics resistance profile, and seven housekeeping genes. A plant cultivar or variety could have an AFLP pattern and a microarray experiment linked to it. There is virtually no limit to the number and variety of experiments that can be linked to a single object under study. Trend data types Trend data types include all types of sequential readings that express an evolution of one parameter in function of another. Unlike character data, the measurements are not independent but together form a curve, through which a function can be fitted. The most prevalent trend data experiments are kinetic readings, i.e. the measurement of a parameter, e.g., a concentration of a product, in function of time. Examples are enzymatic activity measurements, real-time PCR, growth BioNumerics 7

When more than one experiment is available for a set of entries, it may be interesting to generate an overall table of characters, which includes all the characters of the available experiments, or a

A composite data set may include character types, sequences, 1-D or 2-D gels, curve parameter data, and can be used for clustering, identification, or statistical analysis just like a single

8 When more than one experiment is available for a set of entries, it may be interesting to generate an overall table of characters, which includes all the characters of the available experiments, or a selection made by the user. The result is a seventh class of experiments, the so-called composite data set. A composite data set may include character types, sequences, 1-D or 2-D gels, curve parameter data, and can be used for clustering, identification, or statistical analysis just like a single experiment type. Analysis of 1-D fingerprints During more than 15 years, Applied Maths has built an unparalleled experience and leadership in electrophoresis typing and analysis. With the Fingerprint types module, BioNumerics offers the most comprehensive and reliable platform that exists for the analysis of 1-D profiles, including electrophoresis fingerprints, MALDI and SELDI profiles, chromatography, spectrophotometry, HPLC, and virtually every type of densitometric records that can be used for comparison purposes. The software handles 8-bit, 12-bit, and 16-bit TIFF files as well as densitometric curves from capillary sequencers, scanners, and spectrophotometers. Convenient wizards enable the user to define new fingerprint types and choose optimal settings for normalization, resolution, background subtraction, smoothing, band finding, etc. The whole process of analyzing a run or gel, starting with track preprocessing, normalization, band or peak finding, and ending with quantification, is contained in a powerful tab-based window, allowing the user to re-edit the processing at any stage without losing any editing done in another step. In addition, the software can optionally record history files, keeping track of any changes made. Reference peaks or bands used for alignment can be given a name or a size value (e.g., molecular weight or length in base pairs), which is used by the software to calculate the size regression. The full information of reference peaks or bands used for the normalization of a specific type of electropho- resis is called a reference system. The concept of reference systems also makes it possible to automatically and reliably remap experiments run under different conditions or using different reference markers into any other. This important feature makes it possible for different labs to exchange and compare electrophoresis data obtained with different conditions or setups. Reliable quantification of bands or peaks is often a requirement in molecular research, in genetic breeding, and for quantitative comparisons. BioNumerics calculates best-fitting Gaussian curves for 1-D peaks, and 2-D images can even be quantified by determining the contours of the bands. A regression of known calibration bands can be calculated; resulting in a reliable estimate of concentration. Defining bands or peaks on patterns can often be a critical and time-consuming task. BioNumerics offers accurate band/peak search algorithms that are amenable to all types of patterns through a number of adjustable parameters. The software allows bands/peaks within certain intensity thresholds to be marked as uncertain, in which case they are neither considered as a match, nor as a mismatch in comparisons. For techniques where the band/peak intensity differs in function of the size (e.g. EtBr stained gels such as PFGE), a peak intensity regression can be created based upon processed database patterns. The software uses the obtained regression to define peaks with much higher accuracy. Zoom-sliders in all images, convenient buttons, tool tips, floating menus, and multilevel undo/redo features make the processing of gels easy and highly surveyable and give the user easy and quick access to the wealth of advanced features available in BioNumerics. Numerous other features such as spot removal, 2-D and 1-D background subtraction, spot removal, filtering, spectral analysis, alignment distortion bars, optimization & tolerance statistics, have made BioNumerics the absolute standard for fingerprint analysis in environments where speed, volume and reliability are critical issues. As a last important feature, the script language in BioNumerics allows any action involved in gel processing to be executed from a script, which makes it possible to introduce various levels of automation in the gel analysis procedure. In environments where large numbers of standardized gels are run, this feature forms an invaluable basis for low cost high through put routine analysis.

DATABASING The backbone of BioNumerics is a powerful relational database, specifically designed for storing and retrieving biological data.

For high volume databasing, lab-wide access, permission control, automatic backup etc.

The rich and flexible database structure allows information to be added at numerous levels.

These include text files, images, Word, Excel and PDF files, and HTML/XML files or URLs. One of the highly appreciated database features that characterize BioNumerics, is its advanced querying tools.

9 DATABASING The backbone of BioNumerics is a powerful relational database, specifically designed for storing and retrieving biological data. By default, the software will create Microsoft Access databases, which are suitable for most purposes and occasional multi-user access. For high volume databasing, lab-wide access, permission control, automatic backup etc., BioNumerics will also manage a number of professional database engines such as Oracle, SQL Server, PostgreSQL, MySQL, DB2. The rich and flexible database structure allows information to be added at numerous levels. For example, an organism can have its own descriptive information fields (up to 150), and can also have a number of attachments associated with it. These include text files, images, Word, Excel and PDF files, and HTML/XML files or URLs. One of the highly appreciated database features that characterize BioNumerics, is its advanced querying tools. Query components can be created based upon database fields, ranges of fields, availability of experiments, presence of bands or characters, character values, subsequences, etc. These components can be combined using logical operators such as AND, NOT, OR, XOR, giving rise to complex queries that are nicely represented in a smart interactive diagram. Really no search query is too complex to be realized in BioNumerics. Queries can be saved to be reused or modified at any time. For full control, experienced users can also enter SQL query statements. Experiments linked to the organism, for example the gel pattern and the gel file in which the pattern occurs, can have their own descriptive information fields. Even comparisons, subsets, libraries, and other objects can have associated information fields. To add even more flexibility, multiple Levels can be defined within a database. As an example in clinical diagnostics, one level could hold the patients, a second level the samples that were taken from these patients, and in a third level the experimental data obtained from these samples. Each level can have its own associated descriptive information fields, and levels are interrelated to each other through Relations. Every biological experiment, including gel patterns, densitometric curves, carbohydrate assimilation panels, antibiotics resistance profiles, blots, microarrays, 2-D gels and sequences, can instantly be visualized with a single mouse-click and comparisons between experiments can be shown. Character-type experiments can be visualized in table format or graphically to resemble e.g. commercial test kits. BioNumerics 9

ANALYSIS AND COMPARISON In addition to the six experiment type modules, BioNumerics offers three comparison type modules: (i) Cluster Analysis and phylogeny, (ii) Non-hierarchic grouping techniques

Each of these modules is very comprehensive in terms of functionality and possibilities, so that only their most important features can be highlighted in the following paragraphs.

Putting together the concepts of a relational database, the contribution of multiple techniques, and a range of powerful clustering algorithms has resulted in a clustering module with unique

10 ANALYSIS AND COMPARISON In addition to the six experiment type modules, BioNumerics offers three comparison type modules: (i) Cluster Analysis and phylogeny, (ii) Non-hierarchic grouping techniques and statistics, and (iii) Identification and decision networks. Each of these modules is very comprehensive in terms of functionality and possibilities, so that only their most important features can be highlighted in the following paragraphs. Cluster Analysis and Phylogeny Since the availability of computers to biologists, cluster analysis, also called unsupervised learning, has been a fundamental tool in bioinformatics. Putting together the concepts of a relational database, the contribution of multiple techniques, and a range of powerful clustering algorithms has resulted in a clustering module with unique capabilities in Bio Numerics. The Comparison window. This crucial window in Bio- Numerics presents a comprehensive overview of all available experiments for a selection of entries and enables the user to show and compare any combination of experiments. Similarity or distance matrices and dendrograms can be calculated for any selected experiment, and the obtained groupings can be compared with patterns or characters obtained from other experiments. A variety of similarity and distance coefficients and clustering methods are available, in order to provide the most appropriate clustering for all data types. Composite cluster analysis. Composite clusterings can be generated from selected combinations of experiments, and various methods can be used to obtain a combined dendrogram. Similarities can be adopted from the individual experiments and averaged by user-defined weights, or weights determined by the program, based upon the number of characters available in each experiment. Alternatively, all characters from the individual experiments can be pooled to form one global data set, which can be clustered. Advanced mathematical algorithms allow the calculation of a consensus similarity matrix and dendrogram based upon individual matrices from different experiments. Dendrogram functions. BioNumerics offers a comprehensive set of features for clustering and mining of complex data sets. Numerous viewing modes and editing tools such as twoway zoom-sliders, swapping and abridging of branches, rerooting of trees, displaying data (characters, patterns, curves or sequences) in various modes, make the interpretation of large cluster analyses easier. Incremental clustering. The incremental clustering algorithm allows batches of entries to be pasted, or deleted from existing dendrograms without having to recalculate the entire similarity matrix. BioNumerics automatically updates the existing matrix and rebuilds the dendrogram accordingly, so that adding or deleting entries becomes a matter of seconds instead of minutes or hours. Special attention has been paid to the incremental construction of multiple alignments of sequences. In order to maintain alignments edited by the user, new sequences can be realigned while preserving the existing alignment. Dendrogram significance tools. Several statistical methods are available for evaluating the confidence level of a global tree, and of each individual branch. These methods include the standard deviation and co-phenetic correlation at each branching level and the root, bootstrap analysis at each branching level of a rooted or unrooted tree, and the Jackknife method. BioNumerics also can search and show all degeneracies on a dendrogram and display a consensus dendrogram that encompasses all degeneracies. In a similar way, consensus dendrograms can be calculated from different techniques. Partitioning methods provide an alternative way to discovering group structures in complex data sets. Clustering of characters. Not only can entries be clustered based upon their common and different characters, but also characters can simultaneously be clustered based upon the swapped data matrix. This approach results in a transversal clustering or two-way clustering, a combined view in which both database entries and characters are clustered, and which allows the user to easily reveal the characters that determine and distinguish groups of related entries.

evolutionary optimization criteria. These include the Generalized Maximum Parsimony method and the Maximum Likelihood algorithm.

11 Phylogenetic inference. In addition to pair-wise clustering techniques such as UPGMA, Ward, Single Linkage, Complete Linkage and Neighbor Joining, BioNumerics offers true phylogenetic clustering algorithms based upon evolutionary optimization criteria. These include the Generalized Maximum Parsimony method and the Maximum Likelihood algorithm. Parsimony can be combined with bootstrap analysis whereas maximum likelihood offers the Likelihood Ratio Test. Both methods result in an unrooted seaweed dendrogram, which can be converted into a pseudo-rooted tree after assignment of a root. To correct phylogenetic distance scaling, the Jukes & Cantor or Kimura 2 parameter correction factors can be chosen. Dimensioning techniques and statistics Under dimensioning techniques, we classify all techniques that place the entries in a two- or more dimensional space, rather than imposing a hierarchical, bifurcating structure like a dendrogram. Principal Components Analysis (PCA). This technique starts directly from a character table to obtain groupings in a multidimensional space. Any combination of axes can be displayed in two- or three dimensions. Multi-Dimensional Scaling (MDS). Rather than starting from the data set, MDS uses the similarity matrix as input, which has the advantage over PCA that it can be applied directly to banding patterns. The MDS algorithm iteratively optimizes the distances between the entries in the MDS space according to the similarity values of the matrix. The advanced presentation modes of both PCA and MDS produce fascinating three-dimensional graphs in an X-Y-Z coordinate system, which can rotate in real time to enhance the perception of the spatial structures. All dimensioning techniques in BioNumerics provide great interactive features, making it possible to select, add or remove entries directly on the plot, display additional database information as colors or labels, relate groupings directly to discriminatory characters, etc. Minimum Spanning Trees. Whereas parsimony and maximum likelihood techniques are suitable for inferring deeper phylogenetic relationships, the Minimum Spanning Tree (MST) algorithm allows short-term divergence and micro-evolution in populations to be reconstructed based upon sampled data. The MST technique as implemented in BioNumerics is an excellent tool for analyzing genetic subtyping data such as derived from MLST, MLVA and other allele-comparison techniques. The MST interface offers great interaction with the database and other techniques and is the ideal platform for plotting epidemic divergence against other factors such as geographical distribution, date of sampling, serotypes, etc. BioNumerics 11

Self-Organizing Maps (SOM). Basically being a type of neural network, a SOM is able to place many thousands of entries in a two-dimensional representation, a map, according to overall relatedness.

An interesting option of a SOM is that unknown entries can be placed in an existing map with very little computing time, which offers a quick and easy-to-interpret identification tool.

These very useful statistical analysis methods allow the relation between groups of entries and characters to be discovered, and the significance of such groups to be determined.

Easy and intuitive tool to perform a number of parametric and non-parametric statistical tests (Chi-square test, T-test, Wilcoxon signed-ranks test, Kruskal- Wallis test, ANOVA, Pearson correlation

Libraries and identification Identification, also called supervised learning or classification, is no doubt one of the most important techniques in bioinformatics.

12 Self-Organizing Maps (SOM). Basically being a type of neural network, a SOM is able to place many thousands of entries in a two-dimensional representation, a map, according to overall relatedness. For complex data sets with large numbers of entries, SOM analysis is to be preferred over traditional clustering. An interesting option of a SOM is that unknown entries can be placed in an existing map with very little computing time, which offers a quick and easy-to-interpret identification tool. BioNumerics was the first software to apply this exciting technique to biological relatedness study and for identification. Discriminant Analysis and MANOVA. These very useful statistical analysis methods allow the relation between groups of entries and characters to be discovered, and the significance of such groups to be determined. The groups can be clusters derived from a dendrogram, or any user-defined selections of entries (e.g., by origin, species, serotype ). Statistical tests and charts. Easy and intuitive tool to perform a number of parametric and non-parametric statistical tests (Chi-square test, T-test, Wilcoxon signed-ranks test, Kruskal- Wallis test, ANOVA, Pearson correlation test, Spearmann rank-order test ). For each input data type, the software displays the suitable tests and the available plot types. Libraries and identification Identification, also called supervised learning or classification, is no doubt one of the most important techniques in bioinformatics. The possibility of identifying unknown organisms based upon various available experiment data sets is also a big step forward realized in BioNumerics, leading to more faithful consensus identifications. The same range of similarity and distance coefficients available for cluster analysis can be used for identification. Identification libraries. Identification can be as quick and simple as sorting a large list of database entries according to similarity with an unknown entry. However, the use of libraries can make the identification between complex groups much more reliable. An identification library is a collection of units, each of which consists of one or more entries of the same group (taxon, subtype, variant, ecotype ). The identification of unknown samples depends on the similarity to the available library units. A very easy and surveyable identification report lists the identifications obtained by all individual data sets. The number of closest matches shown can be expanded or reduced, and full detailed information on the identification of a specific entry is shown instantly with a simple click. Mathematical and statistical methods allow the estimation of the reliability and the relevance of each identification case.

A detailed pairwise comparison can be obtained between any two entries from the database, which lists all the experiments that both entries share, together with the percentage similarity.

As an interesting alternative to classical similarity-based identification, BioNumerics allows neural networks to be generated for each experiment type.

Decision networks are one of the most versatile and powerful tools in BioNumerics, allowing the user to build automated workflows to make decisions, predict features, perform queries, fill in fields,

13 A detailed pairwise comparison can be obtained between any two entries from the database, which lists all the experiments that both entries share, together with the percentage similarity. With a simple mouse-click on the experiment type, the gelstrips, character sets, or aligned sequences or whatever data entered for both entries are shown together. As an interesting alternative to classical similarity-based identification, BioNumerics allows neural networks to be generated for each experiment type. For large databases containing groups that are difficult to distinguish, neural networks can be the quickest and most reliable identification tool. Decision networks. Decision networks are one of the most versatile and powerful tools in BioNumerics, allowing the user to build automated workflows to make decisions, predict features, perform queries, fill in fields, create graphs and plots, and much more. A decision network is an operational workflow that carries out one or more [logical] operations and/or actions on the database. The network is built of Operators as building blocks that form the Nodes of the network: Input operators to retrieve specific, usually experimental, data String, Value and Sequence operators, which perform a manipulation on data types Boolean operators, which combine one or more binary states into a new binary state Output actions, performing a specific action on the database, for example writing a field. The operators of a decision network together form an easy-to-use construction kit that allows one to build automated decision or action workflows, with endless possibilities. Analysis of congruence between techniques When comparisons are made between groupings based upon different techniques, the question arises to what extent there exists any congruence between these different techniques. Another interesting aim is to find which technique is the closest to the consensus classification, since this technique will in general be the most reliable for identifying the organisms or samples under study. This is another analytical tool offered by BioNumerics: similarity matrices obtained from different techniques are compared in a pairwise manner by comparing corresponding similarity values by either Kendall s Tau coefficient or the product-moment correlation. This results in a congruence matrix, expressing the global similarity or congruence between different techniques. This matrix in turn can be clustered into a dendrogram, now grouping techniques according to congruence. Pairwise comparisons between any two techniques are obtained by plotting the corresponding similarity values in an X-Y diagram. BioNumerics 13

Such plots are very useful to reveal the taxonomic level or depth of one technique compared to another: it shows whether one technique is discriminative at a lower or higher level than another

Database sharing Today, the exchange of information among different laboratories is of the utmost importance in the life sciences.

BioNumerics offers a powerful solution to this important issue with its integrated Database Sharing Tools, available as a separate module. Peer-to-peer data exchange.

14 Such plots are very useful to reveal the taxonomic level or depth of one technique compared to another: it shows whether one technique is discriminative at a lower or higher level than another technique and provides insight into the limitations and benefits of each technique in building identification strategies. Database sharing Today, the exchange of information among different laboratories is of the utmost importance in the life sciences. The need to exchange biodata has become particularly urgent in clinical and epidemiological research and surveillance networks. BioNumerics offers a powerful solution to this important issue with its integrated Database Sharing Tools, available as a separate module. Peer-to-peer data exchange. The Database Sharing Tools allow BioNumerics users to exchange information at a peerto-peer level by simply making a selection of database entries, and clicking the information fields and experiment data to be exported in XML format. Received XML files can be imported and directly analyzed together with other database entries. BioNumerics automatically recognizes which experiments are compatible. XML exchange files can be optionally compressed and encrypted. Client-Server approach. BioNumerics advanced client-server system is the perfect solution for collaborative research projects, networks, and private initiatives of any size, where central databases are made available to a restricted or unrestricted number of client users. Each BioNumerics software package that contains the Database Sharing Tools comes as a client version, which can connect and communicate with a BioNumerics Server using TCP/IP. A direct connection is established between the Server and the Client allowing uploading and downloading of database entries, interactive querying, and automatic identification of profiles uploaded by the client. Using the script language both at client and server site, the most sophisticated implementations can be designed. Examples include automatic creation and broadcasting of reports and notices, or automatic alerts of members in surveillance networks. Geographical mapping. In many research projects, especially epidemiological, biological data is closely linked to geographical data. BioNumerics Database Sharing Tools enable a Geo plugin to be installed, offering a simple yet powerful way to map the results from queries, comparisons, identifications etc. on geographical maps. Geographical information with database entries can be provided as city names, postal or zip codes, or geographical coordinates. Entries can be plotted individually or as stacked bar graphs or pie charts, using different colors according to groups defined in the database. The powerful geographic tools of Google Maps and versatile search, select, and query interface of BioNumerics together make the Geo plugin a very useful and interactive asset. Plugin tools Although BioNumerics is a versatile and comprehensive platform for the analysis and databasing of any type of biological data, a number of applications are too specific to be provided in a generic environment. This is the case for import and export tools, but also for a number of cutting-edge techniques that require continuous updating of the analysis tools to keep up with the latest developments. Therefore, most techniqueoriented functionality has been enabled as Plugin applications. These plugins are well-documented in separate manuals and are officially supported by Applied Maths. Free specialist plugins are available for antibiotic susceptibility analysis, HIV drug resistance analysis, MLST analysis, spa

15 typing, MLVA-VNTR analysis, project-based automated 2-D gel analysis, etc. BioNumerics also offers free plugins for import and export, XML-based exchange, automated batch sequence assembly, geographical mapping, and a wide variety of extra functionality for the database, dendrograms and reporting. A few highly advanced plugin-based modules are also available as separate software licenses, e.g. the HDAplugin for CSCE-based heteroduplex analysis (HDA), and the Band Scoring plugin for electrophoresis-based codominant band scoring and recurrent parent analysis. Modular structure BioNumerics consists of 10 different modules in total, of which 6 modules are related to the different experimental applications that can be analyzed (application modules), and 4 modules constitute the different analysis tools that the software contains (analysis modules). 6 application modules: Fingerprint types, Character types, Sequence types, Trend data types, 2-D gel types, and Matrix types. 4 analysis modules: Cluster analysis, Identification & Libraries, Dimensioning techniques & Statistics, and Database sharing tools. The full BioNumerics functionality is physically contained in the same program unit, which guarantees perfect integration of the modules and easy co-evaluation of different data sets and analyses. For example, a selection of entries highlighted on a dendrogram (Cluster analysis module) becomes also highlighted on a PCA (Dimensioning techniques module) and in the database. Any or all of the application and analysis modules can be combined with each other. At least one application module is required to operate the software. Compatibility Import of fingerprints: Accepts uncompressed gel images as 8-bit, 12-bit, and 16-bit TIFF files generated by any imaging system. Direct import and processing of multichannel chromatogram files from automated sequencers (Applied Biosystems, Beckman, Amersham). Import of absorbance and densitometry profiles from a variety of scanners, sequencers and automated system (electrophoresis, spectrophotometry, HPLC, mass spectrometry, MALDI, SELDI, etc. Import of processed densitometric data data as peak MW, RF, height and/or surface tables. Scriptable import of any densitometric or peak table record available in text format. Import of character data: Easy wizard-driven import of character data from text tables, Excel spreadsheets or databases. Plugin tools available for import of most common phenotypic test panels and automated identification systems such as fatty acid profiling. Scriptable import of any character array available in text format or contained in a database or spreadsheet. RGB channel-quantification of grid-based character data scanned as 8-bit to 24-bit TIFF images, such as microplates, phenoypic test panels, DNA arrays, etc. Import of sequences: Processing and contig construction in BioNumerics Assembler of multichannel chromatogram files from automated sequencers (Applied Biosystems, Beckman, Amersham) and text SCF or binary files. Compatible with EMBL, GenBank, FASTA sequence formats for import of annotated sequences (header descriptions, features and qualifiers). Import of aligned sequences possible. Scriptbased import of less common file formats. Import of database information: Import of information fields from any text file type, spreadsheet or database using the import plugin. Direct link with SQL and ODBC compatible databases. Printing and export: Professional print reports in color or grayscale. Each graphical or text-oriented print job can be copied to the clipboard for import in other Windows software, or can be saved as bitmap file with adjustable resolution. Creation of custom graphics or text reports possible using scripts. Script language: Powerful script language to realize tasks like importing data from files or databases, automated import and processing of fingerprints, automated sequence assembly, exporting data, creating customized graphics and text reports, manipulation of database fields, manipulation of experimental data, performing complex queries, creating specific analysis tools, etc. BioNumerics 15

www.applied-maths.com Keistraat 120, B-9830 Sint-Martens-Latem, Belgium Phone +32 9 2222100, Fax +32 9 2222102 13809 Research Blvd.

16 Keistraat 120, B-9830 Sint-Martens-Latem, Belgium Phone , Fax Research Blvd., Suite 645, Austin, Texas 78750, USA Phone , Fax BioNumerics is a trademark of Applied Maths NV. All other trademarks are the properties of their respective owners. The information in this brochure is subject to changes without prior notice. Copyright , Applied Maths NV. All rights reserved.

GelCompar II TODAY S FOREMOST SOFTWARE FOR THE ANALYSIS OF BANDING PATTERNS AND FINGERPRINTS.

GelCompar II TODAY S FOREMOST SOFTWARE FOR THE ANALYSIS OF BANDING PATTERNS AND FINGERPRINTS www.applied-maths.com Today s foremost software for the analysis of banding patterns and fingerprints Ever since