Sarah Cohen-Boulakia. Laboratoire de Recherche en Informatique CNRS UMR 8623 Université Paris-Sud, Université Paris-Saclay

Size: px

Start display at page:

Download "Sarah Cohen-Boulakia. Laboratoire de Recherche en Informatique CNRS UMR 8623 Université Paris-Sud, Université Paris-Saclay"

Della Tyler
6 years ago
Views:

1 Sarah Cohen-Boulakia Laboratoire de Recherche en Informatique CNRS UMR 8623 Université Paris-Sud, Université Paris-Saclay

2 Understanding Life Sciences Progress in multiple domains: biology, chemistry, maths, computer sciences Emergence of new technologies: Next generation sequencing, Increasing volumes of raw data All stored in Web data sources Raw data are not sufficient Data Annotated by experts Bioinformatics analysis of data New data sources Concrete example: Querying NCBI Entrez («Gquery NCBI» on google ) Sarah Cohen-Boulakia, Université Paris-Sud 2

3 National Center for Biotechnology Information (NCBI) provides access to biomedical and genomic information through its Web portal 33 databases queries Breast cancer genes Sarah Cohen-Boulakia, Université Paris-Sud 3

4 Content of a biological database = a set of pages (a book!) Each database is focused on one biological entity (here, Gene) Each page describes one instance (ie, one gene) Sarah Cohen-Boulakia, Université Paris-Sud 4

5 Various quality features Last update (freshness) Status of the sequence (reliability) «volume of information» (completeness) + Reputation of the data source, number of incoming links (paths) to reach the data Sarah Cohen-Boulakia, Université Paris-Sud 5

6 to determine the importance of results (obtained by several ways) and get complete sets of answers Mammilian carcinoma Mamamilian carcinoma 771 genes not all included in the previous result! 771 Sarah Cohen-Boulakia, Université Paris-Sud 6

NCBI ranks answers for each keyword using the relevance Number of occurrences of the keyword (breast cancer) in the files Other approaches have been tested Combining various criteria (weighted

7 NCBI ranks answers for each keyword using the relevance Number of occurrences of the keyword (breast cancer) in the files Other approaches have been tested Combining various criteria (weighted functions) Freshness first then reputation of the source then Following PageRank or ObjectRank-like solutions (Biozon, Biorank ) How to deal with several sets of answers? How to aggregate several rankings? Answers obtained by several keywords should have higher priority Wanted: Ranking solution exploiting many criteria Problem : How to combine all such criteria? Alternative: Consensus rankings? Sarah Cohen-Boulakia, Université Paris-Sud 7

8 Need for consensus rankings to order biological data Definition of the problem Experimental Study of existing approaches ConQuR-Bio Conclusion Sarah Cohen-Boulakia, Université Paris-Sud 8

9 Input: a set of rankings (A to D are four genes) Ranking 1 = [A, D, C, B] Ranking 2 = [B, A, D, C] Ranking 3 = [D, A, B, C] Output: one consensus ranking Consensus = [A, D, B, C] A consensus ranking makes the most of several rankings, each obtained by given ranking methods Put emphasis on their common points Does not put too much importance on data classified good by only one or a few ranking methods The optimal consensus is one ranking which is the closest to the input rankings Distance Sarah Cohen-Boulakia, Université Paris-Sud 9

10 The Kendall τ D(π,σ) distance counts the number of pairs of elements inversed (ie in the opposite order) between two rankings Sarah Cohen-Boulakia, Université Paris-Sud 10

11 The Kendall τ D(π,σ) distance counts the number of pairs of elements inversed (ie in the opposite order) between two rankings Kemeny Score Optimal Consensus (median) Sarah Cohen-Boulakia, Université Paris-Sud 11

12 The Kendall τ D(π,σ) distance counts the number of pairs of elements inversed (ie in the opposite order) between two rankings Kemeny Score Optimal Consensus (median) Complexity [Dwork et al 2001, Biedl et al. 2009] NP-Difficult for an odd number of permutations 4 Approximation and heuristics Sarah Cohen-Boulakia, Université Paris-Sud 12

13 Real data sets Have equalities: several items ex-aequo May be incomplete not all on the same sets of elements One unifying bucket at the end of each ranking with the missing elements Unification process Generalized Kendall τ G(r,s) counts the number of pairs of elements inversed between two rankings r et s tied in only one of the two rankings R1=[A, {B,C},D,{E}] R2=[{A,D},E,{B,C}] R3=[{C,E},A,B,{D}] Elements in the unifying bucket are equally unimportant Complexity: Finding an optimal consensus with ties is at least as difficult as in the permutation case [VLDB15] Optimal consensus implemented [VLDB15] Provides exact solutions until n=40 elements Approximation and heuristics Sarah Cohen-Boulakia, Université Paris-Sud 13

14 BioConsert [Cohen-Boulakia et al. 2011] FaginDyn [Fagin et al. 2004] Generalized Kendall-τ distance Ailon3/2 [Fagin et al. 2004] RepeatChoice [Ailon et al. 2010] Pick-A-Perm [Ailon et al. 2008] Kendall-τ distance PNE (exact) [Conitzer, et al. 2006] permutations KwikSort [Ailon et al. 2008] B&B [Ali et al. 2012] Chanas [Chanas et al. 1996] MEDRank [Fagin et al. 2003] BordaCount [Borda 1781] CopelandMethod [Copeland et al. 1951] Positionnal approaches MC4 [Dwork et al. 2001] ChanasBoth [Coleman et al. 2009] Sarah Cohen-Boulakia, Université Paris-Sud 14

15 Compatibility with ties Ability to consider the cost of (un)tying [VLDB15] Sarah Cohen-Boulakia, Université Paris-Sud 15

Optimal solution can be found until n~40 Comparison of 12 algorithms Re-implemented and adapted to ties Evaluated on Quality: gap (optimal consensus c*) Time Real datasets Extracted from previous

16 Optimal solution can be found until n~40 Comparison of 12 algorithms Re-implemented and adapted to ties Evaluated on Quality: gap (optimal consensus c*) Time Real datasets Extracted from previous publications 7 datasets: EachMovie, F1, GiantSlalom, SkiCross/Jumping, WebCommunities, WebSearch, BioMedical [VLDB15] Synthetic datasets 1. Uniformly generated with MuPad Combinat 2. With increasing levels of similarity 3. Similarity and unification process Precise understanding of the impact of the similarity and unification process Sarah Cohen-Boulakia, Université Paris-Sud 16

BioConsert can be used in a very large majority of the cases For very large data sets (>30.

use BordaCount Otherwise use MEDRank Alternativeley: use both algorithms and pick the best High quality

17 BioConsert can be used in a very large majority of the cases For very large data sets (> elements) KwikSort can be preferred If there is a need to seed up then In case of few equalities use BordaCount Otherwise use MEDRank Alternativeley: use both algorithms and pick the best High quality low quality Time (gap) [VLDB15] Sarah Cohen-Boulakia, Université Paris-Sud 17

18 Need for consensus rankings to order biological data Definition of the problem Experimental Study of existing approaches ConQuR-Bio Conclusion Sarah Cohen-Boulakia, Université Paris-Sud 18

Synonyms: Breast cancer vs mammalian carcinoma (14 772 vs 771 genes, not all included) Abbreviations: Attention deficit hyperactivity disorders vs ADHD (109 vs 144 genes, 74 common) Linguistics

19 Synonyms: Breast cancer vs mammalian carcinoma ( vs 771 genes, not all included) Abbreviations: Attention deficit hyperactivity disorders vs ADHD (109 vs 144 genes, 74 common) Linguistics variations: tumour vs tumor (& breast cancer) : 681 vs 291 genes More precise reformulations: colorectal cancer vs Lynch syndrom (+6 new genes) [DILS14] Which are the genes associated to a given disease? Finding all synonyms is time-consuming Querying using all synonyms provide huge amounts of data sets which have to be ranked. Sarah Cohen-Boulakia, Université Paris-Sud 19

20 I) Reformulations (synonyms) can be automatically generated Input: the user s Keyword Automatic search of synonyms in major biomedical terminologies Output : set of Synonyms of the input keyword [DILS14] Sarah Cohen-Boulakia, Université Paris-Sud 20

21 I) Reformulations (synonyms) can be automatically generated Input: the user s Keyword Automatic search of synonyms in major biomedical terminologies Output : set of Synonyms of the input keyword II) DB querying is automated (based on keywords) Input : set of synonyms obtained in I) Querying gene databases with each synonym (#queries = # synonyms) Output : sets of rankings (ranked genes), one ranking per synonym [DILS14] Sarah Cohen-Boulakia, Université Paris-Sud 21

I) Reformulations (synonyms) can be automatically generated Input: the user s Keyword Automatic search of synonyms in major biomedical terminologies Output : set of Synonyms of the input keyword II)

22 I) Reformulations (synonyms) can be automatically generated Input: the user s Keyword Automatic search of synonyms in major biomedical terminologies Output : set of Synonyms of the input keyword II) DB querying is automated (based on keywords) Input : set of synonyms obtained in I) Querying gene databases with each synonym (#queries = # synonyms) Output : sets of rankings (ranked genes), one ranking per synonym III) Aggregating using a series of consensus algorithms with a variant of the Generalized Kendall τ distance [DILS14] Sarah Cohen-Boulakia, Université Paris-Sud 22

23 Highly accessed by members of APHP (Hospitals), Institut Curie, Institut Pasteur Sarah Cohen-Boulakia, Université Paris-Sud 23

24 Ranking Biological data is a complex task Many quality criteria, difficult to combine Consensus ranking approaches Classification of consensus ranking algorithms Rankings with ties, incomplete data sets Study of distances and normalization techniques Complexity results, new distances considered, new heuristics Guide in the choice of distances and algorithms Rank-n-ties platform available Concrete application to real biological data Consensus of query reformulations Currently under evaluation based on real datasets (Leukemia) Increasing number of reformulations + increasing number of answers ranking big data sets Sarah Cohen-Boulakia, Université Paris-Sud 24

25 Bryan Brancotte Mastodon Project QualibioConsensus: LRI (Paris-Sud), LIGM (Marne-lavallée), IFB (Institut Français de Bioinformatique) Other partners: Univ. Montréal, Institut Curie, APHP Georges Pompidou, APHP Paul Brousse Sarah Cohen-Boulakia, Université Paris-Sud 25

26 Rank aggregation with ties: Experiments and Analysis B Brancotte, B Yang, G Blin, S Cohen-Boulakia, A Denise, S Hamel. Proceedings of the VLDB Endowment 8 (11), Interrogation de bases de données biologiques publiques par reformulation de requêtes et classement des résultats avec ConQuR-Bio B Brancotte, B Rance, A Denise, S Cohen-Boulakia. JOBIM (Journées Ouvertes Biologie Informatique Mathématiques) 2015 Conqur-bio: Consensus ranking with query reformulation for biological data B Brancotte, B Rance, A Denise, S Cohen-Boulakia. International Conference on Data Integration in the Life Sciences (DILS), Using medians to generate consensus rankings for biological data S Cohen-Boulakia, A Denise, S Hamel. International Conference on Scientific and Statistical Database Management (SSDBM) Sarah Cohen-Boulakia, Université Paris-Sud 26

Sarah Cohen-Boulakia. Université Paris Sud, LRI CNRS UMR On leave at INRIA Virtual Plants & Zenith, Inst. of Comput. Biology, Montpellier

Sarah Cohen-Boulakia. Université Paris Sud, LRI CNRS UMR On leave at INRIA Virtual Plants & Zenith, Inst. of Comput. Biology, Montpellier Sarah Cohen-Boulakia Université Paris Sud, LRI CNRS UMR 8623 On leave at INRIA Virtual Plants & Zenith, Inst. of Comput. Biology, Montpellier Repositories queried (IR-style) with workflow docum. Open question: