Improving the local alignment of LocARNA through automated parameter optimization

Size: px

Start display at page:

Download "Improving the local alignment of LocARNA through automated parameter optimization"

Marilynn Parsons
6 years ago
Views:

1 Improving the local alignment of LocARNA through automated parameter optimization Bled Teresa Müller

2 Introduction Non-coding RNA High performing RNA alignment tool correct classification LocARNA: global and local alignment program Heuristic of Sankoff algorithm [Sankoff, 1985] 2

3 SMAC [Hutter, Hoos, Leyton-Brown, 2011] Sequential Model-Based Algorithm Configuration Black box tool Task: find high-performance parameter settings Uses Random Forest model New parameter setting cleverly chosen Can optimize categorical parameters 3

4 Set-up 4

5 Local alignment Global alignment Local alignment Lack of accurate local sequence-structure alignment tools Challenges of sequence-structure local alignment: Find correct boundaries Find correct alignment edges 5

6 Construct local benchmark set from BRAliBase BRAliBase ncrna (green) Extract genomic context (red) from European Nucleotide Archive Specify a context size [L] Extract context parts (blue) Shuffled context areas 6

7 Set-up 7

8 Sum of Pairs Score SPS= [Thompson et al., 1999] correct predicted edges number of reference edges 8

9 maxsps example maxsps = correct predicted edges maxlength(reference, predicted) 9

10 Default vs. Optimized maxsps Default parameter setting Low SI: low maxsps quality High SI: more easy to find alignment edges Optimized parameter setting Improvement for SI > 40 For low SI 40 no change (less data points) 10

11 Validation of best run Validation set Dataset context 20 Dataset context 200 Default parameters Optimized parameters % 27% Improvement 11

12 Position penalty Observation: Background bonus Conserved structures can be found in context Solution: position penalty λ Each position of the local alignment is penalized by 12 λ

13 Position penalty 5 optimization Default parameter setting Optimized parameter setting using position penalty 5 Parameter optimization based Improvement even without optimization on dataset SI

14 Summary Novel local benchmark set New local quality measure maxsps Learning improves maxsps (27 %) Position penalty solely improves maxsps Additional improvement by learning Outlook: more parameters, position penalty validation, additional benchmark set parameter Gap Gap opening Structure weight Tau factor default first optimized Penalty 5 optimized

15 Acknowledgement: Prof. Dr. Rolf Backofen Dr. Frank Hutter Dr. Sebastian Will Milad Miladi Christina Otto Thanks for your attention

16 Outlook Optimization on the exhaustive set of parameter optimization Validate the position penalty Use different validation set Research on failed alignments 16

17 Position penalty 5 Default parameter setting Default with position penalty 5 17

18 LocARNA scoring function sw. (ij ; kl) S (Ψ Aij +Ψ Bkl )+tf. (ij ;kl) S (σ ( A i, B k )+σ ( A j, Bl ))+ Ψ ij Base pair weight σ( A i, Bk ) (mis-)match score γ Gap penalty N gap β No. of gaps Gap opening penalty N ogap No. of gap openings sw. Structure weight tf. Tau factor (i, k ) A S [Will et.al., 2007] σ ( A i, B k ) N gap γ N ogap β Parameter optimization algorithm configuration 18

19 SMAC algorithm [Hutter, Hoos, Leyton-Brown, 2012] Specify parameter configuration space Θ Π instance space θinc: best parameter setting seen so far R tracks parameter settings and observed performance Initialization: set the first incumbent θinc, and R 19

20 Loop iterations [Hutter, Hoos, Leyton-Brown, 2012] 1. FitModel Built using R 2. SelectConfiguration Model finds promising configurations 3. Intensify Compare promising configurations against incumbent 20

21 References Reference Figure silde 13: 21

22 Local Optimization results (con 100) gap gap opening structure weight tau train quality 22

23 ncrna sensitivity (RS) and context specificity (CS) Measuring the aligned areas for each sequence Calculate the mean of both values Alignment edge No alignment in reference edge in alignment reference alignment Alignment edge in predicted alignment No alignment edge in predicted alignment True positive (TP) False Positive (FP) False Negative (FN) True Negative (TN) ncrna sensitivity (RS) Context specificity (CS) TP A RS A = TP A +FN A TN A CS A = TN A + FP A 23

24 Optimization based on uniform k2-bralibase quality train default difference improvement SPS % SP S MCC % Default parameter setting: -gap 350 -gap-opening 500 -struct-weight 200 -tau 0 Final parameter setting: -gap 68 -gap-opening 807 -struct-weight 210 -tau 72 Begin End Default gap Gap-opening struct-weight tau

25 Distribution of refsps and maxsps (con 100) 25

26 Heatmap default compared to optimized for sens/spec 26

27 Parameter distribution of uniform k2-bralibase(sps) global gap gap opening structure weight tau train quality 27

28 K-fold Parameter (SI 50-70) gap gap opening structure weight tau train quality 28

29 K-fold Parameter (SI 71-90) gap gap opening structure weight tau train quality 29

30 Distribution of ncrna sensitivity and context specificity (con 20) 30

31 Distibution of refsps and maxsps (con 20) 31

32 refsps and maxsps no. correctedges refsps= referencelength maxsps= no. correctedges max( referencelength, predictedlength) 32

33 Dataset size Full dataset Dataset Size Full_Global_Dataset 2090 Full_Local_Dataset 1370 SI dataset Dataset Training size Validation Size IS_ IS_

34 K-fold validation and default quality (Dataset: SI 50-70) Dataset (Dataset: SI 71-90) Mean difference Standard deviation

35 Average difference dataset Mean difference standard deviation swaped swaped (mcc) (mcc)

36 Random Forest Data of each node is divided trough a split criterion Decision can be based on parameters with continuous values (real values) Leaf will specify the runtime [Hutter et.al., Sequential Model-based Optimization for General Algorithm Configuration, LION 5. Rome. January 18, 2011 ] Take mean runtime

37 Global alignment dataset Dataset: BRAliBase [Wilm et al., 2006] BRAliBase: Benchmark RNA Alignment database Equal number of instances per family K-fold cross validation Showed no overfitting 37

38 Optimized parameters and quality Gap penalty gap opening penalty structure weight tau train quality 1 SPS MCC 38

39 Optimized parameter evaluation quality train default difference improvement 1 - SPS % 1 - SPS MCC % 39

Sequential Model-based Optimization for General Algorithm Configuration

Sequential Model-based Optimization for General Algorithm Configuration Frank Hutter, Holger Hoos, Kevin Leyton-Brown University of British Columbia LION 5, Rome January 18, 2011 Motivation Most optimization