Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Similar documents
PowerGREP. Manual. Version October 2005

Version November 2017

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

Regular Expressions. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 9

successes without magic London,

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

Regular Expressions in programming. CSE 307 Principles of Programming Languages Stony Brook University

Paolo Santinelli Sistemi e Reti. Regular expressions. Regular expressions aim to facilitate the solution of text manipulation problems

Lecture 18 Regular Expressions

Introduction to Regular Expressions Version 1.3. Tom Sgouros

Regular Expressions. Perl PCRE POSIX.NET Python Java

Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

The Little Regular Expressionist

Regular Expressions!!

Appendix. As a quick reference, here you will find all the metacharacters and their descriptions. Table A-1. Characters

This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl.

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

Getting to grips with Unix and the Linux family

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Regex Guide. Complete Revolution In programming For Text Detection

Introduction to regular expressions

=~ determines to which variable the regex is applied. In its absence, $_ is used.

Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

Regular Expressions Explained

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

HEP Computing Part II Scripting Marcella Bona

Version June 2017

正则表达式 Frank from

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

Index. caps method, 180 Character(s) base, 161 classes


Regular Expressions. Regular Expression Syntax in Python. Achtung!

Computer Systems and Architecture

LING115 Lecture Note Session #7: Regular Expressions

Lecture 2. Regular Expression Parsing Awk

Understanding Regular Expressions, Special Characters, and Patterns

Filtering Service

Using Lex or Flex. Prof. James L. Frankel Harvard University

Computer Systems and Architecture

Learning Ruby. Regular Expressions. Get at practice page by logging on to csilm.usu.edu and selecting. PROGRAMMING LANGUAGES Regular Expressions

STATS Data analysis using Python. Lecture 0: Introduction and Administrivia

Regular Expressions in Practice

Regular Expressions Primer

Pattern Matching. An Introduction to File Globs and Regular Expressions

CS Unix Tools & Scripting

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

System & Network Engineering. Regular Expressions ESA 2008/2009. Mark v/d Zwaag, Eelco Schatborn 22 september 2008

CSC 467 Lecture 3: Regular Expressions

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

JavaScript Functions, Objects and Array

STREAM EDITOR - REGULAR EXPRESSIONS

Regular expressions: Text editing and Advanced manipulation. HORT Lecture 4 Instructor: Kranthi Varala

Regular Expressions. Regular expressions match input within a line Regular expressions are very different than shell meta-characters.

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Last Time. Strings. Example. Strings. Example. We started talking about collections. Strings, Regex, Web Response

Regular Expressions: The Power of Perl

Cisco Common Classification Policy Language

Lecture 11: Regular Expressions. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Effective Programming Practices for Economists. 17. Regular Expressions

3 The Building Blocks: Data Types, Literals, and Variables

Regular Expressions. using REs to find patterns. implementing REs using finite state automata. Sunday, 4 December 11

Using Microsoft Excel

DQ Analyzer 9. Cheat Sheets. Read the most up-to-date documentation for the latest Ataccama release online at docs.ataccama.com

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Regular expressions. LING78100: Methods in Computational Linguistics I

DATA STRUCTURE AND ALGORITHM USING PYTHON

ITST Searching, Extracting & Archiving Data

Server-side Web Development (I3302) Semester: 1 Academic Year: 2017/2018 Credits: 4 (50 hours) Dr Antoun Yaacoub

Table ofcontents. Preface. 1: Introduction to Regular Expressions xv

More Details about Regular Expressions

CST Lab #5. Student Name: Student Number: Lab section:

RegExpr:Review & Wrapup; Lecture 13b Larry Ruzzo

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Who This Book Is For What This Book Covers How This Book Is Structured What You Need to Use This Book. Source Code

X Language Definition

CSE 154 LECTURE 11: REGULAR EXPRESSIONS

perlrebackslash - Perl Regular Expression Backslash Sequences and Escapes

Structure of Programming Languages Lecture 3

Java Basic Datatypees

FILTERS USING REGULAR EXPRESSIONS grep and sed

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Java Notes. 10th ICSE. Saravanan Ganesh

Innovative User Group Conference 2009 Anaheim 1

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Introduction to Unix

KU Compilerbau - Programming Assignment

Regular Expression Reference

CSE528 Natural Language Processing Venue:ADB-405 Topic: Regular Expressions & Automata. www. l ea rn ersd esk.weeb l y. com

Sequence Alignment: BLAST

Describing Languages with Regular Expressions

C How to Program, 6/e by Pearson Education, Inc. All Rights Reserved.

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

Here's an example of how the method works on the string "My text" with a start value of 3 and a length value of 2:

Regular Expressions for Information Processing in ABAP. Ralph Benzinger SAP AG

Using Microsoft Excel

Tips and Tricks for Making the Most of Create Lists

Regular Expressions. Upsorn Praphamontripong. CS 1111 Introduction to Programming Spring [Ref:

ML 4 A Lexer for OCaml s Type System

Transcription:

Bioinformatics Programming EE, NCKU Tien-Hao Chang (Darby Chang) 1

Regular Expression 2

http://rp1.monday.vip.tw1.yahoo.net/res/gdsale/st_pic/0469/st-469571-1.jpg 3

Text patterns and matches A regular expression, or regex for short, is a pattern describing a certain amount of text In this slide, regular expressions are highlighted as regex it is the most basic pattern, simply matching the literal text regex (highlighted in this slide) I will use the term string to indicate the text that I am applying the regular expression to and will be highlighted as string 4

Literal characters The most basic regular expression consists of a single literal character, ex: a match the first occurrence of that character in the string on Jack is a boy Jack is a boy, not Jack is a boy In this slide, I ll use a shorter notation sometimes a: Jack is a boy Eleven characters with special meanings: [ \ ^ $.? * + ( ) metacharacters escape metacharacters with a backslash use 1\+1=2 to match 1+1=2 5

Character classes/sets Match only one out of several characters to match an a or an e, use [ae] you could use this in gr[ae]y to match gray or grey a character class matches only a single character gr[ae]y will not match graay or graey the order does not matter Use a hyphen to specify a range of characters [0-9] matches a single digit between 0 and 9 combine ranges and single characters [0-9a-fA-F] combine ranges and single characters [0-9a-fxA-FX] A caret after the opening square bracket negates the class q[^x] matches qu in question but does not match Iraq since there is no character after the q for the negated character class to match 6

Shorthand character classes \d matches a single character that is a digit \w matches a word character alphanumeric characters plus underscore \s matches a whitespace character includes tabs and line breaks \S not \s The actual characters matched by the shorthands depends on the software you re using $ man perlre 7

Non-printable characters Use special character sequences to put non-printable characters \t for tab (ASCII 0x09) \r for carriage return (0x0D) \n for line feed (0x0A) Remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n Use \xff to match a specify character by its hexadecimal index in the character set \xa9 matches the copyright symbol \uffff for a Unicode character (if supported) \u20a0 matches the euro currency sign 8

The dot The dot,., matches (almost) any character The dot matches a single character, except line break characters a short for [^\n] gr.y matches gray, grey, gr%y, etc Most regex engines have a dot matches all or single line mode that makes the dot match any single character, including line breaks 9

Anchors Anchors do not match any characters but match a position ^ matches at the start of the string $ matches at the end of the string Most regex engines have a multi-line mode that makes ^ match after any line break, and $ before any line break b$ matches only bob \b matches at a word boundary a word boundary is a position between a character that can be matched by \w and a character that cannot be matched by \w \b also matches at the start and/or end of the string if the first and/or last characters in the string are word characters \B matches at every position where \b cannot match \bis\b: This island is beautiful 10

Alternation Alternation is the regular expression equivalent of or cat dog: About cats and dogs You can add as many alternatives as you want cat dog mouse fish 11

Repetition? makes the preceding token in the regular expression optional colou?r matches colour or color * matches the preceding token zero or more times + matches the preceding token once or more <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes <[A-Za-z0-9]+> is easier to write but matches invalid tags such as <1> {} specifies a specific amount of repetition \b[1-9][0-9]{3}\b matches 1000 9999 \b[1-9][0-9]{2,4}\b matches 100 99999 12

Greedy and lazy repetition The repetition operators or quantifiers are greedy They will expand the match as far as they can, and only give back if they must to satisfy the remainder of the regex <.+>: This is a <EM>first</EM> test Place a question mark after the quantifier to make it lazy, i.e., stop matching as soon as possible <.+?>: This is a <EM>first</EM> test A better solution is to use <[^<>]+> to quickly match an HTML tag without regard to attributes the negated character class is more specific than the dot, which helps the regex engine find matches quickly 13

Grouping and backreferences Place round brackets, (), around multiple tokens to group them together you can then apply a quantifier to the group Set(Value)? matches Set or SetValue Round brackets create a capturing group the above example has one group how to access the group s contents depends on the software or programming language you re using Group zero always contains the entire regex match Set(Value)?: SetValue, then $0 = SetValue, $1 = Value Set(Value)?: Set, then or $0 = Set, $1 is nothing Use the special syntax Set(?:Value)? to group tokens without creating a capturing group more efficient if you don t need the contents 14

Look-around Look-around is a special kind of group The tokens inside the group are matched normally, but then the regex engine makes the group give up its match and keeps only the result Look-around matches a position, just like anchors q(?=u) matches question, but not Iraq (?=u) match at each position in the string before a u u is not part of the overall regex match positive look-ahead q(?!u) matches Iraq but not question negative look-ahead (?<=a)b matches abc positive look-behind (?<!a)b fails to match abc negative look-behind 15

Reference Regular Expression Quick Start http://www.regularexpressions.info/quickstart.html 16

http://www.pdqmailingservices.com/db5/00409/pdqmailingservices.com/_uimages/messydesk.jpg We have done a lot of exercises 17

Now, let s Talk about Bioinformatics programming in real cases 18

Sequence alignment We have learnt/implemented it twice dynamic programming longest common sub-string/sub-sequence sequence alignment DNA/protein sequence residue substitution We know that time complexity is O(n2) backtracking, alternative alignments That s all? Theoretical Applicative No! There are always better algorithms. That s why we always have new papers to read. 19

Sequence alignment Some advanced ideas Band alignment Arbitrary region When is this point never considered? 20

Seq = AGATCGAT 12345678 AAA AAC. AGA 1. ATC 3. CGA 5. GAT 2 6 TCG 4. TTT The state-of-the-art solutions: seeding and extension. 21

This is not Bioinformatics algorithm 22

Protein clustering In Out a FASTA file and an integer k k clusters of proteins Requirement - invoke BLAST - complexity/teamwork report - using Perl would be the best Bonus - k-means algorithm - invoke clustering package 23

Deadline 2010/5/4 23:59 Zip your code, step-by-step README, complexity analyses and anything worthy extra credit. Email to darby@ee.ncku.edu.tw. 24

BLAST Download protein sequence from UniProt $ wget -o ytf.fa 'http://www.uniprot.org/uniprot/?query=saccharomyces+cerevisiae+transc ription+factor+and+reviewed%3ayes&force=yes&format=fasta A Unix tip using grep and regular expression $ grep '^>' ytf.fa wc l # how many sequences $ grep -c '^>' ytf.fa # a better version Download BLAST from NCBI http://blast.ncbi.nlm.nih.gov/ I prefer this version ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/latest/blast-2.2.23- ia32-linux.tar.gz Execution $ format db i ytf.fa # building indices $ blastall -d ytf.fa -i ytf.fa -p blastp > ytf.bo # default output $ blastall -d ytf.fa -i ytf.fa -m 6 -p blastp > ytf.bo # tabular output 25