UNIX, GNU/Linux and simple tools for data manipulation Dr Jean-Baka DOMELEVO ENTFELLNER BecA-ILRI Hub Basic Bioinformatics Training Workshop @ILRI Addis Ababa Wednesday December 13 th 2017 Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 1 / 37
1 UNIX & GNU/Linux: brief history and introduction 2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash 3 So many tools You CANNOT live without your man Data manipulation commandline tools Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 2 / 37
Outline 1 UNIX & GNU/Linux: brief history and introduction 2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash 3 So many tools You CANNOT live without your man Data manipulation commandline tools Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 3 / 37
UNIX & GNU/Linux: introduction GNU/Linux is an operating system (OS). GNU/Linux fully belongs to a broad family of OSes, the UNIX family. Operating system: definition unique interface between the computer (hardware) and the different programs (software) users run on it allows different programs and different users to use concurrently the same machine implements a filesystem, a console environment, a graphical environment, drivers for keyboard and mouse, etc examples of operating systems: Windows (Microsoft), Mac OS X (Apple), Android (Google), GNU/Linux, FreeBSD, etc Linux is only the kernel of GNU/Linux systems, responsible for granting access to the resources on the host and for time-sharing between processes. Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 4 / 37
UNIX & GNU/Linux systems: timeline GNU/Linux: a fairly recent member of an old and huge family (see http://www.levenez.com/unix/) 1969: UNICS 1971: UNIX Time-Sharing System V1 1982: SunOS 1.0 1983: UNIX System V 1991: GNU project (GNU/Hurd) ; Linux 0.01 1994: Linux 1.0 1999: Darwin 0.1 ; Mac OS X Server 1.0 2008: Android 1.0 (derived from Linux 2.6.23) 2013: Linux 3.9 Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 5 / 37
Linux distributions: different flavours of the same OS The GNU/Linux operatring system comes in different distributions. Three distributions have ever been true beacons and gave many offsprings: 1 Debian (1993) Ubuntu, 2004 and Linux Mint, 2010 2 Slackware (1993), from SLS (1992) SuSE, 1998 3 RedHat (late 1994) CentOS and Fedora, both 2003 For a full account, see http://futurist.se/gldt Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 6 / 37
What makes UNIX systems superior to the Windows family UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37
What makes UNIX systems superior to the Windows family UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37
What makes UNIX systems superior to the Windows family UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine. Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37
What makes UNIX systems superior to the Windows family UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi-user and multi-process systems. Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37
What makes UNIX systems superior to the Windows family UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi-user and multi-process systems. UNIX systems are much more secure than any Windows. Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37
What makes UNIX systems superior to the Windows family UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi-user and multi-process systems. UNIX systems are much more secure than any Windows. Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface. Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37
Outline 1 UNIX & GNU/Linux: brief history and introduction 2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash 3 So many tools You CANNOT live without your man Data manipulation commandline tools Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 8 / 37
Bash: a shell environment Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell". Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37
Bash: a shell environment Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell". Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37
Bash: a shell environment Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell". Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh), Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37
Bash: a shell environment Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell". Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh), perform all of the above through automated processes scripts. Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37
Bash: a shell environment Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell". Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh), perform all of the above through automated processes scripts. Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts). Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37
Bash: a shell environment Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell". Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh), perform all of the above through automated processes scripts. Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts). On most GNU/Linux distributions, Bash is accessible through the "Terminal" icon. Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37
Outline 1 UNIX & GNU/Linux: brief history and introduction 2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash 3 So many tools You CANNOT live without your man Data manipulation commandline tools Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 10 / 37
Standard structure of a UNIX command Synopsis of a command <command> <options> <objects> For example: ls (only the command) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37
Standard structure of a UNIX command Synopsis of a command <command> <options> <objects> For example: ls (only the command) ls -l (command plus an option) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37
Standard structure of a UNIX command Synopsis of a command <command> <options> <objects> For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37
Standard structure of a UNIX command Synopsis of a command <command> <options> <objects> For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37
Standard structure of a UNIX command Synopsis of a command <command> <options> <objects> For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37
Standard structure of a UNIX command Synopsis of a command <command> <options> <objects> For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37
Standard structure of a UNIX command Synopsis of a command <command> <options> <objects> For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37
Standard structure of a UNIX command Synopsis of a command <command> <options> <objects> For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37
Outline 1 UNIX & GNU/Linux: brief history and introduction 2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash 3 So many tools You CANNOT live without your man Data manipulation commandline tools Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 12 / 37
UNIX filesystems Filesystems are hierarchies. The filesystem of a UNIX machine is standardized. Under the root (/) are: /bin essential command binairies /boot static files of the boot loader /dev device files (special files to access your devices) /etc host-specific system configuration files /home user home directories (e.g. /home/peter, /home/sarah, etc) /lib essential shared librairies and kernel modules /media mount point for removable media (e.g. CD-ROMs & flash disks) /mnt old-style mount point for any media /tmp system-wide temporary folder, writable by anyone Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 13 / 37
File permissions Three (four) types of rights: right to read from a file (r) right to write to it (w) right to execute a binary file or a script (x) right to traverse a directory (x) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 14 / 37
File permissions Three (four) types of rights: right to read from a file (r) right to write to it (w) right to execute a binary file or a script (x) right to traverse a directory (x) Three types of people: the owner of a file (u) the other members of the user s group (g) the rest of the world, the others (o) Typical line of output from ls -l -rw-r--r-- 1 jbde jbde 171104 juil. 6 12:48 awk.dvi Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 14 / 37
File permissions explained Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 15 / 37
Outline 1 UNIX & GNU/Linux: brief history and introduction 2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash 3 So many tools You CANNOT live without your man Data manipulation commandline tools Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 16 / 37
Why it is often necessary to quote strings or escape chars Some characters have a special meaning for the tools you use, e.g. the commandline interpreter Bash: spaces or tabs are logical separators between elements on the commandline: cd /tmp a dollar sign introduces Bash variables: echo $PATH a star means all the files (wildcard): cat * the greater than sign is interpreted as a redirection: cat * > listing.txt the vertical bar pipes the output of some command into the input of another: grep h3a long_course.htm wc -l... Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 17 / 37
Why it is often necessary to quote strings or escape chars Some characters have a special meaning for the tools you use, e.g. the commandline interpreter Bash: spaces or tabs are logical separators between elements on the commandline: cd /tmp a dollar sign introduces Bash variables: echo $PATH a star means all the files (wildcard): cat * the greater than sign is interpreted as a redirection: cat * > listing.txt the vertical bar pipes the output of some command into the input of another: grep h3a long_course.htm wc -l... escaping or quoting prevents these characters from being interpreted by the shell. Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 17 / 37
Escaping a single character In Unix, prepending a backslash (\) escapes the character following the backslash. Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 18 / 37
Escaping a single character In Unix, prepending a backslash (\) escapes the character following the backslash. > echo $PATH /home/jbde/bin:/usr/local/bin:/usr/bin:/bin > echo \$PATH $PATH Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 18 / 37
Escaping a single character In Unix, prepending a backslash (\) escapes the character following the backslash. > echo $PATH /home/jbde/bin:/usr/local/bin:/usr/bin:/bin > echo \$PATH $PATH And if a filename contains spaces, e.g. named with spaces.txt: > cat named with spaces.txt cat: named: No such file or directory cat: with: No such file or directory cat: spaces.txt: No such file or directory > cat named\ with\ spaces.txt <produces the content of the file> Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 18 / 37
Outline 1 UNIX & GNU/Linux: brief history and introduction 2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash 3 So many tools You CANNOT live without your man Data manipulation commandline tools Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 19 / 37
Strong quoting with single quotes You can also quote a string to prevent included spaces to be interpreted: > cat 'named with spaces.txt' <produces the content of the file> Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 20 / 37
Strong quoting with single quotes You can also quote a string to prevent included spaces to be interpreted: > cat 'named with spaces.txt' <produces the content of the file> Generally speaking, simple quote do not allow any kind of interpretation/substitution/expansion. > echo 'Your PATH variable contains $PATH' Your PATH variable contains $PATH Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 20 / 37
Weak quoting with double quotes While preventing included spaces to be interpreted, double quotes allow expansion of Bash variables: > cat "named with spaces.txt" <produces the content of the file> > echo "Your PATH variable contains $PATH" Your PATH variable contains /home/jbde/bin:/usr/local/bin:/usr/b Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 21 / 37
Using Bash every day Bash has nice features you should use to work efficiently: the history of previous commands (browse vith,, Ctrl+R) autocompletion with the <TAB> key everywhere you can (commands, filenames, etc) wildcards and regexps use quoting appropriately pipe commands into each other ( ) redirect output (> erases previous file, >> appends) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 22 / 37
Outline 1 UNIX & GNU/Linux: brief history and introduction 2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash 3 So many tools You CANNOT live without your man Data manipulation commandline tools Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 23 / 37
Outline 1 UNIX & GNU/Linux: brief history and introduction 2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash 3 So many tools You CANNOT live without your man Data manipulation commandline tools Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 24 / 37
Asking for help on a command: man This is the absolute basic command, to learn first! man ls To browse within the manpage: <Space>: next page b: previous page G: goto the bottom g: goto the beginning /: search an expression (indicate pattern or string and press <Enter>) q: quit and return to commandline Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 25 / 37
Sectioning of a manpage Manpages are all written using the same format/sectioning: 1 NAME: the name of the command 2 SYNOPSIS: the syntax of the command (sometimes several lines to describe several ways of using the command) square brackets ([...]) indicate optional components pipes ( ) within a construct separates alternatives ellipsis (...) usually indicate that the previous object is repeatable 3 DESCRIPTION and OPTIONS: meaning and behaviour of the different options and objects to give on the commandline 4 EXAMPLES: the most useful section, provides real-world examples along with some explanation of what they do 5 EXIT STATUS: useful in scripts, to monitor automatically whether the command execution produced and error 6 SEE ALSO: also useful when you don t know exactly the name of a command but know a similar/sister one (e.g. uniq and join are cross-referenced) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 26 / 37
Outline 1 UNIX & GNU/Linux: brief history and introduction 2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash 3 So many tools You CANNOT live without your man Data manipulation commandline tools Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 27 / 37
Reading files: cat and less cat produces the full content of file(s) to the standard output can concatenate several files: cat FILE1 FILE2 > FILE3 is non-interactive: prints all and quits Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 28 / 37
Reading files: cat and less cat produces the full content of file(s) to the standard output can concatenate several files: cat FILE1 FILE2 > FILE3 is non-interactive: prints all and quits less is a pager produces the full content of file(s) to the standard output, one page at a time several files are processed one after the other: less FILE1 FILE2 and then :n (next) and :p (previous) to browse is fully interactive: <space> for next page, b for the previous, / to search, q to quit, etc useful option: -S not to have your lines automatically wrapped (preserves column alignment on long lines) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 28 / 37
Count the numbers of chars, words or lines: wc wc stands for "word count" wc -l FILE number of lines wc -c FILE number of bytes ( chars) wc -w FILE number of words wc -L FILE length of longest line in file Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 29 / 37
Select columns from a file: cut Simplified syntax cut -f <fields> -d <delimiter> FILE be sure you quote the delimiter, e.g. ``;'' <fields> can be a comma-separated list (ranges indicated with hyphens) Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 30 / 37
Select columns from a file: cut Simplified syntax cut -f <fields> -d <delimiter> FILE be sure you quote the delimiter, e.g. ``;'' <fields> can be a comma-separated list (ranges indicated with hyphens) Example: select fields 2 and 5 from a semicolon-separated file cut -f 2,5 -d ';' cut_example.csv Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 30 / 37
Select columns from a file: cut Simplified syntax cut -f <fields> -d <delimiter> FILE be sure you quote the delimiter, e.g. ``;'' <fields> can be a comma-separated list (ranges indicated with hyphens) Example: select fields 2 and 5 from a semicolon-separated file cut -f 2,5 -d ';' cut_example.csv Example: specify output separator cut -f 1-3 -d ';' --output-separator=$'\t' cut_example.csv Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 30 / 37
Select columns from a file: cut Simplified syntax cut -f <fields> -d <delimiter> FILE be sure you quote the delimiter, e.g. ``;'' <fields> can be a comma-separated list (ranges indicated with hyphens) Example: select fields 2 and 5 from a semicolon-separated file cut -f 2,5 -d ';' cut_example.csv Example: specify output separator cut -f 1-3 -d ';' --output-separator=$'\t' cut_example.csv Example: extract only the first three characters of each line cut -c 1-3 cut_example.csv Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 30 / 37
Sort a file according to some rules: sort sort sorts text files according to the content of some fields, called keys. Example: sorting lines alphabetically sort cut_example.csv Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 31 / 37
Sort a file according to some rules: sort sort sorts text files according to the content of some fields, called keys. Example: sorting lines alphabetically sort cut_example.csv But it s usually not a good idea not to control the way sort sorts. Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 31 / 37
Sort a file according to some rules: sort sort sorts text files according to the content of some fields, called keys. Example: sorting lines alphabetically sort cut_example.csv But it s usually not a good idea not to control the way sort sorts. Example: sort according to 2 nd and then 3 rd field (semicol-separated fields) sort -t ';' -k 2,3 cut_example.csv Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 31 / 37
Sort a file according to some rules: sort sort sorts text files according to the content of some fields, called keys. Example: sorting lines alphabetically sort cut_example.csv But it s usually not a good idea not to control the way sort sorts. Example: sort according to 2 nd and then 3 rd field (semicol-separated fields) sort -t ';' -k 2,3 cut_example.csv Example: sort numerically (-n) according to 9 th field only sort -t ';' -n -k 9,9 cut_example.csv # to check results: sort -t ';' -n -k 9,9 cut_example.csv cut -f 9 -d ';' Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 31 / 37
sort, continued -g option to sort numerical fields containing scientific notation: sort -k 2,2 -n with_sci_notation # unexpected result sort -k 2,2 -g with_sci_notation # GOOD! Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 32 / 37
sort, continued -g option to sort numerical fields containing scientific notation: sort -k 2,2 -n with_sci_notation # unexpected result sort -k 2,2 -g with_sci_notation # GOOD! WARNING!! sort relies heavily on your locale setting! Try: LC_ALL=fr_FR.utf8 sort -k 2,2 -g with_sci_notation Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 32 / 37
sort, continued -g option to sort numerical fields containing scientific notation: sort -k 2,2 -n with_sci_notation # unexpected result sort -k 2,2 -g with_sci_notation # GOOD! WARNING!! sort relies heavily on your locale setting! Try: LC_ALL=fr_FR.utf8 sort -k 2,2 -g with_sci_notation One-letter sorting options can be used as flags, and several fields specified: Ascending order on the 5 th field, descending on the 6 th and then alphabetically on the 1 st field sort -k 5,5g -k 6,6nr -k 1,1 hmmsearch_raw_output less -S Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 32 / 37
sort, some caveats WARNING!! by default, sort separates fields on blank to non-blank transitions. careful with empty fields! One should specify the delimiter. A precise delimiter to prevent sort from merging delimiters sort -k 11,11 -t $'\t' CDS_top_100.txt cut -f 11 less Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 33 / 37
join lines of two files sharing a common field join allows you to perform the relational join operation on two files. Example: I want to select the lines of FILE2 whose 11 th field corresponds to an entry in FILE1. join -1 1-2 11 -t $'\t' dg_top_100.txt CDS_top_100.txt WARNING!! join operates on files already sorted on the join field! Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 34 / 37
Produce only the n last lines of a file: tail Convenient to cut parts you are not interested in, for instance because: the final lines of a log file contain the error that matters to you the header (first few lines) of the file is of no interest for the next tool in the pipeline the file is sorted and the last lines contain the samples of interest: you set a cutoff Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 35 / 37
Produce only the n last lines of a file: tail Convenient to cut parts you are not interested in, for instance because: the final lines of a log file contain the error that matters to you the header (first few lines) of the file is of no interest for the next tool in the pipeline the file is sorted and the last lines contain the samples of interest: you set a cutoff Produce the last 30 lines of a file tail -n 30 input_file or simply: tail -30 input_file Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 35 / 37
Produce only the n last lines of a file: tail Convenient to cut parts you are not interested in, for instance because: the final lines of a log file contain the error that matters to you the header (first few lines) of the file is of no interest for the next tool in the pipeline the file is sorted and the last lines contain the samples of interest: you set a cutoff Produce the last 30 lines of a file tail -n 30 input_file or simply: tail -30 input_file Produce all the lines from the 30th tail -n +30 input_file Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 35 / 37
Symmetrical to tail: head Produce the first 30 lines of a file head -n 30 input_file or simply: head -30 input_file Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 36 / 37
Symmetrical to tail: head Produce the first 30 lines of a file head -n 30 input_file or simply: head -30 input_file Produce all but the last 30 lines head -n -30 input_file Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 36 / 37
Translate chars with tr tr helps you change any occurrence of a character into another: Translating Windows end-of-lines into UNIX ones cat Win_formatted_file tr '\r' '\n' > UNIX_formatted_file Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 37 / 37
Translate chars with tr tr helps you change any occurrence of a character into another: Translating Windows end-of-lines into UNIX ones cat Win_formatted_file tr '\r' '\n' > UNIX_formatted_file Warning! tr only processes its standard input! Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 37 / 37
Translate chars with tr tr helps you change any occurrence of a character into another: Translating Windows end-of-lines into UNIX ones cat Win_formatted_file tr '\r' '\n' > UNIX_formatted_file Warning! tr only processes its standard input! But tr also comes handy to change separators in a CSV file: Translating semicols into tabulations cat example_mj.txt tr ';' '\t' Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 37 / 37