CS 307: UNIX PROGRAMMING ENVIRONMENT WORKING WITH FILES AND COLLECTIONS OF FILES Prof. Michael J. Reale Fall 2014
Credit Where Credit Is Due Prof. Nick Merante s notes: http://web.cs.sunyit.edu/~merantn/cs307/ Indiana University Tutorial on Tar: https://kb.iu.edu/d/acfi
File Compression
File Compression There are three main file compression utilities: gzip Standard Unix compression algorithm bzip2 xz Slower but better compression Very slow but best compression
gzip gzip command Unix standard for compression Uses Lempel-Ziv coding (LZ77) Compresses one or more files gzip mightybigfile gzip file1 file2 file3 Compressed file still has the same permissions, access times, etc. Original file is replaced with compressed file adds.gz to extension E.g., gzip File1 File1 is gone; replaced with File1.gz
gzip Compression Quality You can also specify how good you want the compression to be -1 --fast Fastest, but worst compression -9 --best Slowest, but optimal compression (at least with the approach used)
gunzip gunzip command Decompresses (restores) one or more files E.g., gunzip file1.gz Note: ignores files without.gz or.tgz extension (suffix) Can override suffix with S option E.g., gunzip -S waffle test2.waffle Compressed file replaced with original (decompressed) file E.g., gunzip file1.gz file.gz is replaced with file1
gzip/gunzip: Leaving Files Intact By default, the original file is replaced with the compressed file (and vice versa with decompression) To keep the existing files, use c option to write to STDOUT gzip c test > test.gz Compresses test and writes results to test.gz File test still there gunzip -c test.gz Decompresses test.gz and writes to terminal (STDOUT) File test.gz still there gunzip -c test.gz > newtest Decompresses test.gz and write it to newtest File test.gz still there
gzip/gunzip as Filters gzip/gunzip can read data from STDIN and write to STDOUT if no files specified (CNT=1; while [ $CNT -lt 1000 ]; do CNT=`expr $CNT + 1`; /usr/games/fortune; done) gzip > test.gz Prints a thousand fortunes, pipes them to gzip, writes compressed data to test.gz (which didn t exist before) gunzip can decompress something right to the terminal: cat test.gz gunzip gzip, on the other hard, will NOT write to the terminal UNLESS you use the f (force) option Not a great idea anyway, but it s good to know
gzcat gzcat command Decompresses file and then prints it to terminal (a la cat) Same as doing gunzip c
bzip2/bunzip2 bzip2/bunzip2 commands Better compression but will take longer Uses Burrows-Wheeler block sorting text compression algorithm and Huffman coding Very similar in options and usage as gzip (but not identical) Uses.bz2,.bz,.tbz2, or.tbz extensions Can read from STDIN and write to STDOUT (if no filenames specified or using c option) bzcat same function as gzcat
xz/unxz xz/unxz commands Very slow but gives the best compression results Again, very similar options and usage as gzip and bzip2 Also has the same STDIN/STDOUT behavior Uses.xz format Can also handle legacy format.lzma Also has xzcat
Tar
Introduction So far, we re able to compress single, regular files What if we want to compress multiple files and/or a whole directory as one big file? Have to somehow turn all the files (or the contents of the directory) into one file
Tape Drives In days of yore (and to a MUCH lesser extent even now), tape drives were used to store/archive data VERY slow, but high capacity The tar utility was originally written to archive data to tape drives Now, we use it to archive files/directories
Tar: Tape Archiver tar command Concatenates file contents (each separated by header information) Preserves owner, permissions, timestamp information, etc. -f -v Specify tar file (either as input for output) Verbose output lists files names it is reading from/writing to the archive
Creating an Archive To create an archive from files and directories, use the cf option: tar cf myarchive.tar file1 file2 file3 Puts file1, file2, and file3 into the archive myarchive.tar
Unpacking an Archive To unpack the archive, use the xf option tar xf myarchive.tar
Listing the Contents To list the contents of a tar file without unpacking it, use the tf option tar -tf myarchive.tar
Compressing AND Tarring If you re using GNU tar, you can compress and tar files at the same time: -z Use gzip compression -y Use bzip2 compression Examples: tar cvzf myarchive.tgz file1 file2 file3 Creates compressed archive with gzip tar xvyf myarchive.tbz2 Unpacks bzip2 compressed archive.tgz =.tar.gz and.tbz2 =.tar.bz2
Assorted Useful Utilities
srm: Secure Removal srm command Securely deletes files Overwrites file data and then deletes file (unlinks hard link) May not be available on all systems (or may be named something else)
split split command Allows you to split a file into pieces Useful when you have a VERY large.tar file Syntax: split -b byte_count[k k M m G g] [-a suffix_length] [file [prefix]] -a suffix length; determines how many letters to use for each part Example: split -b 650m -a 1 big_tarball.tar Default prefix: x Becomes xa, xb, xc,
Message Digests Say you want to give a file to a friend; when they download it, how do they know for sure that the file data is the original file data? Could have been altered or corrupted (either unintentionally or intentionally) One way to handle this: Generate a message digest for the file Your friend downloads the file and the digest They generate a digest from the file they received If their digest matches your digest, life is good Message digest = kind of like a fingerprint for the data Believed to be computationally infeasible to have two different files generate the same digest
Generating Message Digests Depending on what kind of digest you want to generate, the name of the command is different Under FreeBSD, command is usually just the name: md5 generates MD5 digest sha1 generates SHA-1 digest Under Linux, name + sum md5sum sha1sum sha256sum MD5 is completely crackable (and with SHA-1 it s possible), so it s recommended you use SHA-256 (or higher) if security is a concern