Summary of Digital Information (so far):

Similar documents
Manipulating Digital Information

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Lecture #3: Digital Music and Sound

Compression; Error detection & correction

CS/MA 109 Fall Wayne Snyder Computer Science Department Boston University

Multimedia Networking ECE 599

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

Chapter 1. Data Storage Pearson Addison-Wesley. All rights reserved

From last time to this time

Bits and Bit Patterns

Data Representation and Networking

14 Data Compression by Huffman Encoding

Multimedia Data and Its Encoding

CHAPTER 2 - DIGITAL DATA REPRESENTATION AND NUMBERING SYSTEMS

Compression; Error detection & correction

Hexadecimal Numbers. Journal: If you were to extend our numbering system to more digits, what digits would you use? Why those?

CS106B Handout 34 Autumn 2012 November 12 th, 2012 Data Compression and Huffman Encoding

Text Compression through Huffman Coding. Terminology

Computing in the Modern World

Data Representation. Types of data: Numbers Text Audio Images & Graphics Video

CSE COMPUTER USE: Fundamentals Test 1 Version D

Data Storage. Slides derived from those available on the web site of the book: Computer Science: An Overview, 11 th Edition, by J.

Keywords Data compression, Lossless data compression technique, Huffman Coding, Arithmetic coding etc.

Lossless Compression Algorithms

ECE 499/599 Data Compression & Information Theory. Thinh Nguyen Oregon State University

More Bits and Bytes Huffman Coding

Lecture 16 Perceptual Audio Coding

Source coding and compression

Information Science 2

Perceptual Coding. Lossless vs. lossy compression Perceptual models Selecting info to eliminate Quantization and entropy encoding

"Digital Media Primer" Yue- Ling Wong, Copyright (c)2011 by Pearson EducaDon, Inc. All rights reserved.

Elementary Computing CSC 100. M. Cheng, Computer Science

ESE532: System-on-a-Chip Architecture. Today. Message. Project. Expect. Why MPEG Encode? MPEG Encoding Project Motion Estimation DCT Entropy Encoding

Fundamental of Digital Media Design. Introduction to Audio

An Overview 1 / 10. CS106B Winter Handout #21 March 3, 2017 Huffman Encoding and Data Compression

CSC 170 Introduction to Computers and Their Applications. Lecture #1 Digital Basics. Data Representation

CHAPTER 6 Audio compression in practice

DigiPoints Volume 1. Student Workbook. Module 8 Digital Compression

Data Compression Techniques

Compressed Audio Demystified by Hendrik Gideonse and Connor Smith. All Rights Reserved.

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

Jianhui Zhang, Ph.D., Associate Prof. College of Computer Science and Technology, Hangzhou Dianzi Univ.

Huffman Coding Assignment For CS211, Bellevue College (rev. 2016)

Introduction to Computer Science (I1100) Data Storage

ITNP80: Multimedia! Sound-II!

CS101 Lecture 12: Image Compression. What You ll Learn Today

A New Compression Method Strictly for English Textual Data

So, what is data compression, and why do we need it?

NUMBERS AND DATA REPRESENTATION. Introduction to Computer Engineering 2015 Spring by Euiseong Seo

Linked Structures Songs, Games, Movies Part IV. Fall 2013 Carola Wenk

Standard File Formats

A Detailed look of Audio Steganography Techniques using LSB and Genetic Algorithm Approach

Image coding and compression

CS 074 The Digital World. Digital Audio

Errors. Chapter Extension of System Model

CHAPTER 1 Encoding Information

Data Compression Fundamentals

OBJECTIVES After reading this chapter, the student should be able to:

General Computing Concepts. Coding and Representation. General Computing Concepts. Computing Concepts: Review

FILE CONVERSION AFTERMATH: ANALYSIS OF AUDIO FILE STRUCTURE FORMAT

Data Compression Scheme of Dynamic Huffman Code for Different Languages

University of Pennsylvania Department of Electrical and Systems Engineering Digital Audio Basics

Multimedia Communications ECE 728 (Data Compression)

Huffman Code Application. Lecture7: Huffman Code. A simple application of Huffman coding of image compression which would be :

Example 1: Denary = 1. Answer: Binary = (1 * 1) = 1. Example 2: Denary = 3. Answer: Binary = (1 * 1) + (2 * 1) = 3

Lecture 19 Media Formats

Skill Area 214: Use a Multimedia Software. Software Application (SWA)

Networking Applications

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Functional Programming in Haskell Prof. Madhavan Mukund and S. P. Suresh Chennai Mathematical Institute

CS/COE 1501

[301] Bits and Memory. Tyler Caraza-Harter

CHAPTER 10: SOUND AND VIDEO EDITING

Lecture 11 Unbounded Arrays

Lecture Information Multimedia Video Coding & Architectures

EE-575 INFORMATION THEORY - SEM 092

15 July, Huffman Trees. Heaps

3 Data Storage 3.1. Foundations of Computer Science Cengage Learning

Bits, bytes, binary numbers, and the representation of information

GCSE Computing. Revision Pack TWO. Data Representation Questions. Name: /113. Attempt One % Attempt Two % Attempt Three %

CSCD 443/533 Advanced Networks Fall 2017

DRAFT. Computer Science 1001.py. Instructor: Benny Chor. Teaching Assistant (and Python Guru): Rani Hod

Figure 1. Generic Encoder. Window. Spectral Analysis. Psychoacoustic Model. Quantize. Pack Data into Frames. Additional Coding.

5: Music Compression. Music Coding. Mark Handley

A Research Paper on Lossless Data Compression Techniques

ELL 788 Computational Perception & Cognition July November 2015

A COMPRESSION TECHNIQUES IN DIGITAL IMAGE PROCESSING - REVIEW

MPEG-1. Overview of MPEG-1 1 Standard. Introduction to perceptual and entropy codings

Worksheet - Storing Data

Fundamentals of Video Compression. Video Compression

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Digital Audio Basics

Information Retrieval

Basic data types. Building blocks of computation

CS/COE 1501

Teaching KS3 Computing. Session 5 Theory: Representing text and sound Practical: Building on programming skills

Binary Trees Case-studies

<< WILL FILL IN THESE SECTIONS THIS WEEK to provide sufficient background>>

umber Systems bit nibble byte word binary decimal

Compression Part 2 Lossy Image Compression (JPEG) Norm Zeck

Transcription:

CS/MA 109 Fall 2016 Wayne Snyder Department Boston University Today : Audio concluded: MP3 Data Compression introduced Next Time Data Compression Concluded Error Detection and Error Correcction Summary of Digital Information (so far): Digital Information: Represented by symbols Bits: 0 or 1 Binary Numbers: 000, 001, 010, 011, 100, 101, 110, 111 All symbols can be presented by binary numbers. Data is Digitized by Sampling and Quantizing: o Images: 24 bits, 8 bits for each color in a pixel o Audio: 16 bits, 44,100 times a second o Text: ASCII code (8 bits/ letter), Unicode: (24 bits/letter) Punchline: Digitizing represents a (potentially) infinite amount of information by symbols, approximating reality. Question: When is the approximation good enough? 2 1

Manipulating Digital Information Digital Information can be manipulated by many different algorithms to improve its properties; we will look at two different aspects of this in the next week or so. This week: Data Compression (lossy and lossless): Compressed data takes less storage space and less time to transmit; lossless data compression does this without any loss of information. We will look at one lossy method (MP3 Music files) and two different lossless methods: o Run-Length Encoding o Huffman Encoding Next Week: Error detecting and correcting codes: Data can be encoded in such a way that errors (scratched disks, errors in transmission, computer glitches) can be detected or even corrected 3 Audio Redux: Compressing Music.. Audio Formats CDA, WAV, AU, AIFF, VQF, and MP3 MP3 (MPEG-2, audio layer 3 file) is most popular Music signal is simplified using psychoacoustic principles The sequence of bits is then compressed (start on this next lecture!) 4 2

MP3: Psychoacoustics MP3: Auditory Masking 3

MP3 Encoding Principles Break file into small frames of a fraction of a second each; Analyze each frame in terms of frequencies present; Eliminate frequencies which would be masked anyway; Recalculate the samples; and Put together the frames into a new (smaller) music file. Encoding music files in MP3 is a lossy process; you lose some information---you could not recreate exactly the original music file from the MP3 version! But MP3: Variable-rate encoding MP3 compression rates are based on how much bandwidth the final file will use to play music in real time: 128kbps ~ 128,000 bits per second is CD Quality Much smaller rates can be used for voice, e.g., news broadcasts, etc. Compare to CD Audio 10,600,000 bytes per minute = 1,413,000 bits per second MP 3 encodes the same quality in 9% of space! 8 4

Lossless Data Compression: It s Theoretically Impossible! First of all, let s examine the mathematical problem, and show that it is impossible to solve in general! Suppose we have the following letters: A, B, D, E, F, M, N & _ (blank). These 8 letters would take 3 bits to encode, e.g, A 000 F 100 B 001 ` M 101 D 010 N 110 E 011 _ 111 So we could encode the message BE_ A_MAN as 001011111000111101000110 B E _ A _ M A N 9 Lossless Data Compression: It s Theoretically Impossible! So we could encode the message BE_ A_MAN as 001011111000111101000110 B E _ A _ M A N That s 8 characters and 3*8 = 24 bits. How many possible messages of 24 bits could we have? A 000 B 001 D 010 E 011 F 100 M 101 N 110 _ 111 Well, how many 24-bit sequences are there? 10 5

Lossless Data Compression: It s Theoretically Impossible! So we could encode the message BE_ A_MAN as 001011111000111101000110 B E _ A _ M A N That s 8 characters and 3*8 = 24 bits. How many possible messages of 24 bits could we have? A 000 B 001 D 010 E 011 F 100 M 101 N 110 _ 111 Well, how many 24-bit sequences are there? 2 24 = 16,777,216 Any such sequence is a possible message, e.g., 000000000000000000000000 = AAAAAAAA 11 Lossless Data Compression: It s Theoretically Impossible! How many 24-bit sequences are there? 2 24 = 16,777,216 Any such sequence is a possible message, e.g., A 000 B 001 000000000000000000000000 = AAAAAAAA 000000000000000000000001 = AAAAAAAB 000000000000000000000010 = AAAAAAAC D 010 E 011 F 100... M 101 N 110 010101010101010101010101 = BMBMBMBM _ 111... 010111011010111101000101 = D_ED_MAM (random series of bits)... 111111111111111111111110 = N (7 blanks and N) 111111111111111111111111 = (8 blanks) To encode 2 24 random possibilities in bits takes a minimum of 24 bits! If we use only 23 bits, then we have only 2 23 possibilities, and some would be left out! PUNCHLINE: You can t compress ALL possible messages. 12 6

11/2/16 Lossless Data Compression: It s Practically Possible! BUT, we don t have to represent all possible messages, just the ones we are interested in. The number of possible meaningful messages with the code shown at right is far less than 224. A B D E F M N _ Basic Idea of All Compression Algorithms: Interesting data has redundancies, and are not random (because it is information); we can take advantage of the patterns to reduce the total number of bits. Essentially ALL modern data storage uses compression in some way. 000 001 010 011 100 101 110 111 Music file: 0 0 0 0 0 0... (silence) 13 Lossless Data Compression: Run-Length Encoding The simplest compression method is called RLE: Run-Length Encoding : The idea is to count the number of repeated symbols, and alternate counts and symbols: Instead of AAAAAABBAADDDFFFFF (would require 45 bits) we would write: A B D E F M N _ 000 001 010 011 100 101 110 111 6A2B2A3D5F We could alternate binary numbers for the count and binary numbers for the symbols: 110000010001010000011010110100 6 A 2 B 2 A 3 D 5 F (30 bits or 33% smaller, or an average of 2 bits/letter) 14 7

Lossless Data Compression: Run-Length Encoding One problem is that the counts can not be bigger than 3 bits, or 7. We could just live with this, e.g., AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA... is encoded as 7A7A7A7A... or we could make a rule that whenever the count field is 000 (not useful in general), the next 6 bits are the actual count; this gives us counts as high as 2 6-1 = 63. A 000 B 001 D 010 E 011 F 100 M 101 N 110 _ 111 Suppose we have 50 As and 3 Bs: AAAAA...AAAABBB 000110010000011001 50 A 3 B This seems like a good idea, so let s use it in the homework for next week: We ll call this method RLE109. 15 The fundamental characteristic of data is that it is non-random, and compression relies on finding patterns and encoding them more compactly. We can use the principles of Probability Theory to design other kinds of lossless compression algorithms. Huffman Encoding is one such approach, and it relies on the probability distribution of the symbols in the data. Let s focus on text files, and look at three different files, with different probabilities for the occurrence of the letters... 16 8

File 1: Random sequence of the letters A...Z: YBJOJNQTTCYTAAHCJKZMKUOEKACDVWNEDAXGZMJAFSDTOKIMRZFT CZLBIXOLGIEWACBJJFMMJLOHOYGNGXBEPZZUOIAFAFTNQGCSFTWISF KYAXMFOQJLYJEXSVZYECFNLJHYCCEZIYCMVZECVBJVJGFYHYJXGQVJ JDKVGMONVBHSVQLJCALCJHWRPCZNCNFXGCMDVQMKRISBKHNYMM DMALJDLBOUKRDKJBQBJUYURDZQWPQUNENOWFNUWODVKVNSVLMM KOECXIEEFSXASYNIBOOPNGCCVLBAQTJPWZJIJKIIEMCUUGFNEMGMXP VIHJYOKZWXRQGNWPHPTIHDGDTJHCXKQPXVQQPCCJVLMZBJVAHMUY GEZAPRLLUVRHSAOHRUOSBSHQLJWJMAHBJVYRRHVWLCLSGHTSIUPJ NMOUXBBBPCDVHKFDAKGWQIFWNWCBESCOFBOFGVBXXHYUJHRXILZ YWXGUIWPZNCDQYAYEORQNZJRHZGNBJHWLRPYFMWUQOAZZTJUPZX SUHJBZJRCVYLMSXUMRAFXQMAGEPPDYRBICCDOYHQZWWGHMRINW WCEQGDQOUGNYZRSUJHUSFCKDVRBPLLYPNHNDUWDHZLITGRFYGME CQXMSMZGCVNUFKISXYHTHFQFMLPVUUEGWLYUMEHHJSWSYVEASM NJLKMQEKVRWELYBZNDGKFYUMWAVPMBDDUPQJWNTGNSDQCIVUDIK JGJQPKNGBPBQZHYWHBSBQZAHVKMSAISKVKEQFRUFXRZYEARMYOF OWUAUOICJPBZZEUPMDHQOGNDOBWLKVAQTOMDMXFXVEWIKZCPIJPI SVHBYIHPPZGRXACQINUXKOCCBPBSJZFQQZITYTRNCELVGSHQXDUIU MRLVJXSFAUPPEQDQWWSVHOMABNMYYYRSELBNMSVUCYKGCWRLEK HSZDOVNFRRAJFKOOJJKFJPLUYLXDFPAPBMCKREXAZIDPKAMUDBKRG VMSUXMYICXNPYHSMSPQRMZVNDCBGOETAKTWBABWQWHYHWRTNL FJQQTBOEXNJXXUYWGKBFKAXZOMAJDTEGUPCPTAHRCLIOIRGDIAHVM FTFAVDARCNVOSUTTCQYJPIDJDWUNAYIAGACUWLZEFQCTAKIGKYHSM NTQYFMDIERPODFRLRSFEIWHXURTXTNRMLVXIUTUSQWSMEPEZNYUZ KNSBZUDUHDIZXZEOVABLIVJVZLXIKEYNQAFMIIVUOZGYYCJDDZVZIWM HMWWLTKDJZPPXTGFQYEXLEHISEAJZDPCNXHDSHFISMWUKZIYBWYZ..... What is the probability that any particular letter is A? Since all letters occur with equal probability, and there are 26 letters, the probability of A is 1/26 = 0.03846... Each of the letters has the same probability. 17 File 2: Non-random sequence of letters AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA... What is the probability that any particular letter is A? Um... 1.0 = 100% 18 9

File 3: Non-random sequence of letters What is the probability that any particular letter is A? Um... 0% 19 File 4: Another non-random sequence of letters It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way-- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. There were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. In both countries it was clearer than crystal to the lords of the State preserves of loaves What is the probability that any particular letter is A? We could count the letters in A Tale of Two Cities but for a good estimate, we could use statistics about the distribution of letters in the English language... The probability in English of A is 8.167%. 20 10

Why does this matter? If all symbols occur with equal frequency, then we have little choice but to simply figure out how many possibilities there are (= the number of letters), and invent a binary code with enough bits to cover all possibilities. 26 English letters: 2 4 = 16 Not enough! 2 5 = 32 Enough and room to spare (maybe you want blanks!) So we need a minimum of 5 bits per letter to store random sequences or minimal text messages. When you have random data, this is the best you can do... 21 But, what if the probabilities are not equal? How many bits does it take to indicate a letter if the text file contains only A and Bs? AAAAAAABBBBBBBBBBAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAA... ONLY ONE! A = 1 and B = 0 111111100000000001111111111111111111111111 1111111111111111111111111111111111111111111 111111111111111... This is the absolute minimum number of bits for a code! Ok, this is silly, but let s consider a more realistic example... 22 11

Suppose I have three letters: A, B, C Normally you would need how many bits to encode messages with these three letters? 23 Suppose I have three letters: A, B, C Normally you would need how many bits to encode messages with these three letters? Two bits: 2 2 = 4 A 00 B 01 C 10 BUT: Suppose we know that the probabilities of the letters are not equal, e.g., some letters are rare: A occurs with probability.8 B occurs with probability.1 C occurs with probability.1 AABACAAAAAAABAAACAAAAABACAAAAAA ABAAACAAAAABAAAAAACAABAAACAAAAA AACAAABAAABAAACAAA... 00 00 01 00... 24 12

Then the following code (called a Huffman Code) is more efficient: A 0 B 10 C 11 Let s try the message AABACAAAAA with these two codes Basic 00 00 01 00 10 00 00 00 00 00 (20 bits) Huffman 0 0 10 0 11 0 0 0 0 0 (12 bits) That is 60% compression... Instead of each letter taking 2 bits, each on average uses 0.8 * 1 + 0.1 * 2 + 0.1 * 2 = 1.2 bits Basic A 00 B 01 C 10 Huffman A 0 B 10 C 11 Probabilities: A 0.8 B 0.1 C 0.1 25 This is the basic idea of Huffman Encoding: Use variable length codes, and encode more frequent letters with shorter codes. This is actually an old idea, and was used in designing Morse Code for telegraph operators to most quickly transmit messages. It also used, more or less, to figure out the points for letters in Scrabble: 26 13