Source coding and compression

Similar documents
Chapter 2 Bits, Data Types, and Operations

1.1. INTRODUCTION 1.2. NUMBER SYSTEMS

Data Representation and Binary Arithmetic. Lecture 2

Chapter 2 Bits, Data Types, and Operations

Chapter 3. Information Representation

Fundamentals of Programming (C)

Chapter 2 Bits, Data Types, and Operations

Numbers and Computers. Debdeep Mukhopadhyay Assistant Professor Dept of Computer Sc and Engg IIT Madras

CPSC 301: Computing in the Life Sciences Lecture Notes 16: Data Representation

CPS 104 Computer Organization and Programming Lecture-2 : Data representations,

Chapter 2 Bits, Data Types, and Operations

Number Systems Base r

Bits and Bytes. Data Representation. A binary digit or bit has a value of either 0 or 1; these are the values we can store in hardware devices.

Positional Number System

CS/ECE 252: INTRODUCTION TO COMPUTER ENGINEERING UNIVERSITY OF WISCONSIN MADISON

5/17/2009. Digitizing Discrete Information. Ordering Symbols. Analog vs. Digital

2a. Codes and number systems (continued) How to get the binary representation of an integer: special case of application of the inverse Horner scheme

DATA REPRESENTATION. Data Types. Complements. Fixed Point Representations. Floating Point Representations. Other Binary Codes. Error Detection Codes

CS/ECE 252: INTRODUCTION TO COMPUTER ENGINEERING UNIVERSITY OF WISCONSIN MADISON

Binary Numbers. The Basics. Base 10 Number. What is a Number? = Binary Number Example. Binary Number Example

Number Representations

Chapter 4. Coding systems. 4.1 Binary codes Gray (reflected binary) code

Fundamentals of Programming

Chapter 2 Bits, Data Types, and Operations

Chapter 2 Number System

Number System (Different Ways To Say How Many) Fall 2016

CSE-1520R Test #1. The exam is closed book, closed notes, and no aids such as calculators, cellphones, etc.

Under the Hood: Data Representation. Computer Science 104 Lecture 2

CSE-1520R Test #1. The exam is closed book, closed notes, and no aids such as calculators, cellphones, etc.

Unit 3. Analog vs. Digital. Analog vs. Digital ANALOG VS. DIGITAL. Binary Representation

The Binary Number System

Number Systems for Computers. Outline of Introduction. Binary, Octal and Hexadecimal numbers. Issues for Binary Representation of Numbers

CS341 *** TURN OFF ALL CELLPHONES *** Practice NAME

EE 109 Unit 3. Analog vs. Digital. Analog vs. Digital. Binary Representation Systems ANALOG VS. DIGITAL

Exercises Software Development I. 03 Data Representation. Data types, range of values, internal format, literals. October 22nd, 2014

EE 109 Unit 2. Analog vs. Digital. Analog vs. Digital. Binary Representation Systems ANALOG VS. DIGITAL

CMSC 313 Lecture 03 Multiple-byte data big-endian vs little-endian sign extension Multiplication and division Floating point formats Character Codes

3.1. Unit 3. Binary Representation

plc numbers Encoded values; BCD and ASCII Error detection; parity, gray code and checksums

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 02, SPRING 2013

Hardware. ( Not so hard really )

ASSIGNMENT 5 TIPS AND TRICKS

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 02, FALL 2012

Data Representa5on. CSC 2400: Computer Systems. What kinds of data do we need to represent?

Data Representa5on. CSC 2400: Computer Systems. What kinds of data do we need to represent?

Variables and data types

Coding Theory. Networks and Embedded Software. Digital Circuits. by Wolfgang Neff

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 02, FALL 2012

Lecture (09) x86 programming 8

EE 109 Unit 2. Binary Representation Systems

PureScan - ML1. Configuration Guide. Wireless Linear Imager Wireless Laser scanner - 1 -

Oberon Data Types. Matteo Corti. December 5, 2001

User s Manual. Xi3000 Scanner. Table of Contents

void mouseclicked() { // Called when the mouse is pressed and released // at the same mouse position }

Unit 3, Lesson 2 Data Types, Arithmetic,Variables, Input, Constants, & Library Functions. Mr. Dave Clausen La Cañada High School

Simple Data Types in C. Alan L. Cox

Chapter 1. Hardware. Introduction to Computers and Programming. Chapter 1.2

Number Systems II MA1S1. Tristan McLoughlin. November 30, 2013

Digital Representation

Do not start the test until instructed to do so!

Introduction. Chapter 1. Hardware. Introduction. Creators of Software. Hardware. Introduction to Computers and Programming (Fall 2015, CSUS)

Xi2000-BT Series Configuration Guide

Configuration Manual PULSAR C CCD SCANNER. Table of Contents

Introduction to Decision Structures. Boolean & If Statements. Different Types of Decisions. Boolean Logic. Relational Operators

Table of Contents Sleep Settings How to Configure the Scanner. 7 Chapter 2 System Setup

UNIT 2 NUMBER SYSTEM AND PROGRAMMING LANGUAGES

Universal Asynchronous Receiver Transmitter Communication

Chapter 8. Characters and Strings

Do not start the test until instructed to do so!

Fundamental Data Types

Information Science 1

Sequencing and control

Chapter 7. Binary, octal and hexadecimal numbers

FA269 - DIGITAL MEDIA AND CULTURE

Review. Single Pixel Filters. Spatial Filters. Image Processing Applications. Thresholding Posterize Histogram Equalization Negative Sepia Grayscale

Computing in the Modern World

Digital Fundamentals

Connecting UniOP to Datalogic Barcode Readers

15110 Principles of Computing, Carnegie Mellon University - CORTINA. Digital Data

Experiment 3. TITLE Optional: Write here the Title of your program.model SMALL This directive defines the memory model used in the program.

CSE 30 Winter 2009 Final Exam

EXPERIMENT 8: Introduction to Universal Serial Asynchronous Receive Transmit (USART)

Serial I/O. 4: Serial I/O. CET360 Microprocessor Engineering. J. Sumey

Review of Number Systems

This is great when speed is important and relatively few words are necessary, but Max would be a terrible language for writing a text editor.

1. Character/String Data, Expressions & Intrinsic Functions. Numeric Representation of Non-numeric Values. (CHARACTER Data Type), Part 1

Do not start the test until instructed to do so!

Introduction to Computer Engineering. CS/ECE 252, Spring 2017 Rahul Nayar Computer Sciences Department University of Wisconsin Madison

CHW 261: Logic Design

BARCODE SCANNER. Configuration Guide - 1 -

2D BARCODE SCANNER CA-SC-20200B

CHAPTER 2 - DIGITAL DATA REPRESENTATION AND NUMBERING SYSTEMS

CSE 30 Fall 2012 Final Exam

Objectives. Connecting with Computer Science 2

UNIT 7A Data Representation: Numbers and Text. Digital Data

Data Storage. Slides derived from those available on the web site of the book: Computer Science: An Overview, 11 th Edition, by J.

NC-1200 BARCODE SCANNER. Configuration Guide - 1 -

EXPERIMENT 7: Introduction to Universal Serial Asynchronous Receive Transmit (USART)

Application Note #2437

CSE 30 Fall 2013 Final Exam

Transcription:

Computer Mathematics Week 5 Source coding and compression College of Information Science and Engineering Ritsumeikan University

last week binary representations of signed numbers sign-magnitude, biased one s complement, two s complement signed binary arithmetic Central Processing Unit IR increment PC CU operation select PSR DR PC registers ALU AR address bus 4 8 6 2 24 28 data bus Random Access Memory negation addition, subtraction Universal Serial Bus Mouse Keyboard, HDD Input / Output Controller PCI Bus GPU, Audio, SSD signed overflow detection multiplication, division width conversion sign extension floating-point numbers 2

this week coding theory source coding information theory concept information content Central Processing Unit IR increment PC CU operation select PSR DR PC registers ALU AR address bus 4 8 6 2 24 28 data bus Random Access Memory binary codes numbers text Universal Serial Bus Input / Output Controller PCI Bus Mouse Keyboard HDD GPU Audio SSD Net variable-length codes UTF-8 compression Huffman s algorithm 3

coding theory coding theory studies the encoding (representation) of information as numbers and how to make encodings more efficient (source coding) reliable (channel coding) secure (cryptography) 4

binary codes a binary code assigns one or more bits to represent some piece of information number digit, character, or other written symbol colour, pixel value audio sample, frequency, amplitude etc. codes can be arbitrary, or designed to have desirable properties fixed length, or variable length static, or generated dynamically for specific data 5

a code for numbers movement why a code for numbers? binary is ambiguous between values e.g., electro-mechanical sensor moves from position 7 to 8 no guarantee all bits change simultaneously encoded position a better code would have one bit difference between adjacent positions also known as minimum-change codes 6

reflected binary code begin with a single bit encoding two positions ( and ) this has the property we need: only one bit changes between adjacent positions to double the size of the code... repeat the bits by reflecting them the two central positions now have the same code prefix the first half with, and the second half with this still has the property we need: only the newly-added first bit changes between the two central positions 3-bit reflect 2-bit reflect -bit this is Gray code, named after its inventor 7

Gray code and binary conversion from binary to Gray code add (modulo 2) each input digit to the input digit on its left (there is an implicit on the far left) binary conversion from Gray code to binary = implicit + Gray = + binary = add (modulo 2) each input digit to the output digit on its left (there is an implicit on the far left) 8

binary and Gray code rotary encoders 9

high-resolution Gray code rotary encoder

fixed length codes for text Baudot code (5 bits) International Telegraph Alphabet (ITA) No. 2 (5 bits) ASCII (7+ bits)

fixed length codes for text: ASCII ASCII (American Standard Code for Information Interchange) latin alphabet, arabic digits common western punctuation and mathematical symbols the ASCII codes, in hexadecimal: most significant digit least significant digit 2 3 4 5 6 7 8 9 A B C D E F NUL SOH STX ETX EOT ENQ ACK BEL BS HT NL VT NP CR SO SI DLE DC DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SP! " # $ % & ( ) * +, -. / 3 2 3 4 5 6 7 8 9 : ; < = >? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ \ ] ^ 6 a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { } ~ DEL text: hex encoding: H 48 e 65 l 6C l 6C o 6F 2 t 74 h 68 e 65 r 72 e 65! 2 2

fixed length codes for text: UTF-32 Unicode is a mapping of all written symbols to numbers modern languages, ancient languages, math/music/art symbols (and far too many useless emoticons) 2-bit numbering, usually written U+hexadecimal-code first 27 Unicode characters are the same as ASCII (hexadecimal) name symbol Unicode ASCII asterisk * U+2A 2A star U+265 n/a shooting star U+ F32 n/a UTF (Unicode Transformation Format) is an encoding of Unicode in binary UTF-32 is a fixed length encoding each character s number is represented directly as a 32-bit integer 3

UTF-32 is wasteful variable length codes for text: UTF-8 most western text uses the 7-bit ASCII subset U+2 U+7F most other modern text uses characters with numbers < 2 6 kana are in the range U+3 U+3FF kanji are in the range U+4E U+9FAF UTF-8 is a variable length encoding of Unicode characters are between and 4 bytes long first byte encodes the length of the character most common Unicode characters are encoded as one, two or three UTF-8 bytes number of bits code points byte byte 2 byte 3 byte 4 7 6 7F 6 xxxxxxx 2 (5 + 6) 8 6 7FF 6 xxxxx 2 xxxxxx 2 6 (4 + 6 + 6) 8 6 FFFF 6 xxxx 2 xxxxxx 2 xxxxxx 2 2 (3 + 6 + 6 + 6) 6 FFFF 6 xxx 2 xxxxxx 2 xxxxxx 2 xxxxxx 2 number of leading s encodes the length in bytes U+265 ( ) = 2 = 2 2 2 = E2 98 85 6 4

information theory how much information is carried by a message? information is conveyed when you learn something new if the message is predictable (e.g., series of s) minimum information content zero bits of information per bit transmitted if the message is unpredictable (e.g., series of random bits) maximum information content one bit of information per bit transmitted if the probability of receiving a given bit (or symbol, or message) x is P x, then ( ) information content of x = log 2 bits P x 5

variable length encoding of known messages messages often use only a few symbols from a code e.g., in typical English text 2.72% of letters are e.74% of letters are z considering their probability in English text e conveys few bits of information (log 2 (/.272) = 2.98 bits) z conveys many bits of information (log 2 (/.74) =.4 bits) and yet they are both encoded using 8 bits of UTF-8 (or ASCII) if we know the message in advance we could construct an encoding optimised for it make high-probability (low information) symbols use few bits make low-probability (high information) symbols use more bits this is the goal of source coding, or compression 6

binary encodings as decision trees follow the path from the root to the symbol in the decision tree a means go left and a means go right root (start here) digits and punctuation -3F capital letters 4-5F UTF-8 multibyte characters 8-F7 6-63 7C-7F... d e f g... x y z {... (65 6 ) (7A 6 ) for English, we really want e to have a shorter path (be closer to the root) than z 7

Huffman coding Huffman codes are variable-length codes that are optimal for a known message the message is analysed to find the frequency (probability) of each symbol a binary encoding is then constructed, from least to most probable symbols create a list of subtrees containing just each symbol labeled with its probability repeatedly remove the two lowest-probability subtrees s, t combine them into a new subtree p label p with the sum of the probabilities of s and t insert p into the list until only one subtree remains the final remaining subtree in the list is an optimal binary encoding for the message 8

Huffman coding example known message googling : f symbol P (s) l i n /8 (3. bits of information) 2 o 2/8 (2. bits) 3 g 3/8 (.42 bits) repeatedly combine two lowest-probability subtrees. 2 3 l i n 2. 2 2 3 n o l o 3. 2 3 3 g i l i n 4. 3 5 2 2 n o 5. 8 n 3 o l 2 2 l i i 5 g g 2 o 3 g 3 g code table: symbol encoding n o l i g encoded message: g o o g l i message length: 64 bits (UTF-8) vs. 8 bits (encoded) compression ratio: 64/8 = 3.6 actual information: 3 3. + 2 2. + 3.42 = 7.26 bits encoding optimal? 7.26 = 8 ( ).. n g 9

common source coding algorithms zip Huffman coding with a dynamic encoding tree tree updated to keep most recent common byte sequences efficiently encoded video mp2, or MPEG-2 (Moving Picture Expert s Group layer 2), for DVDs mp4, or MPEG-4, for web video avi (Audio Video Interleave), wmv (Windows Media Video), flv (Flash Video) lossy audio (decoded audio has degraded quality and is permanently damaged) mp3, or MPEG-3 (Motion Picture Expert s Group layer 3) AAC (Apple Audio Codec), wma (Windows Media Audio) lossless audio (decoded audio is identical to the original) ALAC (Apple Lossless Audio Codec) FLAC (Free Lossless Audio Codec) (CODEC = coder-decoder, a program that both encodes and decodes data) 2

next week coding theory channel coding information theory concept Hamming distance error detection motivation parity cyclic redundancy checks Central Processing Unit IR increment PC CU operation select PSR Mouse DR PC registers ALU Universal Serial Bus AR Input / Output Controller address bus Random Access Memory PCI Bus Keyboard HDD GPU Audio SSD Net 4 8 6 2 24 28 data bus error correction motivation block parity Hamming codes 2

homework practice converting between binary and gray code write a Python program to do it practice encoding and decoding some UTF-8 characters write a Python program to do it practice constructing Huffman codes write a Python program to do it ask about anything you do not understand it will be too late for you to try to catch up later! I am always happy to explain things differently and practice examples with you 22

glossary ASCII American Standard Code for Information Interchange. A 7-bit encoding of the western alphabet, digits, and common punctuation symbols. Almost always stored one character per 8-bit byte. binary code a code that assigns a pattern of binary digits to represent each distinct piece of information, such as each symbol in an alphabet. channel coding the theory of information encoding for the purpose of protecting it against damage. compression an efficient encoding of information that reduces its size when stored or transmitted. cryptography an encoding of information for the purpose of protecting it from unauthorised access. encoding a mapping of symbols or messages onto patterns of bits. fixed length an encoding in which every symbol is encoded using the same number of bits. Gray code an encoding of numbers with the property that adjacent values differ by only one bit. information content the theoretical number of bits of information associated with a symbol or message. minimum-change an encoding of symbols in which adjacent codes differ by only one bit. 23

path a route through a tree from the root to some given node. root the topmost node in a tree. source coding the theory of information encoding for the purpose of reducing its size. tree a hierarchical structure representing parent-child relationships between nodes. Unicode a numbering of all symbols used in modern and ancient written languages, and other written systems of representation such as music. UTF-32 a fixed length encoding of Unicode in which each character s number is encoded directly in a 32-bit integer. UTF-8 a variable length encoding of Unicode in which each character s number is encoded in one, two, three or four bytes. UTF Unicode Transformation Format. A family of binary encodings for Unicode characters. variable length an encoding in which different symbols are encoded using different numbers of bits. 24