CPSC 301: Computing in the Life Sciences Lecture Notes 16: Data Representation

Similar documents
Chapter 2 Bits, Data Types, and Operations

Data Representation and Binary Arithmetic. Lecture 2

Chapter 2 Bits, Data Types, and Operations

CPS 104 Computer Organization and Programming Lecture-2 : Data representations,

Chapter 2 Bits, Data Types, and Operations

CSE-1520R Test #1. The exam is closed book, closed notes, and no aids such as calculators, cellphones, etc.

Numbers and Computers. Debdeep Mukhopadhyay Assistant Professor Dept of Computer Sc and Engg IIT Madras

1.1. INTRODUCTION 1.2. NUMBER SYSTEMS

Source coding and compression

CSE-1520R Test #1. The exam is closed book, closed notes, and no aids such as calculators, cellphones, etc.

Fundamentals of Programming (C)

Chapter 2 Bits, Data Types, and Operations

Under the Hood: Data Representation. Computer Science 104 Lecture 2

Number System (Different Ways To Say How Many) Fall 2016

Fundamentals of Programming

Chapter 3. Information Representation

CMSC 313 Lecture 03 Multiple-byte data big-endian vs little-endian sign extension Multiplication and division Floating point formats Character Codes

CS/ECE 252: INTRODUCTION TO COMPUTER ENGINEERING UNIVERSITY OF WISCONSIN MADISON

EE 109 Unit 3. Analog vs. Digital. Analog vs. Digital. Binary Representation Systems ANALOG VS. DIGITAL

3.1. Unit 3. Binary Representation

Chapter 2 Bits, Data Types, and Operations

Binary Numbers. The Basics. Base 10 Number. What is a Number? = Binary Number Example. Binary Number Example

Unit 3. Analog vs. Digital. Analog vs. Digital ANALOG VS. DIGITAL. Binary Representation

CS/ECE 252: INTRODUCTION TO COMPUTER ENGINEERING UNIVERSITY OF WISCONSIN MADISON

Bits and Bytes. Data Representation. A binary digit or bit has a value of either 0 or 1; these are the values we can store in hardware devices.

EE 109 Unit 2. Analog vs. Digital. Analog vs. Digital. Binary Representation Systems ANALOG VS. DIGITAL

Number Systems for Computers. Outline of Introduction. Binary, Octal and Hexadecimal numbers. Issues for Binary Representation of Numbers

Number Representations

Number Systems Base r

Oberon Data Types. Matteo Corti. December 5, 2001

5/17/2009. Digitizing Discrete Information. Ordering Symbols. Analog vs. Digital

The Binary Number System

DATA REPRESENTATION. Data Types. Complements. Fixed Point Representations. Floating Point Representations. Other Binary Codes. Error Detection Codes

Exercises Software Development I. 03 Data Representation. Data types, range of values, internal format, literals. October 22nd, 2014

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 02, FALL 2012

EE 109 Unit 2. Binary Representation Systems

Number Systems II MA1S1. Tristan McLoughlin. November 30, 2013

2a. Codes and number systems (continued) How to get the binary representation of an integer: special case of application of the inverse Horner scheme

Data Representa5on. CSC 2400: Computer Systems. What kinds of data do we need to represent?

Data Representa5on. CSC 2400: Computer Systems. What kinds of data do we need to represent?

plc numbers Encoded values; BCD and ASCII Error detection; parity, gray code and checksums

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 02, SPRING 2013

Data Storage. Slides derived from those available on the web site of the book: Computer Science: An Overview, 11 th Edition, by J.

Chapter 2 Number System

Do not start the test until instructed to do so!

Do not start the test until instructed to do so!

Unit 3, Lesson 2 Data Types, Arithmetic,Variables, Input, Constants, & Library Functions. Mr. Dave Clausen La Cañada High School

Chapter 1. Data Storage Pearson Addison-Wesley. All rights reserved

3 Data Storage 3.1. Foundations of Computer Science Cengage Learning

Simple Data Types in C. Alan L. Cox

ASSIGNMENT 5 TIPS AND TRICKS

Fundamental Data Types

Lecture (09) x86 programming 8

Bits and Bit Patterns

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 02, FALL 2012

Chapter 8. Characters and Strings

Chapter 7. Binary, octal and hexadecimal numbers

Positional Number System

CS341 *** TURN OFF ALL CELLPHONES *** Practice NAME

Do not start the test until instructed to do so!

a- As a special case, if there is only one symbol, no bits are required to specify it.

Experiment 3. TITLE Optional: Write here the Title of your program.model SMALL This directive defines the memory model used in the program.

Digital Representation

Jianhui Zhang, Ph.D., Associate Prof. College of Computer Science and Technology, Hangzhou Dianzi Univ.

FA269 - DIGITAL MEDIA AND CULTURE

4/14/2015. Architecture of the World Wide Web. During this session we will discuss: Structure of the World Wide Web

Chapter 1. Hardware. Introduction to Computers and Programming. Chapter 1.2

Review. Single Pixel Filters. Spatial Filters. Image Processing Applications. Thresholding Posterize Histogram Equalization Negative Sepia Grayscale

Variables and data types

UNIT 2 NUMBER SYSTEM AND PROGRAMMING LANGUAGES

Data Representation and Networking

Basic data types. Building blocks of computation

Introduction. Chapter 1. Hardware. Introduction. Creators of Software. Hardware. Introduction to Computers and Programming (Fall 2015, CSUS)

1. Character/String Data, Expressions & Intrinsic Functions. Numeric Representation of Non-numeric Values. (CHARACTER Data Type), Part 1

Lecture 19 Media Formats

Introduction to Decision Structures. Boolean & If Statements. Different Types of Decisions. Boolean Logic. Relational Operators

PureScan - ML1. Configuration Guide. Wireless Linear Imager Wireless Laser scanner - 1 -

LING 388: Computers and Language. Lecture 5

NUMBERS AND DATA REPRESENTATION. Introduction to Computer Engineering 2015 Spring by Euiseong Seo

Homework 1 graded and returned in class today. Solutions posted online. Request regrades by next class period. Question 10 treated as extra credit

Digital Fundamentals

This is great when speed is important and relatively few words are necessary, but Max would be a terrible language for writing a text editor.

Connecting UniOP to Datalogic Barcode Readers

EXPERIMENT 8: Introduction to Universal Serial Asynchronous Receive Transmit (USART)

^BC Code 128 Bar Code (Subsets A, B, and C)

Week 1 / Lecture 2 8 March 2017 NWEN 241 C Fundamentals. Alvin Valera. School of Engineering and Computer Science Victoria University of Wellington

Problem Max. Points Act. Points Grader

Hardware. ( Not so hard really )

void mouseclicked() { // Called when the mouse is pressed and released // at the same mouse position }

Characters Lesson Outline

MOBILE THERMAL PRINTER

Elementary Computing CSC 100. M. Cheng, Computer Science

Digital Fundamentals

Data Representation From 0s and 1s to images CPSC 101

IT 1204 Section 2.0. Data Representation and Arithmetic. 2009, University of Colombo School of Computing 1

EXPERIMENT 7: Introduction to Universal Serial Asynchronous Receive Transmit (USART)

Introduction to Computer Science (I1100) Data Storage

Coding Theory. Networks and Embedded Software. Digital Circuits. by Wolfgang Neff

Appendix A Developing a C Program on the UNIX system

marson MT8200S 2D Handheld Scanner User Manual V / 6 / 25 - I -

Transcription:

CPSC 301: Computing in the Life Sciences Lecture Notes 16: Data Representation George Tsiknis University of British Columbia Department of Computer Science Winter Term 2, 2015-2016 Last updated: 04/04/2016 12:04 PM Original slides by Ian M. Mitchell ; Revisions/updates by George Tsiknis

Representations for simple data Binary representations of numbers ASCII and Unicode representations for characters Other representations XML Documents Compression Images Outline Python pickling-storing data in files 2

Objectives At the end of this section, you will be able to: Explain the binary representation of positive integers Recognize that similar representations are used for negative integers, memory addresses, floating point numbers Explain the ASCII / UTF-8 representation of characters Explain how strings are compared by Python Explain what compression is and the difference between lossy and lossless compression List some examples of lossless and lossy compression formats List some examples of document formats List some examples of image formats, and explain situations where one type of format might be preferred over another Explain the meaning of "pickling" in Python 3

Representing Numbers: Integers Any info that is stored in a computer is represented by streams of 0 s and 1 s called bits information stored in voltage, light or magnetic field. etc. for instance, 0 could mean high voltage, 1 mean low voltage Bits are grouped in groups of 8, called bytes A number is represented by its binary value similar to our standard counting system ( decimal or base 10 ) but now each digit is either 0 or 1 ( binary or base 2 ) decimal system uses a base of 10 and ten digits (0,1, 2,...,9) binary system used a base of 2 and two digits 0 and 1 decimal system the value of each position is a power of 10 i.e. 105 in decimal is : 1x10 2 + 0x10 1 + 5 binary system the value of each position is a power of 2 105 in binary is : 1101001 = 1x2 6 + 1x2 5 + 0x2 4 + 1x2 3 + 0x2 2 + 0x2 1 + 1 4

Representing Numbers (cont ) To convert the decimal 105 to binary, keep dividing by 2 and take the remainders: 105 / 2 = 52 rem 1 52 / 2 = 26 rem 0 26 / 2 = 13 rem 0 13 / 2 = 6 rem 1 6 / 2 = 3 rem 0 3 / 2 = 1 rem 1 1 / 2 = 0 rem 1 Now take the remainders from bottom up and add 0 s in front to make it to the length you want. for instance, 105 in 8 bits is: 01101001 A regular Python integer is 32 bits or 4 bytes can hold the numbers from -2147483648 to 2147483647 Python is happy to store an integer with larger magnitude In Python 2, these were called "long integers" In Python 3, it is hidden from the user In other languages "overflow" often causes bugs in integer arithmetic Negative numbers represented with two s-complement for instance number k is represented in n bits by the number 2 n -k 5

Approximating Real Numbers Most non-integer numbers cannot be exactly represented by a computer The most common approach to representing such number is called floating point, which is similar to scientific notation ±c(10 q ) Written as ±ceq, where typically -10 < c < +10 and q is an integer For example: 1.496e11 denotes 1.496(10 11 ) Stored internally using the IEEE floating point standard Either 32 (single precision) or 64 (double precision) bits divided into three parts: a sign, a mantissa or coefficient (the part before "e") and an exponent (the part after "e") Python float corresponds to IEEE double precision, which can represent roughly the range ±10 ±308 with about 16 digits of precision 6

Memory Addresses Just a positive integer Always of fixed size on a given machine (either 4 or 8 bytes on current machines) Often displayed in hexadecimal (base 16) notation to distinguish them from regular integers Hexadecimal numbers are usually displayed with a "0x" at the beginning to distinguish them from integers or strings; for example, 0x025D90F0 Not an explicit data type in Python, but implicit in (almost) every reference to an object 7

Representing Characters Computers can only store numbers Characters are represented by assigning a number for each character There are a number of coding schemes American Standard Code for Information Interchange (ASCII) is the oldest coding scheme for English Developed in the 1960s for teletype machines Uses one byte per character but only 7 of the bits are used for the character Numbers 0 to 31 and 127 are control characters such as line feed (\n), tab (\t) and many obsolete codes Numbers 32 to 126 are the standard characters used in English (letters, numbers, punctuation) 8

Representing Characters (cont'd) 128 different characters can be represented using ASCII Examples of ASCII characters: 0 is represented by 00110000 (48) 9 ===> 00111001 (57) A ===> 01000001 (65) Z ===> 01011010 (90) a ===> 01100001 (97) z ===> 01011010 (122) Appendix I has a table with the ASCII codes Example: The string " "CPSC 301!" will be represented with 9 bytes containing the numbers: 67 80 83 67 32 51 48 52 33 9

Other Encodings: Unicode, UTF-8, UTF-16 Lack of standard encoding for all languages seriously complicates development of internationally used software Newer Unicode standard includes more than 107,000 characters in 90 scripts Includes a variety of different encodings: UTF-8, UTF-16,... UTF-8 (Unicode Translation Format ) has widely replaced ASCII Uses 1 to 4 bytes for each character First 127 characters of UTF-8 are exactly the same as US-ASCII, and it is hence backward-compatible In Python 3, all strings are encoded in Unicode In Python 2 Regular strings use the computer's local encoding Unicode strings (preceded by the letter u) permit access to any Unicode standard: u'äöü' Functions are available to translate between different encodings 10

Stupid Newline! ASCII was defined for teletypes and has two control characters related to line control line feed (LF, \n or ASCII code 10) which told the paper controller to move the paper down one line carriage return (CR, \r or ASCII code 13) which told the printer head to move back to the left side of the page Unfortunately, computer operating system designers chose different ways of representing the end of a line in a file Linux, MacOS 10 and after, Unix and others chose LF Windows and others chose CR+LF MacOS 9 and before and others chose CR So when you transmit files between different OS, the line breaks do not appear correctly Python (mostly) adopts \n internally, and converts to local OS representation when files are written 11

Plain text Document Formats Portable Document Format (PDF) Developed by Adobe Systems, now an ISO standard Has features to handle fonts, pictures and other content and potentially compress the result Office Open XML Developed by Microsoft, now an ISO standard An XML-based representation including both word processing and spreadsheet data, and compression Other formats: Rich Text Format (RTF), MS binary formats (Word 1997 2003), TeX & LaTeX, Open Document, Postscript 12

extensible Markup Language (XML) A standard set of rules for encoding documents A document is a string of unicode characters A document contains markup and content Markup data are strings of the form <...> (called tags ) or &...; Content is surrounded by start and end tags that identify the meaning of that element of the content XML is a very general standard For specific data storage tasks, a schema will be defined which specifies the types and meaning of markup tags Often given in Document Type Definitions (DTDs), XML Schema Definition (XSDs) or some other languages Example: the BLAST results from NCBI is stored in an XML format that NCBI has defined. XML is often used for data exchange between devices (i.e. a mobile application and a server ). 13

Data Compression Store data using fewer bits, typically by taking advantage of repetition in the data For example: compress the string Here are some letter j: jjjjjjjjj. Here are some letter k: kkkkkkkkk. Advantages: Fewer bits to store / transmit Disadvantages: Extra computational cost, removing redundancy may increase sensitivity to errors Lossless compression requires that original data be recreated exactly Used for generic computer data, such as documents and programs, where every bit matters Examples: zip, gzip, bzip2, 7-zip, compress, png, gif, tiff,... Lossy compression permits small errors in order to achieve higher compression ratios Used for pictures, movies & audio where the perceived quality of the result suffers little from the error examples: jpeg, mpeg, mp3,... 14

Image Formats Vector formats Image is composed of geometric primitives (points, lines, curves, circles, etc) with mathematical descriptions which can be examined at arbitrary resolution Supported by Scalable Vector Graphics (svg) format, as well as inside pdf document format and for modern fonts (such as true type and postscript) Raster formats Image is composed of an array of pixels and has fixed resolution Pixels typically have integer colour (or a tuple of integer colours) Supported by bmp, tiff, gif, jpeg, png,... Lossy compression: jpeg Lossless compression: gif, png There are situations where vector is better, and situations where raster is better. Basic rules: For natural images (eg: photographs) use jpeg. For synthesized images (eg: plots) use png (or possibly svg ). 15

Representing More Complex Data Representations of the built-in complex data types: Conceptually, everything can be represented as a dictionary A dictionary acts like a table of key-value pairs, but it is very efficiently implemented In practice, Python has more efficient implementations of commonly used data structures like lists and files Python s built-in types are easy to use because they hide the implementation details from the user User-defined classes representation A well-designed class (or hierarchy of classes) likewise hides the data s representation ; Uses methods to access the data For example: You do not need to know that Biopython s Seq class stores the sequence data as a string; you just need to know how to call the methods transcribe() and translate() 16

Storing Complex Data in Files Often programmers find the need of storing and retrieving complex data from a file. We call these data persistent data because they can exist after the program terminates and can be used again by the same or other programs. Files store bytes. All you need to do is define what those bytes mean by defining a file format or storage protocol Specifies how characters / bytes in the file should be interpreted Examples: the FASTA and GenBank formats we have seen Python has a simple but powerful procedure to convert an object into a series of bytes and store it in a file. This process is called object serialization (or pickling in Python's jargon). Another procedure can read this series of bytes from the file and create the original object (unpickling). The Python module that performs serialization is called pickle 17

Converts Python objects into strings which can then be written to and read from files Some restrictions on user defined objects Only accessible through Python To save objects to a file we can do the following: import pickle object1 = object2 =. create one or more objects f = open(filename, "wb") open the file for writing (binary) pickle.dump(f, object1) use pickle's dump method to write pickle.dump(f, object2) f.close() Python's pickle Module objects (one at a time) to the file close the file 18

To get back the objects we have stored in a file using pickle we can do the following: import pickle f = open(filename, "rb") open the file for reading (binary) # use pickle's load to get the objects back in the same order object1 = pickle.load(f) object2 = pickle.load(f)... f.close() Python's pickle Module (cont'd) For an example see files close the file pickle_towns.py and unpickle_towns.py This process is also called serialization since it turns the data into a long serial sequence of bytes. Another way to serialize data is by using an XML format. 19

Summary All data in a computer is represented in binary by bits Different types of data interpret those bits in different ways Common examples of basic types are integers, memory addresses, floating point, and characters Incompatible representations of similar types has caused widespread frustration and failure Files / documents can be stored in many different formats Some formats are character based and some are binary Compression can considerably shrink the size of a file Can be used to offset the file size disadvantages of using character based formats Comes in lossy or lossless varieties Python has many ways of storing data Reading and writing character or binary files is built in Modules are available for most standard document formats Complex data types can be saved in binary format by pickling 20

21 Appendix I: ASCII Codes for Characters 000 NUL 001 SOH 002 STX 003 ETX 004 EOT 005 ENQ 006 ACK 007 BEL 008 BS 009 HT 010 NL 011 VT 012 NP 013 CR 014 SO 015 SI 016 DLE 017 DC1 018 DC2 019 DC3 020 DC4 021 NAK 022 SYN 023 ETB 024 CAN 025 EM 026 SUB 027 ESC 028 FS 029 GS 030 RS 031 US 032 SP 033! 034 " 035 # 036 $ 037 % 038 & 039 ' 040 ( 041 ) 042 * 043 + 044, 045-046. 047 / 048 0 049 1 050 2 051 3 052 4 053 5 054 6 055 7 056 8 057 9 058 : 059 ; 060 < 061 = 062 > 063? 064 @ 065 A 066 B 067 C 068 D 069 E 070 F 071 G 072 H 073 I 074 J 075 K 076 L 077 M 078 N 079 O 080 P 081 Q 082 R 083 S 084 T 085 U 086 V 087 W 088 X 089 Y 090 Z 091 [ 092 \ 093 ] 094 ^ 095 _ 096 ` 097 a 098 b 099 c 100 d 101 e 102 f 103 g 104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o 112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w 120 x 121 y 122 z 123 { 124 125 } 126 ~ 127 DEL 21