Introduction to Computer Networking Chapter 1

Size: px

Start display at page:

Download "Introduction to Computer Networking Chapter 1"

April Hopkins
6 years ago
Views:

1 Chapter 1 1 Introduction to Computer Networking Chapter 1 Prof. Jean-Yves Le Boudec ICA, EPFL CH-1015 Ecublens Leboudec@epfl.ch

2 Chapter 1 Goal of the introduction understand TCP/IP and networking terminology layered model transport protocol TCP; UDP; IP; MAC IP addresses, MAC addresses,machine names, DNS, NetBIOS names, port numbers routers, bridges, servers client server architecture HTTP, FTP, SMTP multiplexing protocol connection simple transport layer: UDP basic elements of transmission Textbook [Stevens] TCP/IP illustrated volume I, The protocols, W. Richard Stevens, Addison Wesley (Very detailed, experimental hands-on description of TCP/IP. Also volume III for HTTP) 2 This lecture requires that you buy a textbook on TCP/IP. The book I recommend as the best complement to these notes is mentioned on the slide. However, there are many good books on TCP/IP: feel free to choose another one.

3 Chapter 1 3 network services examples: Network Services distributed database, Web (3), file transfer, remote login, ,news, talk, remote processing, resource sharing (file servers, printers, modems), network time, name service (2) user clicks: query name server 3 answer IP addr = IP addr = GET Web server data (HTML page) In this lecture we study computer networks. We use a top-down approach, starting with socket programming. We will study in this chapter the global picture, which will enable you to get started with writing your first programs. Then in the following chapters, we will study the various components (called layers ), one by one. What are computer networks used for? Computer networks allow people and machines to communicate, using a number of services. The slide shows a small subset of services.

4 Chapter 1 Network Infrastructure 4 A computer network is made of distributed applications provides service to userson other machines, or to other machines is in computers network infrastructure: supports transport of data between computers where distributed applications reside in computers (Ethernet card, modem + software) + in special network devices (bridges, routers, concentrators, switches) focus of this lecture = network infrastructure A computer network is made of two disctinct subsets of components - distributed applications are programs running on interconnected computers; a web server, a remote login server, an exchanger are examples. This is the visible part of what people call the Internet. In this lecture we will study the simplest aspects of distributed applications. More sophisticated aspects are the object of lectures called Distributed Systems and Information Systems. - the network infrastructure is the collection of systems which are required for the interconnection of computers running the distributed applications. It is the main focus of this lecture. The network infrastructure problem has itself two aspects: - distance: interconnect systems that are too far apart for a direct cable to be possible - meshing: interconnect systems together; even in the case of systems located close to each other, it is not possible in non-trivial cases to draw cables from all systems to all systems (combinatorial explosion, cable salad management problems). The distance problem is solved by using a network, such as the telephone network with modems (see later). The meshing problem was originally solved easily because the terminals were not able to communicate with each other, but always has to go through a main computer. The mesh in such cases is reduced to a star network. Today this is solved by a complex set of bridges and routers.

5 Chapter 1 Physical Layer Data Link Layer 5 T1 point to point cables 1 to T3: Hello T2 T3 2 From T1: Hello terminals mainframe computer physical transmission = Physical function bits <-> electrical / optical signals transmit individual bits over the cable: modulation, encoding packet transmission = Data Link function bits <-> frames bit error detection packet boundaries in some cases: error correction by retransmission Modems, Ethernets The objective of this and the following slides is to introduce the concept of layers. Like any complex computer system, a network is decomposed into functions. This decomposition is, to a large extend, stable: computer networking people have agreed on a reasonable way to divide the set of functions into what is called layers. The decomposition always assumes that the different components can be ordered such that one component interfaces only with two adjacent components. We call layers the components. We start with the simplest, and oldest, network example: it is a mainframe connected to terminals. In that case, there are mainly two functions physical layer: translates bits into electromagnetic waves; data link layer: translates packets into bits. These two functions are implemented on cables or on radio links. The physical layer has to do with signal processing and coding; it is the object of the lecture called Telecommunication. The data link layer has to do with bits and bytes; we will study the data link layer in this lecture.

6 Chapter 1 A Network 6 Network layer set of functions require to transport packets end-to- end examples: IP, Appletalk, IPX intermediate system forwards data not destined to itself T1 M1 T4 T2 1 1a 2 3 2a 3a M2 4, 6a 5a 4a T3 packet switch srce=t2, dest=m2, to T3: hello Modern networks have more than physical and data link. The network layer is the set of mechanisms that can be used to send packets from one computer to another in the world. There are two types of networks: With Packet switching, data packets can be carried together on the same link. They are differentiated by addressing information. Packet switching is the basis for all data networks today, including the Internet, public data networks such as Frame Relay or X.25, and even ATM. Circuit Switching is the way telephone networks operate. A circuit emulates the physical signals of a direct end-to-end cable. When computers are connected by a circuit switched network, they establish a direct data link over the circuit. This is used today for modem access to a data network. Modern circuit switches are based on byte multiplexing and are thus similar to packet switches, with the main difference that they perform nonstatistical multiplexing (see what this means later in this chapter). A network has Intermediate systems (ISs): those are systems that send data to next ISs or to the destination. Using interconnected ISs saves cable and bandwidth. Intermediate systems are known under various terms depending on the context: routers (TCP/IP, AppleTalk, ), switches (X.25, Frame Relay, ATM, telephone), communication controllers (SNA), network nodes (APPN)

7 Chapter 1 Transport Layer 7 Why a transport layer? transport layer = makes network service available to programs is end-to-end only, not in routers in TCP/IP there are two transport protocols UDP (user datagram protocol) unreliable offers a datagram service to the application (unit of information is a message) TCP (transmisssion control protocol) reliable offers a stream service (unit of information is a byte) an application uses UDP or TCP, it is a designer s choice use for example the socket API: a library of C functions socket also means (IP address, port number) Physical, data link and network layers are sufficient to build a packet transport system between computers. However, this is not enough for the programmer. When you write a low-level program which uses the network (as we will do in this lecture), you do not handle packets, but data. The primary goal of the transport layer is to provide the programmer with an interface to the network. Second, the transport layer uses the concept of port. A port is a number which is used locally (on one machine) and identifies the source and destination of the packet inside the machine. We will come back to the concept of ports later in this chapter. The transport layer exists in two varieties: unreliable and reliable. The unreliable variety simply sends packets, and does not attempt to guarantee any delivery. The reliable variety, in contrast, makes sure that data does reach the destination, even if some packets may be lost from time to time.

8 Chapter 1 Protocol, service and other fancy definitions 8 Peer entities two (or more) instances of the same layer Protocol and PDU: the rules of the game observed by peer entities the data exchanged is called PDU (protocol data unit) there is one protocol (or more) at every layer Service and SDU the interface between a layer and the layer above the interface data is called SDU (service data unit) Connection a protocol is connection oriented if the peer entity must be synchronized before exchanging useful data; otherwise it is connectionless. A protocol is the formal definition of external behaviour for communicating entities. It defines: - message formats - expected actions (message sent, data delivered, abort) Examples of protocols are: TCP UDP IP Ethernet Protocols are connection oriented or connectionless. A connection exists if the communication requires some synchronization of all involved parties before communication can take place. The telephone system is connection oriented: before A can send some information to B, A has to call B (or vice versa) and say hello. The postal (mail) system is connectionless. If A wants to send some information to B, A can write a letter and mail it, even if B is not ready to read it. Networking functions are ordered in a layered model: - layer n communicates with other layer n entities using the layer n protocol, the deat units exchanged are called layer n PDUs (protocol data units) - layer n uses the service of layer n-1 and offers a service to layer n+1. - entities at the same layer are said peer entities.

9 Chapter 1 1 name name resolver Example: name resolution user clicks: 2 query answer A name name server 9 TCP TCP UDP UDP UDP UDP TCP TCP IP IP (network layer) Data Data Link Link (modem) IP IP DL DL DL DL (modem) CSMA/CD IP IP Data Data Link Link (CSMA/CD) Physical (twisted pair) pair) PHY PHY (TP) (TP) PHY PHY (TC) (TC) Physical (thin (thin Coax) Coax) Host A P1 router R1 P2 Host B Flow 2 illustrates the query/response protocol of the Domain Name System (DNS). The name resolver and the name server are two application programs, probably C programs using sockets. These programs use UDP, which is the non-reliable transport protocol used in the Internet. Let us now apply the terminology on this example. name resolver uses the UDP service: it creates a request to send data to name server. name server is identified by its IP address (for example: ). name resolver also knows that name server can be reached by means of port 53 (a well known convention, used in the Internet). The SDU is the request, with the data. The transport-pdu is called a datagramme. It contains the data, the address and the port numbers. It is shown by 2 on the figure. UDP creates a request to IP to send a data to the name server machine identified by the IP address The network-pdu is called an IP packet. It contains the UDP datagramme plus the IP addressing information (and some other information, see later). IP creates a request to send a data frame over the modem. The modem card creates a data-link PDU, called a modem frame. The frame contains the IP packet, maybe compressed. Then the data link layer requests transmission of the frame; the physical layer SDU is a bit. The physical layer PDU is an electromagnetic signal. At the router the data frame is received, understood as an IP packet IP reads the IP destination address ( ) and decides to forward it over its Ethernet interface IP creates a request to send the data frame over the Ethernet. An Ethernet frame is created.

10 Chapter 1 An example with TCP 10 Web Web Browser Web Web Server UDP UDP open connection to :80 open (SYN) TCP TCP connect (SYN ACK) connect ack (ACK) GET 80 TCP TCP passive open UDP UDP IP IP (network layer) Data Data Link Link (modem) IP IP DL DL DL DL (modem) CSMA/CD IP IP Data Data Link Link (CSMA/CD) Physical (twisted pair) pair) PHY PHY (TP) (TP) PHY PHY (TC) (TC) Physical (thin (thin Coax) Coax) Host A P1 router R1 P Host B Here is a second example. A web browser always uses TCP for communication with a web server. The web browser starts by requesting from the transport layer the opening of a connection for reliable data transport. TCP opens a connection to the peer entitiy at the web server machine by exchanging 3 messages ( a 3-way handshake ). If the connection can succesfully be opened, then data can flow between the web client and server. TCP monitors missing packets and retransmits them as appropriate. The web browser and server can thus assume that they have a reliable data pipe between them, that transports data in sequence and without errors, at least as long as the TCP layer does not close the connection. TCP is connection oriented. What is shown is the connection setup phase. TCP uses IP, which is connectionless. UDP is connectionless. An observer at P1 or P2 would see the beginning of the message between web clients and servers only in the third data frame.

11 Chapter 1 What is Client Server? 11 distributed applications use the client-server model server = program that is awaiting data to be sent to it: clients send data to servers user clicks: 1 Internet 2 query name server 3 answer IP addr = IP addr = GET data (HTML page) Web server We use the terms client and server in the following sense. When two entities say A and B, want to communicate, there is a boostrap problem: how can you initialize both A and B such that the communication can take place? One solution is to manually start A, then B, but this defeats the purpose of networking. The only way we have found so far is to request that one of the two, say B, is started manually; then it immediately puts itself in a listening position. We say that B is a server. A system, such as A, which talks to B, is said to be a client. Being a server or a client is relative to a given protocol. For example, consider the application level protocol called FTP (file transfer protocol). The FTP server is a machine which waits for other machines to send requests for logging in. When an FTP client has contacted an FTP server, then there is normally a dialogue (change directory, etc). Then the FTP client requests that a file is transferred from or to the server. Then, FTP has been designed such that the FTP client has to wait for the FTP server to open a connection back to the client (try it!). In that interaction, the FTP client is a TCP server, namely, a machine which waits for some other machine to open a TCP connection. In everyday s life, most people use the term server to designate a machine whose main function is to be a server for some protocol: a name server, a file server, a news server...

12 Chapter 1 The TCP/IP Architecture 12 OSI layer Number Application Layer Layer 5-7 Application Layer Layer Transport Layer Layer 4 Transport Layer Layer Network Layer Layer 3 Network Layer Layer Network Layer Layer Data Data Link Link Layer Layer 2 Data Data Link Link Layer Layer Data Data Link Link Layer Layer Physical Layer Layer 1 Physical Layer Layer Physical Layer Layer Host (= end-system) Router (= intermediate system) Host (= end-system) An architecture is a set of external behaviour specifications for a complete communication system. It describes the protocols, but not how to implement them. The TCP/IP Architecture, or Internet Architecture is described by the collection of Internet standards, published in documents called RFCs (requests for comments), available (for example) from ftp://ftp.switch.ch/standard. The picture shows all the layers of the Internet Architecture. There exists, inside every layer, a number of protocols which we will discover in this course. There exist other architectures, each of them having a different set of layers and names for layers.there are: proprietary architectures: SNA (IBM), Decnet (Digital), AppleTalk (Apple), XNS (Xerox), UUCP (Unix internal protocols), etc the OSI architecture the ITU architecture defines public networks for telephony, telex, fax, data networks (X.25, Frame Relay, mail and directory services) and ATM the IEEE LAN architecture defines layers 1 and 2 for local area networks. We will see some details later. Having several architectures is a nuisance; everything would be simpler if there would be only one. Today, the TCP/IP architecture has become dominant, so this is the only one we will study in detail. The ITU architecture (Frame Relay and ATM) does also play an important role and we will study it at the end of the lecture.

13 Chapter 1 OSI Architecture 13 end to end layers global layer Application Layer Layer Presentation Layer Layer Session Layer Layer Transport Layer Layer Network Layer Layer local layers Data Data Link Link Layer Layer 2 Physical Layer Layer 1 The OSI architecture defines protocols and service specifications. It is the official standard, similar to the TCP/IP architecture, but is not much implemented. However, the OSI model is used most frequently to describe all systems, including TCP/IP Architectures do not interoperate by themselves at the protocol level. For example, the OSI transport protocols are not compatible with TCP or UDP. Worse, there is no compatibility at the service level, so it is not possible to use layer n of one architecture and put it on top of layer n-1 of some other architectures. There are fortunately exceptions to this statement. Layer interfaces where service compatibility is often implemented are: the data link layer the transport layer. For example, it is possible to use various protocol families over the same local area network (LAN). The OSI presentation layer is in charge of hiding specific data representation formats. It defines ASN.1, an abstract, universal means for coding all types of data structures. ASN.1 has also become part of the TCP/IP architecture, in the application layer The OSI session layer synchronizes events between end-systems, in order for example to support failure recovery. It is implemented in TCP/IP over a number of application layer protocols and TCP.

14 Chapter 1 UDP: User Datagram Protocol 14 UDP uses port numbers Host Host IP IP addr=a process process process process pa qa ra sa IP network Host Host IP IP addr=b process process process process sb rb qb pb 1267 UDP IP TCP IP SA=A DA=B prot=udp source port=1267 destination port=53 data TCP IP UDP 53 IP datagram IP header UDP Source Port UDP Dest Port UDP Message Length UDP Checksum data UDP datagram Let us have a closer look at UDP, the unreliable transport protocol used in the Internet. Two processes (= application programs) pa, and pb, are communicating. Each of them is associated locally with a port, as shown on the figure. In addition, every machine (in reality: every communication adapter) has a network layer address called IP address (coded on 32 bits). The example shows a packet sent by the name resolver process at host A, to the name server process at host B. The UDP header contains the source and destination ports. The destination port number is used to contact the name server process at B; the source port is not used directly; it will be used in the response from B to A. The UDP header also contains a checksum which verifies the UDP data plus the IP addresses and packet length. It is not performed by all systems.

15 Chapter 1 Port Assignment 15 Multiplexing based on source and destination numbers called port numbers example. DNS query source port =, dest port = Some ports are statically defined (well-known ports) ex: DNS server port = 53 Other ports are allocated on request from application program (ephemeral ports) ex: client port for DNS queries Application level protocol specifies use of ports examples: assigned ports echo 7/UDP discard 9/UDP domain 53/UDP talk 517/UDP snmp 161/UDP snamp-trap 161/UDP Ports are 16 bits unsigned integers. They are defined statically or dynamically. Typically, a server uses a port number defined statically. Standard services use well-known ports; for example, all DNS servers use port 53. Ports that are allocated dynamically are called ephemeral. They are usually above If you write your own client server application on a multiprogramming machine, you need to define your own server port number and code it in your application (we will see how later).

16 Chapter 1 The UDP service 16 UDP service interface one message, up to 64K destination address destination port source address source port UDP service is message oriented delivers exactly the message or nothing several messages might not be delivered in order UDP used when TCP does not fit short interactions real time. multimedia multicast The UDP service interface uses the concept of message. If a machine A sends one or several messages to a machine B, then the messages are delivered exactly as they were submitted, without alteration, if they are delivered. Since UDP is unreliable and since some packet losses do occur, messages may silently be discarded. No one is informed, it is up to the application program to handle this. For example, the query/response protocol of DNS specifies that queries are sent k times until a response is received or a timer expires. If the timer expires, the name resolver will try another domain name server, if it is correctly configured. The UDP service does not go beyond one message. If two messages, M1 and M2, are sent one after the other from A to B, then in most cases, B will receive A before B (if no message is lost). However, it is possible, though infrequent, that B receives first B then A. UDP is used mainly for short transactions, or for real time multimedia applications: - DNS queries - Network File System - Remote Procedure Call (RPC) - any distributed program that requires raw, unreliable transport - IP telephony It is also used by all applications which use multicast, since TCP does not support it.

17 Chapter 1 The UDP protocol 17 A checksum is used to verify source and destination addresses port numbers the data optional with IPv4, mandatory with IPv6 based on pseudo-header method: checksum is computed over: UDP datagram + pseudo header checksum is put in UDP header; pseudo header is not transmitted IP srce addr IP dest addr 0 prot length checksum Entire UDP datagram pseudo-header checskum = 0 sum modulo all 16bit words checksum = 1 s complement length = length of UDP datagram

18 Chapter 1 Physical Layer 18 Layer 1 function is to transmit / receive a sequence of bits on electrical or optical system Bits are modulated into analog signal; examples of direct modulation used on Ethernet NRZ NRZI Manchester Differential Manchester more sophisticated modulations: pulse mode (optical signal) carrier modulation (modem) We see here some rudiments of transmission. The diagram shows some very primitive channel coding methods. They are used on short distances, for example with Ethernet and Token Ring.

19 Chapter 1 Bit Rates 19 Bit Rate (débit binaire, Bitrate) of a transmission system = number of bits transmitted per time unit units: b/s, kb/s = 1000 b/s, Mb/s = 10e+06 b/s, Gb/s=10e+09 b/s Shannon-Hartley law: C max = B log 2 ( 1 + S/N ), with B = bandwidth (Hz), S/N =signal to noise ratio example: telephone circuit: B = 3 khz, S/N = 30 db, C max - 30 kb/s Practical Bit Rates: modem: 2.4 kb/s to 28.8 kb/s ISDN line: 144 kb/s to 2 Mb/s Ethernet: 10 Mb/s Token Ring 4 Mb/s, 16 Mb/s FDDI: 100 Mb/s ATM: 2 Mb/s to 622 Mb/s SDH: 155 Mb/s to 2.4Gb/s Transmission time = time to send x bits at a given bit rate Example: time to send 1 MB at 10 kb/s =? The bit rate of a channel is the number of bits per second. The bandwidth is the width of the frequency range which can be transmitted over the channel. Bandwidth is translated into bit rate by means of a given code. For example, NRZ and NRZI obtain 10 Mb/s with a bandwidth of 10 MHz. Manchester coding obtains 5 Mb/s with a bandwidth of 10 MHz. In general, information theory gives the maximum bit rate available under some modelling assumptions. The Shannon-Hartley laws gives the maximum bit rate, for a given bandwidth, assuming the channel is a white noise channel. Many people confuse bandwidth and bit rate, but you should keep the distinction. The bit rate defines the transmission time.

20 Chapter 1 Statistical and Non-statistical Multiplexing 20 Multiplexing several sources use the same link Statistical Multiplexing the bit rate is less than the sum of the incoming bit rates may produce packet loss; requires congestion control T1 T T3 Multiplexing means putting several sources on the same link. On a packet switch, the bit rate of the output (4) is often less than the sum of the bit rates of all inputs (1 to 3). There is a queue at the output; if several packets arrive at the same time, then only one of them is transmitted whiles others have to wait. If nothing special is done, then once in a while, the queue may overflow and packets are lost. This happens everyday in the Internet. Special mechanisms, called congestion control, are required to avoid that packet losses happen too frequently. Congestion control is the object of the advanced lecture on networking. In contrast, with circuit switching, the bit rate of the outgoing circuit (4 on the picture) is at least equal to the sum of the incoming circuits bit rates (1 to 3). There is no loss of data. What is the value of statistical multiplexing? Well, economy. Most of the time, sources are not active, so circuit switching tends to be a waste.

21 Chapter 1 Propagation 21 Propagation between A and B = time for head of signal to travel from A to B A t0 t1 tn time B s0 s1 si - ti = D (propagation delay) D = d /c, where d = cable length, c =signal celerity copper: c= 2.3e+08 m/s; glass: c= 2e+08 m/s; example: earth round trip in fiber: D = 0.2 s time through circuits also adds to propagation delays Lausanne - Brest over acoustic channel. D =??? sn Propagation is the time taken by the front of a signal to reach the destination. It is independent of the bit rate. Propagation of an electro magnetic signal is the speed (also called celerity) of light. It depends on the wavelength and the element in which the signal is propagating. Acoustic waves move at ca. 300 m/s. What is the propagation time if we use an acoustic phone system between two cities which are 1000 km apart?

22 Chapter 1 Examples 22 At time 0, computer A sends a packet of size 1000 bytes to B; at what time is the packet received by B for each of the following cases? distance 20 km km 2 km 20 m bit rate 10kb/s 1 Mb/s 10 Mb/s 1 Gb/s 1-way propagation? transmission? reception time? Compute the values for these examples and try to find scenarios where they apply. Meditate the results.

23 Chapter 1 Throughput 24 Throughput (am thruput, f débit utile, g Durchsatz) for a transmission system or a communication flow = number of useful data bits / time unit units: b/s, kb/s, Mb/s Example 1: PCM voice ( 8 khz, 8 bits per sample -> 64 b/s) throughput = 64 kb/s Example 2: stop and go protocol The throughput defines how much data can be moved by time unit. It is equal to the bit rate if there is no protocol (example 1). However, in most practical cases, the throughput is less than the bit rate for two reasons: - protocol overhead: protocols like UDP use some bytes to transmit protocol information. This reduces the throughput. If you send one-byte messages with UDP, then for every byte you create an Ethernet packet of size = 53 bytes; thus the maximum throughput you could ever get at the UDP service interface if you use a 64 kb/s channel would be 1.2 kb/s. - protocol waiting times: some protocols may force you to wait for some event, as we show on the next page.

24 Chapter 1 A Simple Protocol: Stop and Go 25 Packets may be lost during transmission: bit errors due to channel imperfections, various noises. Computer A sends packets to B; B returns an acknowledgement packet immediately to confirm that B has received the packet; A waits for acknowledgement before sending a new packet; if no acknowledgement comes after a delay T1, then A retransmits Question: What is the maximum throughput assuming that there are no losses? notation: packet length = L, constant (in bits); acknowledgement length = L, constant channel bit rate = b; propagation delay = D processing time = 0 This example is a simple protocol, often used, for repairing packet or message losses. The idea is simple. - identifiy all packets with some number or some other means - when you send one packet, wait until you receive a confirmation - after some time, if no confirmation arrives, consider that the packet has been lost and retransmit. Compute the maximum throughput of this protocol, assuming the source has an infinite supply of packets to send, the destination generates the confirmation instantly, and the bit rate of the channel is constant.

25 Chapter 1 Bandwidth -Delay Product 28 Consider the scenario : A time B B says: stop β = 2Db last bit sent by A arrives β = maximum number of bits B can receive after saying stop large β means: delayed feedback As an illustration of the effect of propagation, consider the scenario above. The number β is called the bandwidth -delay product (why with quotation marks?). It expresses the latency of a channel. We will find it important in the rest of the lecture.

26 Chapter 1 Facts to Remember (this Chapter) 29 Computer networks are organized using a layered model There is one layered model per architecture ex. TCP/IP, Appletalk, Novell Netware, OSI but the numbering is standard (1 to 7) Layers 1 and 2 correspond to cables (or wireless channels) Layer 3 = the network layer ; has mainly routers Layer 4 = transport ; is in end systems only UDP provide the simplest access to network services Layer 5-7 is the application layer (web, , etc) Concepts you should know protocol, peer entities, PDU, service transmission time versus propagation time bandwidth delay product

27 Chapter 3: The MAC layer 1 Local Area Networks: Ethernet Prof. Jean-Yves Le Boudec ICA, EPFL CH-1015 Ecublens Leboudec@epfl.ch

28 Chapter 3: The MAC layer Objective 2 Understand shared medium access methods of Ethernet; Describe network aspects for an Ethernet network; PART A: The CSMA/CD Access Method PART B: Network Aspects The access method is the protocol by which it is possible to share a given medium (cable, wireless link). Ethernet is built on a shared medium access method called CSMA/CD. The network aspects explain how a local area network is built today. We will see that the resulting network is very far away from the original design.

29 Chapter 3: The MAC layer Part A: Motivation for LANs 3 goal: connect computers in same site (building, small campus) experience from host centric networks: bursty traffic basic idea: share a cable, no complex software in end system alternatives? switch based LANs: connection oriented: ATM switch based LANs: connectionless. Switched Ethernet If you want to understand something in the world of local area networks, you should keep in mind the design requirements. Today, they are: (1) interconnect many pieces of equipment without complex cabling, inside a limited geographical area, and inside one organization (2a) be easy to manage, in particular, detect cable faults easily. When Ethernet was first conceived, the requirements were a little different. The second requirement was replaced by: (2b) use one shared cable for the entire network. Today most people would agree that this is not necessarily a good idea, because fault isolation is difficult on a shared cable. Originally, it was believed to be good because it would reduce the amount of cabling, and because traffic is bursty. Burstiness means that, most of the time, sources are idle; once in a while, they send a large amount of traffic. The response time is better with a shared medium system than if you allocated a fixed share to all (see exercise).

30 Chapter 3: The MAC layer Access Method 4 multiaccess communication = share a communication medium examples radio channel, cellular networks, satellite links machine bus local area cable multiaccess communication (= shared medium) requires an Access Method deterministic: Time Division Multiple Access (TDMA) Token Passing (Token Ring, Token Bus, FDDI) non-deterministic Aloha CSMA/CD The purpose of the access method is to control access to the channel. If all stations talk at the same time, then no data can be understood by receivers (collision). Compare to a CB channel. Deterministic access method require that stations talk only when they are authorized by the access protocol. With TDMA, time is divided into periodic slots; station i can use time intervals [( i-1)δ, iδ[, [T+( i-1)δ, T+ iδ[,, [2T+( i-1)δ, 2T +iδ[,, where T is the period and δ the slot duration. With n stations, only 1/n of the channel time is usable by one station. The scheme requires a global synchronization; it does not support well bursty traffic (why?) but is simple to control. It is used in cellular and satellite systems. With a token passing schemes, there exists one global token, which is circulated among stations; in order to talk, a station must have the token; while talking, the token is kept by the station, which has to release it after a maximum token holding time. Token passing schemes allow a very high utilization even with sporadic traffic, as long as the bandwdith delay product is not too large (time is wasted while passing token from one station to the other). Non deterministic (=collision based) schemes take an optimistic approach. Collisions are avoided if possible, but they may occur, and the schemes operate in such a way that they can be recovered from. Aloha is a primitive scheme which evolved to CSMA/CD, the access method of Ethernet. These schemes are simpler to implement than token passing schemes, but do not support as high utilization (time is wasted during collisions and during collision recovery times). Collision based schemes do not work well if the bandwidth delay product is high.

31 Chapter 3: The MAC layer Access Method Topology 5 Logical Topology: bus: all bits sent by one station are propagated to all stations data die at end of bus all stations see all frames used by Ethernet, Token Bus, LocalTalk, Wireless systems ring: all bits are passed from one station to next station, then to next s neighbour, etc bits eventually return to originating station which has to remove them all stations see all frames used by Token Ring and FDDI cabling topology = layout of cables = star in most cases CSMA/CD uses a bus logical topology, whereas token passing schemes such as the Token Ring and FDDI use ring topologies. The cabling topology is in general different from the logical topology. A simple network today uses a star topology: all cables go from a central point (the hub) to all end-systems. A more complex network uses a tree of stars. It is the Token Ring network which first introduced a star based cabling topology; this because the designers of the Token Ring took requirement (2b) seriously. With the first Token Rings, a hub contained electro-magnetic relays which would automatically bypass a station which does not correctly function (or is powered off).

32 Chapter 3: The MAC layer ALOHA data 6 central host ack transmission procedure i = 1 while (i (i <= <= maxattempts) do do send packet wait for for acknowledgement or or timeout if if ack ack received then leave wait for for random time increment i end end do do ALOHA is the basis of all non-deterministic access methods. The ALOHA protocol was originally developped for communications between islands (University of Hawaï) that use radio channels at low bit rates. The ALOHA protocol requires acknowledgements and timers. Collisions occur, and if a packet is lost, then source has to retransmit; the retransmission strategy is not specified here; many possibilities exist. We will see the one used for CSMA/CD. There is no feedback to the source in case of collision (was too complex to implement at that time). The picture shows a radio transmission scenario; Aloha can also be used on a cable (bus). It is used nowadays in cases where simplicity is more important than performance (for example: ATM metasignalling) The maximum utilization can be proven to be 18% (see below). This is assuming an ideal retransmission policy that avoids unnecessary repetitions of collisions.

33 Chapter 3: The MAC layer Maximum Utilization of Aloha 7 For a given total transmission attempt rate of µ, the utilization is S = µs exp(-2µt)t / s = µt exp(-2µt) µt = G is the normalized total transmission attempt rate S is maximum equal to 1/2e for G=0.5 S G The maximum utilization is difficult to obtain and depends on a large number of parameters. We provide an upper bound. We observe packet arrivals at one point on the medium. We assume that packet arrivals (fresh + retransmissions) are Poisson, and call µ the parameter. This assumption is not obvious. It has been shown to be valid if fresh traffic is Poisson, and if the retransmission policiy is optimal. See also in the exercises for an experimental verification. Other retransmission policies lead to worse utilizations, or evn to unstable systems (see below) We assume that packet transmission time is constant, equal to T. Consider a packet arriving at time t. The packet will be transmitted without collision iff no other packet arrives during time interval [t-t, t+t]. The probability of this to happen is exp(-2µt). Over a long time interval s, the total number of packet arrivals is close to µs, the fraction of packets transmitted without collision is close to exp(-2µt), therefore the maximum utilization is : µs exp(-2µt)t / s = µt exp(-2µt) µ is unknown and depends on the retransmission policy. However we can compute the maximum value of the utilization over all possible values of µ. The function is maximum for 2µT =1, and the value of the maximum is 1/2e = ca

34 Chapter 3: The MAC layer Detailed Analysis : Slotted Aloha 8 The analysis is simpler for slotted ALOHA Assume that tranmission are synchronized to start at the beginning of a time slot, and last for exactly one time slot, Then the probablity of collision becomes exp(-µt); the throughput becomes approximatively G exp -G the maximum utilization is bounded by 1/e 0.36 Let us examine a simple model (finite number of stations) Slotted Aloha, finite number of stations backlogged stations retransmits with proba qr fresh arrival with probability qa per unbacklogged station, 0 otherwise m stations; qa = 1/m^2 to 1/m; qr= qa to 4 qa In this and the following slides we do a more detailed analysis. The analyis is considerably simple for slotted Aloha, which we assume in the rest of this chapter.

35 Chapter 3: The MAC layer Numerical Examples 9 m = 10 stations m = 50 stations The figure illustrates that the relation throughput G exp(-g) holds well for large m The figure shows results of the Markov chain analysis. We have considered a number of possible values for the parameters m, qa and qr. For a given value of m, we vary qa and qr as explained above. Every value of (m, qa, qr) gives one point on one curve. A point is defined by x = G = offered load y = achieved throughput The dots represent the exact values for our model. The curve is the ideal relation y = x exp(-x).

36 Chapter 3: The MAC layer Instability of ALOHA 10 The previous examples indicate that the system may run into an operation mode where G is high while the throughput is small Indeed, for some values of the parameters, the total throughput decreases as the offered load increases. This heavily depends on the retransmission probability. throughput m = 10, qr = 1.25 qa m = 10, qr = 2.50 qa offered load This is a sign of instability.

37 Chapter 3: The MAC layer Stability of ALOHA with Infinite Population Model 11 In order to understand the stability, we first study the limiting case with an infinite population. We show two results: (1) With a fixed retransmission probability (state independent retransmission), the system is unstable (2) It is possible to compute an optimal distribution of probabilities which makes the system stable for any utilization less than 1/e Our model is as follows: time is slotted, one packet takes exactly one slot to be transmitted arrivals in different time slots are independent and are independent of the state of the system. Call a(n) the probability of n arrivals in one time slot immediate feedback: when a collision occurs, this is detected in the same time slot. All collided packets become backlogged. retransmissions: are also independent of each other. A backlogged packet attempts to retransmit in one time slot with probability qr. We assume first that qr is fixed. call n the number of backlogged packets. We assume all backlogged packets are possible candidates for retransmission, and that n is unbounded The infinite population moded states in a compact way what can be found numerically on finite examples. Aloha is not stable, and it bears the risk of congestion collapse: when many retransmissions occur, they reduce the throughput, thus causing congestion.

38 Chapter 3: The MAC layer CSMA 12 Improvement 1: Listen before you talk: Carrier Sense Multiple Access i = 1 while (i (i ˆ maxattempts) do do listen until channel idle transmit immediately wait for for acknowledgement or or timeout if if ack ack received then leave wait random time /* /* collision*/ increment i end end do do CSMA improves on Aloha by requiring that stations listen before transmitting (compare to CB radio) Some collisions can be avoided, but not completely. This is because of propagation delays. Two or more stations may sense that the medium (= the channel) is free and start transmitting at time instants that are close enough for a collision to occur. Assume propagation time between A and B is 2 ms and that all stations are silent until time 0. At time 0, station A starts transmitting for 10 ms, at time 1 ms, station B has not received any signal from A yet, so it can start transmitting. At time 2ms, station B senses the collision but it is too late according to the protocol. The CSMA protocol requires that stations be able to monitor whether the channel is idle or busy (no requirements to detect collisions). It is a simple improvement to Aloha, at the expense of implementing the monitoring hardware. The effect of the CSMA protocol can be expressed in the following way. Call T the maximum propagation time from station A to any other stations; if no collision occurs during a time interval of duration T after A started transmitting, then A has seized the channel (no other station can send). CSMA works well only if the transmission time is much larger than propagagation, namely bandwidth-delay product << frame size. It has the same stability problems as Aloha In order to avoid repeated collisions, it is required to wait for a random delay before re-transmitting. If all stations choose the random delays independently, and if the value of the delay has good chances of being larger than T, then there is a high probability that only one of the retransmitting stations seizes the channel.

39 Chapter 3: The MAC layer CSMA / CD 13 improvement 3: detect collisions as soon as they occur : Carrier Sense Multiple Access / Collision Detection i = 1 while (i (i <= <= maxattempts) do do listen until channel is is idle transmit and and listen wait until (end of of transmission) or or (collision detected) if if collision detected then stop transmitting /* /* after bits ( jam )*/ else wait for for interframe delay leave wait random time increment i end end do do improvement 4: acknowledgments replaced by CD This is Ethernet ( , the standard conformant version of Ethernet) CSMA/CD is the protocol used by Ethernet. In addition to CSMA, it requires that a sending station monitors the channel and detects a collision. The benefit is that a collision is detected within a propagation round trip time. Collisions may still occur.

40 Chapter 3: The MAC layer A senses idle channel, starts transmitting shortly before T, B senses idle channel, starts transmitting CSMA / CD Time Diagram 1 0 T A B 14

41 Chapter 3: The MAC layer A senses collision, continues to transmit 32 bits ( jam ) B senses collision, continues to transmit 32 bits ( jam ) CSMA / CD Time Diagram 2 0 T A B 15 t2 Jam bits are simply there to make sure the collision is long enough to be detected by the hardware.

42 Chapter 3: The MAC layer A waits random time t1 B waits random time t2 B senses channel idle and transmits A senses channel busy and defers to B A now waits until channel is idle CSMA / CD Time Diagram 3 0 T A B 16 t2 t1 CSMA/CD improves on CSMA by requiring that stations detect collisions and stop transmitting (after 32 bits, called jam bits, in order to ensure that all circuits properly recognize the presence of collisions). CSMA/CD has a better performance than Aloha or CSMA but suffers from the same stability problems After a collision is detected, stations will re-attempt to transmit after a random time. Acknowledgements are not necessary because absence of collision means that the frame could be transmitted (see Minimum Frame Size ). The interframe delay ( gap ) is 9.6 µs. It is used to avoid blind times, during which adapters are filtering typical noise at transmission ends. The random time before retransmission is chose in such a way that if repeated collisions occur, then the time increases exponentially. The effect is that in case of congestion (too many collisions) the access to the channel is slowed down.

43 Chapter 3: The MAC layer Exponential Backoff 17 random time before re-transmission is given by: k = min min (10, AttemptNb) r = random (0, (0, 2 k k -1) -1) * slottime AttemptNb is the number of the re-transmission attempt that will be attempted after the random time (k=1 for the first retransmission); random returns an integer, uniformly distributed between the two bounds given in argument; examples: first retransmission attempt: k = 1; r = 0 or r = slottime second retransmission attempt (if preceding one failed): k = 2; r = 0, 1, 2 or 3 * slottime

44 Chapter 3: The MAC layer Minimum Frame Size 18 A B t = 0: A begins transmission A B t = 1- ε: B begins transmission t = 1 : B detects collision, stops transmitting A A B B t = 2- ε: A detects collision

45 Chapter 3: The MAC layer Minimum Frame Size 19 a minimum frame size equal to number of bits transmitted during one round trip is required to detect all collisions beta = number of bits transmitted by a source during the maximum round trip time for any ethernet network beta = bandwidth - delay product + jam time + safety margin = 512 bits (corresponding to 51.2 µs at 10 Mb/s) includes propagation time in repeaters + margin: 4 repeaters + 5 segments + 2 stations = 2*21.2 µs + 2*1 µs = 44.4 µs rule: in Ethernet, all frames must be as large as beta properties: P1: all collisions are detected by sources while transmitting P2: collided frames are shorter than beta Proof: P1 see previous slide P2 because collided frame are aborted by source at the latest after slottime, including jam bits beta is called slottime in the IEEE standards. We prefer to use some other name, because it is not a time, but a number of bits.

46 Chapter 3: The MAC layer Ethernet at 10, 100 and 1000 Mb/s 20 Ethernet exists at 10 Mb/s, 100 Mb/s and 1 Gb/s Beta is 512 bits at 10 Mb/s and 100 Mb/s This means that the network size is 2 km at 10 Mb/s, and at 100 Mb/s At 1 Gb/s, beta is 512 Bytes The network size is ca. the same as at 100 Mb/s what does it implies? See also network aspects: is there CSMA/CD in Gigabit Ethernet? This implies that the minimum packet size is larger. This can be achieved by grouping several small packets together otherwise, padding Padding means that bit rate is wasted.

47 Chapter 3: The MAC layer CSMA / CD performance 21 Maximum utilization of Ethernet is difficult to determine analytically. Approximation : θ 1 / ( 1 + C α ) where α = β / L = 2 * propagation delay / transmission time L = frame size, β = bandwidth-delay product C is a constant : C = 3.1 is a pessimistic value; C = 2.5 is an approximate value based on simulations for a large network, β is close to 60 Bytes; for traffic with small frames (L = 64 bytes), the utilization is less than 30 %. For large frames (1500 Bytes), it is around 90%. Key for high utilization is: bandwidth delay product << frame size The formula with C= 3.1 is proven in the next slide. It is a pessimistic estimate.

48 Chapter 3: The MAC layer Proof 22 We prove the formula with C = 3.1 We assume that all frames have a constant length T. We call arrival a transmission or re-transmission submitted when the channel is sensed idle. Also call R the maximum propagation delay. Lastly, we assume that stations are trying to saturate the network with as much traffic as can be transmitted. We obtain a pessimistic bound by doing the following worst case assumption: arrivals are always at alternate ends of the network, namely, separated by the maximum propagation delay. The bound is derived as follows. We assume that all frames have a constant length T. We call arrival a transmission or re-transmission submitted when the channel is sensed idle. Also call R the maximum propagation delay. Lastly, we assume that stations are trying to saturate the network with as much traffic as can be transmitted. We obtain a pessimistic bound by doing the following worst case assumption: arrivals are always at alternate ends of the network, namely, separated by the maximum propagation delay. We consider cycles starting with the end of a successfulor aborted transmission. Call: - x1: time until all stations know the channel is idle - x2 time from then until next arrival - x3: time until transmission completes or is aborted due to collision. We have: x1 = R x2 = 1/µ in average E(x3 collision occured) = 2R; Proba (collision occured) = 1 - exp(-rµ) E(x3 successful transmission )= T; Proba (successful transmission ) = exp(-rµ) The last formula is because collisions can occur only if an arrival occurs during the propagation time R, because of collision avoidance. The average cycle time is thus, for this worst case scenario: τ = R + 1/µ + 2R(1- exp(-rµ)) + T exp(-rµ) and the corresponding utilization: θmax = average useful time per cycle / average cycle duration = T exp(-rµ) / τ Computing the maximum of θmax with respect to x = R µ gives the formula (maximum obtained for x = 0.43). Note that α = 2R / T.

49 Chapter 3: The MAC layer Part B: Ethernet / IEEE Ethernet = CSMA/CD with exponential backoff as shown in part A originally over a coaxial cable 10 Mb/s to 1 Gb/s local area only (<= 0.2 to 2 kms) Ethernet history 1980 : Ethernet V1.0 (Digital, Intel, Xerox) 1982 : Ethernet V : IEEE standard small differences in both specifications; adapters today support both 1995 : IEEE Mb/s standard frame Ethernet V.2 frame preamble SFD DA SA Length SNAP data pad FCS 7 B 1 B = B 6 B 2 B <= 1500 B 4 B preamble SFD DA SA Type data FCS DA = destination address SA = source address The preamble is used for the receivers to synchronize (0 and 1 in alternance terminated by 0). With Ethernet, transmission starts asynchronously (stations start independently), and between transmissions, the channel is idle. SFD (start frame delimiter) is used to validate the beginning of a frame. Destination length is used to indicate the total length before padding. Padding is required if the minimum frame size of 512 bits = 64 bytes is not reached.with the Ethernet proprietary (=non standard) format, this field is not present. It is up to the layer using Ethernet to know that frames have to be at least 512 bits, and perform the padding. Maximum size of data part is 1500 Bytes (limitation imposed by buffer size considerations in adapters). The type field indicates the type of upper layer that uses the protocol (for example: IP or Appletalk). With 802.3, this field is absent; it is replaced by an intermediate layer, called LLC, which provides mainly this multiplexing function. LLC is not needed with the non-standard Ethernet. Type values are larger than the maximum size so both formats can exist on the same network (even on the same station). The FCS (frame check sequence) is a 32-bit cyclic redundancy check. It can detect all single, double, triple errors, all error bursts of length <= 32, most double bursts of length up to 17. The probability that a random collection of bit errors is undetected is 2e-10. Ethernet works for a local area only. This is because the CSMA/CD protocol has poor utilization as the bandwidth-delay product becomes large compared to the frame sizes. Appletalk s first network was CSMA/CA (collision avoidance) at kb/s.

50 Chapter 3: The MAC layer Addressing 24 MAC address: 48 bits (16 bits) = adapter name sender puts destination MAC address in frame all stations read all frames; keep only if destination address matches connectionless network operation all 1 address (FF:FF:FF:FF:FF:FF ) = broadcast MAC address A 08:00:20:71:0d:d4 B C D 00:00:c0:3f:6c:a4 01:00:5e:02:a6:cf (group address) Ethernet addresses are known as MAC addresses. Every Ethernet interface has its own MAC address, which is in fact the serial number of the adapter, put by the manufacturer. MAC addresses are 48 bit-long. The 1st address bit is the individual/group bit, used to differentiate normal addresses from group addresses. The second bit indicates whether the address is globally administered (the normal case, burnt-in) or locally administered. Group addresses are always locally administered. When A sends a data frame to B, A creates a MAC frame with source addr = A, dest addr = B. The frame is sent on the network and recognized by the destination. Some systems like DEC networks require that MAC addresses be configured by software; those are so-called locally administered MAC addresses. This is avoided whenever possible in order to simplify network management. Data on Ethernet is transmitted least significant bit of first octet first (a bug dictated by Intel processors). Canonical representation thus inverts the order of bits inside a byte(the first bit of the address is the least significant bit of the first byte); examples of addresses: 01:00:5e:02:a6:cf (a group address) 08:00:20:71:0d:d4 (a SUN machine) 00:00:c0:3f:6c:a4 (a PC ) 00:00:0c:02:78:36 (a CISCO router) FF:FF:FF:FF:FF:FF the broadcast address

51 Chapter 3: The MAC layer Ethernet Cabling 25 Ethernet cabling is originally shared cable Today: mainly point to point UTP How is that possible? repeaters bridges Thick Coax Thin Coax UTP Contrary to the original design point, Ethernet cabling is today mainly point to point. Why do network managers prefer point to point cabling? - because fault isolation is simpler - because configuration management is simpler How is point to point cabling possible with a shared medium protocol? - using repeaters (shown on the figure) - or using bridges (called Ethernet Switches)

52 Chapter 3: The MAC layer Extend network beyond cable length limit Function of a simple (2 port-) repeater: repeat bits received on one port to other port if collision sensed on one port, repeat random bits on other port One network with repeaters = one collision domain Even with repeaters, network is limited propagation time 51.2µs slottime includes repeaters at most 4 repeaters in one path Repeaters perform physical layer functions only (bit repeaters) Repeaters Repeater 26 From ethernet.faq: There are limitations on the number of repeaters and cable segments allowed between any two stations on the network. There are two different ways of looking at the same rules: 1. The Ethernet way: A remote repeater pair (with an intermediate pointto-point link) is counted as a single repeater (IEEE calls it two repeaters). You cannot put any stations on the point to point link (by definition!), and there can be two repeaters in the path between any pair of stations. This seems simpler to me than the IEEE terminology, and is equivalent. 2. The IEEE way: There may be no more than five (5) repeated segments, nor more than four (4) repeaters between any two Ethernet stations; and of the five cable segments, only three (3) may be populated. This is referred to as the "5-4-3" rule (5 segments, 4 repeaters, 3 populated segments). From 3Com, for 110 Mb/s Ethernet: The 100BASE-T standard defines two classes of repeaters, called Class I and Class II repeaters. A collision domain can include at most one Class I or two Class II repeaters. Key topology rules are as follows: Using two Class II repeaters, the maximum diameter of the collision domain is 205 meters (typically 100m + 5m + 100m). With just a single Class II repeater in the collision domain, the diameter can be extended to 309 meters using fiber (typically 100m UTP + 209m fiber downlink). With a single Class I repeater in the collision domain, the diameter can be extended to 261 meters using fiber (typically 100m UTP + 161m fiber downlink). Connecting from MAC to MAC (switch to switch, or end-station to switch) using halfduplex 100BASE-FX, a 412-meter fiber run is allowed. For very long distance runs, a nonstandard, full-duplex version of 100BASE-FX can be used to connect two devices over a 2-kilometer distance. The IEEE is currently working on a standard for full duplex, but at this time all full-duplex solutions are proprietary.

53 Chapter 3: The MAC layer Multiport repeater (n ports) logically equivalent to: n simple repeaters connected to one internal Ethernet segment From Repeaters to Hubs Multiport Repeater 27 Ethernet Hub Multi-port repeaters make it possible to use point-to-point segments (Ethernet in the box) Value of point to point cabling? ease of management fault isolation S1 S2 S3 UTP segment Multiport Repeater to other hub Repeaters are the first building block which made it possible to have pointto-point, star based cabling.

54 Chapter 3: The MAC layer From Bus to Star and Tree 28 Ethernet today = active concentrators allow star wiring UTP on point-to-point configurations only remote network management How many frames can be transmitted in parallel in this network? UTP fiber Intermediate Hub NMA Head hub NMA NM Application coax Intermediate Hub NMA transceiver cable console coax Intermediate Hub NMA The figure shows the tree of stars topology which is now typical for a large shared medium Ethernet. However, we see on the next slides that large shared medium Ethernet are not frequent anymore, due to the introduction of switching, or bridging.

55 Chapter 3: The MAC layer One word on Bridges 29 port 1 Bridge A Repeater port 2 D port 3 B C Forwarding Table Dest Port MAC MAC Nb Nb addr A 1 B 2 C 3 D 2 Bridges are intermediate systems, or switches, that forward MAC frames to destinations based on MAC addresses Bridges perform connectionless data forwarding Bridges separate collision domains a bridged LAN maybe much larger than a repeated LAN there may be several frames transmitted in parallel in a bridged LAN A bridge is an intermediate system for the MAC layer. It receives MAC frames and forwards them further.

56 Chapter 3: The MAC layer Repeaters and Bridges in OSI Model 30 5 to 7 Application Presentation Session Application Presentation Session 5 to Transport Network LLC MAC Physical L2 PDU (MAC Frame) Physical MAC Physical L2 PDU (MAC Frame) Transport Network LLC MAC Physical End System Repeater Bridge End System Bridges are layer 2 intermediate systems Repeaters are in layer 1intermediate systems There also exist layer 3 intermediate systems (IP routers) -> module M3

57 Chapter 3: The MAC layer Switched Ethernet 31 Switched Ethernet = Bridge in the box Total bandwidth is not shared: parallel frame transmission An Ethernet Switch = Multiport Bridge is a connectionless data switch Ethernet used as a point-to-point mechanism! Frame Switching Hub Frame Switching Hub B1 Bridge B2 Bridge A B C D U V W X

58 Chapter 3: The MAC layer Today s Concentrators 34 concentrators (=hub) combine frame switching and port switching frame switching = bridging port switching = assign repeater ports to collision domains How many Ethernet segments (=collision domains) on the picture? Frame Switching Hub Frame Switching Hub B1 Bridge B2 Bridge 3a repeater A B C D U V W X LAN concentrators perform both bridging and repeating. They can be configured by a network management application.

59 Chapter 3: The MAC layer Virtual LANs 35 several bridged LANs consolidated on one physical layer uses ATM or proprietary methods A B C D X1 Virtual LAN Concentrator X2 Virtual LAN Concentrator L M N P Virtual LAN Concentrator X3 U V The picture shows two virtual LANs: (ACLNV) and (BDMPU). For each of the virtual LANs, theres exists one or more collision domains per concentrator, plus one per inter-concentrator link. The concentrators perform bridging between the different collision domains of the same virtual LAN. Between X1 and X2, the two virtual LANs use the same physical link. If ATM is used, there is one VCC per virtual LAN. The advantage is that physical location becomes independent of LANs. For example, all servers and routers can be concentrated in the same rooms (ex: U and V) There is no communication between the different virtual LANs at layer 2.

60 Chapter 3: The MAC layer Full duplex Ethernet 36 A shared medium Ethernet cable is half duplex Full duplex Ethernet = a point to point cable, used in both directions no access method, no CSMA/CD 100 Mb/S and Gigabit Ethernet uses full duplex links to avoid distance limitations

61 Chapter 3: The MAC layer Congestion Control 37 A network of buffers require some form of congestion control otherwise congestion collapse may occur Known forms of congestion control are reservations (ex: ATM) end-to-end (ex: TCP) hop by hop (ex: machine bus) Ethernet concentrators use hop-by-hop flow control STOP signal can be simulated by collisions on half duplex links on full duplex links: PAUSE ( n ) frames, where n is the duration of required stopping time P0 P1 P2 P3 P=0 P=1 P=2 P=3 STOP P=4 STOP P=5 P=6 P=7 GO

62 Chapter 3: The MAC layer Architecture versus Products 38 architecture = set of protocols and functions defined by standards or proprietary books (SNA, Decnet, AppleTalk) examples: MAC layer, Ethernet Physical Layer Bridge, Repeater Products = implementation of various archiecture components examples: a concentrator that performs repeating, bridging an adapter that performs MAC + PHY frame switching performed store and forward cut through Bridging is a well defined architecture concept. Switching is a commercial name, with different meanings depending on the context. In a LAN context, a switching Ethernet concentrator is simply a bridge.

63 Chapter 3: The MAC layer Facts to Remember 39 Computers communicate in a local area network using Ethernet and MAC addresses A MAC address is the serial number of the Ethernet adapter Original Ethernet is shared medium: one collision domain per LAN Using bridging we can have several collisions domains per LAN An Ethernet switch uses bridging Repeaters are bit-forwarding devices inside one Ethernet segment Bridges are connectionless intermediate systems that separate Ethernet segments Concepts you should know Aloha CSMA/CD shared medium access protocol Further recommended reading: Ethernet [Walrand Varaiya]chapter [Halsall] chapters and [BG] chapter 4 Big-LAN FAQ Ethernet FAQ Token Bus, Token Ring, FDDI, 100VG, Wireless LANs, other LANs [Walrand Varaiya]chapter [Halsall] chapters 6 and 7 [BG] chapters and Token Ring Network Architecture reference (IBM doc number SC )

65 Chapter 4 : IP 1 The Connectionless Network layer : IPv4 and IPv6 Prof. Jean-Yves Le Boudec ICA, EPFL CH-1015 Ecublens Leboudec@epfl.ch

66 Chapter 4 : IP Contents 2 A.The Internet Network Layer: Introduction: The Network layer The Internet IP addresses IP packet forwarding ARP fragmentation ICMP Multicast: IGMP, MBone Routers, bridges and switches Non routable protocols B. IPv6

67 Chapter 4 : IP Why a network layer? 3 MAC addresses and bridging are not sufficient bridging does not scale well to large networks MAC have no topological structure Solution: connnectionless network layer (ex: Internet Protocol, IP): every host receives a network layer address (IP address) intermediate systems forward packets based on destination address Forwarding tables in bridges contain the list of all the MAC addresses which are reachable in the LAN. It is not possible to aggregate MAC addresses because they are not structured in a way that would reflect the topology of the network. The Internet Protocol solves this problem

68 Chapter 4 : IP The Internet demonstration of ARPANET 50 sites, 20 routers; based on CO network layer (IMP) and NCP (transport) initial work on TCP, then TCP /IP TCP/IP protocols in ARPANET 1980 UNIX BSD 4.1 includes TCP/IP 1980 construction of an Internet using Arpanet asa backbone 1983 TCP/IP standard for ARPANET 1983 ARPANET split into military and civil networks 1987 worldwide Internet 1989 Switch(CH) connected to Internet 1992 WWW 1994 Netscape, Mosaic

69 Chapter 4 : IP The Internet Organization 5 Coordinates development of Internet standards = (TCP/IP standards) Internet Organization Internet Society Internet Architecture Board (IAB) Internet Research Task Force (IRTF) IR Steering Committee (IRSG) Internet Engineering Task Force (IETF) IE Steering Committee (IESG) area 1 area 2 area n RFCs = official Internet documentation: standards, other documents maintained by INTERNIC, shadows at ftp.switch.ch INTERNIC manages IP addresses and domain names IANA manages constant names (eg: port 53 for DNS) Internet standardization process: initial 1 proposed 2 draft 3 standard experimental historic

70 Chapter 4 : IP Internet and intranet 6 an intranet a collection of end and intermediate systems interconnected using the TCP/IP architecture normally inside one organization the Internet the global collection of all end and intermediate systems interconnected using the TCP/IP architecture coordinated allocation of addresses and implementation requirements by the Internet Society intranets are often connected to the Internet by firewalls hosts that act as application level relays - an internet can use its own addresses - Internet addresses are managed worldwide There is no global Internet organization: like for telephony, the Internet service is provided by a collection of competing Internet Service Providers (ISPs) Only addresses and standards are managed world-wide.

71 Chapter 4 : IP Connectionless Network Layer 7 Connectionless network layer = without connection Host A.H1 to output B.x 2 A.x 0 to output A.x 1 B.D.x 2 B.C.x 3 to output A.x 1 B.D.x 1 B.C.x 0 Host B.C.H router R1 router R2 router R3 1 2 to output A.x 1 B.x 2 1 router R4 2 Host B.D.H2 The connectionless network layer is similar to the postal system; every packet behaves like a postcard. - the destination address is present in every packet - intermediate systems (called routers) use routing tables to forward the packets - changes in routing tables are not synchronized with end systems. This is why we say that this type of operation is connectionless. Contrary to MAC addresses, network layer addresses have a topological (= geographical) structure. For example, all IP addresses of the form x.x belong to EPFL. This enables aggregation of tables in routers. Routers in the Internet need only to know in which direction x.x is; they do not need a list of all addresses in use at EPFL.

72 Chapter 4 : IP Intermediate Systems 8 unicast vs multicast forwarding forwarding method + control method IP routers are the intermediate systems for IP implementation: stand alone box (CISCO, Wellfleet, IBM, ) or Unix host software Input Port x1 x2 Forwarding Method Output Port y1 y2 x3 x4 x5 - individual PDU forwarding - control method y3 y4 y5 The picture shows unicast and multicast flows for a simplified intermediate system.

73 Chapter 4 : IP ETHZ-Backbone Network Example with Komsys ezci7-ethz-switch IP Addresses Modem + PPP sic500cs ed0-ext EPFL-Backbone ed0-swi stisun lrcsuns :00:20:71:0D:D4 lrcpc :00:C0:B8:C2:8D in-inr :00:0C:02:78:36 LRC ed2-in DI lrcmac :00:07:01:a2:a Switch x.x in-inj :00:0C:17:32:96 disun :00:20:20:46:2E Anneau SIDI SUN ezci7-ethz-switch ed2-el LEMA lrcmac :00:07:01:a2:a5

74 Chapter 4 : IP IP Addresses 10 An IP addreess is 32 bits, noted in dotted decimal notation An IP address has a prefix and a host part: prefix:host Subnet mask identifies the prefix by bitwise & operation Examples subnet mask at EPFL = question: net:subnet and host parts of : lrcsuns.lrc.epfl.ch? answer: address is prefix is host part is

75 Chapter 4 : IP IP Address Hierarchies 11 The prefix of an IP address is itself structured in order to support aggregation For example: x.y represents an EPFL host / 24 represents the LRC subnet at EPFL / 16 represents EPFL Used between routers by routing algorithms This way of doing is called classless and was first introduced in inter domain routing under the name of CIDR (classless interdomain routing) Notation: /16 means : the prefix made of the 16 first bits of the string In the past, an old model was used: class based addresses, with networks of class A, B or C; now only the distinction between class D and non-class D is relevant. IP address changes when host moves to another subnet (ex: Ethernet split into 2); compare to bridging

76 Chapter 4 : IP IP Address Classes 12 class A Net Id Subnet Id Host Id class B 10 Net Id Subnet Id Host Id class C 110 Net Id Host Id class D class E 1110 Multicast address Reserved Examples: x.x = EPFL host; x.x = ETHZ host 9.x.x.x = IBM host 18.x.x.x = MIT host Class A B C D E Range to to to to to Class B addresses are close to exhausted; new addresses are taken from class C, allocated as continuous blocks At the origin, the prefix of an IP address was defined in a very rigid way. For class A addresses, the prefix was 8 bits. For class B, 16 bits. For class C, 24 bits. The interest of that scheme was that by simply analyzing the address you could find out what the prefix was. It was soon recognized that this form was too rigid. Then subnets were added. It was no longer possible to recognize from the address alone where the subnet prefix ends and where the host identifier starts. For example, the host part at EPFL is 8 bits; it is 6 bits at ETHZ. Therefore, an additional information, called the subnet mask, is necessary. Class C addresses were meant to be allocated one per network. Today they are allocated in contiguous blocks.

77 Chapter 4 : IP IP Addresses (examples -1) 13 subnet mask at ETHZ = = xff:ff:ff:b question: net:subnet and host parts of spr13.tik.ee.ethz.ch? answer: address is net = net:subnet =? address = = :b net:subnet = b = host = b = IP address changes when host moves to another subnet (ex: Ethernet split into 2); compare to bridging

78 Chapter 4 : IP IP Addresses (examples -2) /24 Java Business Solutions AG /23 Tango SA /16 Internet Service Provider SovKom Sovkom has received IP addresses to Java Business Solutions AG has received IP addresses to Tango SA has received IP addresses to

79 Chapter 4 : IP Special Case IP Addresses this this host host hostId specified host host on on this this net net limited broadcast (not (not forwarded by by routers) netid.all 1 s 1 s broadcast on on this this net net netid.subnetid.all 1 s 1 s broadcast on on this this subnet x.x.x loopback /8 10/8 reserved networks for for internal use use / /16 Example: : : broadcast to all LRC net : LRC net : tik-sprach hostid = 0 designates the network 1,2: source IP@ only; 3,4,5: destination IP@ only The following address blocks are reserved and cannot be used in the Internet. they are typically used in experimental or closed environments (10/8) (172.16/12) ( /16)

80 Chapter 4 : IP IP Principles 17 Homogeneous addressing an IP address is unique across the whole network ( = the world in general) IP address is the address of the interface communication between IP hosts requires knowledge of IP addresses Routers between Subnetworks only: a subnetwork = a collection of systems with a common prefix inside a subnetwork: hosts communicate directly without routers between subnetworks: one or several routers are used Terminology: host = end system; router = intermediate system subnetwork = one collection of hosts that can communicate directly without routers

81 Chapter 4 : IP The IP Packet Forwarding Algorithm 18 Rule for sending packets (hosts, routers) if the destination IP address has the same prefix as one of self s interfaces, send directly to that interface otherwise send to a router as given by the IP routing table Example of IP routing tables: At lrcsuns: Next Hop Table Physical Interface Tables destination@ subnetmask nexthop IP subnetmask DEFAULT At in-inj: Next Hop Table Physical Interface Tables destination@ subnetmask nexthop IP subnetmask DEFAULT The IP packet forwarding algorithm is the core of the TCP/IP archictecture. It defines what a system should do with a packet it has to send or to forward. The rule is simple: - if the destination IP address has the same prefix as one of self s interfaces, send directly to that interface - otherwise send to a router as given by the table It uses the IP routing table; the table can be checked with a command such as netstat with Unix or Route with Windows NT

82 Chapter 4 : IP IP Unicast Packet Forwarding Algorithm Read destaddr= destination IP address /* assume it is unicast */ Case 1: a host route exists for destaddr for every entry in routing table if(destinationaddr= destaddr) then send to nexthop IPaddr; leave Case 2: destaddr is on a directly connected network (= on-link): for every physical interface IP address A and subnet mask sm if(a & sm = destaddr & sm) then send directly to destaddr; leave Case 3: a network route exists for destaddr for every entry in routing table if(destinationaddr & subnetmask = destaddr & subnetmask) then send to nexthop IP addr; leave Case 4: use default route for every entry in routing table if(destinationaddr=default) then send to nexthop IPaddr; leave 19 In reality there are exceptions to the rule. The complete algorithm is as above; the cases should be test in that order (it is a nested if then else statement). Remember that the above is the packet forwarding algorithm. The tables are written by the control method (the routing algorithms).

83 Chapter 4 : IP Example 20 Fill in the table if an IP packet has to be sent from lrcsuns final destination next hop case number Fill in the table if an IP packet has to be sent from in-inj On UNIX: netstat -nr

84 Chapter 4 : IP Direct Packet Forwarding: ARP 21 Sending to system on same subnet = direct packet forwarding does not use a router Requires knowledge of (next-hop) MAC address on LANs (called physical address) solution 1: configuration (ex: with arp utility on UNIX) solution 2: algorithmic mapping (encode MAC address in IP address) solution 3: directory service in server (used for ATM LANs, OSI CLNS) solution 4: Address Resolution Protocol automatic configuration, used on Ethernet, Token Ring, FDDI 32 bit IP address ARP 48 bit MAC address A system A (host or a router) decides to send an IP packet directly to the destination when the IP address of the destination has the same prefix as A. Otherwise, A sends to the next hop router. In most cases (namely all cases except point to point links such as modems), this requires the knowledge of the MAC address of the destination or the next hop router. There are four types of solutions for that; all exist in some form or another. Solution 1 can always be implemented manually on Unix or Windows NT using the arp command. Solution 2 requires that the MAC address fits in the IP address; it is used with IPv6 but not with the current version of IP. Solution 3 is used with ATM. Solution 4 is used with all other LANs, in particular with Ethernet, Token Ring or FDDI.

85 Chapter 4 : IP ARP Protocol (Ethernet, FDDI) (1) 22 1: lrcsuns has a packet to send to IP address (lrcpc1) lrcsuns lrcpc1 lrcpc2 in-inr :00:20:71:0D:D :00:C0:B3:D2:8D :00:0C:02:78:36 this address is on same subnet lrcsuns sends an ARP request to all systems on subnet (Ethernet broadcast) with target IP address = ARP request is received by all IP systems on local network is not forwarded by routers

86 Chapter 4 : IP ARP Protocol (Ethernet, FDDI) (1) lrcsuns lrcpc1 lrcpc2 in-inr :00:20:71:0D:D :00:C0:B3:D2:8D :00:0C:02:78:36 2: lrcpc1 has recognized its IP address sends an ARP reply packet to requesting system with its IP and MAC addresses

87 Chapter 4 : IP ARP Protocol (Ethernet, FDDI) (1) lrcsuns lrcpc1 lrcpc2 in-inr :00:20:71:0D:D :00:C0:B3:D2:8D :00:0C:02:78:36 3: lrcsuns reads ARP reply, stores in cache and sends IP packet to lrcpc1 Systems learn from ARP-REQUESTs. At the end of flow 1, all systems have learnt the mapping IP <-> MAC addr for the source of the ARP-REQUEST, namely, they have updated the following entry in their ARP table: IP addr: hw addr: 08:00:20:71:0D:D4. As a result, lrcpc1 will not send an ARP-REQUEST to communicate back with lrcsuns. Gratuitous ARP consists in sending an ARP-REQUEST to self s address. This is used at bootstrap to test the presence of a duplicate IP address. It is also used to force ARP cache entries to be changed after an address change (because systems learn from the ARP-REQUEST). As flow 2 shows, the ARP-REPLY is not broadacast but sent directly to the system that issued the request. The arp command on Unix can be used to see or modify the ARP table.

88 Chapter 4 : IP Example 25 On UNIX: arp -a What are the MAC and IP addresses at points 1 and 2 for packets sent by M1 or M4 to M3? What must the router do when it receives a packet to M2 for the first time? subnet p subnet q Ethernet Concentrator 1 Ethernet Concentrator M1 p.h1 M9 p.1 Router M8 q.1 2 M4 q.h3 M2 p.h2 M3 q.h1

89 Chapter 4 : IP Proxy ARP 26 Proxy ARP = system answers ARP requests on behalf of others example: sic500cs for PPP connected computer manual configuration works well for stub networks only Modem + PPP sic500cs ed0-ext EPFL-Backbone stisun ed2-in Proxy ARP is a trick used in special situations, typically: - on modem lines - when you want to interconnect 2 subnets while using only one subnet prefix. It requires a manual configuration and causes a single point of failure. Reverse ARP has nothing to do with ARP; the purpose of Reverse ARP is to find the IP address corresponding to a MAC address. It is now superseded by protocols like DHCP

90 Chapter 4 : IP IP Packet Format Ver= 4 IHL ToS Total Length Identification Flag Fragment Offset TTL Protocol Header Checksum Source Address Destination Address Options (if present) data \ \ Current IP = version 4 Next generation IP = version 6 IHL = IP header length = 20 + size of option fields (IPv4) Flag = 0; Don t Fragment; More Fragments; 27 Other versions are experimental (ex: STII), or obsolete

91 Chapter 4 : IP IP Header fields 28 Type of Service: precedence (3b), ToS (4 bits), unused (1b) ToS: 0000 normal; 1000 minimize delay; 0100: maxim. throughput 0010 maxim. reliability; 0001 minimize $ cost; use being standardized by diff-serv group Time To Live (TTL): TTL = 1 + maximum lifetime in number of hops (1 hop = 1 router) avoids infinite loops every router must decrement by 1 (used as hop count) router receives packet with TTL = 1 : if not destined to self then discard packet and (except for multicast addresses) generate error message (ICMP) to source; except if packet dest addr = one address of self host should not send datagrams with TTL = 0; Protocol Type: identifies user of IP protocol 1=ICMP, 2=IGMP, 6=TCP, 17=UDP, 89=OSPF

92 Chapter 4 : IP MTU 29 physical networks have different maximum packet length MTU (maximum transmission unit) = maximum packet size usable for an IP packet value of short MTU? of long MTU? Network Ethernet with LLC/SNAP Token Ring 4 Mb/s 16 Mb/s FDDI X.25 Frame Relay ATM with AAL5 Hyperchannel PPP MTU to 1500 lrcsuns:/export/home1/leboudec$ ifconfig -a lo0: flags=849<up,loopback,running,multicast> mtu 8232 inet netmask ff le0: flags=863<up,broadcast,notrailers,running,multicast> mtu 1500 inet netmask ffffff00 broadcast ether 8:0:20:71:d:d4 modem link: short MTU 1000 B at 9600 b/s = 530 ms too large for interactive traffic large MTU = higher throughput less overhead(tcp + IP = 40 bytes header overhead) no fragmentation loss avalanche effect

93 Chapter 4 : IP IP Fragmentation 30 IP hosts or routers may have IP datagrams larger than MTU Fragmentation is performed when IP datagram too large re-assembly is only at destination fragmentation in in principle avoided with TCP MTU = 1500 MTU = 620 MTU =1500 R1 R2 1 IP 1400 Bytes 2a Header IP Header 600 B 3a IP Header 600 B 2b IP Header 600 B 3b IP Header 600 B 2c IP Header 200 B 3c IP Header 200 B

94 Chapter 4 : IP IP Fragmentation (2) 31 IP datagram datagram is fragmented if MTU of interface < datagram total length all fragments are self-contained IP packets fragmentation controlled by fields: Identification, Flag and Fragment Offset IP datagram = original ; IP packet = fragments or complete datagram 1 2a 2b 2c Length Identification More Fragment flag Offset 8 * Offset Fragment data size (here 600) is always a multiple of 8 Identification given by source

95 Chapter 4 : IP Fragmentation Algorithm 32 Repeated fragmentations may occur Don t fragment flag prevents fragmentation Fragmentation Algorithm: procedure sendipp(p0): if P0.totalLength > MTU then data1length = (MTU-P0.HLEN rounded to multiple of 8) data1= first data1length bytes of P0 data part data2= remainder of P0 data part header1 = P0.header with More bit set totallength = P0.HLEN + data1length P1= new (IPPacket; header1; data1) send P1 on data link layer header2 = P0.header with totallength = P0.totalLength - data1length fragmentoffset += data1length/8 P2= new(ippacket; header2; data2) sendipp(p2) else send P0 on data link layer Attention: option field processing not included (see exercise)

96 Chapter 4 : IP Fragment Re-Assembly 33 Re-assembly is performed at final destination only, never at intermediate points Re-assembly issues: packet misordering packet loss other? Other = deadlock

97 Chapter 4 : IP IP packets are sorted in fragment lists one fragment list per (Identification, source sorted by increasing Fragment Offset Fragments F1 and F2 are contiguous iff F1.moreBit = 1 F1.fragmentOffset + F1.dataLength/8 = F2.fragmentOffset Fragment List F0 Fn is complete iff F0.fragmentOffset = 0 Fi and Fi+1 are contiguous for i=0 (n-1) Fn.moreBit = 0 34 IP packet arrival (P0) /* and packet is not a complete datagram */ -> if (P0.(identification, source address)) is new then if (new(fragmentlist, P0.(identification, source address), fl)) then insert P0 in fl start reassemblytimer(fl) else fl = fragmentlist(p0.(identification, source address)) insert(fl,p0) if fl is complete then deliver IP datagram else start reassemblytimer(fl) reassemblytimer(fl) expires -> send ICMP error message to source delete(fl) Comments: new(fragment list) may fail if there is no buffer left; in that case the datagram is lost and it is better not to retry for a given time (max TTL) insert may fail; if insert fails, then the fragment is discarded Timervalue (RFC1122) = 60s to 120s (corresponds to the maximum lifetime of a packet in the Internet) Note that RFC 815 (Clark s algorithm) gives a complete version, with care for saving space and time --- not the case for this algorithm.

98 Chapter 4 : IP Fragmentation Problems 35 Fragmentation requires re-assembly deadlocks identification wrapping problem unit of loss is smaller than unit of re-transmission Solution = avoid fragmentation Path MTU = minimum MTU for all links of one path Discovery of path MTU heuristics: local -> 1500; other : 576 (subnetsarelocal variable) Path MTU discovery avoids fragmentation UNIX (BSD) heuristics: subnets are assumed local, hence PMTU = 1500, except if variable subnetsarelocal is false On a bridged LAN with different frame lengths (for example, a bridged LAN with both FDDI and Ethernet segements), there is no simple mechanism to handle the difference in frame size. Thus, such configurations should be avoided. There exists another form of bridging, called sourec routing bridging, which avoids the problem.

99 Chapter 4 : IP Path MTU Discovery 36 Method for Path MTU (PMTU) discovery 1. host sets Don t Fragment bit on all datagrams and estimate PMTU to local MTU 2. routers send an ICMP message: destination unreachable/ fragmentation needed 3. host reduces PMTU estimate to next smallest value 4. after timeout, host increases PMTU estimate route changes may cause 2

100 Chapter 4 : IP TCP, UDP and Fragmentation 37 The UDP service interface accepts a datagramme up to 64 KB UDP datagramme passed to the IP service interface as one SDU is fragmented at the source if resulting IP datagramme is too large The TCP service interface is stream oriented packetization is done by TCP several calls to the TCP service interface may be grouped into one TCP segment (many small pieces) or: one call may cause several segments to be created (one large piece) TCP always creates a segment that fits in one IP packet: no fragmentation at source fragmentation may occur in a router, if IPv4 is used, and if PMTU discovery is not implemented

101 Chapter 4 : IP Other IP Header Fields 38 Protocol Type: identifies user of IP protocol 1=ICMP, 2=IGMP, 6=TCP, 17=UDP, 89=OSPF Header checksum Options: a number of options can be in one datagram: strict source routing (with IP addresses of next hop routers) loose source routing (next router need not be next hop) record route timestamp

102 Chapter 4 : IP ICMP : Internet Control Message Protocol 39 used by router or host to send error or control messages to other hosts or routers error or control messages relate to layer 3 only carried in IP datagrams (protocol type = 1) ICMP message types echo request ( reply) -> used by ping destination unreachable time exceeded -> used for traceroute responses address mask request/reply source quench redirect - router discovery timestamps ICMP messages never sent in response to ICMP error message - datagram sent ot multicast or broadcast IP or layer 2 address - fragment other than first echo contains data part to be returned destination unreachable = routing error or administartive error (sent if possible) no message sent for buffer overflow TTL exceeded not used on multicast address address mask sent to router (broadcast address if not known) source quench not used timestamp have fields for: time sent time received time responded sources computes transit delay ->real time (icmptime) NTP (proctocol RFC 1305) has a precision of the order of 1ms. Broadcast storms message sent to broadcast ethernet address with unicast IP address all systems send a host unreachable or redirect (see exercise)

103 Chapter 4 : IP ICMP Redirect Example in-inr lrcsuns ed2-in inr-el ed2-el lemas dest IP addr srce IP addr prot data part 1: udp xxxxxxx 2: udp xxxxxxx 3: icmp type=redir code=host cksum xxxxxxx (28 bytes of 1) 4: udp...

104 Chapter 4 : IP ICMP Redirect Example (cont d) 41 After 4 lrcsuns:/export/home1/leboudec$ netstat -nr Routing Table: Destination Gateway Flags Ref Use Interface UH lo UGHD U le U 3 0 le0 default UG

105 Chapter 4 : IP ICMP Redirect Sent by router to source host to inform source that destination is directly connected host updates routing table can ICMP redirect be used to update router table (ex: in-inj route to LRC?) 42 ICMP Redirect Format / / IP IP datagram header (prot (prot = ICMP) ICMP) Type=5 code code checksum Router IP IP address that that should be be preferred IP IP header plus plus 8 bytes bytes of of original datagram data data / / General routing principle of the TCP/IP architecture: host have minimal routing information learn host routes from ICMP redirects routers have extensive knowledge of routes

106 Chapter 4 : IP Routing Table maintenance 43 at host configuration ICMP redirect ICMP router discovery messages at routers all routers participate in routing protocols: distribute addresses and routes autonomous systems (ASs) stub or mutlihomed: ex: EPFL transit: ex: Switch between ASs: EGP and BGP inside AS: RIP, OSPF(standard), IGRP (Cisco) example. OSPF routers exchange topology and addressing information -> topology database routes computed with Dijkstra s SPF algorithm routers answer with preference level, setup by admin ICMP protocol type = 9 (router advertisement) 10 (router sollicitation) sent over multicast addresses advertisements randomized every 9 to 10 mn host sollicits 3 times 3 seconds apart EGP is between stub/mutlihomed / transit BGP is between transit nets -> supports policy routing BGP lets all addresses of all nets be known to all BGP routers router sollicitation is for host to discover default router only

107 Chapter 4 : IP ICMP : Internet Control Message Protocol 44 used by router or host to send error or control messages to other hosts or routers error or control messages relate to layer 3 only carried in IP datagrams (protocol type = 1) ICMP message types echo request ( reply) -> used by ping destination unreachable time exceeded -> used by traceroute address mask request/reply source quench redirect - router discovery timestamps

108 Chapter 4 : IP Broadcasting, Multicasting 45 Broadcast = send to all: sent to all hosts on one net/subnet ; usedby NetBIOS for discovery Multicast = send to a group IP multicast address = class D = to = all multicast capable systems on subnet = all multicast capable routers on subnet used for: conferencing, radio distribution, IP uses open group paradigm multicast IP addresses are logical (= non topological) for receiving data sent to multicast address m, a host must subscribe to m for sending to multicast address m, a host simply writes m in the dest addr field Multicast addresses are not allocated on a geographical basis. A global allocation scheme is under discussion at the IETF. Today, global scope addresses are allocated using the sd tool on Unix IPv6 SCOP RFC 1884 Description IPv4 Prefix ================================================================== 0 reserved 1 node-local scope 2 link-local scope /24 3 (unassigned) /16 4 (unassigned) 5 site-local scope 6 (unassigned) 7 (unassigned) 8 organization-local scope /14 A (unassigned) B (unassigned) C (unassigned) D (unassigned) E global scope F reserved

109 Chapter 4 : IP IP Multicast Principles 46 A IGMP: join m 2 R1 Multicast Routing 3 3 R3 R5 5 to m 1 S B 4 5 R2 R4 hosts subscribe via IGMP join messages sent to router routers build distribution tree via multicast routing sources do not know who destinations are packet multiplication is done by routers 1 S sends stat to multicast address m; there is no member, the data is simply lost at the router 2 A joins the multicast address m 3 R1 informs the rest of the network that m has a member at R1; the multicast routing protocol builds a tree. Data sent by S now reach A 4 B joins the multicast address m 5 R4 informs the rest of the network that m has a member at R4; the multicast routing protocol adds branches to the tree. Data sent by S now reach both A and B

110 Chapter 4 : IP Multicast Address Scopes 47 IPv6 SCOP RFC 1884 Description IPv4 Prefix ================================================================== 0 reserved 1 node-local scope 2 link-local scope /24 3 (unassigned) /16 4 (unassigned) 5 site-local scope 6 (unassigned) 7 (unassigned) 8 organization-local scope /14 A (unassigned) B (unassigned) C (unassigned) D (unassigned) E global scope F reserved

111 Chapter 4 : IP IP Multicast Forwarding Algorithm 48 Packet Forwarding (host, router) Read Read address MA MA = destination IP@ IP@ /* /* assume it it is is multicast */ */ for for every every physical interface PI PI if if MA MA is is enabled on on PI PI then then send send directly to to PI PI Send directly (Ethernet, FDDI) send send directly(ma, MAC@): map map last last bits bits of of MA MA to to last last bits bits of of MAC MAC address send send MAC MAC frame frame with with DA DA = E-xx-xx-xx, SA SA = own own i/f i/f address Systems have to know which group they belong to Hosts: application processes register to IP Routers: learn if members present with IGMP Direct send to link layer: At lrcsuns: Physical Interface Tables algorithmic mapping of 23 last bits : ex : > E-02-A6-CF IP subnetmask The mapping IP to MAC for multicast addresses is not unique. Ethernet hosts must filter up to 32 IP addresses for one MAC multicast address

112 Chapter 4 : IP IGMP: Internet Group Management Protocol 49 Purpose: manage group membership inside one subnet routers: know if group is present on an interface know whether to forward locally or not hosts: know if a multicast address is already in use locally lrcsuns lrcpc1 lrcpc2 MCrouter : IGMP query, TTL =1, IGMP = 0 dest IP@ = ; source IP@ = : IGMP report, TTL =1, IGMP = dest IP@ = ; source IP@ = lrcpc2 lrcpc2 is is configured configured not not to to use use multicast multicast 3: IGMP report, TTL =1, IGMP = dest IP@ = ; source IP@ = TTL is in order to avoid broadcast propagation

113 Chapter 4 : IP IGMP Host Implementation 50 Host Implementation goal: avoid avalanche effects: one router originated query might cause a burst of reports solution = the synchronization avoidance protocol 1. hosts delay responses randomly 2. hosts listen to responses, only first one answers Host IGMP Finite State Machine event: action Multicast Address not used join group: (1) send response leave group: Timer Active response read: timer expires: send response query read: (2) leave group: Member (1): a first response is sent spontaneously, a short timer (10s) set, then another response sent after expiration (because of possible loss) (2): a random timer is chosen

114 Chapter 4 : IP MBone (1) a 3a 2b Internet 3b 4b MBone routers a dest IP addr srce IP addr prot IP packet data part UDP bla bla 2a IP UDP bla bla 3a IP UDP bla bla 4a UDP

115 Chapter 4 : IP MBone (2) 52 Global Multicast not available no stable routing protocol implemented in all routers of the Internet Mbone = a network of routers supporting multicast Tunneling used to build virtual links protocol = 4 in IP header example of use of a network layer as a layer 2 by another network other examples: IPv6 over IPv4, IP over Frame Relay, over ATM, AppleTalk over IP, etc. MBone hacks limitation of multicast enforced by Mbone routers on TTL field multicast routing with DVMRP each router computes SPT from each source using distance vector algorithm reverse path forwarding (RPF) At EPFL, IP multicast is supported as follows: - inside EPFL, CISCO routers support Multicast IP - scoping is by use of TTL: TTL < 8: inside LRC TTL < 16: inside DI TTL < 32 inside Ecublens TTL < 64: inside EPFL TTL = 127: world wide - routing inside EPFL with PIM, outside via Mbone - to know more about multicast ingeneral: - at EPFL

116 Chapter 4 : IP Multicast Sockets 53 uses only UDP many servers in principle server has to join explicitely supported by socket option in in.h: struct ip_mreq { struct in_addr imr_multiaddr; /* IP multicast address of group */ struct in_addr imr_interface; /* local IP address of interface */ }; struct ip_mreq mreq; rc = setsockopt(sd, IPPROTO_IP, IP_ADD_MEMBERSHIP, (void *) &mreq, sizeof(mreq) ); IN_MULTICAST(a) tests whether a is a multicast address set ttl appropriately

117 Chapter 4 : IP Multicast Client and Servers 54 Test examples in these notes %./mcastclient <destaddr> bonjour les amis % %./mcastserv <address> & %

118 Chapter 4 : IP /*********************************************************/ /* mcastclient.c */ /* multicast test client */ /*********************************************************/ #include "inet.h" int main(int nbargplusun, char *mot[]){ int sd, rc, i; // socket descriptor and ret code unsigned char ttl = 1; // send multicast with ttl =1! struct sockaddr_in cliaddr, servaddr; struct hostent *h; // check command line arguments if (nbargplusun < 3) { printf("usage: %s <server> <data1>...<datan>\n",mot[0]); exit(1); } // resolve server name, print result and populate server // address and port h = gethostbyname(mot[1]); if (h == NULL){ printf("%s: unknown host %s \n", mot[0], mot[1]); exit(1); } printf("%s: trying to send data to host %s at address: %s \n", mot[0], h->h_name, inet_ntoa(*(struct in_addr *) h->h_addr_list[0])); servaddr.sin_family = h->h_addrtype; memcpy((char *) &servaddr.sin_addr.s_addr, h ->h_addr_list[0], h->h_length); servaddr.sin_port = htons (SERVER_PORT);

119 Chapter 4 : IP // check dest addr is multicast; if (!IN_MULTICAST(ntohl(servAddr.sin_addr.s_addr))){ printf("%s: dest addr %s is not multicast \n",mot[0], inet_ntoa(servaddr.sin_addr)); exit(1); } // create socket sd = socket(af_inet,sock_dgram,0); if (sd <0) { printf("%s: cannot open socket \n",mot[0]); exit(1); } // bind any port number cliaddr.sin_family = AF_INET; cliaddr.sin_addr.s_addr = htonl(inaddr_any); cliaddr.sin_port = htons(0); rc=bind(sd, (struct sockaddr *) &cliaddr, sizeof(cliaddr)); if (rc<0) { printf("%s cannot bind \n", mot[0]); exit(1); } // set ttl on the socket rc = setsockopt(sd, IPPROTO_IP, IP_MULTICAST_TTL, &ttl, sizeof(ttl)); if ( rc < 0) { printf("%s cannot set ttl = %d IPPROTO_IP, IP_MULTICAST_TTL \n", mot[0], ttl); exit(1); }

120 Chapter 4 : IP // send data for (i=2;i<nbargplusun;i++){ rc = sendto (sd, mot[i], strlen(mot[i])+1,0, (struct sockaddr *) &servaddr, sizeof(servaddr)); if (rc<0){ printf("%s: cannot send data %d\n",mot[0], i-1); close(sd); exit(1); } } } // end for // close socket and exit close(sd); exit(0);

121 Chapter 4 : IP /*******************************************************/ /* mcastserv.c */ /* */ /* multicast test server */ /*******************************************************/ #include "inet.h" int main(int nbargplusun, char *mot[]){ int sd, rc, i, n, clilen; struct ip_mreq mreq; // req block for mcast address struct sockaddr_in cliaddr, servaddr; struct in_addr mcastaddr; struct hostent *h; char msg[max_msg]; // check command line arguments if (nbargplusun!= 2) { printf("usage: %s <mcast address>\n",mot[0]); exit(1); } // get multicast address for server to listen to h = gethostbyname(mot[1]); if (h == NULL){ printf("%s: unknown group %s \n", mot[0], mot[1]); exit(1); } memcpy(&mcastaddr, h ->h_addr_list[0], h->h_length); // check dest addr is multicast; if (!IN_MULTICAST(ntohl(mcastAddr.s_addr))){ printf("%s: dest addr %s is not multicast \n",mot[0], inet_ntoa(mcastaddr)); exit(1); } printf("%s: server ready to listen to %s\n", mot[0], mot[1]);

122 Chapter 4 : IP // create socket sd = socket(af_inet,sock_dgram,0); if (sd <0) { printf("%s: cannot open socket \n",mot[0]); exit(1); } // bind server port servaddr.sin_family = AF_INET; servaddr.sin_addr.s_addr = htonl(inaddr_any); servaddr.sin_port = htons(server_port); rc = bind (sd, (struct sockaddr *) &servaddr, sizeof(servaddr)); if (rc<0) { printf("%s cannot bind port number %d \n", mot[0], SERVER_PORT); exit(1); } // join multicast group mreq.imr_multiaddr.s_addr = mcastaddr.s_addr; mreq.imr_interface.s_addr = htonl(inaddr_any); rc = setsockopt(sd, IPPROTO_IP, IP_ADD_MEMBERSHIP, (void *) &mreq, sizeof(mreq) ); if (rc<0) { printf("%s cannot join multicast address %s \n", mot[0], inet_ntoa(mcastaddr)); exit(1); } else printf("%s now listening to multicast address %s \n", mot[0], inet_ntoa(mcastaddr));

123 Chapter 4 : IP // server infinite loop while(1){ // receive clilen = sizeof(cliaddr); n = recvfrom(sd, msg, MAX_MSG, 0, (struct sockaddr *) &cliaddr, &clilen); if (n<0){ printf("%s: cannot receive data \n", mot[0]); continue; } // print message received printf("%s: from %s on address %s: %s\n", mot[0], inet_ntoa(cliaddr.sin_addr), mot[1], msg); } // end of infinite while // never reach this line }

124 Chapter 4 : IP Routing 61 Packet Forwarding for every packet real time Routing computation of routing tables or data structures fro unicast and multicast normally only between routers non-real time: latency up to 2 minutes uses protocols such as RIP, OSPF, EIGRP (Cisco) for unicast and DVMRP, M-OSPF, PIM for multicast

125 Chapter 4 : IP Router Definitions 62 Definition: IP router a system that forwards packets based on IP addresses performs packet forwarding + control method routing, configuration management. DHCP relay, IPv6 router advertisements Implementation: any UNIX machine can be configured as IP router normally, dedicated packet forwarder called router Multiprotocol router a system that forwards packets based on layer 3 addressesfor various protocol architectures (ex: IP, Appletalk) CISCO, IBM, etc most multiprotocol routers perform both bridging and routing architecture: bridge + router implementation: one CISCO IP router boxes also perform other functions: port filtering, DHCP relay, If your ever read commercial literature, you have to be aware of the difference between architecture names and implementation names. The word router (like most words) is unfortunately used in both contexts. - from an architecture view point, a router is any system which forwards packets based on layer 3 information. Router in that context is a function. - The router function can be implemented by a piece of software on Unix or Windows NT, or by a complex dedicated machine (a Cisco, IBM, Bay Networks or Flextel box for example). Most boxes called routers perform a set of additional functions that have nothing to do with packet forwarding using layer 3 addresses. For example, they can be used as bridges or application level relay.

126 Chapter 4 : IP Routers and Bridges 63 Routers extends the scale limitations of bridges But bridges are plug and play and are simpler to manage Intelligent products combine advantages of both example: switching router : knows the MAC addresses of directly attached hosts H1 Switching Router 1 H2 Switching Router M1 p.h1 M9 p.1 Router M8 q.1 2 M4 q.h3 M2 p.h2 M3 q.h1 The words switches and routers are normally used in many differnt ways. For us, a switch is an intermediate system for connection oriented network layers such as ATM or Frame Relay. For the commercial literature, it usually means a fast packet forwarder, usually implemented in hardware. In reality, routers can be implemented exactly in the same way and with the same performance as switches. The main difference is for multiprotocol routers, which need to understand not just one network layer, but many. in such cases, only software implementations are available. In contrast, IP only routers are emerging with a performance similar to that of switches. The switching router concept is an example of marketing exaggeration. It is a router function, placed in an Ethernet concentrator. Since the router is in the concentrator, it can know (for example by learning, or by configuration) the MAC address of directly attached systems. Thus, the ARP broadcasts are avoided.

127 Chapter 4 : IP Protocols Other than TCP/IP 64 Some other protocol families (ex: Appletalk, IPX) are not compatible with TCP/IP routers must be multiprotocol MAC interface is standard -> bridges are not aware of higher layer protocol (they are multiprotocol ) TCP Ap ple talk A B Appletalk LLC MAC PHY IP LLC MAC PHY Bridge C TCP IP MAC PHY B (an old macintosh file server) runs only Appletalk. Only applications using the Appletalk protocols can be used (MacOS file sharing, printing). TCP/IP applications such as the web cannot be used on B. C (a modern PC) runs only TCP/IP. All TCP/IP applications can be used, but not MacOS file sharing. A (a windows NT server) runs both in parallel. It can talk to both C and B. A bridge can be used to interconnect A, B and C; there is nothing special to do. If a router is used instead, it must run in parallel Appletalk and IP. The protocol stacks shown are all implemented in software. They use the standard Ethernet adapters.

128 Chapter 4 : IP NetBIOS 65 NetBIOS was originally developed to work only in one bridged LAN uses LLC-2, similar to TCP but located in layer 2 (also called NETBEUI) in that form, it is not routable : can only be bridged App NetBIOS App NetBIOS Layer 2 LLC2 MAC Bridge MAC MAC LLC2 MAC PHY R1 PHY R2 PHY NetBIOS today is offered as a TCP/IP application uses the NBT reserved port Windows machines at EPFL use TCP/IP only NetBIOS is an interface for distributed applications which is commonly used with IBM and Microsoft systems. Originally, NetBIOS used the LLC-2 protocol, a link layer protocol which does packet retransmissions, much as TCP does. Only MAC addresses are used. In addition, NetBIOS offers a naming service. This version of NetBIOS works only in a bridged environment.

129 Chapter 4 : IP IPv6 66 The current IP is IPv4 IPv4 address space is too small (32 bits) will be exhausted some day IPv6 is the new version of IP addresses are 128 bit longs draft standard 3b 45b 16b 64b 010 prefix by prov. subnet interface Id allocated by org / provider allocated by customer IPv6 is incompatible with IPv4 IPv6 is primarily IP with a larger address space. However, a number of details are different, in particular the IPv6 header is easier to process (but is also longer). Many features which were originally designed for IPv6 are now part of IPv4 (security and mobility). In 1992, many beleived that IPv6 would be necessary very soon. Now it is more questionable when the transition will take place. IPv6 is incompatible with IPv4; this is to avoid the IBM s SNA syndrom (a monster of complexity,, because the last version is compatible with all details of all previous versions). Interworking between the two will use the dual stack approach, as shown for interworking between Appletalk and IP.

130 Chapter 4 : IP Plug and Play and DHCP 67 IPv6 address is allocated automatically by negotiation with routers stateless allocation alternativey, DHCP can be used DHCP can be used with IPv4 also DHCP server on LAN has a list of IP addresses that can be allocated dynamically MAC address used to identifiy a host to DHCP server renumbering is possible more complex to use than IPv6 stateless allocation

131 Chapter 4 : IP Facts to Remember 68 IP addresses are 32 bit numbers One IP address per interface A unicast IP address has a topological meaning Routers scale well because they can aggregate routes Multicast addresses are logical IP is connectionless Hosts on the Internet exchange packets with IP addresses Non routable protocols use only MAC addresses

132

133 Chapter 5: TCP 1 Transport Layer : TCP and UDP Prof. Jean-Yves Le Boudec ICA, EPFL CH-1015 Ecublens Leboudec@epfl.ch

134 Chapter 5: TCP Objective 2 Know different concepts of elements of procedure: ARQ sliding window flow control SRP Go Back n acknowledgements Understand how TCP works connection finite state machine TCP algorithms flow control Contents A: Elements of Procedure B: TCP

135 Chapter 5: TCP Part A: Elements of Procedure ARQ 3 Errors in data transmission do occur : corrupted data: noise, crosstalk, intersymbol interference lost data: buffer overflow, control information error misinserted data: control information error Automatic Repeat Request (ARQ) one method to handle data errors principle is: 1. detection of error or loss 2. retransmission

136 Chapter 5: TCP 4 ARQ Protocols and Services There is not one ARQ protocol, but a family of protocols examples: TCP (transport layer of the Internet) HDLC or MNP (data link layer over modems) LLC-2 (data link layer over LANs, used with SNA) Stop and Go (interface between a CPU and a peripheral) The service offered by all ARQ protocols is ordered, loss-free, error-free delivery of packets ARQ protocol differs in their design point does the lower layer preserve sequence? yes: HDLC, LLC-2 no: TCP In the next slide we start with a simple example of ARQ protocol. Then we will study a systematic description of the concepts used by ARQ protocols in general.

137 Chapter 5: TCP Example of ARQ: Stop and Go 5 Definition: ARQ method where a packet is sent only after previous one is acknowledged S T1 L Packet 1 Ack 1 Packet 2 Ack 2 Packet 2 Idle RQ, loss detection at sender by timeout utilization is low in general: r = 1 / (1 + ω + β /L ) with ω = L /L = ratio of overhead and β = 2Db = bandwidth-delay product used on half duplex or duplex links ( Idle RQ ) Examples (assume processing time + ack transmit times are negligible, 1000 bits per packet) 1 km 200km 50000km 1 kb/s 100% 99.8% 67% 1 Mb/s 99% 33% 0.2% In presence of errors (rate of lost cycles = p), see exercise.

138 Chapter 5: TCP Multiple Transmissions 6 Objective: increase throughput of Stop and Wait by allowing multiple transmissions Errors and delivery in order requires unlimited buffer unless flow control is exercised Solution: use a sliding window P0 P1 P2 Pn P0 again Pn+1 A1 A2 P1 Receive Buffer P1 P2 P1 P2... Pn P1 P2... Pn+1 The limitation of Stop and Go is its little efficiency because of idle periods, when the bandwidth delay product is not negligible. A solution is to allow multiple transmissions, which in turn requires a sliding window in order to avoid unlimited buffer at the destination.

139 Chapter 5: TCP Legend The Principle of Sliding Window (W=4) P = A = 0 Maximum Send Window = Offered Window ( = 4 here) Usable Window P = 1 P = 2 P = 3 P = 4 P = 5 P = 6 P = 7 P = 8 P = 9 P = 10 A =1 A =2 A =3 A =4 A =5 A =6 A =7 On the example, packets are numbered 0, 1, 2,.. The sliding window principle works as follows: - a window size W is defined. On this example it is fixed. In general, it can vary based on messages sent by the receiver. The sliding window principles requires that, at any time: number of unacknowledged packets at the receiver <= W - the maximum send window, also called offered window is the set of packet numbers for packets that either have been sent but are not (yet) acknowledged or have not been sent but may be sent. - the usable window is the set of packet numbers for packetst hat may be sent for the first time. The usable window is always contained in the maximum send window. - the lower bound of the maximum send window is the smallest packet number that has been sent and not acknowledged - the maximum window slides (moves to the right) if the acknowledgement for the packet with the lowest number in the window is received A sliding window protocol is a protocol that uses the sliding window principle. With a sliding window protocol, W is the maximum number of packets that the receiver needs to buffer in the resequencing (= receive) buffer. If there are no losses, a sliding window protocol can have a throughput of 100% of link rate (if overhead is not accounted for) if the window size satisfies: W b / L, where b is the bandwidth delay product, and L the packet size. Counted in bytes, this means that the minimum window size for 100% utilization is the bandwidth-delay product.

140 Chapter 5: TCP Elements of ARQ 8 The elements of an ARQ protocol are: Sliding window: used by all protocols Loss detection at sender versus at source timeout versus gap detection Retransmission Strategy Selective Repeat Go Back n Others Acknowledgements cumulative versus selective positive versus negative All ARQ protocols we will use are based on the principle of sliding window. Loss detection can be performed by a timer at the source (see the Stop and Go example). Other methods are: gap detection at the destination (see the Go Back n example and TCP s fast retransmit heuristic). The retransmission strategy can be go back n, selective repeat, or any combination. Go back n, selective repeat are explained in the following slides. Acknowledgements can be cumulative: acknowledging n means all data numbered up to n has been received. Selective acknowledgements mean that only a given packet, or explicit list of packets is acknowledged. A positive acknowledgement indicates that the data has been received. A negative acknowledgement indicates that the data has not been received; it is a request for retransmission, issued by the destination.

141 Chapter 5: TCP Selective Repeat 9 Upper Bound Maximum Send Window Retransmission Buffer P0; P0; P1 P0; P2 P0; P2; P3 Timeout P0; P2 P2 P2; P4 Timeout P2; P4; P5 P4; P5; P6 P=0 P=1 P=2 A=1 P=3 A=2 A=3 P=0 P=2 A=0 P=4 P=5 P=6 A=2 A=4 Resequencing Buffer P1 0 P1; P2 0 P1; P2; P3 0 P0;P1;P2;P P4 4 5 P5 5 6 Lowest Expected Packet Number 0 deliver P0... P3 deliver P4 deliver P5 window size = 4 ; loss detection by timeout at sender; selective, positive acknowledgements Selective Repeat (SRP) is a familiy of protocols for which the only packets that are retransmitted are the packets assumed to be missing. On the example, packets are numbered 0, 1, 2,... At the sender: a copy of sent packets is kept in the send buffer until they are acknowledged The sequence numbers of unacknowledged packets differ by less than W (window size) A new packet may be sent only if the previous rule is satisfied. The picture shows a variable ( upper bound of maximum send window ) which is equal to: the smallest unacknowedged packet number + W -1. Only packets with numbers <= this variable may be sent. The variable is updated when acknowledgements are received. At the receiver: received packets are stored until they can be delivered in order. the variable lowest expected packet number is used to determine if received packets should be delivered or not. Packets are delivered when a packet with number equal to this variable is received. The variable is then increased to the number of the last packet delivered + 1.

142 Chapter 5: TCP SRP: packet numbering 10 packet numbering modulo N: packets are numbered 0, 1, 2,..., N-1, 0, 1, 2,... packet numbering modulo N requires: N 2W modulo 128, maximum window size is 64 P0 P1 P2 P3 P 4P5 P6 P7 P=0 P=1 P=2 P=3 P=4 P=5 P=6 P=0 A=0 A=1 A=2 A=3 P0 P1 P2 P3 Timeout; P0 again P=0 P=1 P=2 P=3 P=0 A=0 A=1 A=2 A=3 packet numbering modulo 7 with window size = 4; find the bug (d après Walrand) You can skip this slide at first reading. Proof that N = 2W is enough. Consider that at time t, the receiver just received packet number n (absolute numbers) and delivered it. (1) then the last packet transmitted by sender is at time t is <= n+4 (indeed, receiver has not acked packet n+1, so window rule ar sender) (2) the receiver has certainly received all acks until n-4 since it sent n. Therefore, receiver can receive only packets n-3 to n+4,at most 8 values. For example, if n= 570, and W=4, N=8: only packets 567, 568, 569, 570 (old packets) and 571, 572, 573, 574 (new) can be received, corresponding to sequence numbers: 7, 0, 1, 2 (old) and 3, 4, 5, 6 (new). Same thing for acknowledgement numbers.

143 Chapter 5: TCP SRP efficiency 11 In absence of errors, SRP thoughput is controlled by the window max throughput = min c, WL τ τ = d + L c c = bit rate, d= 2-way prop, L= packet size, W=window size in packets The window is limiting if and only if (W-1)L <= b In presence of errors, very efficient since only lost PDUs are retransmitted we study only two special cases: (1) infinite window or (2) rare losses

144 Chapter 5: TCP SRP Efficiency (2) 12 Special case 1: Infinite window size call r the proportion of (packet, ack) which is lost assume there is one ack per packet transmission time is constant every packet is eventually acknowledged Then, with SRP, the utilization of a channel, with errors is given by U 1 = 1 -r Proof: Every packet or lost ack causes a retransmission when it is detected The source always has a packet to transmit Call N1 the number of fresh packets over a long time interval N2 the number of retransmissions We have N2 = (N1 + N2) * r Over a long time period, the utilization is N1 / (N1 + N2) This result is an important information theoretic results. It shows that, at least if you can affort infinite buffers and unlimited waiting times, then the capacity of a channel with bit rate C, which loses a fraction r of packets is exactly C (1-r).

145 Chapter 5: TCP Special Case 2: rare losses, optimal window size and optimal timeout We assume optimal window: W = c t / L optimal timeout: T1 = WL/c = t rare losses: whenever a packet or an ack is lost, there will be no loss (ack or packet) in the next period of 2t seconds call r the proportion of (packet, ack) which is lost assume there is one ack per packet transmission time is constant every packet is eventually acknowledged Then, with SRP, the utilization of a channel, with errors is given by U 2 = 1 1+ r 1 r W = 1 rw + o(r) 13 You can skip this and the next two slides at first reading. Proof: call LS the lower side of the maximum send window, namely the last acknowledged packet number plus one. Consider first the following example, with W=3 packets, and assume that a loss occurs for packet 1. We see that the loss of a single ack or packet causes a waste of time equal to t. The same is true if the loss occurs for example for packet 10.

146 Chapter 5: TCP Proof 14 LS just after receiving ack t = packet number Lemma: at every time t, except maybe during a period of 2τ starting with a loss, the following is true: (P1) (a) At time t, a packet has just been sent for the first time. (b) Call n its number, and call LS the lower bound of the maximum send window just after receiving any ack that might arrive at time t. All packets with numbers in {LS, LS+1,..., n-1} have already been sent and will be acknowledged without loss. Proof: call LS the lower side of the maximum send window, namely the last acknowledged packet number plus one. Consider first the following example, with W=3 packets, and assume that a loss occurs for packet 1. We see that the loss of a single ack or packet causes a waste of time equal to t. The same is true if the loss occurs for example for packet 10 In general, we can say that at every time t, except maybe during a period of 2t starting with a loss, the following is true: (P1) (a) At time t, a packet has just been sent for the first time. (b) Call n its number, and call LS the lower bound of the maximum send window just after receiving any ack that might arrive at time t. All packets with numbers in {LS, LS+1,..., n-1} have already been sent will be acknowledged without loss. For example, you can see on the above example that P1 is true at all times except times in [2,6] and [14, 18]. The proof is by recursion over the number of loss occurences, by inspecting all possible cases. We see on the figure that the effect of the loss event disappears after a time interval of 2t. We can now apply the above. Call N2 the number of losses, N1 the number of first transmissions over a long interval of time. We have N2 = r (N1 + N2) and the utilization is N1 / (N1 + N2 t).

147 Chapter 5: TCP SRP Efficiency (3) 15 For small loss probabilities, case 2 is a bad case; therefore the utilizationgiven for case 2 should be a slightly pessimistic estimate for rare losses. However it is not the worst case. Case 1 is an upper bound on the utilization Example: loss ratio = 1%, W = 10: U1 = 99% U2 = 90% loss ratio = 5%, W = 10: U1 = 95% U2 = 65%

148 Chapter 5: TCP Go Back n Next Sequence Number for Next Expected Sending Retransmission Packet Number V(S) Buffer V(R)) P=0 0 1 P0; 0 P=1 0 2 P0; P1 deliver P0 P= P0; P1; P2 P=3 A=0 2 deliver P1 0 4 P0; P1; P2; P3 A=1 3 deliver P2 0 0 P0; P1; P2; P3 P=0 4 deliver P3 0 1 P0; P1; P2; P3 A=2 P=1 A=3 0 2 P0; P1; P2; P3 P=2 0 3 P0; P1; P2; P3 4 discard P=3 0 4 P0; P1; P2; P3 4 discard 4 discard 2 4 P2; P3 P=2 4 discard 16 Lowest unacknowledged packet number V(A) Example: window size = 4 loss detection by timeout at sender, positive, cumulative acknowledgements Go Back N (GBN) is a family of sliding window protocols that is simpler to implement than SRP, and possibly requires less buffering at the receiver. The principle of GBN is: if packet numbered n is lost, then all packets starting from n are retransmitted; packets out of order need not be accepted by the receiver. On the example, packets are numbered 0, 1, 2, At the sender: A copy of sent packets is kept in the send buffer until they are acknowledged (R1) The sequence numbers of unacknowledged packets differ by less than W (window size) The picture shows two variables: V(S) ( Next Sequence Number for Sending ) which is the number of the next packet that may be sent. which packet will be sent next, and V(A) ( Lowest Unacknowledged Number ), which is equal to the number of the last acknowledged packet + 1. A packet may be sent only if: (1) its number is equal to V(S) and (2) V(S) <= V(A) + W -1. The latter condition is the translation of rule R1. V(S) is incremented after every packet transmission. It is set ( decreased) to V(A) whenever a retransmission request is activated (here, by timeout at the sender). V(A) is increased whenever an acknowledgement is received. Acknowledgements are cumulative, namely, when acknowledgement number n is received, this means that all packets until n are acknowledged. Acknowledged packets are removed from the retransmission buffer. at any point in time we have: V(A) <= V(S) <= V(A) + W

149 Chapter 5: TCP Go Back N : storing out of sequence packets or not 17 Implementations of Go Back N may or may not keep out of sequence packets On a channel which preserves packet sequence: an out-of-sequence packet can be interpreted as a loss and it is reasonable not to save it, since the Go Back n principle will cause it to be retransmitted HDLC, LLC-2 On a channel which does not preserve packet sequence out-of-sequence packets may simply be due to packet misordering, and it is reasonable to save it, hoping that packets with lower numbers are simply delayed (and not lost). TCP At the receiver: the variable V(R) lowest expected packet number is used to determine if a received packet should be delivered or not. A received packet is accepted, and immediately delivered, if and only if its number is equal tov(r). On this example, the receiver rejects packets that are not received in sequence. V(R) is incremented for every packet received in sequence. Packets received in sequence are acknowledged, either immediately, or in grouped (cumulative) acknowledgement. When the receiver sends acknowledgement n, this always means that all packets until n have been received in sequence and delivered. The picture shows acknowledgements that are lost, and some that are delayed. This occurs for example if there are intermediate systems in the data path that are congested (buffers full, or almost full). Such intermediate systems could be IP routers (the protocol would then be TCP) or bridges (the protocol would be LLC-2).

150 Chapter 5: TCP Go Back N (continued 2) 18 Principle of Go Back N if packet n is lost, all packets from n are retransmitted out of order packets need not be accepted by receiver Less efficient than SRP, but good efficiency with lower complexity cumulative acnowledgements On a sequence preserving channel(ex: layer 2): Packet Numbering is usually done modulo M (typically 8 or 128) packets are numbered 0, 1, 2,..., M-1, 0, 1, 2,... (R2) packet numbering modulo N requires: N >= W + 1 modulo 128, maximum window size is 127 With TCP (non- sequence preserving channel: IP) bytes are numbered modulo 2^32 maximum window size is set to 2^16 (except with window scale option) byte numbers can be considered absolute numbers (except with window scale option, where window size is ca. 2^30, and a time stamp is added). You can skip this slide at first reading. Proof of R2: Lemma 1. At time t1, if the resequencing buffer at the receiver is not empty, then the absolute numbers x for packets that can be received at any time satisfy: u- W+1 <= x <= u+ W - 2, where u is the largest absolute packet number in the resequencing buffer. Proof: Call s1 the time at which packet labelled u was sent. LW(s1) u-w+1, where LW is the lower bound of send window (at sender!), by definition of the send window. Now LW is non-decreasing with time; Now there exists a packet labelled y, with y <= u-1, that has not been acknowledged by the receiver, since otherwise, packet u should be delivered. Acknowledgement for packet y has thus certainly not been received at time t1, hence: UW(t1) <= u-1+w-1. Now UW and LW are non-decreasing (the windlow slides to the right); it follows from (1) and (2) that for all s1 <= s <= t1: u-w+1 <= LW(s) <= UW(s) <= u+w-2. Since the channel preserves sequence, packets arriving at time t1 have been sent after s1 (and before t1); this ends the proof of lemma 1. Lemma 2: At time t1, if the resequencing buffer is empty, then the absolute numbers x for packets that can be received at any time satisfy: v-w <= x <= v+ W - 1, where v is the lower bound of the receive window. Proof: If no packet was ever delivered in sequence, then no packet has been received at all (resequencing buffer is assumed to be empty), so no packet has been acknowledged, and thus only packets 0 to W-1 can be received, which proves the formula in that case since v=0. Otherwise, call t2 the last time at which a packet was delivered in sequence by the receiving side; packet number v-1 was received and acknowledged at a time t3, with t3 < t2 < t1. It was sent at a time s3, and therefore LW(s3) v-1-w+1. Now packet v has not been acknowledged by the destination at time t1, therefore UW(t1) <= v+w-1. Thus v-w <= LW(s) <= v+w-1 for all s: s3 <= s <= t1.

151 Chapter 5: TCP Go Back N with Negative Acks 19 V(A) V(S) Retransmission Buffer V(R) P=0 0 1 P0; 0 P=1 0 2 P0; P1 P=2 deliver P0 0 3 P0; P1; P2 1 P=3 A=0 0 4 P0; P1; P2; P3 NACK, A=0 1 discard 1 4 P1; P2; P3 P=4 NACK, A=0 1 discard 1 5 P1; P2; P3; P4 1 1 P1; P2; P3; P4 P=1 1 discard 1 2 P1; P2; P3; P4 P=2 deliver P1 2 A=1 deliver P2 3 Example: window size = 4 loss detection by gap detection at receiver; negative acknowledgements Negative acknowledgements are an alternative to timeouts at sender for triggering retransmissions. Negative acknowledgements are generated by the receiver, upon reception of an out-of-sequence packet. NACK, A=n means: all packets until n have been received in sequence and delivered, and packets after n should be retransmitted Negative acknowledgements are implemented mainly on sequence preserving channels(for example, with HDLC, LLC-2, etc). They are not defined with TCP. Negative acknowledgements increase the retransmission efficiency since losses can be detected earlier. Since NACKs may be lost, it is still necessary to implement timeouts, typically at the sender. However, the timeout value may be set with much less accuracy since it is used only in those cases where both the packet and the NACK are lost. In contrast, if there are no NACKs, it is necessary to have a timeout value exactly slightly larger than the maximum response time (round trip plus processing, plus ack withholding times).

152 Chapter 5: TCP ARQ Protocols 20 HDLC protocols (used on Modems) SRP with selective acknowledgements Go Back N on old equipment LLC-2 TCP old, used with SNA over bridged networks Go Back n with negative acknowledgements a hybrid originally designed as Go Back n with cumulative acks today modern implementation have both selective acks cumulative acks Example of timeout for Loss detection at receiver: in NETBLT, sender tells in advance which PDUs it will send. SSCOP is a protocol used in signalling networks for telephony: periodic poll by sender with list of sent PDUs; response (sollicited stat) by receiver with missing PDUs + credit

153 Chapter 5: TCP Error Correction Alternatives 21 ARQ suited for traditional point to point data variable delay due to retransmissions correct data is guaranteed Alternative 1: forward error correction principle add redundancy (Reed Solomon codes) and use it to recover errors and losses for real time traffic (voice, video, circuit emulation) or on data on links with very large delays (satellites) Alternative 2 : use of parities code n blocks of data into k parities (RS codes) any n out of n+k blocks can be used to recover the data if some data loss is detected, then send one or several parities used for multicast

154 Chapter 5: TCP Flow Control 22 Purpose: prevent buffer overflow at receiver receiver not ready (software not ready) many senders to same receiver (focussed overload on receiver) receiver slower than sender Solutions: Backpressure, Sliding Window, Credit Flow Control Congestion control (inside the network)

155 Chapter 5: TCP Backpressure Flow Control 23 Stop and Go principle Implemented in X-ON / X-OFF RTS / CTS ATM backpressure (Go / No Go) (IBM ATM LAN) Pause packets in 802.3x Can be combined with ARQ acknowledgement schemes or not Receiver requires storage for the maximum round trip per sender Requires low bandwidth delay product P0 P1 P2 P3 P=0 P=1 P=2 P=3 STOP P=4 STOP P=5 P=6 P=7 GO

156 Chapter 5: TCP Sliding Window Flow Control 24 Number of packets sent but unacknowledged <= W Included in SRP and Go Back N protocols assuming acknowledgements sent when receive buffer freed for packets received in order Receiver requires storage for at most W packets per sender Sliding window protocols have an automatic flow control effect: the source can send new data if it has received enough acknowledgements. Thus, a destination can slow down a source by delaying, or withholding acknowledgements. This can be used in simple devices; in complex systems (for example: a computer) this does not solve the problem of a destination where the data are not consumed by the application (because the application is slow). This is why such environments usually implement another form, called credit based flow control.

157 Chapter 5: TCP Credit Based Flow Control P = P = P = P = A = -1, credit = 2 A = 0, credit = 2 A = 0, credit = 4 A = 2, credit = P = P = 5 A = 4, credit = P = A = 6, credit = 0 A = 6, credit = P = 7 With a credit scheme, the receiver informs the sender about how much data it is willing to receive (and have buffer for). Credits may be the basis for a stand-alone protocol (Gigaswitch protocol of DEC, similar in objective to ATM backpressure) or, as shown here, be part of an ARQ protocol. Credits are used by TCP, under the name of window advertisement. Credit schemes allow a receiver to share buffer between several connections, and also to send acknowledgements before packets are consumed by the receiving upper layer (packets recived in sequence may be ready to be delivered, but the application program may take some time to actually read them). The picture shows the maximum send window (called offered window in TCP) (red border) and the usable window (pink box). On the picture, like with TCP, credits (= window advertisements) are sent together with acknowledgements. The acknowledegements on the picture are cumulative. Credits are used to move the right edge of the maximum send window. (Remember that acknowledgements are used to move the left edge of the maximum send window). By acknowledging all packets up to number n and sending a credit of k, the receiver commits to have enough buffer to receive all packets from n+1 to n+k. In principle, the receiver(who sends acks and credits) should make sure that n+k is non-decreasing, namely, that the right edge of the maximum send window does not move to the left (because packets may have been sent already by the time the sdr receives the credit). A receiver is blocked from sending if it receives credit = 0, or more generally, if the received credit is equal to the number of unacknowledged packets. By the rule above, the received credits should never be less than the number of unacknowledged packets. With TCP, a sender may always send one byte of data even if there is no credit (window probe, triggered by persisttimer) and test the receiver s advertized window, in order to avoid deadlocks (lost credits).

158 Chapter 5: TCP Credit Based Flow Control: Receive Buffer P = 0 P = 1 P = 2 P = 3 P = 4 P = 5 P = 6 A = -1, credit = 2 A = 0, credit = 2 A = 0, credit = 4 A = 2, credit = 4 A = 4, credit = 2 A = 6, credit = 0 A = 6, credit = P = 7 free buffer, or unacked data data acked but not yet read The figure shows the relation between buffer occupancy and the credits sent to the source. It is an ideal representation. Typical TCP implementations differ becuase of misunderstandings by the implementers. The picture shows how credits are triggered by the status of the receive buffer. The flows are the same as on the previous picture. The receiver has a buffer space of 4 data packets (assumed here to be of constant size for simplicity). Data packets may be stored in the buffer either because they are received out of sequence (not shown here; some ARQ protocols suchas LLC-2 simply reject packets received out of sequence), or because the receiving application, or upper layer, has not yet read them. The receiver sends window updates (=credits) in every acknowledgement. The credit is equal to the available buffer space. Loss conditions are not shown on the picture. If losses occur, there may be packets stored in the receive buffer that cannot be read by the application (received out of sequence). In all cases, the credit sent to the source is equal to the buffer size, minus the number of packets that have been received in sequence. This is because the sender is expected to move its window based only on the smallest ack number received. See also exercises.

159 Chapter 5: TCP Connection Control 29 Reliable communication with ARQ must be connection oriented sequence numbers must be synchronized thus, reliable transport protocols always have three phases: setup data transfer release Connection Control = connection setup and connection release see TCP part for the TCP connection control states of TCP

160 Chapter 5: TCP Elements of Procedure 30 Elements of procedure usually means the set of following functions ARQ Flow Control Connection Control Elements of Procedure transform a raw, unreliable frame transport mechanism into a reliable data pipe ordered delivery of packets loss-free as long as the connection is alife no buffer overflow EoP exist in reliable transport (TCP for the TCP/IP architecture) (layer 4) also: in reliable data link (ex: over radio channels, over modems) (layer 2) called HDLC and variants (LLC2, SDLC, LAPB, LAPD, LAPM) connection here is called data link Applications that use UDP must implement some form of equivalent control, if the application cares about not loosing data. UDP applications (like NFS, DNS) typically send data blocks and expect a response, which can be used as an application layer acknowledgement.

161 Chapter 5: TCP Real Time Transport Protocol (RTP) 31 Streaming multimedia applications do not use TCP repetition of lost packets is not adequate RTP UDP is used uses UDP defines format of additional information required by the application (sequence number, time stamps) uses a special set of messages (RTCP) to exchange periodic reports see [STA]

162 Chapter 5: TCP B : TCP : Transmission Control Protocol 32 Provides a reliable transport service first a connection is opened between two hosts then TCP guarantees that all data is delivered insequence and without loss, unless the connection is broken at the end the connection is closed Uses port numbers like UDP ex: TCP port 53 is also used for DNS TCP connection is identified by: srce IP addr, srce port, dest IP addr, dest port TCP does not work with multicast IP addresses, UDP does TCP uses connections, UDP is connectionless

163 Chapter 5: TCP TCP is an ARQ protocol 33 Basic operation: sliding window loss detection by timeout at sender retransmission is a hybrid of go back and selective repeat cumulative Supplementary Elements fast retransmit selective acknowledgements Flow control is by credit Congestion control TCP also implements congestion control functions, which have nothing to do with ARQ (see later). Do not confuse flow control and congestion control.

164 Chapter 5: TCP prot=tcp TCP hdr Segments and Bytes TCP data 34 IP hdr IP data = TCP segment TCP views data as a stream of bytes bytes put in packets called TCP segments bytes accumulated in buffer until sending TCP decides to create a segment MSS = maximum segment size (maximum data part size) B sends MSS = 236 means that segments, without header, sent to B should not exceed 236 bytes 536bytes by default (576 bytes IP packet) sequence numbers based on byte counts, not packet counts TCP builds segments independent of how application data is broken unlike UDP TCP segments never fragmented at source possibly at intermediate points with IPv4 where are fragments re-assembled?

165 Chapter 5: TCP A 1 TCP Basic Operation 8001:8501(500) ack 101 win 6000 B :201(100) ack 8501 win :9001(500) ack 201 win deliver bytes...: :9501(500) ack 201 win (0) ack 8501 win :10001(500) ack 201 win Timeout! 7 8 Reset timers for packets 4, 5, 6 201:251(50) ack 8501 win :9001(500) ack 251 win deliver 251:401(150) ack win bytes 8501: :10501(500) ack 563 win deliver bytes 10001: TCP uses a sliding window protocol with automatic repetition of lost packets (in other words, TCP is an ARQ protocol). The picture shows a sample exchange of messages. Every packet carries the sequence number for the bytes in the packet; in the reverse direction, packets contain the acknowledgements for the bytes already received in sequence. The connection is bidirectional, with acknowledgements and sequence numbers for each direction. Acknowledgements are not sent in separate packets ( piggybacking ), but are in the TCP header. Every segment thus contains a sequence number (for itself), plus an ack number (for the reverse direction). The following notation is used: firstbyte : lastbyte+1 ( segmentdatalength ) ack acknumber+1 win offeredwindowsise. Note the +1 with ack and lastbyte numbers. - At line 8, a retransmission timer expires, causing the retransmission of data starting with byte number 8501 (Go Back n principle).note however that after segment 9 is received, transmission continues with byte number This is because the receiver stores segments received out of order. - the window field (win) gives to the sender the size of the window. Only byte nzmbers that are in the window may be sent. This makes sure the destination is not flooded with data it cannot handle. - Note that numbers on the figure are rounded for simplicity. Real examples use non-round numbers between 0 and 2^32-1. The initial sequence number is not 0, but is chosen at random using a 4 µsec clock. The figure shows the implementation of TCP known as TCP Reno, which is the basis for current implementations. An earlier implementation ( TCP Tahoe ) did not reset the pending timers after a timeout; thus, this was implementing a true Go Back n protocol; the drawback was that packets were retransmitted unnecessarily, which contributed to congestion collapses.

166 Chapter 5: TCP Fast Retransmit 36 Issue: retransmission timeout in practice often very large earlier retransmission would increase efficiency add SRP behaviour Fast retransmit heuristics if 3 duplicate acks for the same bytes are received before retransmission timeout, then retransmit P1 P2 P3 P4 P5 P6 retransmit P3 P7 A1 A2 A2 A2 A2 A? implemented in all modern versions of TCP; is an IETF standard.

167 Chapter 5: TCP Selective Acknowledgements 37 Newest TCP versions implement selective acknowledgements ( TCP- SACK ) up to 3 SACK blocks are in TCP option, on the return path a SACK block is a positive ack for an an interval of bytes first block is most recently received allows a source to detect a loss by gap in received acknowledgement TCP-SACK (Fall and Floyd, 1996): when a loss is detected at source by means of SACK, fast retransmit is entered when all gaps are repaired, fast retransmit is exited For the (complex) details, see: Simulation-based Comparisons of Tahoe, Reno,and SACK TCP, Kevin Fall and Sally Floyd, IEEE ToN, 1996 Matthew Mathis and Jamshid Mahdavi. Forward Acknowledgement: Refining TCP Congestion Control, SIGCOMM Aug. 1996

168 Chapter 5: TCP Connection Data Transfer Setup Connection Release application active open syn_sent established active close fin_wait_1 fin_wait_2 time_wait TCP Connection Phases SYN, seq=x SYN seq=y, ack=x+1 ack=y+1 FIN, seq=u ack=u+1 FIN seq=v ack=v+1 listen snc_rcvd established close_wait application close: last_ack closed passive open 38 Before data transfer takes place, the TCP connection is opened using SYN packets. The effect is to synchronize the counters on both sides. The initial sequence number is a random number. The connection can be closed in a number of ways. The picture shows a graceful release where both sides of the connection are closed in turn. Remember that TCP connections involve only two hosts; routers in between are not involved.

169 Chapter 5: TCP 18 LISTEN CLOSED 2 -> SYN: -> RST: send SYN ACK (1) 3 -> SYN: SYN_RCVD send SYN ACK close: send FIN FIN_WAIT_1 -> ACK: FIN_WAIT_2 -> FIN: send ACK TIME_WAIT 2 MSL timeout: passive open: > ACK: ESTABLISHED CLOSING -> FIN ACK: send ACK send data: send SYN -> FIN: send ACK -> ACK: active open: send SYN 6 SYN_SENT 11 -> FIN: send ACK 10 CLOSE_WAIT 12 close: send FIN LAST_ACK -> ACK: -> SYN ACK: send ACK close or timeout: TCP Finite State Machine 39 (1) if previous state was LISTEN If the application issues a half-close (ex: shutdown(1)) then data can be received in states FIN_WAIT_1 and FIN_WAIT_2. TIME-WAIT - represents waiting for enough time to pass to be sure the remote TCP received the acknowledgment of its connection termination request (RFC 793). The connection stays in that state for a time of 2*MSL, where MSL = maximum segment lifetime (typically 2*2 mn). This also has the effect that the connection cannot be reused during that time. Entering the FIN_WAIT_2 state on a full close (not on a half-close) causes the FIN_WAIT_2 timer to be set (ex: to 10 mn). If it expires, then it is set again (ex: 75 sec) and if it expires again, then the connection is closed. This is to avoid connections staying in the half-close state for ever if the remote end disconnected. Transitions due to RESET segments except the 2nd case are not shown on the diagram There is a maximum number of retransmissions allowed for any segment. After R1 retransmissions, reachability tests should be performed by the IP layer. After unsuccesful tranmission lasting for at least R2 seconds, the connection is aborted. Typically, R1 3 and R2 is a few minutes. R2 can be set by the application and is typically a few minutes. Transitions due to those timeouts are not shown. The values are usually set differently for a SYN packet. With BSD TCP, if the connection setup does not succeed after 75 sec (= connectionestablishmenttimer), then the connection is aborted. The diagram does not show looping transitions; for example, from TIME-WAIT state, reception of a FIN packet causes an ACK to be sent and a loop into the TIME-WAIt state itself.

170 Chapter 5: TCP Resetting a TCP connection 40 Here is an example of use of the RESET segment: TCP A TCP B 1. (CRASH) (send 300,receive 100) 2. CLOSED ESTABLISHED 3. SYN-SENT --> <SEQ=400><CTL=SYN> --> (??) 4. (!!) <-- <SEQ=300><ACK=100><CTL=ACK> <-- ESTABLISHED 5. SYN-SENT --> <SEQ=100><CTL=RST> --> (Abort!!) 6. SYN-SENT CLOSED 7. SYN-SENT --> <SEQ=400><CTL=SYN> --> Reset segments are used to abort a connection. See RFCs 793 and 1122 for an exact description. In short, they are sent: when a connection opening is attempted to a port where no passive open was performed; while in the SYN_SENT state, when a segment with invalid ack number arrives; when an application performs a half-close : application calls close and then (by the definition of the half close system call) cannot read incomig data anymore. In that case, incoming data is discarded and responded to by a Reset segment. On reception of a Reset segment, it is checked for validity (seq number) and if it is valid: in the LISTEN state, it is ignored in the SYN_RCVD state, the state is moved to LISTEN provided it was the state from which SYN_RCVD was entered otherwise, the connection is aborted (move to CLOSED state)

171 Chapter 5: TCP TCP Segment Format IP header (20 B + options) 42 hlen srce port dest port sequence number ack number rsvd code bits window checksum urgent pointer options (if any) padding TCP header (20 Bytes + options) segment data (if any) Š MSS bytes code bit meaning urg urgent ptr is valid ack ack field is valid psh this seg requests a push rst reset the connection syn connection setup fin sender has reached end of byte stream the push bit can be used by the upper layer using TCP; it forces TCP on the sending side to create a segment immediately. If it is not set, TCP may pack together several SDUs (=data passed to TCP by the upper layer) into one PDU (= segment). On the receiving side, the push bit forces TCP to deliver the data immediately. If it is not set, TCP may pack together several PDUs into one SDU. This is because of the stream orientation of TCP. TCP accepts and delivers contiguous sets of bytes, without any structure visible to TCP. The push bit used by Telnet after every end of line. the urgent bit indicates that there is urgent data, pointed to by the urgent pointer (the urgent data need not be in the segment). The receiving TCP must inform the application that there is urgent data. Otherwise, the segments do not receive any special treatment. This is used by Telnet to send interrupt type commands. RST is used to indicate a RESET command. Its reception causes the connection to be aborted. SYN and FIN are used to indicate connection setup and close. They each consume one sequence number. The sequence number is that of the first byte in the data. The ack number is the next expected sequence number. Options contain for example the Maximum Segment Size (MSS) normally in SYN segments (negotiation of the maximum size for the connection results in the smallest value to be selected). The checksum is mandatory.

172 Chapter 5: TCP TCP Additional Algorithms 43 TCP specifies a number of additionnal rules Nagle s algorithm specifies when to create a segment Silly window syndrome avoidance specifies when to update the credit Round trip estimation is an adaptive algorithm to compute the retransmission timer Fast Retransmit Example: when to send ACKs delaying acks reduces processingand network load but also decreases performance of ARQ method recommendation: delay at most 0.5 s, at most one of two in a full stream. There are various methods, rules and algorithms that are part of the TCP specification. Some of those appeared as more experience was gained with the early implementations of TCP. They address two kinds of issues: specify parts of the protocol for which there is freedom of implementation. For example: when to send ACKs, when to send data, when to send window advertisements (updates of the offered window) address bugs that were discovered in the field: silly window syndrome avoidance improve algorithms for round trip estimation and setting retransmission timer values improve performance by allowing early retransmissions When to send ACKs is an issue that is not fully specified. However, RFC 1122 gives implementation guidance. When receiving a data segment, a TCP receiver may send an acknowledgement immediately, or may wait until there is data to send ( piggybacking ), or until other segments are received (cumulative ack). Delaying ACKs reduces processing at both sender and receiver, and may reduce the amount of IP packets in the network. However, if ACKs are delayed too long, then receivers do not get ealry feedabck and the performance of the ARQ scheme decreases. Also, delaying ACKs also delays new information about the window size. RFC 1122 recommends to delay ACKs but for less than 0.5 s. In addition, in a stream of full size segments, there should be at least one ACK for every other segment. Note that a receiving TCP should send ACKs (possibly delayed ACKs) even if the received segment is out of order. In that case, the ACK number points to the last byte received in sequence + 1.

173 Chapter 5: TCP TCP and Congestion Control 44 TCP is used to avoid congestion in the Internet in addition to what was shown: a TCP source adjusts its window to the congestion status of the Internet (slow start, congestion avoidance) this avoids congestion collapse and ensures some fairness TCP sources interprete losses as a negative feedback use to reduce the sending rate UDP sources are a problem for the Internet use for long lived sessions (ex: RealAudio) is a threat: congestion collapse UDP sources should imitate TCP : TCP friendly In the Internet, all TCP sources sense the status of the network and adjust their window sizes accordingly. Reducing the window size reduces the traffic. The principle is to build a feedback system, with: - a packet loss is a negative feedback; a ewn acknowledgement is a positive feedback; - additive increase, multiplicative decrease: when a TCP source senses no loss, it increases its window size linearly; when a loss is detected, it reduces the window size by 50%. It can be shown that, if all sources have the same round trip time, then all sources converge towards using a fair share of the network. In general, though, sources with large round trip times have a smaller share of the network bandwidth. If some part of the Internet has a larage volume of traffic not controlled by TCP, then there is a risk of congestion collapse. This is in particular caused by Internet telephony applications. In the future, all applications, even those which use UDP, will have to imitate the behaviour of TCP, at least as far as sending rates are concerned.

174 Chapter 5: TCP TCP Socket Calls 45 server tell OS to receive and queue SYN packets int listen(int sd, int queuelength); accept connection and create new socket int accept(int sd, struct sockaddr* adrdest, int longueur); returns the new socket descriptor; client establish connection to server int connect (int sd, struct sockaddr* adrdest, int longueur);

175 Chapter 5: TCP TCP Socket Calls (2) 46 client or server send or receive for TCP (also for UDP, see exercise) int send (int sd, char* buf, int nbytes, int flags); int recv (int newsd, char* buf, int nbytes, int flags); returns number of bytes received 0 means connection was closed by other end flags is normally 0 example in this chapter: %./tcpclient <destaddr> bonjour les amis % %./tcpserv & %

176 Chapter 5: TCP Example 47 client socket(); bind(); connect(); socket(); bind(); listen(); accept(); server send(); close(); receive(); close();

177 Chapter 5: TCP /*********************************************************/ /* tcpclient.c */ **********************************************************/ #include "inet.h" int main(int nbargplusun, char *mot[]){ int sd, i; // socket descriptor int rc // REXXish return code struct sockaddr_in cliaddr, servaddr; struct hostent *h; // check command line arguments if (nbargplusun < 3) { printf("usage: %s <server> <data1>...<datan>\n", mot[0]); exit(1); } // resolve server name, print result // and populate server address and port h = gethostbyname(mot[1]); if (h == NULL){ printf("%s: unknown host %s \n", mot[0], mot[1]); exit(1); } printf("%s: now preparing to send data to host %s \nat address: %s \n", mot[0], h->h_name, inet_ntoa(*(struct in_addr *) h->h_addr_list[0])); servaddr.sin_family = h->h_addrtype; memcpy((char *) &servaddr.sin_addr.s_addr, h -> h_addr_list[0], h->h_length); servaddr.sin_port = htons (SERVER_PORT);

178 Chapter 5: TCP // create socket sd = socket(af_inet,sock_stream,0); if (sd <0) { printf("%s: cannot open socket \n",mot[0]); exit(1); } // bind any port number cliaddr.sin_family = AF_INET; cliaddr.sin_addr.s_addr = htonl(inaddr_any); cliaddr.sin_port = htons(0); rc=bind(sd, (struct sockaddr *) &cliaddr, sizeof(cliaddr)); if (rc<0) { printf("%s cannot bind \n", mot[0]); exit(1); } // connect to server rc = connect (sd, (struct sockaddr *) &servaddr, sizeof(servaddr)); if (rc<0){ printf("%s: cannot connect \n",mot[0]); close(sd); exit(1); } printf("%s: connecting... \n",mot[0]); // send arguments one by one for (i=2; i < nbargplusun; i++){ // send data rc = send(sd, mot[i], strlen(mot[i])+1, 0); if (rc<0){ printf("%s: cannot send data%d\n",mot[0], i-1); close(sd); exit(1); } printf("%s: sent data%d: %s \n",mot[0],i, mot[i]); }// end for // close socket and exit close(sd); exit(0); }

179 Chapter 5: TCP /***************************************************/ /* tcpserver.c */ /* */ /* simple sequential test server */ /* connection closed by client */ /***************************************************/ #include "inet.h" int main(int nbargplusun, char *mot[]){ int sd, newsd, rc, i, n, clilen; // socket descriptors and return code struct sockaddr_in cliaddr, servaddr; char msg[max_msg]; // create socket sd = socket(af_inet,sock_stream,0); if (sd <0) { printf("%s: cannot open socket \n",mot[0]); exit(1); } // bind server port servaddr.sin_family = AF_INET; servaddr.sin_addr.s_addr = htonl(inaddr_any); servaddr.sin_port = htons(server_port); rc = bind (sd, (struct sockaddr *) &servaddr, sizeof(servaddr)); if (rc<0) { printf("%s cannot bind port number %d \n", mot[0], SERVER_PORT); exit(1); } // tell OS to receive SYN packets on sd // sd is an unconnected socket (associated with local host and port only) listen(sd, 5);

180 Chapter 5: TCP // server infinite loop while(1){ // accept one connection from the queue if any // create a socket newsd for that connection // newsd is connected: associated to source and destination clilen = sizeof(cliaddr); newsd = accept(sd, (struct sockaddr *) &cliaddr, &clilen); if (newsd<0){ printf("%s: cannot accept connections \n", mot[0]); continue; } // receive segments while (1) { n = recv(newsd, msg, MAX_MSG, 0); if (n<0) { printf("%s: cannot receive data \n", mot[0]); } else if(n==0) { printf("%s: connection closed by client \n", mot[0]); close(newsd); break; } printf("%s: from %s, received %d bytes : %s\n", mot[0], inet_ntoa(cliaddr.sin_addr), n, msg); } // end of receive segments } // end of infinite loop // never reach this line }

181 Chapter 5: TCP Facts to remember 52 Applications use TCP or UDP TCP is connection oriented, reliable, byte oriented; is complex. UDP is connectionless; message oriented; just adds port numbers to IP packets usually: port numbers are well known for servers TCP also provides congestion avoidance for the Internet

182

183 Chapter 6: Application Layer 1 Chapter 6 : Application Layer Prof. Jean-Yves Le Boudec ICA, EPFL CH-1015 Ecublens Leboudec@epfl.ch

184 Chapter 6: Application Layer Application Layer: DNS; Web; 2 Application programs (ex. netscape) use a set of well defined application layer protocols (ex. HTTP) and formats (ex: HTML) A given Application Layer protocol uses TCP or UDP HTTP FTP Real Tel net SMTP POP NNTP TFTP Audio RTP TCP UDP Application layer runs on hosts does not involve routers Web Client HTTP IP network (Internet, intranet) Web Server

185 Chapter 6: Application Layer Example: 3 address: identifier human user format: user@domainname domainname is a nameaccording to DNS (real host or normally a virtual one) electronic mail application elements user agent (UA) : mail, elm, Netscape, Eudora,... mail transfer agent (MTA): sendmail, Eudora,... Typical scenario 1 UA mkksun34. mycorp.com MTA to: al@di.epfl.ch 2 MTA sicmail. epfl.ch 3 lrcsuns. epfl.ch MTA 4 in UA 5 1. user creates mail with UA; UA triggers MTA to send it 2. MTA sends to destination or mail exchanger, using SMTP (simple mail transport protocol) 3. mail exchanger sends to destination MTA using SMTP 4. destination MTA delivers to user mailbox 5. user reads mailbox with UA

186 Chapter 6: Application Layer client SMTP Example open TCP connection server <server Name> 250 OK 250 OK 250 OK 354 Start mail input 250 OK 221 Service closing... HELO <client Name> MAIL FROM: <sender> RCPT TO: <rcv1> DATA line 1 line 2 line n QUIT close TCP connection

187 Chapter 6: Application Layer SMTP Session Example 5 use telnet <destmachine> <serverport> to communicate manually with a server example lrcsuns:/export/home1/leboudec$ telnet localhost 25 Trying Connected to localhost. Escape character is ^]. 220-lrcsuns.epfl.ch Sendmail/LRC ready at Mon, 23 Jun :47: ESMTP spoken here HELO lrcmac45.epfl.ch 250 lrcsuns.epfl.ch Hello localhost [ ], pleased to meet you MAIL FROM: leconcombremasque 250 leconcombremasque... Sender ok RCPT TO: leboudec@di.epfl.ch 250 leboudec@di.epfl.ch... Recipient ok DATA 354 Enter mail, end with "." on a line by itself ceci est un essaiiiii. 250 QAA15185 Message accepted for delivery QUIT 221 lrcsuns.epfl.ch closing connection Connection closed by foreign host.

188 Chapter 6: Application Layer MIME 6 MIME = multipurpose internet mail extensions, RFC 1521 defines formats for mail content beyond 7 bit ASCII examples: accented characters, images, sounds format of message is defined by Content-Type: in message header used by , web, etc examples of content types text/plain Received: Received: image/jpeg by by alf.fe.up.pt; id id AA13614; AA13614; Fri, Fri, Apr Apr :27:04 03:27:04 GMT GMT Message-Id: Message-Id: multipart/mixed; < B.FE44D428@tom.fe.up.pt> Boundary = Next Part Date: Date:... Fri, Fri, Apr Apr :22:35 03:22: From: From: Rui Rui Prior Prior <deec3@tom.fe.up.pt> To: To: linux-atm@lrc.di.epfl.ch Subject: Subject: Nicstar Nicstar Mbps Mbps Content-Type: text/plain; text/plain; charset=us-ascii

189 Chapter 6: Application Layer File Transfer Protocol: FTP 7 uses two TCP connections; ports 20 and 21 are reserved ( active mode ) A: FTP client S: FTP server open TCP connection PORT OK open TCP connection OK <...> passive-mode FTP is a new version, does not use port 20 A: FTP client S: FTP server open TCP connection PASV OK open TCP connection OK <...> FTP uses two TCP connections: one for exchanging commands, one for data. The second connection is setup by the FTP server. From a TCP point of view, the server in that case is the FTP client! Many firewalls forbid or limit incoming TCP connections. An extension of FTP has been defined which avoids potential problems originating there.

190 Chapter 6: Application Layer World Wide Web (WWW) 8 three components file transfer protocol: HTTP (hyper text transfer protocol) format for documents with links ( hyperdocuments ): HTML (hyper text markup language) URLs (universal resource locators) 1. user clicks: 3. user clicks on link in new document 2. transfer of one or several documents 4. transfer of one or several documents Web server S2 Web server S1

191 Chapter 6: Application Layer URLs 9 identify documents to be transferred and application layer protocol to use protocol to be used target host path for document on target host examples ftp://lrcftp.epfl.ch/meinix.ps.gz news://comp.infosystems.www

192 Chapter 6: Application Layer HTML 10 HTML is a document format specifies how source files (in ASCII) are coded. Two functions 1. markup language: originates from IBM s GML (see Script, Bookmaster), similar to TEX, LATEX Idea: specify document structure, not layout contains references to other objects (images, sounds, etc...)to be included for the display 2. links to other documents(hyperdocument feature) Web browser (= web client) interpretes document to display

193 Chapter 6: Application Layer essai.html <html> <head> 11 <body> <title>first Attempt</title> <h1>what is is HTML HTML?</h1> A text text markup language <ul> <ul> <li> <li> HTML HTML is is very very simple <li> <li> less less powerful than than Latex Latex <li> <li> and and very very similar to to Bookmaster <li> <li> defined by by the the <A <A HREF=" Web Web Consortium</A> </ul> </ul>

The TCP/IP Architecture Jean Yves Le Boudec 2014

The TCP/IP Architecture Jean Yves Le Boudec 2014 Objective Understand Layered Model of Communication Systems Know what MAC, IP addresses and DNS names are Chapter 2: Introduction Textbook 2 TCP/IP is a