TCIP Internet Protocol Transmission Control, Thesis of Network Analysis

Internet TCPIP Internet Protocol Transmission Control

Typology: Thesis

2017/2018

Uploaded on 07/06/2018

deevi_dhananjay
deevi_dhananjay 🇮🇳

5

(1)

1 document

1 / 31

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Programming with TCP/IP – Best Practices
Matt Muggeridge
TCP/IP for OpenVMS Engineering
"Be liberal in what you accept, and
conservative in what you send"
Source: RFC 1122, section 1.2.2 [Braden, 1989a]
Overview
A seasoned network programmer appreciates the many complexities and pitfalls associated with
meeting the requirements of robustness, scalability, performance, portability, and simplicity for an
application that may be deployed in a heterogeneous environment and a wide range of network
configurations. The TCP/IP programmer controls only the end-points of the network connection,
but must provide for all contingencies, both predictable and unpredictable. Therefore, an
extensive knowledge base is required. The TCP/IP programmer must understand the
relationship among network API calls, protocol exchange, performance, system and network
configuration, and security.
This article is intended to help the intermediate TCP/IP programmer who has a basic knowledge
of network APIs in the design and implementation of a TCP/IP application in an OpenVMS
environment. Special attention is given to writing programs that support configurations where
multiple NICs are in use on a single host, known as a multihomed configuration, and to using
contemporary APIs that support both IPv4 and IPv6 in a protocol-independent manner. Key
differences between UDP and TCP applications are identified and code examples are provided.
This article is not the most definitive source of information for TCP/IP programmers. There are
many more topics that could be covered, as is evident by the number of expansive text books,
web-sites, newsgroups, RFCs, and so on.
The information in this article is organized according to the structure of a network program. First,
the general program structure is introduced. Subsequent sections describe each of the phases:
Establish Local Context, Connection Establishment, Data Transfer, and Connection Shutdown.
The final section is dedicated to important general topics.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f

Partial preview of the text

Download TCIP Internet Protocol Transmission Control and more Thesis Network Analysis in PDF only on Docsity!

Programming with TCP/IP – Best Practices

Matt Muggeridge TCP/IP for OpenVMS Engineering

"Be liberal in what you accept, and conservative in what you send" Source: RFC 1122, section 1.2.2 [Braden, 1989a]

Overview

A seasoned network programmer appreciates the many complexities and pitfalls associated with meeting the requirements of robustness, scalability, performance, portability, and simplicity for an application that may be deployed in a heterogeneous environment and a wide range of network configurations. The TCP/IP programmer controls only the end-points of the network connection, but must provide for all contingencies, both predictable and unpredictable. Therefore, an extensive knowledge base is required. The TCP/IP programmer must understand the relationship among network API calls, protocol exchange, performance, system and network configuration, and security.

This article is intended to help the intermediate TCP/IP programmer who has a basic knowledge of network APIs in the design and implementation of a TCP/IP application in an OpenVMS environment. Special attention is given to writing programs that support configurations where multiple NICs are in use on a single host, known as a multihomed configuration, and to using contemporary APIs that support both IPv4 and IPv6 in a protocol-independent manner. Key differences between UDP and TCP applications are identified and code examples are provided.

This article is not the most definitive source of information for TCP/IP programmers. There are many more topics that could be covered, as is evident by the number of expansive text books, web-sites, newsgroups, RFCs, and so on.

The information in this article is organized according to the structure of a network program. First, the general program structure is introduced. Subsequent sections describe each of the phases: Establish Local Context, Connection Establishment, Data Transfer, and Connection Shutdown. The final section is dedicated to important general topics.

Structure of a Network Program

All network programs are structured in a similar way, regardless of the complexity of the service they provide. They consist of two peer applications: one is designated as the server and the other is the client (see Figure 1). Each application creates a local end-point ( socket ), and associates ( binds ) a local name with it that is identified by the three-tuple: protocol, local IP address, and port number. The named end-point can be referenced by a peer application to form a connection that is uniquely identified in terms of the named end-points. Once connected, data transfer occurs. Finally, the connection is shut down.

Server Side Client Side

Close Connection Phase

Data Transfer Phase

Connection Phase

(For UDP clients, connect is optional and it affects local context only. UDP servers do not have a connection phase.)

Establish local context phase

(Create and name an End-Point.)

bind

(optional for clients)

listen

(TCP only)

connect

(Optional for UDP)

accept

(TCP only)

recv/send send/recv

TCP only

shutdown

socket

Figure 1 Structure of a Network Program

You have to take special measures to support multihomed configurations for UDP applications. In addition, by using the modern API’s, network programs can readily support both IPv4 and IPv6. Writing applications that are largely independent of the version of the internet protocol (IPv4 or IPv6) requires the use of simple address conversion APIs.

The User Datagram Protocol (UDP) is connectionless, and supports broadcasting and multicasting of datagrams. UDP uses the datagram socket service, which is not reliable; therefore, datagrams may be lost. Also, datagrams may be delivered out of sequence or duplicated. However, record boundaries are preserved; a recvfrom() call will result in the same unit of data that was sent using the corresponding sendto().

In a UDP application, it is the responsibility of the programmer to ensure reliability, sequencing, and detection of duplicate datagrams. The UDP broadcast and multicast services are not well suited to a WAN environment, because routers will often block broadcast and multicast traffic. Also because WANs are generally less reliable, a UDP application in a WAN environment may suffer from the greater processing overhead required to cope with data loss, which may in turn flood the WAN with retransmissions. UDP is particularly suited to applications that rely on short request-reply communications in a LAN environment, such as DNS (the Domain Name System) or applications that use a polling mechanism such as the OpenVMS Load Broker and Metric Server.

The Transmission Control Protocol (TCP) is connection-oriented; provides reliable, sequenced service; and transfers data as a stream of bytes. Because TCP is connection-oriented, it has the additional overhead associated with connection setup and tear-down. For applications that transfer large amounts of data, the cost of connection overhead is negligible. However, for short- lived connections that transfer small amounts of data, the connection overhead can be considerable and can lead to performance bottlenecks. Examples of TCP applications that are long-lived or transfer large amounts of data include Telnet and FTP.

Providing for UDP Behaviors UDP is designed to be an inherently unreliable and simple protocol. Do not expect errors to be returned when datagrams are lost, arrive out of sequence, dropped, or duplicated. You may find it necessary to overcome these behaviors. It is the responsibility of the UDP application to detect these conditions, and it must take the appropriate action according to the application’s needs. At some point, you may be duplicating the behavior of the TCP protocol in the application, in which case you should reconsider your choice of protocol.

Creating an End-Point In a BSD-based system, the socket defines an end-point. It is a local entity and is used to establish local context only. The end-point is created using the BSD socket() function. See Example 1.

int socket(int domain, int type, int protocol);

Where the arguments are: domain – may be either AF_INET (IP version 4) or AF_INET6 (IP version 6) type – field may be either SOCK_STREAM (TCP) or SOCK_DGRAM (UDP) protocol – set to zero, because the protocol is implied by the type argument

Example 1 Creating an End-Point

Naming an Endpoint An end-point is uniquely identified by its name. The name is defined by the protocol, local IP address, and local port number using the bind() function, as shown in Example 2.

The server-side application must bind() a name to its socket so that clients can reference the service. It is not recommended for the client-side application to call bind(). When a client does

not explicitly call bind(), the kernel will implicitly bind a name of its choosing when the application calls either connect() for TCP, or sendto() for UDP.

*int bind(int socket, struct sockaddr address, int address_len);

Where the arguments are: socket – value returned by calling the socket() function address – socket address structure address_len – length of socket address structure

Note that the “address” structure and “address_len” value are best initialized by calling getaddrinfo().

Example 2 Naming an Endpoint

Servers Explicitly Bind to a Local End-Point A service is identified by its protocol, local IP address, and local port number. The application advertises its service as either TCP or UDP on a specific local port number. (When a service binds to a local port number below 1024, the process requires one of the following privileges: SYSPRV, BYPASS, or OPER.) A server application should be capable of accepting connections on all IP addresses configured on the host, including IPv4 and IPv6 addresses.

Binding to all addresses is easiest to achieve by binding to the special address known as INADDR_ANY (IPv4) or IN6ADDR_ANY_INIT (IPv6). However, by using the protocol- independent APIs, the differences between IPv6 and IPv4 become less relevant. The server can readily be programmed to accept incoming TCP connections (or UDP datagrams) sent to any interface configured with an IPv4 or IPv6 address.

Use getaddrinfo() to return the list of all available socket addresses, (see Example 3). This function accepts hostnames as alias names, or in numeric format as IPv4 or IPv6 strings. In a multihomed environment, this may be a long list. For instance, a system configured with IPv and IPv6 addresses will return a socket address for each of the following protocol combinations: TCP/IPv6, UDP/IPv6, TCP/IP and UDP/IP. Example 4 demonstrates the method for establishing local context for each of the socket addresses configured on a system, independent of IPv6 or IPv4.

Clients Implicitly Bind to Their End-Point Whereas a server must explicitly bind() to its local end-point so that its service may be accessible, a client does not advertise a service. Hence, a client is able to use any local IP address and local port number. This is achieved by the client skipping the bind() call, (see the client path in Figure 1). Instead, when a TCP client issues connect() , or a UDP client issues sendto(), an implicit binding is made. The bound IP address is determined from the routing table and the order of addresses configured on an interface. The local port number is dynamically assigned and is referred to as an ephemeral port.

Note that the ephemeral port numbers are selected from a range specified by the following sysconfig inet attributes [Hewlett-Packard Company, 2003c]:

ipport_userreserved (specifies the maximum ephemeral port number) ipport_userreserved_min (specifies the minimum ephemeral port number)

int sd[MAX_SOCKS]; /* one per TCP/IPv6, UDP/IPv6, TCP/IP, UDP/IP */ char *port, *addr = NULL; struct addrinfo *res, hints;

port = argv[1]; /* port number as a string – must not be NULL / if(argc == 3) addr = argv[2]; / hostname – NULL implies ANY address */

memset(&hints, '\0', sizeof(hints)); hints.ai_flags = AI_PASSIVE; /* if usrreq.addr NULL, sets sockaddr to ANY */

err = getaddrinfo(usrreq.addr, usrreq.port, &hints, &res); if(err) { if(err == EAI_SYSTEM) perror("getaddrinfo"); else printf("getaddrinfo error %d - %s", err, gai_strerror(err)); return 1; }

i = 0; for(aip = res; aip; aip = aip->ai_next) { if(aip->ai_family != AF_INET && aip->ai_family != AF_INET6) continue;

/* create a socket for this protocol */ sd[i] = socket(res->ai_family, res->ai_socktype, res->ai_protocol); if(sd[i] < 0) {perror("socket"); return sd[i];}

err = socket_options(sd[i], aip); /* set SO_REUSEADDR, SO_REUSPORT etc. */ if(err == -1) {perror("socket_options"); return 1;}

err = bind(sd[i], res->ai_addr, res->ai_addrlen); if(err == -1) {perror("bind"); return 1;}

/** perform other per-socket work here – e.g. maybe create threads etc **/

if(i == NUM_ELT(sd)) {printf("Insufficient socket elements\n"); break;} i++; } freeaddrinfo(res);

Example 4 Server Establishes Context for All Addresses

To overcome these issues, you must modify the server’s socket to allow it to rebind to the same address and port number multiple times and without delay. This is implemented as a call to setsockopt(), as shown in Example 5.

int on = 1;

/* allow server to reuse address when binding */ err = setsockopt(sd, SOL_SOCKET, SO_REUSEADDR , (char *)&on, sizeof(on)); if(err < 0) {perror("setsockopt SO_REUSEADDR"); return err;}

/* allows UDP and TCP to reuse port (and address) when binding */ err = setsockopt(sd, SOL_SOCKET, SO_REUSEPORT , (char *)&on, sizeof(on)); if(err < 0) {perror("setsockopt SO_REUSEPORT"); return err;}

Example 5 Setting Socket Options to Reuse Port and Address

UDP Servers Enable Ancillary Data In a multihomed environment, a UDP server requires special care when replying to a request. It may reply to a client using any appropriate interface, setting the outgoing source address to that interface. That is, the reply source address does not have to match the request’s destination address.

This creates problems in environments protected by a firewall that monitors source and destination addresses. If a packet that has a reply source address that does not match the request’s destination address, the firewall interprets this as address spoofing and drops the packet. Also, a client using a connected UDP socket will only receive a datagram with a source/destination address pair matching what it specified in the connect() call. Therefore, the server socket must be enabled to receive the destination source address and the server must reply using that address as the reply source address. Sample code that enables a socket to receive the destination address information is shown in Example 6. This is an area where there are differences between IPv6 and IPv4, so they must be treated individually.

int err, on = 1, len = sizeof(on);

/* UDP should reply using dst address of the request */ if(ai->ai_protocol == IPPROTO_UDP) { *if(ai->ai_family == AF_INET) { / must be IPv4 - enable recvdstaddr / err = setsockopt(sd, IPPROTO_IP, IP_RECVDSTADDR, (char )&on, len) ; if(err < 0) {perror("setsockopt IP_RECVDSTADDR"); return err;} } *else { / must be IPv6 - enable recvpktinfo and pktinfo / err = setsockopt(sd, IPPROTO_IPV6,IPV6_RECVPKTINFO, (char )&on, len)); if(err < 0) {perror("setsockopt IP_RECVPKTINFO"); return err;}

err *= setsockopt(sd, IPPROTO_IPV6,IPV6_PKTINFO,(char )&on, len); if(err < 0) {perror("setsockopt IP_PKTINFO"); return err;} } }

Example 6 UDP Servers Enable Ancillary Data

TCP server application calls listen() and accept()

TCP client application calls connect()

SYN_SENT state

ESTABLISHED state

ESTABLISHED state

SYN_RCVD state

ACK

SYN ACK

SYN

Figure 2 TCP Connection Establishment - "Three-Way Handshake"

Before a server can receive a connection, it must first issue listen() and accept(). These are illustrated in Example 7.

int listen(int socket, int backlog);

Where the arguments are: socket – value returned by calling the socket() function backlog – maximum number of outstanding connection requests

**int accept(int socket, struct sockaddr fromaddr, int fromlen);

Where the arguments are: socket – value returned by calling the socket() function fromaddr – socket address of the peer we’re accepting the connection from fromlen – length of the socket address

The return value is a new socket descriptor that can be used for data transfer between the client and server.

Note that if the peers address details need to be recorded, they are best decoded with getnameinfo().

Example 7 TCP Server Connection Phase

Once the server is ready to accept incoming connections the TCP client initiates the connection by calling connect():

**int connect(int socket, struct sockaddr toaddr, int tolen);

Where the arguments are: socket – value returned by calling the socket() function toaddr – socket address describing the peer to connect to tolen – length of the socket address

Note that if the toaddr and tolen fields may be initialized by calling getaddrinfo().

Example 8 TCP or UDP Client Connect Phase

Optional UDP Connection Phase Unlike the TCP client where the connect() call is required, a UDP client may optionally call connect(). The API for connecting a UDP socket is the same as that used to connect a TCP socket, as shown in Example 8. During a UDP connect() the destination address is bound to the socket; therefore, when sending a UDP datagram, it is an error to specify the destination address with each datagram. To transmit data over the connected UDP socket, use the sendto()function with a NULL destination address, or use the send() function. A connected UDP socket does not change the behavior of UDP as a connectionless protocol. There is no protocol exchange when a UDP socket is connected or shut down, and there is no

char *srv_addr, *srv_port; struct addrinfo *srv_res, *ai, hints;

srv_addr = argv[1]; /* server address / srv_port = argv[2]; / sever port */

memset(&hints, '\0', sizeof(hints)); hints.ai_family = usrreq.family; hints.ai_socktype = usrreq.type;

/* get remote address info / err = getaddrinfo(srv_addr, srv_port, &hints, &srv_res); if(err) { if(err == EAI_SYSTEM) perror("getaddrinfo"); else printf("getaddrinfo error %d - %s", err, gai_strerror(err)); return 1; } for(ai = srv_res; ai; ai = ai->ai_next) { / AF_INET and AF_INET6 only */ if(ai->ai_family != AF_INET && ai->ai_family != AF_INET6) continue;

sd = socket(ai->ai_family, ai->ai_socktype, ai->ai_protocol); if(sd < 0) {perror("socket"); continue;} /* try next socket */

err = socket_options(sd, ai); if(err) {perror("sockopt"); continue;} /* try next socket */

/* use connected UDP sockets if userreq.connected is set / if(ai->ai_protocol == IPPROTO_TCP || usrreq.connected) { err = connect(sd, ai->ai_addr, ai->ai_addrlen); if(err == -1) {perror("connect"); continue;} / try next socket / } break; / use first successful connection */ } if(err == -1) {printf("No connection"); return 1;}

/** data transfer phase **/

Example 9 Client Connects to Each Server Address Until Success

TCP is a wonderfully robust protocol that can recover from lengthy network outages, but this can result in zombie connections on the server. A zombie connection is one that is maintained by just one of the peers after the other peer has exited. For example, a cable modem may be powered off before the client application shuts down the connection. Because the modem has been powered off, the TCP client cannot notify the server that it is shutting the connection. As a result, the server-side application unwittingly maintains the context of the connection. This is not unusual with home networks. Without any notification of the client being disconnected, the TCP server will maintain its connection indefinitely.

There are a number of ways to solve this problem. At the application level, a keepalive message can be transferred between peers. When a peer stops responding for a configurable number of keepalives the connection should be closed. Alternatively, the system manager can enable a system-wide keepalive mechanism that will affect all TCP connections. This is controlled using

the following sysconfig inet atrributes: tcp_keepidle, tcp_keepcnt , and tcp_keepintvl [Hewlett-Packard Company, 2003c].

These system configuration parameters are useful when an application does not provide a mechanism for closing zombie connections. An application must be restarted to pick up changes in the keepalive sysconfig attributes.

Because UDP provides no way to determine the availability of its peer, you can implement a keepalive mechanism at the application level for this purpose.

TCP Server’s Listen Backlog The TCP server’s listen backlog is a queue of connection requests for connections that have not been accepted by the application. When the connection has been accepted by the application, the request is removed from the backlog queue. The length of the backlog is set by the listen() call. If this backlog queue becomes full, new connection requests are silently ignored, which may lead to clients suffering from timeouts on their new connection attempts.

When a TCP application issues a successful connect() request, it results in a “three-way handshake” (shown in Figure 2). When the connect() request is unsuccessful, you have to provide for all potential failures.

Note from Figure 2 that the client and server enter the ESTABLISHED state at different times. Therefore, if the final ACK is lost, it is possible for the client to believe it has established a connection, while the server remains in SYN_RCVD state. The server must receive the final ACK before it believes the connection is established.

Consider the impact that this protocol exchange has on the peer applications. For example, after the first SYN is received, the TCP server tracks the connection request by adding the peer details to an internal socket queue (so_q0). When the final ACK is received, the peer details are moved from so_q0 to another internal queue (so_q). The connection state is freed from so_q only when the application’s accept() call completes. (For more details see [Wright, 1995].)

The socket queues will grow under any of the following conditions:

  • The rate of incoming SYN packets (connection requests) is greater than the completion rate of accept().
  • The final ACK is slow in arriving, or an acknowledgement (SYN ACK or ACK) is lost.
  • The final ACK never arrives (as in the case of a SYN flood attack).

If the condition persists, the socket queues will eventually become full. Subsequent SYN packets will be silently dropped (that is, the TCP server does not respond with SYN ACK segments). Eventually, the client-side application will time out. The client timeout will occur after approximately tcp_keepinit/2 seconds (75 seconds by default).

The length of this socket queue is controlled by several attributes. You can specify the queue length in the listen() call. This can be overridden with the sysconfig attributes sominconn and somaxconn , but the server application must be restarted to use these system configuration changes. Restarting a busy server may not be practical, so it is better to treat this as a sizing concern, and ensure that the accept() call is able to complete in a timely fashion, given the rate of requests and the length of the listen queue.

Note that for sendto(), the peer’s destination address is specified by the arguments dstaddr and dstlen, which should be initialized using getaddrinfo(). For recvfrom(), the peer’s address is available in the fromaddr argument and can be resolved with getnameinfo().

*int recv(int socket, char buffer, int length, int flags);

*int send(int socket, char message, int length, int flags);

Where the arguments are: socket – value returned by calling the accept () function message – buffer containing data to be sent l ength – length of the message to send flags – sender may control transmission of the message.

Example 10 Connected Socket Data Transfer APIs

**int sendto(int socket, char message, int length, int flags, struct sockaddr dstaddr, int dstlen);

**int recvfrom(int socket, char *message, int length, int flags, struct sockaddr fromaddr, int fromlen);

Where the arguments are: socket – socket descriptor for data transfer message – buffer containing data to be sent l ength – length of the message to send flags – sender may control transmission of the message dstaddr – socket address of destination dstlen – length of dstaddr fromaddr – socket address of peer from where the data was sent fromlen – length of from addr

Example 11 Unconnected UDP Sockets Data Transfer APIs

For UDP sockets, (regardless of whether they are connected or unconnected) the sendmsg() and recvmsg() routines may be used. These routines provide a special interface for sending and receiving ancillary data^1 , as shown in Example 12. A socket may be enabled to receive ancillary data with setsockopt(). A special use of the received ancillary data is shown in Example 17 through Example 21.

Note that the sendmsg() and recvmsg() functions are particularly important in a UDP server application in a mulithomed configuration; see section “UDP and Multihoming” on page 21.

(^1) UDP over IPv4 ignores ancillary data for sendmsg().

*int sendmsg(int socket, const struct msghdr message, int flags);

*int recvmsg(int socket, struct msghdr message, int flags);

Where the arguments are: socket – value returned by calling the accept () function message – describes the data to be sent and ancillary data flags – sender may control transmission of the message.

Example 12 UDP Data Transfer

The msghdr structure contains a field, msg_control, for ancillary data. IPv4 currently ignores this field for transmission. IPv6 uses the ancillary data as described in RFC 3542 [Stevens et. al., 2003].

The recv(), recvfrom(), and recvmsg() APIs support a flags argument that can be used to control message reception. One of the options is MSG_PEEK, which allows the application to peek at the incoming message without removing it from the socket’s receive buffer. Another option is MSG_OOB, which supports the processing of out-of-band data. These features are not recommended, as explained in the next two sections.

Avoid MSG_PEEK When Receiving Data With today’s modern networks and high-performing large memory systems, MSG_PEEK is an unnecessary receive option. The MSG_PEEK function looks into the socket receive buffer, but does not remove any data from it. Keep in mind that the objective of any network application is to keep data flowing through the network. Because the MSG_PEEK function does not remove data from the receive buffer, it will cause the receive window to start closing, which applies back pressure on the sender and can result in an inefficient use of the network. In any case, after a MSG_PEEK, it is still necessary to read the data from the socket. So an application may as well have done that in the first place and “peeked” inside its own buffer.

Avoid Out-Of-Band Data Out-of-band (OOB) data provides a mechanism for urgently delivering a single byte of data to the peer. The receiving application is notified of OOB data and it may be read out of sequence. It is typically used for signaling. However, out-of-band data cannot be relied upon and is dependent on the implementation of the protocol stack. In many BSD implementations, if the out-of-band data is not read by the application before new out-of-band data arrives, then the new OOB data overwrites the unread OOB data. Instead of using OOB data for signaling, a better approach is to create a dedicated connection for signaling.

TCP Data Transfer – Stream of Bytes Possibly the most common oversight of TCP/IP programmers is that they fail to realize the significance of TCP sending a stream of bytes. In essence, this means that TCP guarantees to deliver no more than one byte at a time; no matter how much data the user application sends. The amount of data that TCP transmits is affected by a wide variety of protocol events such as: send window size, slow start, congestion control, Nagle algorithm, delayed ACKS, timeout events and so on [Stevens, 1994], [Snader, 2000]. In other words, data is delivered to the peer in differently sized chunks that are independent of the amount of data that is written to TCP with each send() call. For a TCP application, this means that, when receiving data, the algorithm must loop on a recv() call. See Example 13.

The maximum size of a socket buffer can be controlled by system-wide variables. These can be viewed and modified with the sysconfig utility. For example, to view these values you can use the following command [Hewlett-Packard Company, 2003c]:

$ sysconfig –q inet udp_sendspace udp_recvspace

To set the udp_recvspace buffer to 9216 bytes, use:

$ sysconfig –r inet udp_recvspace 9216

Changing these attribute settings will override programs that use the setsockopt() function to modify the size of their respective socket buffers.

Understand Buffering A TCP or UDP application writes data to the kernel. The kernel stores the data in the socket send buffer. A successful write operation means that the data has been successfully written to the socket send buffer; it does not indicate that the kernel has sent it yet.

The procedure then involves two steps:

  1. The data from the send buffer is transmitted and arrives at the peer’s socket receive buffer. In the case of TCP, the delivery of data to the peer’s socket receive buffer is guaranteed by the protocol, because TCP acknowledges that it has received the data. UDP provides no such guarantees and silently discards data if necessary. At this point, the data has not yet been delivered to the peer application.
  2. The receiving application is notified that data is ready in its socket’s receive buffer and the application reads the data from the buffer.

Because of this buffering, data can be lost if the receiving application exits (or the node crashes) while data remains in the receive socket buffer. It is up to the application to guarantee that data has arrived successfully at the peer application. TCP has completed its responsibility when it notifies the application that the data is ready.

Management of Data Transfer Phase The following sysconfig attributes of the socket subsystem affect the data transfer phase: tcp_sendspace, tcp_recvspace, udp_sendspace, tcp_recvspace, tcp_nodelack [Hewlett-Packard, 2003c].

Connection Shutdown Phase

A connection is bidirectional; consequently, each side of the connection may be shut down independently. The following topics are described for the connection shutdown phase:

  • TCP Orderly Release
  • Management of Connection Shutdown

The shutdown() function is shown in Example 16.

int shutdown(int socket, int how);

Where the arguments are: socket – socket created for data transfer how – describes the direction to shutdown

Example 16 Connection Shutdown API

TCP Orderly Release An application may not know how much data it will receive from a peer; therefore, each peer should signal when it has finished sending data, as in a telephone conversation in which both parties say “goodbye” to indicate they have nothing more to say. Similarly, a receiving application should not exit before it has received the signal indicating the last of the data. TCP applications can make use of the half-close to signal that a peer has finished sending data. When both peers have signaled this, the socket may be closed and the application can exit.

When a TCP application issues a shutdown() on the sending side of the socket, it results in the protocol exchange as shown in Figure 3. The TCP FIN packet is queued behind the last data. Because a connection is bidirectional, it requires a total of four packets to shut down both directions of a connection. The side that first issues the shutdown()on the sending side of the socket performs an active close. The side that receives the FIN performs a passive close. The difference between these is important, because the side issuing an active close must also wait in the TIME_WAIT state for 2 x MSL^2. Some socket resources persist during the TIME_WAIT state. Because it is more critical to conserve server resources than client resources (see page 28), it is better practice to ensure the client issues the active close. See Servers Reuse Port and Address, on page 6, which discusses avoidance of the TIME_WAIT delay.

It is possible to shutdown() the receive side of a socket, but this is of little use, because shutting down the receive side does not result in a protocol exchange. In practice, the send direction of the socket is the more appropriate to shut down. Further attempts to send data on that socket will return an error. The peer reads the data until there is no more data in the receive socket buffer. When TCP processes the FIN packet, it closes that side of the connection and a subsequent recv() will return an error. This signals the receiver that the peer has no more data to send.

UDP applications do not provide for any protocol exchange when shutdown() is called. Instead, the programmer must design a message exchange that signals the end of transmission.

(^2) MSL = Maximum Segment Lifetime