The 7 tests of highly reliable server applications

Why does the server application process randomly terminate, why does it hang every Tuesday at 2:00 AM, why is it that after a successful pilot test at headquarters the people in the field offices all claim it is not usable and why is it that the weekly price schedules sent to the Miami office always seem to be incomplete?

The problem is that testing during the development cycle, QA tests, and pilot studies are almost always conducted under ideal network conditions. You need to test under real world conditions. Unfortunately, it can be difficult to generate those conditions in a controlled manner suitable for testing.

This article will describe 7 tests and how to reliably create them, which if server applications can handle will go along way in eliminating unexplained terminations and hangs as well as usability and data validity problems.

The tests are design to answer the question "how does the server application handle":

A failed connection attempt due to receiving a reset or a timeout
Not receiving any data after a successful connection
An immediate connection close after a successful connection establishment
Receiving data that does not conform to the application protocol at the start of the session or in the middle of a session
Terminating due to a TCP layer retransmission timeout
Receiving data that is a combination of multiple application messages or a fragment of an application message
Communication over a link with high packet loss, high latency or both

The answers to all these questions needs to be "without any problems"

A failed connection attempt due to receiving a reset or timeout

When a server's host system receives a connection request from a client it puts the connection in a half open sate and sends back a response. The normal result is an acknowledgment from the client's host system and the connection moves to the established state. If the server's host system gets back a reset instead of acknowledgment or times out because it failed to get back any response the results are less clear. Many operating systems will not even tell the server application process that a connection request was made. On those systems this test is an automatic pass. On other operating systems the application server process may get notified of a connection and then get an error when it tries to send or receive data. You need to be sure that the application handles the error, hopefully logging it, and continues to handle other connections, instead of continuing to try to use the socket or faulting in some manner.

What could cause these conditions? The timeout I have seen many times, always caused by a routing failure. The server's host system did not have a route back to the client host system. So far I have seen the reset only once. It was due to a NAT box that incorrectly responded with a reset instead of allowing the server host system's response through to the client host system. However, with more and more middleboxes (things like NAT, firewalls and WAN accelerators) being deployed I expect to see more of these types of errors.

How can you generate these errors? The network shown in figure 1 can be used to create both the reset and timeout scenario. Client 1 and client 2 both have the same IP address (10.1.1.3). Each is connected to a device that can act as a router. Client 2 is connected to a PC with software that allows it to act as a router (more on that later). Client 1 can be connected to a "real" router or a PC with the same routing software as router 2. Since each router has the same set of routes it is important that the routers do not broadcast their routes to each other. The routers' other interfaces are connected to a hub along with the server.

To generate a reset condition, setup in the server a host route that sends packets for 10.1.1.3 to router 2. Then start the client application on client 1 and have it connect to the server. The server will send its response to router 2 which will forward it to client 2. Since client 2 did not try to establish a connection it will respond with a reset.

To generate a timeout condition, have the routing software on router 2 drop 100% of the packets going between the server and client 2. See the section on "Communication over a link with high packet loss, high latency or both" for the details on how to set that up.

Figure 1 - network setup

This may seem like a lot of effort for a test of a not very likely scenario and it is. However, if the problem does occur the effort to diagnose it will far exceed the effort to set up and conduct this test.

An immediate connection close after a successful connection establishment

The most likely cause of this condition is a port scan. This can be the prelude to an attack or your network security staff could be looking for unauthorized services running on the server. Unfortunately I have seen many server applications terminate when they try to read from the new socket and get an error.

The simplest way to duplicate this is with your own port scanning software. This can be downloaded from the Internet, just google "port scanning software". Alternatively you can modify the client application to close the connection as soon as it returns from the connect call.

Not receiving any data after a successful connection

After the TCP layer connection is established either the client or server will send some data. Which one will do so depends on the application protocol. Assuming it's the client you need to test what happens to the server application when the expected data is not received. The difference between this test and the previous one is that the connection is not closed. Unfortunately, I've seen cases where the server application does a blocking read immediately after accepting the connection. The result was that the server application stops handing existing or new requests.

The most common cause of this is the wrong client making the connection. The client was either given the wrong IP address or port number and the application protocol that that client follows has the client waiting for the server to send first.

The easiest way to test this is to create a special client that just doesn't send the expected data instead it just hangs or loops. You want to make sure that it doesn't terminate since that will close the connection.

Receiving data that does not conform to the application protocol at the start of the session or in the middle of a session

This is variation of the above scenario; instead of no data the wrong data is sent. Again, the most common cause is the wrong client application connecting to the server. Once connected it sends something which is not what the server application is expecting. I've also seen version 1 of the client application connecting to version 2 of the server application, or was it the other way around.

To test this you definitely need a specialized client that sends the wrong data. You should program the client to send both ASCII and binary data and of course long strings to test for buffer overflow conditions. Each location in the application server that reads data from the network needs to be tested.

Terminating due to a TCP layer retransmission timeout

This will happen if the sending host sends a packet and never gets an acknowledgement. Eventually the TCP stack times out waiting for the acknowledgement and sends an indication back to the sending application process that it has terminated the connection. The most common cause of this is a link with a high packet loss. Other causes are the route between sender and receiver actually failing, the receiving host crashing or the firewall between the sender and receiver dropping the connection mapping.

The most common effect of this is that the data that the sending application is writing is incomplete. I can hear someone say that TCP guarantees delivery of data. Actually it doesn't. What it guarantees is that the data will be delivered or the sender will be notified that it could not be confirmed that the data was delivered. Notice that I said "confirmed". If the data packet made it to the receiver but the receiver's acknowledgments all failed to arrive, all the sender knows is that no acknowledgments were received. If the sender resends the data in a new connection the receiver can end up with duplicate data.

To test this you need to set up the connection between the server 1 and client 2 (see figure 1) and then at the appropriate times modify the routing software on router 2 to drop 100% of the packets going between server 1 and client 2. See the section on "Communication over a link with high packet loss, high latency or both" for the details on how to set that up.

Receiving data that is a combination of multiple application messages or a fragment of an application message

TCP is a byte stream protocol, there is no such thing as a "message". When an application makes a call to send 1000 bytes and then another call to send another 1000 bytes the TCP stack can send that data in several ways. It can send one TCP segment of 1000 bytes followed by another TCP segment of 1000 bytes. Or it can send 1 segment of 1460 bytes (assuming that 1460 is the maximum segment size) followed by another segment of 540 bytes. If the maximum segment size is set to 536 bytes it will send the first 536 bytes from the first 1000 bytes followed by the last 464 bytes plus the first 72 bytes from the second 1000 bytes. The third send will be bytes 73 thru 608 and the fourth send will be the last 392 bytes. Other variations are also possible. The point being that what is actually sent in a TCP segment by the TCP stack is totally outside of the application's control. Any application that assumes that when it does a read it will get a complete application message and only 1 message will fail at some point. Figuring out why the application failed at that point is almost impossible.

I have seen many applications designed to read 2 bytes, treat it as a length and then read that number of bytes. Everything works fine until the length is split up by TCP and only 1 byte is read for the length. I have also seen applications where the receive buffer length is always set to the maximum length. The assumption is made that no more than 1 message will be read at a time. Again things work find until the first half of the "next" message is tacked onto the end of a short message. That first half of the "next" message is then discarded when the application reads only the current message and assumes that the entirety of the next message will be read by the next receive call. The first 2 bytes of the second half of the application message is then treated as length of the next message.

Unfortunately there is no easy way to chop up and combine messages in the application. However, with the aid of a proxy you can at least increase the probability that the problem will occur. The proxy maintains 2 connections, one with the client host and one with the server host. Bytes sent by one get passed to the other. However, the bytes are put into a buffer and a random number of bytes from 1 to the entire buffer are sent t any one time. New data that arrives before the buffer is empty just get appended to the end of the buffer.

Listing 1 provides a simple program to do this. This program was written for FreeBSD, the same OS that the router software discussed in the next section is written for. It should work on just about any UNIX or Linux system. Run it on the router 2 system with the command line:

mm PortNumber 192.168.1.1

The program will listen on the designated PortNumber and when a connection comes in it will then create a connection to 192.168.1.1 on the same port number. It is of course subject to the same TCP layer buffering discussed above but there is a good chance that the bytes received by the server application will be subsets and combinations of the original application messages.


/* This software is provided on an "AS IS" basis, WITHOUT ANY WARRANTY OR ANY SUPPORT OF ANY KIND. 
   The AUTHOR SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY
   PARTICULAR PURPOSE. This disclaimer applies, despite any verbal representations of any kind provided
   by the author or anyone else.
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <sys/ioctl.h>
#include <stdlib.h>

int	errno;

main (argc, argv)
int	argc;
char	*argv [];
{
int			iPort;
char 			sServerIP [16];

int			sdListen, sdClient, sdServer;
struct sockaddr_in	addrListen, addrClient, addrServer;
socklen_t 		iClientLen;
int			iNonBlocking = 1;
int			iClientRecv, iClientSend, iServerRecv, iServerSend;
#define BUFLEN 10000
char			sClientBuffer [BUFLEN], sServerBuffer [BUFLEN];
int			iClientBufferLen, iServerBufferLen;
int			iClientReady, iServerReady;
int			iRandom;
float			fRandom, fRAND_MAX;
int			iClientCount, iServerCount;

fRAND_MAX = RAND_MAX;

/* Sanity check arguments. There should only be 3 arguments
   (the command, port and server IP), any more or less
   triggers the display of a usage message. */
if (argc != 3)
   {
   printf ("Usage\n\t%s PORT SERVER_IP_ADDRESS\n\n", argv [0]);
   exit (0);
   }

/* The second argument is the server IP address (not name). If
   it is longer than 15 characters it can't be an IP address
   report a problem */
if (strlen (argv[2]) > 15)
   {
   printf ("Length %d or argument 2 (%s) is longer than maximum IP address\n",
	strlen (argv [2]), argv [2]);
   exit (0);
   }

/* Extract out the port number and server IP address from the
   arguments. Reprint the command line just as a sanity check. */
iPort = atoi (argv [1]);
strncpy (sServerIP, argv [2], strlen (argv[2]) + 1);
printf ("%s %d %s\n", argv [0], iPort, sServerIP);

/* Create a socket that will listen for the client's connection */
if ((sdListen = socket (AF_INET, SOCK_STREAM, 0)) < 0)
   {
   perror ("Error calling socket for sdListen");
   exit (errno);
   }

/* bind the client socket to the specified port on all interfaces
   (INADDR_ANY) */
bzero ((char *) &addrListen, sizeof (addrListen));
addrListen.sin_family		= AF_INET;
addrListen.sin_addr.s_addr	= htonl (INADDR_ANY);
addrListen.sin_port		= htons (iPort);
if (bind (sdListen, (struct sockaddr *) &addrListen, sizeof (addrListen)) < 0)
   {
   perror ("Error calling bind for addrListen");
   exit (errno);
   }

listen (sdListen, 5);

/* Block until a connection request comes in */
if ((sdClient = accept (sdListen, (struct sockaddr *) &addrClient, &iClientLen)) < 0)
   {
   perror ("Error calling accept for sdListen");
   exit (errno);
   }

/* Create a socket to be used to connect to the server */
if ((sdServer = socket (AF_INET, SOCK_STREAM, 0)) < 0)
   {
   perror ("Error calling socket for sdServer");
   exit (errno);
   }

/* Build a sockaddr structure containing the server's IP address
   and the port number we will be connecting to */
bzero ((char *) &addrServer, sizeof (addrServer));
addrServer.sin_family		= AF_INET;
addrServer.sin_port		= htons (iPort);
if (inet_aton (sServerIP, &addrServer.sin_addr) == 0)
   {
   printf ("Error calling inet_aton for sServerIP == %s\n", sServerIP);
   exit (0);
   }

/* Connect to the server */
if (connect (sdServer, (struct sockaddr *) &addrServer, sizeof (addrServer)) < 0)
   {
   perror ("Error calling connect for sdServer");
   exit (errno);
   }

/* Set the client and server sockets to non-blocking mode */
if (ioctl (sdClient, FIONBIO, &iNonBlocking) < 0)
   {
   perror ("Error calling ioctl for sdClient");
   exit (errno);
   }

if (ioctl (sdServer, FIONBIO, &iNonBlocking) < 0)
   {
   perror ("Error calling ioctl for sdServer");
   exit (errno);
   }

iClientBufferLen = iServerBufferLen = 0;
iClientReady = iServerReady = 1;
iClientCount = iServerCount = 0;

/* Loop until done.
   Done means that the client and server receive buffers are empty
   (iClientBufferLen + iServerBufferLen) == 0
   and either the client or server have shut down the connection)
   (iClientReady * iServerReady) == 0
*/
while (1)
   {
   if ((iClientBufferLen + iServerBufferLen + (iClientReady * iServerReady)) == 0)
      exit (0);

   if (iClientReady)
      {

      /* Receive from the client, put any data at the end of the buffer */
      iClientRecv = recv (sdClient, &sClientBuffer[iClientBufferLen], (BUFLEN - iClientBufferLen), 0);

      /* If recv returns a 0 it means the client has closed the connection */
      if (iClientRecv == 0)
         {
         printf ("Client terminated connection will exit when done sending to Server\n");
         iClientReady = 0;
         }

      /* If recv returns a -1 and errno is not EWOULDBLOCK some unexpected
         error has happened so terminate. If errno was EWOULDBLOCK it means
         that no data was available, just set the number of bytes
         received (iClientRecv) to 0 */ 
      if (iClientRecv < 0)
         {
         if (errno != EWOULDBLOCK)
            {
            perror ("Error calling recv for sdClient");
            exit (errno);
            }
         iClientRecv = 0;
         }

      /* Calculate the new buffer length (iClientBufferLen) based on the
         old length (iCleintBufferLen) and the number of bytes received
         (iClientRecv). Then calculate a random number of bytes
         to send to the server (iRandom). It is possible that the random
         number is 0. We count the number of 0 loops (iClientCount) and
         after 1000 0 loops we set the random number to 1. */ 
      iClientBufferLen += iClientRecv;
      if (iClientBufferLen > 0)
         {
         fRandom = random ();
         fRandom = fRandom / fRAND_MAX;
         iRandom = iClientBufferLen * fRandom;
         if ((iRandom == 0) && (++iClientCount > 1000))
            iRandom = 1;

         /* Set the loop counter back to 0 since we are going to send some
            characters. Send the characters and check for errors
            (iServerSend < 0). */
         if (iRandom > 0) 
            {
            iClientCount = 0;
            if ((iServerSend = send (sdServer, sClientBuffer, iRandom, 0)) < 0)
               {
               perror ("Error calling send for sdServer");
               exit (errno);
               }

            /* Shift the buffer by the number of characters that were actually
               sent (iServerSend). This number may be less that the number of
               characters that we requested to sent (iRandom). It may also be
               0 which is not an error, just an indication that the server's
               buffers and our local buffers are all full.
            */
            if (iServerSend > 0)
               {
               memcpy (sClientBuffer, &sClientBuffer [iServerSend], iClientBufferLen - iServerSend);
               iClientBufferLen -= iServerSend;
               }
            }
         }
      }
 
   /* There are no comments in this part of the code. This is exactly like the
      previous block of code except that we are reading from the server and
      sending to the client.
   */ 
   if (iServerReady)
      {
      iServerRecv = recv (sdServer, &sServerBuffer [iServerBufferLen], (BUFLEN - iServerBufferLen), 0);
      if (iServerRecv == 0)
         {
         printf ("Server terminated connection will exit when done sending to Client\n");
         iServerReady = 0;
         }
      if (iServerRecv < 0)
         {
         if (errno != EWOULDBLOCK)
            {
            perror ("Error calling recv for sdServer");
            exit (errno);
            }
         iServerRecv = 0;
         }
      iServerBufferLen += iServerRecv;
      if (iServerBufferLen > 0)
         {
         fRandom = random ();
         fRandom = fRandom / fRAND_MAX;
         iRandom = iServerBufferLen * fRandom;
         if ((iRandom == 0) && (++iServerCount > 1000))
            iRandom = 1;
         if (iRandom > 0)
            {
            iServerCount = 0;
            if ((iClientSend = send (sdClient, sServerBuffer, iRandom, 0)) < 0)
               {
               perror ("Error calling send for sdClient");
               exit (errno);
               }
            if (iClientSend > 0)
               {
               memcpy (sServerBuffer, &sServerBuffer [iClientSend], iServerBufferLen - iClientSend);
               iServerBufferLen -= iClientSend;
               }
            }
         }
      }
   }
}

Listing 1 - mm.c

Figure 2 is a highly edited trace of packets received and sent by router 2 (10.1.1.2) running the mm program. The client (10.1.1.3) sent 5 application messages of 100 bytes each. The mm application sent them on to the server (192.168.1.1) as TCP segments of lengths 84, 116, 11, 189, 89, and 11 bytes. Packet number 130 and 141 with lengths greater than 100 must have combined the last part of one application message and the first part of another. Packets containing just acknowledgments have been deleted to make the packets containing data and their lengths more obvious.

 No. Source     Destination       Port               Info
 121 10.1.1.3   10.1.1.2      4459 > 7777   Seq=1   Ack=1 Len=100
 122 10.1.1.2   192.168.1.1   52060 > 7777  Seq=1   Ack=1 Len=84
 127 10.1.1.3   10.1.1.2      4459 > 7777   Seq=101 Ack=1 Len=100
 130 10.1.1.2   192.168.1.1   52060 > 7777  Seq=85  Ack=1 Len=116
 134 10.1.1.3   10.1.1.2      4459 > 7777   Seq=201 Ack=1 Len=100
 135 10.1.1.2   192.168.1.1   52060 > 7777  Seq=201 Ack=1 Len=11
 137 10.1.1.3   10.1.1.2      4459 > 7777   Seq=301 Ack=1 Len=100
 141 10.1.1.2   192.168.1.1   52060 > 7777  Seq=212 Ack=1 Len=189
 144 10.1.1.3   10.1.1.2      4459 > 7777   Seq=401 Ack=1 Len=100
 145 10.1.1.2   192.168.1.1   52060 > 7777  Seq=401 Ack=1 Len=89
 150 10.1.1.2   192.168.1.1   52060 > 7777  Seq=490 Ack=1 Len=11

Figure 2 - A trace showing the length of TCP segments

Communication over a link with high packet loss, high latency or both

Many applications that work well in a LAN environment fail in a WAN environment. The problem is LANs have low latency, low rates of packet loss, and high bandwidth while WANs typically have much higher latency and packet loss and much lower bandwidth. Application protocols that rely on multiple exchanges between client and server fare the worst.

To test this you either have to test over the actual WANs your users will be using or employ some kind on WAN simulator. These can range from thousands of dollars to free. Dummynet which is part of the FreeBSD UNIX distribution is one of the free ones.

In figure 1, router 2 is running FreeBSD. This can be a standard distribution that is loaded on the PC's hard disk or the PC can be booted using a CD with a bootable version of FreeBSD (see www.freesbie.org). Once loaded the following commands can be used to set IP forwarding, load dummynet, and configure rules to set a delay of 250ms, 10% packet loss and a bandwidth equivalent to a T1. More details about dummynet can be found at info.iet.unipi.it/~luigi/ip_dummynet.

sysctl net.inet.ip.forwarding=1
kldload dummynet
ipfw flush
ipfw add 3000 pipe 1 ip from 10.1.1.3 to 192.168.1.1 in
ipfw pipe 1 config delay 250ms plr .1 bw 1544Kbits/s

To set 100% packet loss change the "pl1 .1" to "plr 1".

Summary

Application problems caused by network conditions are some of the hardest to diagnose and fix. The conditions are almost always unpredictable, 20 minutes of problems followed by 3 weeks of no problems or 5 minutes every day, but every day it's a different 5 minutes. Simulating these conditions in a test environment and fixing any observed problems will greatly improve the reliability of your application.

This page was last modified on 10-11-26

Send comments and suggestions
to ndav1@cox.net

Note
This aticle was originally supposed to be published by Dr. Dobb's Journal. At least I signed a contact in April of 2006 indicating that they would publish it and pay for the rights. However, they never published it or paid for the rights so after 6 months of trying to get them to tell me their plans for the article I have decided to publish it here. I understand that they have the right to change their mind about the article but I think it it very unprofessional of them that they failed to communicate their decision to me or respond to my repeated attempts to communicate with them.