What’s Wrong with the Internet and How to Fix It
—Interview with Internet Pioneer John Day
Forty years after the first two-network communications test between Stanford and University College London using Transmission Control Protocol/Internet Protocol (TCP/IP), the wholesale adoption of broadband technology and the propitious language of ‘Web 2.0’ collectively paint an image of Internet-related technologies as well advanced on their ‘beta versions’. Recently, however, Internet pioneer John Day has argued that ‘the Internet is an unfinished demo’ (Patterns), suggesting that the protocol suite on which the Internet is built is fundamentally flawed. Just as ‘protocol’ commonly refers to conventions for acceptable behavior, so too, in the context of networks, does protocol refer to a standardized set of rules defining how to encode and decode 1s and 0s as they move from network to network. Simply put, if it were not for the protocol suite TCP/IP, we would not have the Internet as we know it today. In view of Day’s critique of that protocol suite, we are left to wonder how it is that we have become blind not only to the flaws of the Internet but also to how and why it works the way that it does.
While it is becoming more common for us to stop and ask where the Internet is and how it works in terms of its physical infrastructure, the question of the implications of the protocol standards and how they ‘determine our situation’, as German media theorist Friedrich Kittler (Grammophone xxxix) would put it, remains relatively unexplored in contemporary media studies. Outside of work by media studies scholar Alexander Galloway, a select few policy advocates and Internet governance scholars such as Tim Wu and Laura Denardis, and computer historians such as Andrew L. Russell, there is little critical reflection on the protocol standards that make the Internet possible in the first place.
In the course of laying groundwork for further in-depth and broad-reaching media studies work on some of the biases underlying TCP/IP, then, I approached John Day about the problems inherent in that Internet protocol suite. In the email interview that follows, Day details what he identifies as five specific flaws in the TCP/IP model that are still entrenched in our contemporary Internet architecture, and indicates the ways in which a different structure (like the one proposed by the French CYCLADES group) for handling network congestion would have made the current issue of net neutrality beside the point.
We’ve become so used to the usual narrative about how the Internet is an American invention and (sometimes, therefore) one that is inherently ‘free’, ‘open’, and ‘empowering’ that we are immune to seeing how this network of networks is working on us rather than us on it. Analysing the technical specifications underlying the Internet is vital, then, to understanding how we are unwittingly living out the legacy of the power/knowledge structures that produced TCP/IP. How, for instance, are we now extensions of the specific Internet architecture that we have inherited from the 1970s? Beyond these ‘theoretical’ concerns, understanding the specifics of TCP/IP matters for more pragmatic reasons, too, because, as it happens—and as Day explains below—the Internet does not currently work particularly well, and it will likely function even worse in the coming years as we struggle to keep up with demands for address space and try to control network congestion.
**
*
Emerson: You’ve written quite vigorously about the flaws of the TCP/IP model that go all the way back to the 1970s and about how our contemporary Internet is living out the legacy of those flaws. Particularly, you’ve pointed out repeatedly over the years how the problems with TCP were carried over not from the American ARPANET (the early packet switching network run by the U.S. Department of Defense's Advanced Research Projects Agency) but from an attempt to create a transport protocol that was different from the one proposed by the French CYCLADES group. First, could you explain to readers what CYCLADES did that TCP should have done?
Day: There were several fundamental properties of networks the Internet group overlooked from the early 1972 insights of CYCLADES as well as Richard Watson’s subsequent insights around synchronisation in data transfer. At stake here were the following:
1) the nature of layers
2) why the layers they had were there
3) the fact that congestion could occur in datagram networks
4) a complete naming and addressing model
5) the fundamental conditions for synchronisation.
Failure to appreciate these properties led the Internet group to take a raft of missteps, in addition to some which were unrelated to these issues.
First and probably foremost was the problem of layers. Computer scientists use layers to structure and organise complex pieces of software. Think of a layer as a black box that does something, but the internal mechanism is hidden from the user of the box. One example is a black box that calculates the 24 hour weather forecast. We put in data about temperature, pressure and wind speed and out pops a 24 hour weather forecast. We don’t have to understand how the blackbox did it. We don’t have to interact with all the different aspects it went through to do that. The black box hides the complexity so we can concentrate on other complicated problems for which the output of the black box is input. The operating system of your laptop is a black box. It does incredibly complex things but you don’t see what it is doing.
Similarly, the layers of a network are organised that way. For the ARPANET group, Bolt, Barenek, and Newman (BBN) was responsible for building the network of IMPs (Interface Message Processors) and switches (or, in today’s terminology, routers) while each site (UCLA, SRI, Stanford, Illinois, Utah, MIT, BBN, MITRE, etc.) was responsible for the hosts. For them, the network of IMPs was a blackbox that delivered packets. Consequently, for developers of host software the concept of layers focused on the black boxes in the hosts where their primary purpose was modularity. The layers in the ARPANET hosts were the Physical Layer or the wire; the IMP-Host Protocol; the Network Control Program (NCP), which managed the flows between applications; and the applications, such as Telnet, which is a terminal device driver protocol and maybe File Transfer Protocol (FTP). For the Internet, they were the physical layer or wire; Ethernet; and IP and TCP, which manages the flows between applications such as Telnet or HTTP, etc. It is important to remember that the ARPANET was built to be a production network to lower the cost of doing research on a variety of scientific and engineering problems.
The CYCLADES group, on the other hand, was building a network to do research on the nature of networks. They were looking at the whole system to understand how it was supposed to work. They saw that layers were more than just local modularity but also a set of cooperating processes in different systems. Most importantly, they realised that different layers had different scopes, i.e. each layer had a number of different elements within it. This concept of the scope of a layer is the most important property of layers. Those who worked on the Internet never understood the importance of the scope of a layer.
The layers that the CYCLADES group came up with in 1972 were (1) the Physical Layer — the wires that go between boxes — and (2) the Data Link Layer, which operates over physical media detects errors on the wire, and in some cases keeps the sender from overrunning the receiver. But most physical media have limitations on how far they can be used. The further data is transmitted on them the more likely there are errors. So these wires may be short. To go longer distances, a higher layer with greater scope to relay the data exists over the Data Link Layer. This is traditionally called (3) the Network Layer.
Of course, the transmission of data is not just done in straight lines but as a network, with the consequence that there are alternate paths. We can show from queuing theory that regardless of how lightly loaded a network is it can become congested, with too many packets trying to get through the same router at the same time. If the congestion lasts too long, it will get worse and worse and eventually the network will collapse. It can be shown that no amount of memory in the router is enough, so when congestion happens packets must be discarded. To recover from this, we need (4) a Transport Layer protocol, mostly to recover lost packets due to congestion. The CYCLADES group realised this which is why there is a Transport Layer in their model. They started doing research on congestion around 1972. By 1979, there had been enough research that a conference was held near Paris. DEC and others in the US were doing research on it too. Those working on the Internet didn’t understand that a collapse due to congestion could happen until 1986 when it happened to the Internet. So much for seeing problems before they occur.
Emerson: Before we go on, can you expand more on how and why the Internet collapsed in 1986?
Day: There are situations where too many packets arrive at a router and a queue forms, like everyone showing up at the cash register at the same time, even though the store isn’t crowded. The network (or store) isn’t really overloaded but it is experiencing congestion. However, in the Transport Layer of the network, the TCP sender is waiting to get an acknowledgement (known as an ‘ack’) from the destination that indicates the destination got the packet(s) it sent. If the sender does not get an ack in a certain amount of time, the sender assumes that the packet and possibly others were lost or damaged, and re-transmits everything it sent since it sent the packet that timed out. If the reason the ack didn’t arrive is that it was delayed too long at an intervening router and the router has not been able to clear its queue of packets to forward before this happens, the retransmissions will just make the queue at that router even longer. Now remember, this isn’t the only TCP connection whose packets are going through this router. Many others are too. And as the day progresses, there is more and more load on the network with more connections doing the same thing. They are all seeing the same thing contributing to the length of the queue. So while the router is sending packets as fast as it can, its queue is getting longer and longer. In fact, it can get so long and delay packets so much, that the TCP sender’s timers will expire again and it will re-transmit again, making the problem even worse. Eventually, the throughput drops to a trickle.
As you can see, this is not a problem of not enough memory in the router; it is a problem of not being able to get through the queue fast enough. (Once there are more packets in the queue than the router can send before retransmissions are triggered, collapse is assured.) Of course delays like that at one router will cause similar delays at other routers. The only thing to do is discard packets.
What you see in terms of the throughput of the network vs load is that throughput will climb very nicely, then it begins to flatten out as the capacity of the network is reached. As congestion takes hold and the queues get longer, throughput starts to go down until it is just a trickle. The network has collapsed. The group of engineers under the ARPA contract did not see this coming. John Nagel of Ford Motor Company warned them in 1984 of the congestion problems they were seeing in their network, but the Internet engineers under the DARPA contract ignored it. They were the Internet - what did someone from Ford Motor Company know? It was a bit like the Frank Zappa song: ‘It can’t happen here’. They will say (and have said) that because the ARPANET handled congestion control, they never noticed it could be a problem. As more and more IP routers were added to the Internet, the ARPANET became a smaller and smaller part of the Internet as a whole and it no longer had sufficient influence to hold the congestion problem at bay.
This is an amazing admission. They shouldn’t have needed to see it happen to know that it could. Everyone else knew about it and had for well over a decade. CYCLADES had been doing research on the problem since the early 1970s. The Internet’s inability to see problems before they occur is not unusual. So far we have been lucky and Moore’s Law has bailed us out each time.
Emerson: Thank you. It’s incredible to hear the Internet could have been structured otherwise if those working on it had not (willfully?) overlooked what the French had already figured out about layers and congestion. Please continue on about what CYCLADES did that TCP should have done.
Day: The other thing that CYCLADES noticed about layers in networks was that they weren’t just modules. They realised this because they were looking at the whole network. They realised that layers in networks were more general because they used protocols to coordinate their actions in different computers. Layers were distributed share states with different scopes. Scope? Think of it as building with bricks. At the bottom, we use short bricks to set a foundation, protocols that go a short distance. On top of that are longer bricks, and on top of that longer yet. So what we have is the Physical and Data Link Layer with one scope; and the Network and Transport Layers with a larger scope over multiple Data Link Layers. Around 1972, researchers started to think about networks of networks. The CYCLADES group realised that the Internet Transport Layer was a layer of greater scope yet it also operated over multiple networks. So by the mid-1970s, they were looking at a model that consisted of Physical and Data Link Layers of one small scope that is used to create networks with a Network Layer of greater scope, and an Internet Layer over multiple networks of greater scope yet. The Internet today has the model I described above for a network architecture of two scopes, not an internet of three scopes.
Why is this a problem? Because congestion control goes in that middle scope. Without that scope, the Internet group put congestion control in TCP, which is about the worse place to put it and thwarts any attempt to provide Quality of Service for voice and video, which must be done in the Network Layer, and ultimately precipitated a completely unnecessary debate over net neutrality.
Emerson: Do you mean that a more sensible structure to handle network congestion would have made the issue of net neutrality beside the point? Can you say anything more about this? I’m assuming others besides you have pointed this out before?
Day: Yes, this is my point, and I am not sure that anyone else has pointed it out, at least not clearly. It is a little hard to see clearly when you’re ‘inside the Internet’. There are several points of confusion in the net neutrality issue. One is that most non-technical people think that bandwidth is a measure of speed when it is more a measure of capacity. Bits move at the speed of light (or close to it) and they don’t go any faster or slower. The only aspect of speed in bandwidth is how long it takes to move a fixed number of bits, and whatever that is consumes the capacity of a link. If a link has a capacity of 100Mb/sec and I send a movie at 50Mb/sec, I only have another 50Mb/sec I can use for other traffic. So to some extent, talk of a ‘fast lane’ doesn’t make any sense. Again, bandwidth is a measure of capacity.
For example, you have probably heard the argument that Internet providers like Comcast and Verizon want ‘poor little’ Netflix to pay for a higher speed, to pay for a faster lane. In fact, Comcast and Verizon are asking Netflix to pay for more capacity! Netflix uses the rhetoric of speed to wrap themselves in the flag of net neutrality for their own profit and to bank on the fact that most people don’t understand that bandwidth is capacity. Netflix is playing on people’s ignorance.
From the earliest days of the Net, providers have had an agreement that as long as the amount of traffic going between them is about the same in both directions they don’t charge each other—the idea being that it would all come out in the wash. But if the traffic became lop-sided, if one was sending much more traffic into one than the other was sending the other way, then they would charge each other. This is just fair. Because movies consume a lot of capacity, Netflix is suddenly generating considerable load that wasn’t there before. This isn’t about blocking a single Verizon customer from getting his or her movie; this is about the thousands of Verizon Customers all downloading movies at the same time and all of that capacity is being consumed at a point between Netflix’s network provider and Verizon. It is even likely they didn’t have lines with that much capacity, so new ones had to be installed. That is very expensive. Verizon wants to charge Netflix or Netflix’s provider because the capacity moving from them to Verizon is now lop-sided by a lot. This request is perfectly reasonable and it has nothing to do with the Internet being neutral. Here’s an analogy: imagine your neighbor suddenly installed an aluminium smelter in his home and was going to use 10,000 times more electricity than he used to. He then tells the electric company they have to install much higher capacity power lines to his house and provide all of that electricity and his monthly electric bill should not go up. I doubt the electric company would be convinced.
Net neutrality basically confuses two things: traffic engineering versus discriminating against certain sources of traffic. The confusion is created because of the flaws introduced fairly early and then what that forced the makers of Internet equipment to do to try to work around those flaws. Internet applications don’t tell the network what kind of service they need from the Net. So when customers started to demand better quality for voice and video traffic, the providers had two basic choices: over provision their networks to run at about 20% efficiency (you can imagine how well that went over) or push the manufacturers of routers to provide better traffic engineering. Because of the problems in the Internet, about the only option open to manufacturers was for them to look deeper into the packet rather than just making sure they routed the packet to its destination. However, looking deeper into a packet also means being able to tell who sent it. (If applications start encrypting everything, this will no longer work.) This of course not only makes it possible to know which traffic needs special handling, but makes it tempting to slow down a competitor’s traffic. Had the Net been properly structured to begin with (and in ways we knew about at the time), then these two things would be completely distinct: one would have been able to determine what kind of packet was being relayed without also learning who was sending it and net neutrality would only be about discriminating between different sources of data so that traffic engineering would not be part of the problem at all.
Of course, Comcast shouldn’t be allowed to slow down Skype traffic because it is in competition with Comcast’s phone service—or Netflix traffic that is in competition with its on-demand video service. But if Skype and Netflix are using more than ordinary amounts of capacity, then of course they should have to pay for it.
Emerson: That takes care of three of the five issues related to TCP: the nature of layers, the reasoning behind the development of specific layers, and the fact that network congestion can occur. What about the next two?
Day: The next two are somewhat hard to explain to a lay audience but let me try. A Transport Protocol like TCP has two major functions: 1) to make sure that all of the messages are received and put in order, and 2) to prevent the sender from sending so fast that the receiver has no place to put the data. Both of these require the sender and receiver to coordinate their behavior. This is often called feedback, where the receiver is feeding back information to the sender about what it should be doing. We could do this by having the sender send a message and the receiver send back a special message that indicates it was received (the ‘ack’ we mentioned earlier) and to send another. However, this process is not very efficient. Instead, we like to have as many messages as possible ‘in flight’ between them, so they must be loosely synchronised. However, if an ack is lost, then the sender may conclude the messages were lost and re-transmit data unnecessarily. Or worse, the message telling the sender how much it can send might get lost. The sender is waiting to be told it can send more, while the receiver thinks it told the sender it could send more. This is called deadlock.
In the early days of protocol development a lot of work was done to figure out what sequence of messages was necessary to achieve synchronisation. Engineers working on TCP decided that a three-way exchange of messages (or a three-way handshake) could be used at the beginning of a connection. This is what is currently taught in all of the textbooks.
However, in 1978 Richard Watson made a startling discovery: the message exchange was not what achieved the synchronisation. Rather, synchronisation was achieved by explicit bounding of three timers that occurred during data transfer.1 The messages are basically irrelevant to the problem. I can’t impress on you enough what an astounding result this is. It is an amazingly deep, fundamental result—Nobel Prize level! It not only yields a simpler protocol, but one that is more robust and more secure than TCP. Other protocols, notably the OSI (Open Systems Interconnection) Class 4 Transport Protocol, incorporate Watson’s result but TCP only partially does and not the parts that improve security. We have also found that this implies the bounds of what is networking. If an exchange of messages bounds Maximum Packet Lifetime, it is networking or interprocess communication. If it isn’t bounded, then it is merely a remote file transfer. Needless to say, simplicity, robustness, and security are all hard to get too much of.
Addressing is even more subtle and its ramifications even greater. The simple view is that if we are to deliver a message in a network, we need to say where the message is going. It needs an address, just like when you mail a letter. While that is the basic problem to be solved, it gets a bit more complicated with computers. In the early days of telephones and even data communications, addressing was not a big deal. The telephones or terminals were merely assigned the names of the wire that connected them to the network. (This is sometimes referred to as ‘naming the interface’.) Until fairly recently, the last 4 digits of your phone number were the name of the wire between your phone and the telephone office (or exchange) where the wire came from. In data networks, this often was simply assigning numbers in the order the terminals were installed.
But addressing for a computer network is more like the problem in a computer operating system than in a telephone network. We first saw this difference in 1972. The ARPANET did addressing just like other early networks. IMP addresses were simply numbered in the order they were installed. A host address was an IMP port number, or the wire from the IMP to the host.2 In 1972, Tinker Air Force Base joined the Net and was to have two connections to the ARPANET for redundancy. When my boss related this to me, I first said, ‘Great! Good ide . . .’ I didn’t finish the thought, and instead said, Oh, crap! That won’t work! (It was a head slap moment!) And a half second after that said, ‘Oh, not a big deal, we are operating system guys, we have seen this before. We need to name the node.’
Why wouldn’t it work? If Tinker had two connections to the network, each one would have a different address because they connected to different IMPs. The host knows it can send on either interface, but the network doesn’t know it can deliver on either one. To the network, it looks like two different hosts. The network couldn’t know those two interfaces went to the same place. But, as I said, the solution is simple: the address should name the node, not the interface.3
Just getting to the node is not enough. We need to get to an application on the node. So we need to name the applications we want to talk to as well. Moreover, we don’t want the name of the application to be tied to the computer it is on. We want to be able to move the application and still use the same name. In 1976, Jon Shoch at Xerox PARC put it thus: application names indicate what you want to talk to; network addresses indicate where it is; and routes tell you how to get there.
The Internet still only has interface addresses. They have tried various work-arounds to solve not having two-thirds of what is necessary. But like many kludges, they only kind of work, as long as there aren’t too many hosts that need it. They don’t really scale. But worse, none of them achieve the huge simplification that naming the node does. These problems are as big a threat to the future of the Internet as the congestion control and security problems. And before you ask, no, IPv6 that you have heard so much about does nothing to solve them. Actually from our work, the problem IPv6 solves is a non-problem, if you have a well-formed architecture to begin with.
The biggest problem is router table size. Each router has to know where next to send a packet. For that it uses the address. However for years, the Internet continued to assign addresses in order. So unlike a letter where your local post office can look at the State or Country and know which direction to send it, the Internet addresses didn’t have that property. Hence, routers in the core of the Net needed to know where every address went. As the Internet boom took off that table was growing exponentially and was exceeding 100K routes. (This table has to be searched on every packet.) Finally in the early 90s, they took steps to make IP addresses more like postal addresses. However, since they were interface addresses, they were structured to reflect what provider’s network they were associated with, i.e. the ISP becomes the State part of the address. If one has two interfaces on different providers, the problem above is not fixed. Actually, it needs a provider-independent address, which also has to be in the router table. Since even modest sized businesses want multiple connections to the Net, there are a lot of places with this problem and router table size keeps getting bigger and bigger, now around 500K and 512K is an upper bound that we can go beyond, but it impairs adoption of IPv6 to do so. In the early 90s, there was a push to name the node rather than the interface, a practice already deployed and widely used in the routers. But the IETF (Internet Engineering Task Force) refused to consider breaking with tradition. Had they done that it would have reduced router table size by a factor of between 3 and 4, so router table size would be closer to 150K. In addition, naming only the interface makes supporting mobile access to networks a complex mess.
Emerson: I see—so every new ‘fix’ to make the Internet work more quickly and efficiently is only masking the fundamental underlying problems with the architecture itself. What is the last flaw in TCP you’d like to touch on before we wrap up?
Day: Well, I wouldn’t say ‘more quickly and efficiently’. We have been throwing Moore’s Law at these problems: processors and memories have been getting faster and cheaper faster than the Internet problems have been growing, but that solution is becoming less effective. Actually, the Internet has been getting more complex and inefficient for decades, it is just that Moore’s Law hides it well.
But as to your last question, another flaw with TCP is that it has a single message type rather than separating control and data. This not only leads to a more complex protocol but greater overhead. Those who are committed to simply improving TCP/IP rather than creating an alternative protocol argue that being able to send acknowledgements with the data in return messages saves a lot of bandwidth. And they are right. It saved about 35% of the bandwidth when using the most prevalent machine on the Net in the 1970s, but that behavior hasn’t been prevalent for 25 years. Today the savings are miniscule. Splitting IP from TCP required putting packet fragmentation in IP, which doesn’t work. But if they had merely separated control and data it would still work. TCP delivers an undifferentiated stream of bytes which means that applications have to figure out what is meaningful rather than delivering to a destination the same amount the sender asked TCP to send. This turns out to be what most Applications want. Also, TCP sequence numbers (to put the packets in order) are in units of bytes not messages. Not only does this mean they ‘roll-over’ quickly, either putting an upper bound on TCP speed or forcing the use of an extended sequence number option which is more overhead, but it also greatly complicates reassembling messages, since there is no requirement to re-transmit lost packets starting with the same sequence number.
Of the 4 protocols we could have chosen in the late 70s, TCP was (and remains) the worst choice, but they were spending many times more money than everyone else combined. As you know, he with the most money to spend wins. And the best part was that it wasn’t even their money.
Emerson: Finally, I wondered if you could briefly talk about the Recursive InterNetwork Architecture (RINA) as one example of an alternative protocol and how it could or should fix some of the flaws of TCP you discuss above? Pragmatically speaking, is it fairly unlikely that we’ll adopt RINA, even though it’s a more elegant and more efficient protocol than TCP/IP?
Day: Basically RINA picks up where we left off in the mid-70s and extends what we were seeing then but hadn’t quite recognised. What RINA has found is that all layers have the same functions; they just are focused on different ranges of the problem space. So in our model there is one layer that repeats over different scopes. This by itself solves many of the existing problems with the current Internet, including those described here. But in addition, it is more secure and multihoming and mobility are inherent in the structure. No special concepts or protocols are required. No foreign routers, home routers or tunnels. It solves the router table problem because the repeating structure allows the architecture to scale, etc.
I wish I had a dollar for every time someone has said (in effect), ‘gosh, you can’t replace the whole Internet.’ There must be something in the water these days. They told us that we would never replace the public phone system or IBM, but it didn’t stop us and we did. Of course it won’t happen all at once, but it might not take as long as one thinks. IPv6 has taken forever, because it offered no benefit to those who have to pay for the adoption. A change that radically lowers cost and complexity to those who have to pay for it will likely have a different adoption rate.
Notes
1. ‘Bounding a timer’ implies that a certain event is known to have occurred within this time interval. In this case the 3 timers are Maximum Packet Lifetime, MPL, that once a packet is sent that after this time interval the packet will no longer be in the network; A, that when a packet is received its acknowledgement will be sent within this time interval; and R, that if retries are required they will be exhausted within this time period. #back
2. Had BBN given a lot of thought to addressing? Not really. After all this was an experimental network. It was a big enough question to ask whether it would work at all, let alone whether it could do fancy things! Just getting a computer that had never been intended to talk to another computer to do that was a big job. Everyone knew that addressing issues were important, difficult to get right, and so we felt a little experience first would be good before we tackled them. Heck, the maximum number of hosts was only 64 in those days! #back
3. It would be tempting to say ‘host’ here rather than ‘node’, but one might have more than one node on a host. This is especially true today with Virtual Machines so popular, each one is a node. Actually, by the early 80s we had realised that naming the host was irrelevant to the problem. #back
References
Day, John. Patterns in Network Architecture: A Return to Fundamentals. Boston: Pearson, 2008.
Kittler, Friedrich. Gramaphone, Film, Typewriter, trans. Geoffrey Winthrop-Young & Michael Wutz. Stanford: Stanford University Press, 1999.
Ctrl-Z: New Media Philosophy
ISSN 2200-8616