Department of Electrical Engineering and Computer Science Doctor of Philosophy in Computer Science and Engineering August 2002 August 30, 2002
M. Frans KaashoekProfessor of Computer Science and Engineering
Arthur C. SmithChairman, Department Committee on Graduate Students
John Jannotti
Many people have contributed substantially to this thesis -- technically, emotionally, or both.
I would like to thank my thesis committee, Hari Balakrishnan, Ion Stoica, and my advisor, Frans Kaashoek. Their interest and insight has proven invaluable. I would especially like to thank Frans for his encouragement throughout my years at MIT. His faith in me may be misplaced, but it has encouraged me nonetheless.
The PDOS group have shaped my views considerably, and I am richer for it. I would like to thank Robert Morris, Dawson Engler, Eddie Kohler, and David Mazières in particular. I've learned more from them than I can ever repay.
I would like to thank Cisco Systems. As a company, Cisco has been extremely understanding of my commitment to my work at MIT while I have also worked for there. As individuals, they have been be friends, and technically helpful. I would especially like to thank Jim O'Toole and Suchitra Raman for discussions related to this thesis.
I would like to thank Ann Hintzman. Although she served as proofreader for this thesis, to describe her contribution in such limited terms would be a profound understatment. She has supported me emotionally more than I have deserved, and I remain eternally grateful.
I would like to thank my family. They have loved me, supported me, and encouraged me. I hope I have made them proud.
Finally, I would like to thank all of the people that have distracted me, making this thesis take longer to write, but making my life brighter for their presence.
Distributed applications would be well served by richer semantics than the network layer supplied by the Internet. Today's distributed applications have only one primitive from which they may build services: Internet Protocol (IP) Unicast, the best-effort, single source and destination delivery of datagrams. Unicast delivery allows a single network node to ask for a single packet of data to routed through the Internet to a single destination node. Additional semantics have been built on top of this single primitive. For example, Transmission Control Protocol (TCP) provides reliability and flow control.
Many applications, however, would benefit from additional services that are more difficult to build on top of IP Unicast delivery. Mission critical applications would like control over the way their packets are routed -- perhaps trading off resource usage for reliability by using multi-path routing [35]. Teleconferencing applications [20], chat rooms, and Internet broadcasting systems would benefit from efficient group communication. A stock ticker application might like to perform latency measurements over many paths to find a low latency path undetected by normal IP routing. A content distribution network would like to distribute and store data throughout the network [21,16].
One approach to addressing these needs is to build new network services into routers across the Internet. Generally this approach has two drawbacks. First, it may be inappropriate to add the necessary functions to routers that should remain fast and simple to ensure their continued availability as an important shared resource. Second, adding an important network service to routers is likely to support only those applications envisioned during the design of the service. For example, IP Multicast provides a single service model that is inappropriate for a number of multicasting applications. Efforts to revamp IP Multicast for reliability or for secure admission control require yet more modifications to routers.
Overlay networks completely sidestep these two drawbacks. Overlay networks avoid the issue of ``dumb'' IP routers by performing packet routing and duplication in edge nodes. Many examples are described in Section 2.2. In these systems, cooperating servers throughout the Internet act as routers in an overlay network.
Figure 1-1 demonstrates an overlay network. Just as a physical network has a topology consisting of the nodes of the network and the links between them, an overlay network has a virtual topology, which exists by the agreement of the overlay nodes. Packets are transmitted only along the virtual links between the overlay nodes using the underlying unicast mechanism provided by IP.
![]() |
In contrast to the Internet, in which routers are a shared resource that cannot be specialized for a particular purpose, the members of an overlay network may provide specialized services specific to the application at hand. An overlay-based multicast system can duplicate packets in the servers, a content distribution network can cache gigabytes of data, RON provides resilient routing by constant performance measurements among participating nodes.
We will use three metrics to evaluate how efficiently an overlay network is operating. Stress indicates the number of times that a semantically identical packet traverses a given link. In IP Multicast, stress never exceeds one. On the other hand, overlay networks could not hope to achieve such efficiencies because packets being forwarded by an edge node will traverse (at least) the node's local link twice. Stretch indicates the ratio of latency in an overlay network compared to some baseline, generally IP unicast or multicast. Misfires measures the number of times that duplicate packets are mistakenly sent to the same destination. These metrics are used to evaluate overlay performance with the primitives proposed in this thesis in Chapter 7.
Second, it can be difficult to build virtual topologies that resemble the topology of the underlying network. It is beneficial for the virtual links of an overlay network to connect nodes that are well-connected in the underlying network. Choosing well-connected virtual links is akin to supplying a physical network with a higher bandwidth link-layer. It is also common to prefer virtual links that share as few underlying links as possible with other virtual links. This property leads to independent failures, and less duplicate traffic on underlying links. Unfortunately, it is very difficult to determine these characteristics today. Overlays have fallen back on wasteful and error-prone techniques such as continual bandwidth probes to learn about the underlying network.
The goal of this thesis is to propose simple extensions to the network layer that would allow the construction of simple, efficient overlay networks. The extensions should be beneficial to the large variety of research projects and service providers that are investigating systems based on overlay networks.
A number of design criteria can be drawn from the diverse needs of existing overlay systems. First, and most directly, overlay networks would benefit from pushing work toward the core of the network. End System Multicast exhibits this need most acutely because its participants are expected to be desktop machines -- no nodes are expected to be located at strategic network points. Yet every overlay system would benefit if routing and duplication could be directed by the end hosts, but performed by the appropriate routers, thereby allowing their packets to follow a more optimal path through the underlying network.
Second, all of these systems have difficulties constructing overlay topologies. RMX uses hand configuration. End System Multicast lacks scalability due, in part, to its topology generation algorithm. X-Bone and Yoid need help pruning an initially quadratic number of possible links down to a manageable size. Overcast consumes bandwidth to conduct constant network measurements.
A third, more subtle conclusion, is that the overlay nodes must retain complete control of forwarding when necessary. While it is important to provide a mechanism that allows overlay networks to obtain efficient forwarding, that mechanism must be fine-grained enough to allow the overlay to handle a particular link in a specialized way when necessary. For example, systems such as RMX would like to take advantage of automatic forwarding when possible, but when a link is congested they must retain the ability to perform their own forwarding so that they may transcode to a thinner format. Similarly, Overcast must be able to interpose its nodes in the forwarding mesh so that they may cache the forwarded data and replay transmission for new nodes.
In addition to these criteria, there are lessons to be drawn from the challenges that have faced other proposals for router modification, particularly IP Multicast. It is critical that new functions are incrementally deployable -- there must be benefits to deployment even when only a small portion of routers have been modified to support the new functionality. In addition, the benefits from deployment should be concentrated around the portion of the network in which deployment occurred. These benefits offer an economic incentive to early adopters.
In order to provide for incremental deployment, this thesis proposes mechanisms that are for optimization only. Overlay networks must be prepared to operate as if the primitives do not exist. When the primitives are available, the network will provide explicit signaling to the application, allowing it to avoid work that has been performed in the network.
Finally, it is important to keep proposed enhancements small so that future modifications to their behavior are unlikely to be required, and generally useful so that greater utility might someday be obtained by using then in novel combinations.
The contributions of this dissertation are:
Related work can broadly be divided into three areas. First, IP Multicast and systems built on IP Multicast. IP Multicast provides a group communication primitive for IP, and a number of systems have attempted to build additional semantics on top of IP Multicast. The efforts have seen limited success, in part because of IP Multicast's limited deployment, and in part because IP Multicast provides a difficult base upon which to build -- IP Multicast is a monolithic primitive combining all needed mechanisms to provide a single high-level abstraction.
A second set of related projects are overlay networks. These systems seek to avoid the deployment issues of IP Multicast by providing network abstractions purely in end hosts. This thesis complements these systems by providing protocols to allow these systems to approach the efficiency of router based network services.
The most closely related work is an approach to network-layer extension that is quite similar to the approach advocated in this thesis. In Section 2.3 we examine a active networks based system that builds multicast from unicast forwarding and ephemeral state. That approach yields a primitive that resembles reflection, as well as a few other primitives that may also be of use in general overlay networks.
IP Multicast, taken as a whole, suffers from a number of drawbacks that have hampered its widespread acceptance. First, it requires router support. The spanning tree for a group cannot cross regions of the network that are not running IPM enabled routers. This hurdle creates little incentive to be ``the first on the block'' to roll out IPM support because IPM's utility is greatly diminished without widespread adoption.
Second, IP Multicast's admittance control and, by extension, its security, are weak. Any Internet node (assuming deployment of IPM routers) can send or receive packets for any IP Multicast group. Proponents argue for end-to-end encryption and authentication, but such a solution would do little to avoid denial of service attacks that rely only on the volume of data, not its authenticity. Even if unauthenticated packets were disposed of at end-hosts, IP Multicast's willingness to send copies of ``junk'' packets to all members of a group presents a ripe target for Internet trouble makers. In fact, recent experience with the Ramen Worm [41] shows that IP Multicast is so susceptible to bandwidth wasting attacks that it can be taken advantage of by mistake. The Raman Worm probed IP addresses at random, some of which were IP Multicast addresses. The Worm was first detected by the enormous growth in Multicast traffic it accidentally consumed as it spread.
Third, and most subtly, IP Multicast is an awkward primitive on which to build other network services. For example, attempts have been made for years to describe a simple, scalable reliable multicast protocol on top of IP Multicast [15,24,25,26,29], yet there is no consensus that any single protocol is appropriate for deployment in the world's routers. One problem plaguing such attempts is that there is no single agreed upon semantics for reliable multicast. One application might wish to use flow control to slow the data stream down to the bandwidth of the slowest link. Another might want to stream data as quickly as possible to nodes that have fast connections, while trickling in to those that do not. Yet another might wish to relax ordering constraints in favor of receiving most of the data in near real-time and the rest later. Attempting to support all of these semantics in router based software calls for a consensus that each is important enough to standardize upon, and risks destabilizing core routing functionality as increasingly complex software is added. Instead, more complex semantics should be provided by end-hosts, just as TCP provides flow-control and reliability.
Besides this simplification, these single-source approaches have additional benefits compared to standard IP Multicast. First, the namespace of possible group names is vastly expanded and managed more easily. In standard IP Multicast, 28 bits of address space have been set aside to be used as Multicast group addresses. Management of that space, however, is a difficult problem. In single-source approaches, the IP address of the source is a portion of group identifier, which allows each Internet host to manage its own namespace.
In many applications, the limited power of a single-source approach is also a benefit. Applications that are fundamentally one-to-many benefit from the simpler admission control and security of a hard-wired single source. Only the source of the group may send packets to the group, so the flooding attacks of standard IP Multicast are impossible. Additionally, with only one source, simple cryptographic schemes for admission control are possible. The sender encrypts and signs all packets. Only those nodes possessing the group key may decrypt and authenticate the packets.
Finally, as we will see in the next section, a single-source models lends itself more easily to reliability extensions.
All three schemes support reliability by allowing ``downstream'' receivers to contact ``upstream'' receivers for lost packets. It remains the responsibility of the application residing at the receivers to buffer and retransmit the lost packets. At each router, a particular interface is specially designated to receive request packets when they are sent ``upstream''. This interface is the replier link. In a router that is not acting as a branch point, the replier link is the upstream link. In a router that is a branch point, one of the downstream interfaces is chosen. When an end-host sends a request packet upstream, it will be forwarded along these replier links (a packet coming from a replier link is sent upstream). As a result, all upstream packets will flow through a hierarchy of repliers.
In this way, multiple acknowledgments, sent upstream, may be coalesced when they arrive at a designated replier. That replier may respond to the entire set of receivers using a directed multicast down the distribution tree.
These schemes have the usual drawbacks of network-layer multicast schemes in comparison to the primitives presented in this thesis -- they are not incrementally deployable nor may they be used outside of a multicast context.
REUNITE accomplishes a number of things that the primitives proposed in this thesis also hope to achieve. It is a fairly simple protocol, it is incrementally deployable, and state requirements can be manages explicitly by overloaded routers. REUNITE, however, is aimed strictly at supporting multicast. Path Painting and Packet Reflection are intended to support a much broader range of applications.
Receiver-driven Layered Multicast (RLM) [27] presents one solution to the problem and Thin Streams [46] refines that idea. In RLM, senders encode transmissions in a number of layers. The lowest layer is a low-fidelity encoding of the transmission. Higher level layers contain extra data which allows a more faithful reproduction of the stream at the receiver. Sends transmit each layer in a separate IP Multicast group, allowing receivers to independently subscribe to each layer.
RLM receivers determine the appropriate number of layers to subscribe to by conducting join experiments. In a join experiment a receiver subscribes to the next higher layer and determines whether its subscription creates congestion by monitoring loss rate over a short period called the decision time. If it does, it unsubscribes. Failed join experiments contribute to network congestion though, so RLM scales back the frequency of an individual node's experiments with the size of the multicast group. To allow nodes to learn the appropriate level of subscription rapidly in spite of less frequent experiments, nodes learn by observing the network during other nodes' experiments. If the network becomes congested during another nodes experiment on say, level three, of a presentation, then the observing node can act as though it conducted a failed level three experiment itself.
Thin Streams offers a number of improvements over RLM while retaining the same basic idea of allowing receivers to determine the appropriate level of traffic for a given presentation by subscribing to the correct number of streams in a presentation. First, presentations of Thin Streams consist of many more layers than those advocated by RLM. In an effort to prevent packet loss during join experiments, Thin Streams pares down both the bandwidth of individual layers and the length of the decision time. This design allows the network to buffer an entire join experiment if it causes congestion. Join experiments can now be run without causing packet loss in unrelated traffic.
Obviously, packet loss can no longer be used as metric to measure the success of join experiments. Instead, Thin Streams uses a technique developed in TCP Vegas [8]. A node compares the expected bandwidth of a new stream to the actual bandwidth being observed. If the new layer is arriving slowly it reflects growing buffering in the network. The join experiment can be judged a failure before packet loss occurs.
RLM and Thin Streams have a few significant drawbacks. First, they implicitly limit the multi-source IP Multicast model to a single source. They would not allow two groups of researchers, at MIT and Berkeley, to teleconference in a way that allows MIT researchers to see each other in high-fidelity and Berkeley researchers to see each other in high-fidelity. Such scenarios would require a set of IP Multicast Groups for each source, which would greatly compound the remaining problems.
Second, they require encoder research to develop codecs that decompose well into a layered architecture. Third, they use extra IP Multicast groups. Using extra groups contributes to an existing problem area for IP Multicast, the large size of the IP Multicast routing tables that must be present in every router. Large tables require large amount of expensive fast memory and slow all lookups when using a hash-based lookup scheme. In addition, IP Multicast's small address space is strained even more by using additional groups for each presentation.
Video Gateways [1] are designed to solve the same problems as layered multicast, while avoiding the problems of the layered approaches. What makes Video Gateways even more interesting though, is that they address an additional problem. On the Internet today, and for the foreseeable future, IP Multicast cannot be assumed to reach all hosts. Some networks have deployed IP Multicast, some have not. This situation has existed for some time and it is difficult to predict when it might change.
Video Gateways are application-level proxies that connect two separate IP Multicast groups using unicast. Each pool of nodes contains a Video Gateway that subscribes to the local IP Multicast group. The Video Gateways then communicate using IP unicast to bridge the content of the two groups. Because such bridging is under application control it may perform additional functions if desired. For example, a Video Gateway connecting a group using Motion-JPEG with a group using H.261 by transforming data between the two formats has been demonstrated. The Video Gateways perform the necessary transcoding in each direction so that all members may source and sink data with all other nodes in both groups.
Video Gateways address all of the problems of layered IP Multicast, and even allow some new capabilities that layered systems do not even attempt to provide (such as the ability to transcode data formats and bridge unicast networks). They have a downside, however, that is likely to prevent widespread use of Video Gateways for bridging large groups with many segregated IP Multicast sessions. Video Gateways are statically configured entities. Each pair of gateways connects exactly two IP Multicast groups, in exactly the way that a human operator has configured the two gateways. This manual setup becomes untenable as the desire to connect more groups, on an ad-hoc basis, is considered; the human effort involved in configuring tens of gateways in order to video conference would be considerable. Nonetheless, the fundamental architecture is appealing. A network of forwarding nodes, similar to Video Gateways, that organize themselves to bridge between segregated multicast groups would be a logical extension.
A number of research groups and service providers are investigating services that could be described as a network of Video Gateways. Often referred to as ``Overlay networks'', these services consists of many application-level nodes using unicast to move data to multiple recipients.
Some overlays are extremely generic, such as RON [3], the Dynabone [40], and Yoid [16]. These systems exists solely to provide an overlay network with ``better'' properties than the underlying network, such as lower latency or greater resistance to DOS attacks. In the following sections, more specialized overlays are examined.
RMX focuses on real-time reliable multicast. As such, its focus is on reconciling the heterogeneous capabilities and network connections of various clients with the need for reliability. Therefore RMX focuses on semantic rather than data reliability. For instance, RMX can change high resolution images into progressive JPEGs before transmittance to underprovisioned clients. In that sense, RMX is the closest relative to Video Gateways. Transcoding is an excellent example of function that is best performed by application-level servers, providing direct evidence that an architecture that involves external servers may be preferable to enhancing the network layer.
End System Multicast provides small-scale multicast groups for teleconferencing applications using only the group members to duplicate packets. End System Multicast's chief drawback is its limitation to small scale groups. The primitives proposed here could provide a significant benefit to ESM because reflection can duplicate packets in the core of the network, rather than relying on the receivers alone.
Overcast is an ALM system designed to support high-bandwidth, single-source applications. Overcast's contribution is two-fold. First, it demonstrates the benefits of an ALM system by providing features that IP Multicast cannot support, such as on-demand access to content. Second, simulated evaluations of Overcast's tree building technique provide evidence that reasonable trees can be built under application-level control. Inefficiencies remain, however, because packet duplication cannot occur in routers and online measurements must be used extensively and continuously in order build and maintain the distribution tree.
Painting and Reflection are complementary to these systems, though some P2P systems will be unable to derive benefits from both primitives. For example, in systems such as Chord, important properties of the system are based on the random path a lookup request takes before reaching its destination. Further, once the lookup is completed, no more packets will be expected to take that path. In such cases, Packet Reflection cannot be used to optimize the lookup.
On the other hand, Gnutella might take advantage of Packet Reflection when sending responses to requests. These responses may be quite large, and they are pipelined through the set of nodes through which the request came. The nodes along that path could use Packet Reflection to speed the delivery of the responses.
Therefore, a detailed comparison to this approach is appropriate. The Kentucky approach presents a primitive dup that closely resembles packet reflection, and a number of primitives (in an active networks framework) that accomplish goals similar to packet reflection.
A strength of their approach is that it handles asymmetric routes better than painting and reflection. However, this appears to come at the cost of some scalability - part of their join mechanism involves an echo packet reaching a central rendezvous point before being returned to the sender. Very large groups could overwhelm the rendezvous point's network.
A strength of the primitives presented here is that they have been designed to operate correctly in the face of route changes in the underlying network. dup is insufficient to handle these cases. For example, if a route change causes the router maintaining node A's dup request to stop receiving the group's packets (because they are now taking a different path), A will lose the packets. With reflection, duplications are explicitly confirmed to a node's parent, using success tags. If the duplication point is bypassed, the parent knows it. It can then perform the duplication on its own and make a new reflection request.
A final difference is largely philosophical. Dup and friends are presented as a few of the infinity of possible primitives that could be constructed with an active networks approach. While this approach brings the promise of arbitrary programmability, it carries the baggage of complexity and difficult to address security problems. The primitives presented in this thesis are simple, have enough power, and the security implications can be well understood.
In an overlay network each node carries out explicit unicast communication with its neighbors in the topology. When one overlay node forwards packets between two other nodes, that packet is transmitted on the same link multiple times as it reaches the overlay router and is re-emitted toward the final destination. The links near the overlay router will have a stress of two, and the stretch of the packet will exceed one as time is wasted while the packet approaches and then leaves the overlay router.
When multicasting on an overlay network, the stress problem is exacerbated.
The forwarding node duplicates packets, forcing semantically equivalent
packets to be transmitted on the same link, in the same direction, multiple
times. In such cases, some links will have a stress equal to the number of
packet duplications plus one (one packet arrives, each duplicate leaves).
For example, Figure 3-1 shows a simple application-level
multicasting tree in one link,
, has a stress of three another
link,
, has a stress of two.
Stretch is also a problem in Figure 3-1.
receives
packets only after they have traversed eight links, rather than the four of
a direct unicast.
must wait for six traversals instead of four.
Assuming unit latencies, they would experience a stretch of 2.0 and 1.5
respectively.
In the remainder of this chapter, we will see how packet reflection can reduce stress and stretch in these situations. Most examples will describe a multicasting overlay system both because multicast is a common application of overlay networks and because unicasting overlay networks can be seen as a degenerate case of a multicasting system.
![]() |
IP routers perform a simple operation on most packets: when a packet arrives, lookup the destination IP address in a routing table, and use the resulting entry to choose an interface on which to emit the packet. An overlay node, acting as an overlay router, performs a similar operation. Overlay routers determine the overlay address and forward the packet, using IP unicast, to the next overlay node. The next node is, again, determined by consulting a routing table. In order to perform multicasts, these routing tables may contain multiple next hops for a single overlay destination address.
The stylized nature of this operation suggests that it could be succinctly
described so that another node could perform the operation instead. An end
host may ask ``the network'' to perform packet reflection on its
behalf. In Figure 3-2 end host
directs a
reflection request toward
, which takes it to router
. This
optimization alleviates the excess stress on link
. In addition
to performing requested reflections, routers continue to forward packets
using their normal forwarding rules. Thus,
will continue to receive
all packets addressed to it.
![]() |
The format of a reflection request is denoted
. Such a request will be addressed to
,
and rely on routing symmetry to direct it to routers that can fulfill the
request. This notation should be read as, ``When a reflectable (IP
Protocol = REFLECT) packet arrives matching the inbound flow identifier
, duplicate it once for each outbound flow identifier
. Rewrite the source and destination in each duplicate
and emit each, tagged with the associated
. Emit the original packet
tagged with
.'' Tags are used to ensure that nodes know when their
reflection requests have been honored and are described in detail in
Section 3.3. The operation of a router receiving a reflection
request and handling packets that match the request are formalized in
Rules 1 and 2.
![]() |
As seen so far, packet reflection allows end hosts to avoid wasted packets
on the link between themselves and the nearest router to them. Although
this optimization is useful, greater utility is achieved when routers
themselves make reflection requests. A router that has been asked to
reflect a packet out the same interface on which it is received may pass on
a similar reflection request. In Figure 3-3, router
takes advantage of packet reflection by propagating part of its
responsibility to reflect packets. By pushing a request similar to
's
original request on to
,
avoids work and (more importantly)
alleviates the stress on link
.
Of course, if
performs the reflection to
,
should not.
Tags allow
to know whether a given packet has already been reflected
by
. This mechanism is formalized in Rule 3 and
described in more detail in Section 3.3.
All reflection packets have the structure depicted in Figure 3-4. Many fields have the same meaning in all packets. For example, the Source Address/Port and Destination Address/Port always refer to IP address and ports of the packets that will match the reflection request. The Success Tag is always the value to write into packets that match the reflection request after fulfilling the request. Tags are described in detail in Section 3.3. The Copy Count is the number of five-tuples to follow. Those tuples describe the copies that should be created in response to observing a packet that matches the Source and Destination fields. The tuples contain new source and destinations for the copies as well as a tag to write into those copies.
The Nonce is unused in an ASK. As seen in the following sections, the Nonce is a challenge that is created by the router forming an OFFER.
In an ASK the copy information is speculative. It contains all copies that the asker would like handled for it. The offerer will decide which of those requests it will handle.
The IP Source and Destination in an OFFER indicate the IP address of the router making the offer, and the IP address from the Destination Address field. An OFFER packet is not necessarily addressed (at the IP level) to the asker. Instead, the asker must intercept the OFFER on its return path. This complication is introduced in order to increase the security value of the Nonce.
The Nonce field contains a cryptographically generated integer. The nonce will be echoed back in the DEMAND packet, confirming that the demander controls the IP address in question. The nonce should be the result of a one-way hash function run on the match criteria of the request and a router secret. This technique, based on the idea behind SYN cookies [7], will allow the router to confirm a nonce without storing it, preventing a denial of service attack. The nonce prevents the use of reflection to intercept communications that could not be intercepted by other means. That is, an accurate nonce in a forged DEMAND implies that the attacker can already intercept packets addressed to the victim.
The copy information in an OFFER lists the subset of copies from the ASK that the router is willing to service. The router might determine this subset in any way it chooses, though Rule 4 is a guideline. Generally, a router should be willing to make a copy if, when consulting its own IP routing table, it determines that the copy would not be emitted on the same interface as a packet that meets the matching criteria of the ASK. Rule 4 means that a router would make a copy if doing so would decrease stress on one of its own links.
Alternatively, Rule 5 is a recursive approach to determining what copies a router should offer. In this formulation, all ASK packets would recursively propagate to the source, then a series of OFFER packets would propagate back to original asker. Finally, DEMAND packets would proceed toward the sender. The recursive approach will push requests further into the network at the cost of more network traffic during setup.
The propagation of route reflection requests toward the source of the match criteria assumes symmetry in IP routing. Under this assumption, the request, wherever it ends up, will lie on the path from the source to the destination of the inbound flow identifier. This assumption may not hold, however, in some situations. In those cases, asymmetric routing paths exist between two hosts. This asymmetry could allow a packet to arrive at an end-host without passing through the router that would be expected to perform reflection on it.
A similar concern is that routes in the underlying network change as a result of broken links or configuration changes. A reflection request may have propagated to a router that, after a route change, no longer sees the packets that are to be reflected. Figure 3-5 demonstrates this problem.
For correctness in these scenarios, packet reflectors must signal the original destination node when a packet has been successfully reflected. In the absence of such confirmation, the requesting node would perform the reflection on its own, as if packet reflection had not been requested. In addition, the requester might try to reestablish the reflection request. If the problem was caused by a new route, a router on the new path might accept the request.
![]() |
To implement this signaling mechanism, packet reflection requests contain a tag, as do all packets forwarded by the reflection mechanism. When a router performs a reflection, it writes the value of the tag for that reflection request to the original packet, which continues on its way to the original destination. If a packet is received without the appropriate tag, it is clear that duplication did not occur, so the receiver performs the duplication as if it had never made the reflection request.
There are two important phases in the use of success tags in reflection requests. In the first phase, a requester must choose appropriate success tags for reflection requests. We will see that as a request propagates from router to router the success tag must be changed to avoid ambiguity. The second phase begins once reflection requests have been established. When a router receives a packet for which it has agreed to perform reflection, it must either perform the entire reflection itself or determine that some part of it has already been done, and perform the rest.
We begin by assuming that edge hosts use a success tag of one when
initiating a request, codified in Rule 6. For example,
in Figure 3-2, the first request,
, requests that success be indicated by
writing a 1 to the original
packet. Zero is reserved
to indicate that no reflection has occurred yet. Reflectable packets begin
with their success tag set to zero. The important question is how routers
should choose success tags when propagating their reflecting
responsibility.
Changing success tags is sometimes a necessity. In
Figure 3-3,
has requested that two copies be made
whenever it receives a packet from
.
agreed to perform those
copies, but went on to request that
, in fact, should make one of the
copies. The two requests must be:
The second request must change the success tag in case the
packet emitted by
ever makes it to
without passing through
(due to a route change or a new asymmetric path).
is ensuring
that
will not confuse
with a claim that is not true. A success
tag of 1 would indicate that both copies have been sent, so
must use
a different success tag after making only one copy.
![]() |
Changing success tags, however, is not always a requirement. Compare
Figure 3-6 to Figure 3-3. In
Figure 3-6
has agreed to the same request from
, but was able to go on to request that
perform both copies on
its behalf. In this case, there is no chance of confusion. If the
packet emitted by
ever makes it to
without
passing through
with a success tag of 1,
will not be confused.
The claim is accurate: both copies requested by
have been made.
The difference between these two cases is that in
Figure 3-6,
has agreed to perform the exact same
reflection request that
had previously been assigned. Under these
circumstances, it is reasonable for
to go a step further and ask that
perform
's tagging operation as well. If
had used a new
tag, then when such a packet arrived, its only required operations would
have been to rewrite the success tag to its own value. By reusing its own
value in the new request,
can simply forward the packet.
has
arranged matters so that its operation under Rule 3 has
led to a degenerate case: the incoming packet's tag is the same as the
expected tag, and no copy entries are NORMAL, so IP forwarding of the
original packet is all that remains to be done.
There is a significant advantage to avoiding unnecessary tag changes,
beyond the ease with which
may now forward packets. If a new link
were brought up connecting
directly to
, reflection would
proceed without any difficulty. The packet would be tagged at
in
exactly the way that
expects, so
would correctly detect that
its request had been fulfilled. The fact that
, rather than
,
performed the duplications is irrelevant. If, instead, routers always
changed success tags when propagating requests, then
's presence would
be required between
and
so that the tag could be transformed.
By the same reasoning that allows packets to skip
, we can also
conclude that
may safely throw out the reflection table entry
associated with the request. This optional space optimization is codified
in Rule 8. Routers should not eliminate this
state without cause, however. If, for example, a new route should be added
to the network that skips
but not
, it would be beneficial for
to still have enough information to perform the necessary
duplications. If the state has been eliminated, the packet will not be
duplicated until it reaches the application in
.
Route asymmetry, however, presents the same difficulty in a more persistent
form. For example, if the previously described new link were a one-way
link toward
, then
's next request would still go through
.
Only by keeping the success tag unchanged can the network avoid an extra
reflection. Of course, sometimes route asymmetry will occur around a
router that was forced to change the success tag in its request because it
was unable to forward the entire reflection. In these cases, missing the
router in question causes a misfire. Although the same duplication may
already have been performed, when the packet arrives at the next router the
tag will not be correct so the entire reflection must be performed. This
problem is described in more depth in Section 3.5.
Once a router has agreed to service a reflection request, it is expected to make the appropriate copies or ensure that another router has done so, and then write its success tag to the original packet. In the simplest case the router simply performs the reflection and writes the success tag. The router will always perform the reflection when it has not made a reflection request to any other router. The packet will arrive with no success tag (the field will be set to zero), the router will make the copies, and then forward the original packet to the next hop with the success tag filled as previously requested.
If a router has made a reflection request, then it might face a second case. A packet may arrive with its success tag correctly set to the value requested by the router in its reflection request. In this case, the local router knows that its reflection request has been honored, and certain copies (those labeled DEMANDED by Rule 3) have already been made by another router. In some cases, this knowledge will allow the local router to avoid making any copies. No copies will be required when the router was able to make a reflection request to offload its entire responsibility. Other times, the next router will have offered to perform only a portion of the local router's work. In those cases, the local router will still need to make some number of copies (those marked NORMAL). Once the router makes any remaining copies, it tags and forwards the original packet.
A final case is possible. A non-zero success tag appears in a packet, but the success tag does not correspond to the value of the router's previous reflection request. For example, the router has requested that a reflection be performed and that 11 be written to the forwarded packet, but the packet arrives with a 13. As we will see later, this mismatch can happen when further routers have made requests to yet more routers, but some router that is expected to perform a reflection has been skipped by the path of the original packet. The value 13 indicates that some reflection request was performed on the packet, but the local router cannot know what has occurred, so the entire reflection must be performed, followed by writing the success tag. The local router always writes the same success tag because it is confirming that its own reflection request has been completed.
In summary, Rule 9 refines Rule 2 to account for tags.
In addition to the tag associated with the request as a whole, reflection
requests also contain a tag associated with each outbound flow identifier.
This feature is necessitated by the interaction of multiple reflection
requests. Suppose that a router has accepted a request:
. Now, when a router
observes a packet going from
, it sends a copy from
and 1 will be written to the
packet. Now
suppose that the router receives another request:
. The router now determines that when it
receives a packet from
, it must send two packets in
addition to the original:
and
. The
router tags the
packet with a 5 so that it is clear to the
originator of the second request that its request was honored.
Finally, suppose that the router emits
packets on the same
interface that it receives
packets. In that case, it
makes a request to its upstream router using the tag associated with the
flow identifier:
. By doing so, it insures that it is
upholding its contract to write a 5 into
packets. The
need to associate success tags with copy entries is codified in
Rule 10.
Suppose that a router, R, has agreed to two reflection requests:
From these two requests, R uses Rule 10 to
determine that it must make three copies when it sees an
packet. First, it must send a
copy.
Next, it must send
and
copies in response
to the
packet it just generated. Furthermore, the
packet must be tagged with a 1 to confirm that the two
extra copies were sent (as per the second reflect request).
Now R decides to save itself work by propagating a reflection request toward B. It intends to request:
R builds an ASK for this request. If the OFFER comes back
indicating that the offerer is unwilling to make the
copy,
R cannot go ahead with the obvious DEMAND that includes the other two
copies. If
is demanded, those packets will go out
with a success tag equal to 1. That tag would erroneously indicate that
was generated.
R has two choices to assure correctness. It may demand only the
copy, but ask that it be sent out tagged with a 0, not 1.
In this case R should remove the
copy from its request.
If the
is not removed, it will only cause a misfire when a
later router creates the same copy because no success tag has indicated
that it has already been done. Alternatively, R may simply drop the
request for
completely. In this case, that would mean
dropping the request completely, but a larger request could still have
valid copies to be made. The key requirement is to drop the copy whose tag
would erroneously indicate that additional copies were also made. This
requirement is summarized as Rule 11.
Under these circumstance, one might expect a chain of reflection requests that cause copies to be made as close as possible to the source. This expectation is not always correct, however, due to ``local maxima''. A copy may stop at router because the next router in the chain refuses to perform the copy, even though yet another router, closer to the source, would be willing to make the copy.
Imagine that a router,
, has been asked to reflect a packet to end-host,
. The router has three interfaces, and
lies on a separate interface
from both the source and destination of the original packet.
will
attempt to pass the reflection request further up toward the source, but
the next router may consider
to be the next hop for
. In such a
case, the next router would refuse the request, and
would be left
performing the copy. This situation is a reasonable because stress is
reduced, but it is certainly possible that there is a ``higher''
branchpoint, closer to the source that would be preferable in some way,
because, for example, it decreases stretch. In that case, the recursive
Rule 5 may be preferable to Rule 4.
When these requests meet at a single router, the router should set up its reflection tables to act as though it received the following two requests:
The important rule to keep in mind is that reflection is always an optimization of an existing overlay relay -- two unicast transmissions. The interactions of separate requests will never change the character of those transmissions. This property simplifies predictions about the final result of multiple reflection requests. For example, if there are no cycles in the overlay network, no cycles can be induced by using reflection.
Consider that an acyclic overlay network can be seen as a tree, rooted at
the origin,
, of a packet. In each reflection request made by a node,
, in the tree, the match criteria will name a source node that is closer
to the origin than the destination of the copies. For example, consider:
Each destination (
,
, ...), is guaranteed to be further from
in the overlay topology than
or
. Every time that a copy is
made, its new destination is further from the origin in the overlay
topology. The tree must have finite depth, so the number of copies is
bounded. No cycles are possible.
On the other hand, malicious overlays can also be ``optimized''. For example, three nodes might make the following requests:
These requests would create a cycle among reflecting routers, just as there is a cycle among the three nodes in the overlay. The duplicate suppression techniques described in Section 3.6 would be expected to mitigate the effects of cycles, however.
![]() |
In the case of network asymmetry changing success tags in the core of the
network, the tag mechanism cannot prevent misfires, but it can address the
missed reflection request inside the network. Stress and stretch will still
be reduced at the point the asymmetry ``comes back together''. For
example, in Figure 3-7
is a one-way link toward
. The routes between
and
are now asymmetric. Although
the reflect request in
will never be used in this scenario,
will detect that it should continue to perform both copies on
's
behalf because packets will arrive without the success tag that
asked
to use. Once
performs the reflection, it will write its
success tag as usual. From the standpoint of
, it does not matter who
performed the reflection.
's success tag assures
that no more
work is required.
Two possible designs would allow duplicate detection. In the first, packets contain a sequence number. When a packet is reflected, its sequence number is associated with its reflection table entry. Further packets with equal of lower sequence numbers are duplicates. As usual in designs with sequence numbers of finite size, ``greater than'' is defined to be circular comparison. This technique faces difficulty in the face of network reorderings, however. If a higher numbered packet arrives before lower numbered packet, the lower numbered packet would not be reflected. Although this is not a problem for correctness because the packet would not receive its success tag and the end-host could perform the reflection, it would result in a loss of efficiency.
For the sake of foiling denial of service attacks, a different approach is possible. Instead of using sequence numbers, each packet could be tagged with a large random number. A router would associate a small cache of recently seen packets with each entry in its reflection table. Again, duplicates would not be reflected. This scheme also faces problems in the face of reorderings, if the number of packets reordered exceeds the size of the cache. Again, this is not a correctness issue, but for a different reason. In this scheme, when there is a cache miss on a duplicate packet, it will be reflected a second time. This addition duplication is unnecessary, but not a problem for correctness.
The random number scheme has an additional benefit compared to the sequence number scheme. It permits a strengthening of design choice that states that duplicates should not be reflected. Instead, packets that have been detected as duplicates may be dropped. The random number scheme produces false negatives, which would allow safe dropping. The sequence number scheme permits occasional false positives which should not be dropped.
In addition, the Nonce field ensures that a request is being made by a node
that, at the least, already has access to packets destined for the
destination. A malicious node can reflect
's packets only if it can
intercept
's packets. In such a case, the malicious node already has
access to
's data, so reflection would merely optimize the theft.
Further encouraging deployment, a router is likely to experience less total load compared to a purely end host based multicast system. If overlay networks become more common, network operators will want to support reflection for their own benefit (cheaper provisioning), not just for their customers'. Instead of receiving multiple packets, performing multiple route lookups, and transmitting multiple packets, a reflecting router receives one packet, performs one lookup, and transmits multiple copies of the packet. Additionally, the lookups performed for packet reflection may be faster than a normal routing lookup, as they are exact matches rather than longest-prefix matches.
Packet reflection requests are normal IP datagrams, so requests pass through legacy routers unchanged. If, for example, only the border router of a large organization's network is capable of packet reflection, then all reflection requests for flows originating outside of the organization would make their way to the border router. The effect is that all such flows are short-circuited at the border router, saving the organization from internal resource usage. This design decision simplifies initial deployment.
To build overlays that resemble the underlying network, nodes would like to aggregate locally into small clusters. Then small clusters might further aggregate with nearby small clusters to form larger clusters, and so on. In this way, nodes that are nearby in the underlying network will be nearby in the overlay network. To allow for this aggregation, path painting takes advantage of the fact that, in general, the Internet is organized so that nearby nodes share most of their paths to far away nodes. If nodes could determine which other nodes they share paths with, they could determine which nodes are near them.
This property works at many scales. Locally, the computers of a single university dorm share almost all of their paths, beginning with their LAN. All computers of the university also share most of their paths, though not necessarily the first few hops. Beyond that, all customers of the university's ISP share paths once they reach the ISP, and so on.
To use path painting, all end hosts send paint requests toward an agreed upon rendezvous point. As the requests moves toward the rendezvous point, there are two basic possibilities at each router. First, the request may arrive at a router that has no ``color'' associated with the requests's rendezvous. In that case, the router becomes colored by the request -- the source address and port are noted -- and the packet is forwarded as usual. Alternatively, a request may arrive at a router that is already ``colored'' for the rendezvous of the request. In that case, a notification is sent to two nodes -- the source of the current request and the source of the router's current color. The notification contains information about each node, allowing them to make an application level decision about aggregation. In addition, the request is dropped, or ``quashed'', allowing many nodes to paint toward a given rendezvous without overwhelming it.
Figure 4-1 illustrates the interactions of three paint requests. The first painter emits a paint request which paints all routers on the way to its destination. After that two more painters emit paint requests. Those requests proceed only until they reach a router that has already been painted.
![]() |
A paint request is sent to elicit information about other nodes that share an interest in a given rendezvous. Responses take the form of NOTIFY packets containing the addresses of those nodes.
Once a router has determined the painters that should appear in the NOTIFY, it sends a separate copy of the NOTIFY to each of them. Rule 13 summarizes.
Under normal circumstances paint requests are quashed if they match a request previously made at the same router. Only one ``paint color'' continues on from an intersection point. Without application hints, the propagated request is arbitrarily chosen to be the first observed color at the router at which the paint requests meet.
To allow applications to choose propagation of a particular request, paint requests may contain an IP address and port in a concede field, which indicates that the request should not proceed past a node with a request that originated from concede. Rules 14 and 15 provide details.
In Figure 4-3,
painted first and therefore would be
expected to color all routers on the path to
. Instead, after
sent notifications to
and
, they agreed that
's paint
should continue.
's subsequent paint requests contained
in
their concede fields. Such a paint request is forwarded until it
arrives at a node that is not colored by the requester. In this
case, the first such request would proceed to
, allowing
to
``win'' at
and
as well. Further such requests would stop at
.
![]() |
On the other hand, to avoid a denial of service attack from a node that
might paint to a rendezvous point but refuse to participate with other
nodes painting to the same rendezvous, a request may also contain any
number of ignore addresses. The paint request should continue even
if it encounters a router colored by one of the ignore nodes. An
overlay node would use an ignore list to allow its paint request to
continue past an uncooperative node that has colored a router on the path
to the rendezvous. In Figure 4-4,
is a malicious node
that was the first to paint toward
. Without ignore,
and
would be unable to rendezvous. Their paint requests would be
quashed, leaving them without knowledge of any non-malicious nodes.
When a node is uncooperative, other participants should use ignore to
allow their paint requests to skip the uncooperative node. In the previous
situation,
and
are expected to detect the difficulty (perhaps
because
is unable to participate in a cryptographic join mechanism)
and add
to their paint requests as an ignored color. Subsequent
paint requests would operate independently of the malicious node's paint.
![]() |
Routers must maintain a list of colors, rather than a single color, in
order to support ignore. For example, in Figure 4-4,
must maintain two colors:
. Normally,
acts as
though it is colored by
. When a paint request arrives that ignores
(as
's request does), the router acts as though it is colored by
. Rule 16 details this operation.
Only when there is no support for painting between a node and a rendezvous point must the node fall back on previous work in building overlay topologies. Alternatively, the rendezvous itself could act like a painting router in this degenerate case. Of course, when there are few paint-capable routers, it may be beneficial to employ some traditional overlay construction techniques. For example, if 300 nodes all come together at a single router, it would be a good idea to perform some traditional topology building techniques to avoid a single node with 300 children [21,20,6].
Two design variations that take advantage of that flexibility are considered below. These variations are not explored in depth, however.
As an optimization, a router could keep track of a set of all painters and send a single NOTIFY containing all of their addresses periodically. This choice would require more local state at the router, but that state can be managed if resources are limited.
The router could eliminate the extra state requirement entirely by falling back on unbatched notifications, or it could set a finite limit to the number of addresses stored, sending a batched NOTIFY when the limit is reached and flushing its store.
A malicious node that has painted a router can be bypassed by adding the node to the ignore list of the next paint request. However, when multiple malicious nodes paint a router, this can be a tedious process. It may take several tries, each involving a paint request, a failed attempt to connect to the return node, and a new paint request with an expanded ignore list.
If, instead, a NOTIFY message mentioned all painters that have made requests to the router, then an intelligent painter could attempt to contact each painter until an agreeable node is found. That node could report the address of a node that the true group participants have agreed upon as the color of the router. Finally, the new painter would send a second paint request with the appropriate concede and ignore fields.
This chapter explores the implementation of the proposed primitives. In the first section, the implementation used for evaluation is described. This implementation is for the ns [30] network simulator, but also represents the approach that might be taken in a generic, software router. The second section examines how the primitives might be implemented in hardware for very fast routers, comparing closely to the approach used to implement IP Multicast in similar circumstances.
Ns is an object oriented simulator, written in C++ and OTcl. OTcl is an interpreted language that acts as a front end for the underlying compiled code. New features can be added in C++, OTcl, or a combination of the two languages. After creating a new object type in C++, it may be manipulated using OTcl. The implementation of reflection and paint consists of three new OTcl objects:
The ReflectAgent responds to three commands at the OTcl level: request, send, and clear. Scripts use request
separately for each copy in a reflection request. For example, to prepare
to send:
request 23:1 36:3 36:5 42:5
request 23:1 36:3 36:6 47:6
Following this preparation, the request may be sent by issuing: send with no additional arguments.
clear may be used to clear the set of copies that will be requested.
In addition, PaintAgents support a set of commands for extracting information from the NOTIFY packets that are returned to the Agent. OTcl scripts may use parent to determine the color of the router that has blocked the further progression of node's paint request. They may also call children to obtain a list of the nodes whose paint requests have been blocked by the Agent's own request. These names derive from the common use of this information: a node will treat the node whose color blocked its own color as its parent in a multicast distribution tree.
The data-path of a high speed router should consist of steps that can be implemented in hardware. For the purposes of reflection and painting, this concerns only reflection processing. Path painting and the reflection handshake are control-path operations, and can be assumed to be handled by the routers slow path without degrading overall performance.
For the most part, the steps involved in processing reflections are quite similar to the steps needed to perform IP Multicast. However, some details differ, allowing IP Multicast to be implemented somewhat more efficiently in hardware. A complete comparison is summarized in Figure 5-1.
The first two steps differ only in the size of the fields that must be checked. Step one must match against the 8 bit IP Protocol field in order to determine that a packet may be reflected, instead of the first 4 bits of the IP Destination Address that determine that a packet is an IP Multicast packet. In step two, reflection must perform a lookup over the 128 bits comprised of 32 bits for each of the IP source and destination addresses combined with 32 bit source and destination ports. IP Multicast uses only the 28 bits remaining in the IP destination to perform its lookups. When this exact match fails in IP Multicast, the packet is dropped. When a reflectable packet does not match a reflection table entry, it is forwarded as a normal IP packet.
Step three is the most significant difference between IP Multicast and packet reflection. In IP Multicast, the number of copies is upper-bounded by the number of output interfaces (minus one). No IP Multicast packet will be emitted on the same interface twice. In contrast, a packet may be copied any number of times on a given output interface during reflection. This difference is quite significant to hardware implementations.
A fast IP Multicast router may perform steps 3-6 at line speed on its output crossbar switch. No packet will be duplicated on the same output interface, so there will be no need for buffering. Furthermore, an IP Multicast router does not actually need to perform steps 4 and 5. All emitted packets are identical, as addressing information is contained solely in the IP Destination.
A reflecting router, on the other hand, must allow for the possibility that packets will be duplicated, but destined for the same output interface. To mitigate this effect, a router may choose to bound the number of copies that will be emitted through the same interface. During the reflection handshake, such a router will perform route lookups on each copy and refuse to offer to reflect more than a bounded number of copies to the same interface. At the same time, the reflecting router may choose to cache the results of the route lookups (the output interfaces obtained) in order to avoid step five when forwarding packets.
We conclude that a fully functional reflecting router can approach, but not meet, the speed of an IP Multicast router. However, by compromising some functionality (rejecting certain requests), a reflecting router can more closely compete with an IP Multicast router. In the case of extremely fast routers, however, it may be difficult to support reflection or IP Multicast. If that is the case, reflection has an important advantage -- it continues to function. Again, because reflection is incrementally deployable, some routers may ignore reflection requests without severely affecting performance. Section 7.2.2 explores this situation in depth, concluding that support in routers just outside the core of the network (border routers) is as effective as support in the core (transit routers).
The previous chapters proposed two primitives that overlay networks can take advantage of in order to increase efficiency. These primitives are intended to be flexible, supporting overlay networks of all kinds. To demonstrate that this is the case, this chapter presents various uses of the primitives in real world applications. The first sections explore their use in various ALM systems beginning with a system that emulates the characteristics of IP Multicast and continuing through numerous extensions and modifications.
Packet reflection and path painting were originally conceived with application-level multicast in mind. It is no surprise then, that they are suitable for such applications. Nonetheless, their flexibility and suitability as primitives upon which interesting systems can be built can be assessed by looking at ALM. We will find that, when using the primitives, it is easy to extend a simple IPM-like system to handle heterogeneity and reliability. This flexibility is in stark contrast to IP Multicast, in which support for heterogeneity and reliability represent significant design efforts.
First, we describe a system that mimics the semantics of IP Multicast. In fact, the proposed system will provide a number of improvements over IP Multicast while maintaining the ability to provide the same service model. Second, a heterogeneous multicasting system will be described that has more functionality than similar systems built on IP Multicast (RLM [27], Thin Streams [46]). The versatility and simplicity of these protocols demonstrates the constructive power of the proposed primitives. Finally, we present a simple, reliable multicast protocol. Its virtue, in fact, is its simplicity. By decomposing and exposing the constituent parts of IP Multicast, we have created two abstractions that, taken together, are more powerful than the monolithic ``primitive'' of IP Multicast.
After exploring application-level multicast, the final sections demonstrate that the primitives are useful outside of multicast by showing their utility in two other common overlay tasks: two-hop routing and locating nearby nodes.
| Feature | IP Multicast | Emulation |
| Group address | Class E IP address | (IP address, port) |
| Rendezvous | Core router | End host |
| Join request | Graft message | Paint request |
| Data Path | Routing table | Reflection state |
As in IP Multicast, a join message is sent to the rendezvous point by new group members. In emulation, the join message is a paint request. If the paint request encounters no router that is already painted on its way to the rendezvous, then no action is required; the new node is the only member of the group. If the paint request encounters an already painted router, that router notifies the joining node and the previous painter.
One of these two nodes must become the parent of the other. Various rules are possible, but one rule that appears promising is to set the node nearer to the router at which the collision occurred to be the parent. The nodes can determine their nearness from the TTL field in the collision reports. In case of a tie, any tie breaker is sufficient, such as an ordering on the nodes' IP addresses. More simply, they can rely on the router to select a paint color, which will be the first painter.
Finally, determining the port numbers over which the nodes will converse cements the parent/child relationship. Once the nodes have decided who will be the parent, the child begins sending paint request with concede set to the address of the parent. On the other hand, if the other node is uncooperative, the emulation adds the node to the paint request's ignore list.
In Figure 6-1, three nodes have joined an emulated
multicast group using
as a rendezvous. Here we assume that the
end-hosts sent paint requests in their natural order (
,
, then
), and that the topology induced by that ordering was acceptable.
's paint would have reached
, which would have notified
and
. The nodes would then set up communication by an application-level
protocol, which would include choosing a pair of ports over which they would
communicate.
's paint would reach
, leading to a similar
exchange between
and
.
In Figure 6-1, the rendezvous,
, is an active
participant in the multicast group. Thus
itself acts as a painting
router. That is, when
's first paint arrives at
, a NOTIFY
is sent to
, informing it that it has reached a router that is colored
by
.
and
carry out the same protocol to establish a pair
of ports to communicate over as
and its children did.
The rendezvous does not need to be an active member of the multicast group,
however. Suppose
were eliminated for Figure 6-1
(but IP routing entries still exist for it in its current location). In
such a case,
paint would have elicited only the ``empty'' NOTIFY packets from
,
, and
that tell
that it has
colored each of those routers. When
and
sent their paint
requests, the group would be formed in exactly the same way as before. The
group would consist only of
, with its children
and
.
Any member of the multicast group can send packets to the group. To do so,
it sends packets to each of its neighbors. For example,
would need
to send only one packet, addressed to
(on the port they have
chosen).
would need to send three packets, one to each of its
children and to its parent,
.
The first important observation is that only one group member in a stub network will have a neighbor in the distribution tree outside of the stub network. A stub network is a subnetwork that is connected to the portion of the network containing the rendezvous by a single link. By definition, a portion of the network that contains the rendezvous is not a stub network.
As the paint requests of the members in the stub network travel toward the rendezvous, they must all traverse their shared, single link, and thus the same router. We will refer to the router on the stub end of the link as a border router. Only one group member may color the router, and only that group member's paint color will be seen outside of the stub network.
Our second observation is that tags will accurately inform requesters when their reflection requests have been performed. We know that tags has been designed to work if the path of the packet being reflected follows the reverse path of the reflection request propagation. This property exactly describes a symmetric network.
Our next observation is that stress will never exceed one on the link connecting a stub network to the rest of the network. The only possible reasons for a packet to traverse such links are to get to the node, X, that has colored the border router or from X to one of its (possibly many) children. However, reflection will always be able to eliminate the return packets by pushing the copies to the external router connected to the border router. Consider the propagation of the reflection request. The request must be of the form (tags elided):
As the reflection propagates toward the border router, every router will offer to perform the copies that name destinations outside of the stub network (foreign nodes). A router would only refuse to perform a copy if its destination is in the same direction a X, which is never the case for foreign nodes as the reflection request propagates away from X toward the border router. Their output interface is always toward the border, away from X.
The fact that a stub network's link has a stress of one is surprisingly powerful because our definition of a stub network is quite broad. For example, if the entire network is a tree (it contains no cycles), then every subtree of the network which excludes the rendezvous is a stub network. This property implies that in such cases, there will be no link with a stress greater than one. Similarly, if a stub network is a tree, stress will never exceed one anywhere in the stub (because each subtree of the stub is a stub). This result is particularly nice when one considers that trees are a common network architecture for networks of small to medium scale. All such networks can expect stress equal to one.
We now explain why a portion of the network that is not a tree can have stress greater than one. We first observe that if the copy entries of a reflection request propagated exactly to the router in which the paint color of the destination of the copy collided with the paint color of the requester, the network would, again, see a stress of exactly one. The distribution tree created would consist of all paths from the receivers to the rendezvous, superimposed on one another. Viewed another way, packets would follow the paint trails from the rendezvous back to the group members. Whenever paints collided, packets would be copied.
Figure 6-2 illustrates why copy entries do not always
propagate exactly to the location of the paint collision of the requester
and the copy destination. Assume that routing is by shortest path and that
each link has unit weight. After being the first to paint to
,
finds itself the parent of
and
. It formulates the following
reflection request (tags elided):
Following Rule 4
offers to perform both copies.
However, when
attempts to pass the request further,
will not
offer to perform either copy. This choice is because
would emit the
copies on the
link, which is the same link on which it will emit
the original (
) packet. The request has stalled.
Furthermore, because it has stalled, the
link has a stress of two
because it must carry a packet for each of
and
.
![]() |
We now consider briefly how relaxing our assumptions affects the conclusions we have drawn. First, we discover that the assumption that the primitives have been implemented at every router is stronger than necessary. Only a router that is located at the collision of two paint requests needs to implement the primitives. All other routers will forward all requests unchanged, which is equivalent to a legacy IP router. Of course, in actual deployment it will be impossible to deploy enabled routers at precisely the locations that they might someday be needed, so it is also interesting to ask how well the emulated multicast system will behave in those cases. This question is examined throughout Chapter 7 with simulations on generated topologies.
If we relax our assumption of asymmetry, we lose our observation that the tag system always informs the requester that a reflection has occurred. As such, we lose the observation that the link connecting a stub network to the rest of the network will have a stress equal to one. However, we may continue to conclude that stress will not exceed one on a stub network's link if the stub contains no asymmetry, even if there is asymmetry elsewhere in the network. This observation is justified because we know that the border router will perform any needed reflections, and we know that the requester will receive accurate notification of that fact through the tags system because the notification will be passed through a symmetric subnetwork.
Having built an IP Multicast emulation layer in the previous section, one possible way to handle heterogeneous receivers is to build RLM on top of the IP Multicast emulation. However, a simpler and more featureful system can be built directly by using the proposed primitives.
RLM is implicitly single source. To see why, imagine each source
broadcasting over a number of multicast groups. If the groups are the same
for each node, then a receiver node will be unable to subscribe to a
high-quality stream from one source and a low-quality stream from another.
Yet if each node uses a separate set of groups then a receiver node will
need to subscribe to (at least) one multicast group for each node in the
group. The number of join experiments in the network would grow as
O
, an untenable situation for large groups.
Using the proposed primitives, an overlay can be set up that uses the IP routing infrastructure to do most of its work, but, when necessary, can fall back to explicit forwarding with transcoding over slow links. Using path painting, nodes arrange themselves into an efficient distribution tree as in IP Multicast emulation. Each node makes reflection requests to forward all traffic among its neighbors in the overlay network. Each node also exchanges congestion information with each of its neighbors [4,2]. If an overlay link is found to be suffering from congestion, then the use of reflection requests to forward along that path is suspended. Instead the stream is thinned at each end of the overlay link and forwarded explicitly. The thinning may take the form of a transcoding to an entirely different lower-fidelity format (as in Video Gateways) or by dropping selected packets that are known to be less important to the reconstruction of the data stream (as in layered multicast approaches).
Allowing stream thinning to occur wherever appropriate creates a system that provides all participants with as much bandwidth as possible to all other participants. No single node or set of nodes has been singled out to receive the best connectivity. When two separate pools of well-connected users are joined by a low-bandwidth connection, the users on each pool will experience high-fidelity contact with the users in their own pool.
The application-level distribution tree is set up as for IP Multicast emulation. Like heterogeneous multicast, out of band communication between parents and children determine the level of transmission that is possible without creating congestion. When possible, reflection is used to increase the efficiency of the transmission, but, as for heterogeneous multicast, pairs of nodes may choose to avoid reflection so that packets may be dropped by the application before the congested link. Unlike heterogeneous multicast, the only possible ``thinning'' strategy is to drop packets. The intent is to transmit a bitwise correct data stream, so transcoding is not an option.
Those dropped packets will have to be retransmitted, and there are a number of possible strategies, depending on the application. First, if links are not congested, but a packet is lost anyway, it is appropriate for the node that detects the loss to immediately request a retransmission from its parent.
On the other hand, when a link is congested and the packet has been dropped explicitly by the parent, such a strategy is self-defeating. The best strategy is to wait until the congestion has subsided before asking for retransmissions.
Upon receiving a retransmission request, a parent has the opportunity to apply an optimization. If the parent receives multiple retransmission requests from its children, it may be most efficient to request the packet from its own parent, even if it possesses the packet itself. Such a request will cause the packet to be efficiently reflected to all children. The alternative is for the parent to employ iterative unicast to each child.
Reliable multicast systems built on IP Multicast are considerably more complex than the protocol sketched here. Two factors contribute to this complexity. First, IP Multicast hides details, so nodes don't know who their parent is. Reliable IP Multicast protocols are greatly complicated by this fact. Because receivers know only what group they in, their only option to ask for rebroadcast is to ask the entire group. A node doesn't know who else to ask. Second, because an application-level system is being built for an express purpose, it is possible to design the application-level nodes for the task at hand. Nodes can contain large disks to support retransmissions long after the original transmission. The designers of IP Multicast could not ask routers to buffers large amounts of data just in case someone might want it later.
i3 could take advantage of packet reflection to decrease latency for its
two hop routes. If i3 is routing packets from X to Y by going through Z, Z
would emit a reflect request:
. At the very least, this would remove last-hop latency and resource
use. Often the request would be pushed well into the network, eliminating
much of the overhead associated with the indirection. The deployment of i3
would be greatly simplified because the choice of location for i3 nodes
would become far less important. A node behind a 56k line would become a
perfectly suitable candidate if routers at the node's ISP supported packet
reflection. Results in Section 7.4.2 indicate that using
reflection in this way would allow i3 to eliminate nearly all of the
associated stretch of a two-hop route in networks that support reflection.
Even at low deployment levels, overhead drops considerably.
Two-hop routing is a common feature of overlay networks. RON uses it to route around delays and broken links. Gnutella uses multi-hop routing to return requested data pseudo-anonymously.
Chord is a distributed hash lookup primitive upon which many peer-to-peer services can be constructed (for example, i3 is built on top of Chord). Chord's lookup algorithm involves the cooperation of numerous nodes in an overlay network. At each node the lookup algorithm picks a new node to forward the lookup request to. After each hop, the request arrives at a node that is ``closer'' to the final destination in a logical address space. Eventually, the request arrives at the intended destination and the lookup completes.
Each overlay node that handles a request may know of several potential next hops: nodes whose logical addresses are closer to the intended target than the current node's. However, each node will only know about a limited number of other nodes. For efficient operation, it is preferred that nodes know about nodes that are physically nearby. In this way, lookups proceed quickly through the logical address space because they travel from node to node among physically well-connected nodes.
Chord then, has the following requirement: given a set of nodes, potentially numbers in the thousands, find a subset of those nodes that are physically nearby. Path painting can provide this functionality. Each node in the network will paint toward a small set of agreed upon addresses. Each resulting notification will contain information about a nearby node. In addition, nodes that appear multiple times are expected to be closer than those that appear only once. Once a few nearby nodes are located, those nodes may share information about other nodes that have been located in the same way, until a sufficiently large subset of nearby nodes is discovered. If necessary, that subset may be pruned by conducting measurements to each member of the pool, but path painting has allowed the nodes to find a useful set to begin those measurements upon.
This chapter illustrates the effectiveness of the proposed primitives. The most important measures of effectiveness are decreased stress and stretch, as defined in Chapter 1. The proposed primitives should decrease stress and stretch in all situations, though they can be expected to be most effective when widely deployed.
Resource utilization is an important concern when proposing a network-layer enhancement. The primitives are intended to have modest space requirements at routers. In addition, the space required should scale slowly, if at all, with the size of the group. Again, we can expect full deployments to meet these goals more easily than sparse deployments.
A final metric, ``misfires'' is of use mainly to assess how well the primitives deal with route asymmetry. Misfires are packets that are reflected more than necessary, leading an edge host to receive duplicates. Misfires, like stress, indicate wasted network resources. However, stress measurements may not account for this waste because of the different paths taken by the extraneous packets.
The simulations presented here use highly-connected transit-stub topologies generated by The Georgia Tech Internetwork Topology Models (GT-ITM) [48] and run on the ns-2 [30] network simulator. All simulations were conducted over ten different 100 router graphs. The parameters used to generate these topologies are from the sample graphs in the GT-ITM distribution; no published work describes parameters that might better model common Internet topologies. Figure 7-1 demonstrates the look of a small transit-stub topology.
![]() |
Each of the nodes in the GT-ITM graphs models an Internet router. To model end hosts in the simulation, an extra node is added at each router.
In the first experiment, the number of participating edge nodes is varied while the number of routers supporting the primitives is held constant. Next, to model incremental deployment, group size is held constant as the number of enabled routers is increased. In the incremental deployment tests, a second set of tests compares the performance of random incremental deployment to an intelligent deployment model.
Finally, a separate facet of incremental deployment is tested. In these tests, the issue of local benefit of deployment is examined. In order to encourage deployment, it is important not only that benefits grow with deployment, but that the benefit is conveyed disproportionately to the areas of the network in which deployment occurs
The primitives should accomplish two things with respect to stress in an ALM system. First, they should reduce stress. Second, and more subtly, they should allow the application to scale better. Larger group sizes should see slower stress growth.
To test these hypotheses, a simple single-source ALM system, as described in Section 6.1 is simulated. Simulations are conducted at increasing levels of deployment and with multiple group sizes. The routers that support the primitives are chosen at random, as are the members of the multicast groups.
Our first hypothesis is that increased deployment of the primitives can reduce stress in ALM systems. Figure 7-2 demonstrates that link stress is decreased as deployment is increased. Furthermore, it shows that stress reduction is most rapid at the lowest deployment levels. This fact is encouraging as it indicates that early adopters will be well rewarded. Finally, it shows that stress reduction occurs for various group sizes. Large group sizes are affected the most, as their stress levels have the most room to improve, but even small groups show approximately a 30% reduction in stress after 20% deployment.
![]() |
Our second hypothesis is that the primitives will allow an ALM system to scale more gracefully. As more routers support the primitives, stress should grow more slowly with group size. Figure 7-3 demonstrates this claim. Average link stress is plotted against group size for various levels of deployment. The slope of these plots is clearly decreasing as more routers support reflection and paint, indicating that stress growth is slower as the primitives become more widespread.
![]() |
To understand the effects of intelligent deployment, we compare network stress under four different deployment scenarios. Figure 7-4 shows the results. It is clear that the enabling of certain routers is far more effective than others. Enabling transit routers, the ``core'' of the simulated network is extremely effective at reducing stress even though they represent only 4% of routers. They are in an excellent location to be effective as rendezvous points for painting, and then to duplicate packets with reflection.
Border routers are nearly as effective as transit routers. When all are enabled, they lie on all of the same paths as the transit routers, thus they can eliminate inefficiencies in nearly all of the same cases. They are slightly less effective for small group sizes because of stresses among the transit routers that cannot be relieved without enabled routers in the core. With larger groups, the border routers become more effective. To understand why, consider the ``zone of responsibility'' of a transit or border router. In each case, the router effectively isolates a portion of the network allowing a single packet arriving at the router to be duplicated to service all members in its subtree. From that point on, however, there is, essentially, iterated unicast to those members from the router. Small groups have few members in each ``zone of responsibility'', so stress remains low. As groups size increases, so does the size of each isolated subnetwork, leading to higher stress. Border routers split the network into smaller subnetworks and avoid stressing the border-transit link, thus lowering stress more effectively for larger groups.
![]() |
The other two deployment scenarios are equally interesting. These scenarios compare two deployment scenarios that involve some randomness, and do not turn on all routers of the given type. In one, half of all border routers are enabled. In the other, 20 stub routers out of a possible 84, are enabled. Despite the considerably larger number of stub routers, the border router strategy is somewhat more effective (though less consistent as shown by the fact that averaging over 10 runs was insufficient to smooth the performance of the strategy). There are two reasons that the stub router strategy is ineffective. First, nothing can be done to ease stress in the core, including the ``expanded core'' consisting of transit routers and border routers. Second, the stub routers will often be ineffective because no group members happen to be located behind them.
![]() |
Figure 7-5 shows that the advantages of deployment are gained in the areas of deployment. While this result is hardly surprising, it is important nonetheless. Local advantage is a critical condition for incremental deployment to make economic sense. Investment by network operators in upgrades will be returned to those investors in the form of decreased local waste.
Although reflection and painting are expected to be used together when appropriate, it is interesting to see how effective they are in isolation. In previous ALM experiments, nodes built distribution trees by using paint to learn about nearby nodes that would become parents or children of the painting node. Then reflection was used to iron out the remaining inefficiencies. In the following experiments, we explore the effectiveness ALM systems that use only one of the primitives.
![]() |
Figure 7-6 shows the effects of deploying routers that support reflection, but not paint, on the same ALM system described earlier. The only difference from previous experiments is that the overlay topology was built at random. Members were added to the tree sequentially. At the time of the join, the members selected a random existing member of the tree to be its parent.
Reflection is clearly beneficial even in the absence of paint. As more routers are enabled stress is lessened. For small groups, stress can be nearly eliminated using only reflection. For larger groups, however, reflection alone is unable to decrease average stress below approximately 2. Reflection can only alleviate stress when it is caused by a packet traveling into, and then back out, of a portion of the network. When stress is cause by the need to transmit a packet multiple times into a stub network, nothing can be done.
![]() |
Figure 7-6 shows the effects of deploying routers that support paint, but not reflection. Again, the same basic ALM experiment was conducted. Like reflection, paint appears useful in isolation. In interesting artifact of overlay topologies built with paint is a tendency toward a stress of two. This stress level can be attributed to the fact that building distribution trees with paint makes no effort to eliminate the ``in and out'' packet paths that reflection would normally be used to eliminate.
We examine the effect of the primitives on latency in a repeat of the experiment of Section 7.2.1. All end-hosts participate in an application-level multicast. A single packet is emitted from the source and arrival times are noted for all group members. For each member, these latencies are compared to the time for IP unicast to transmit from the source directly to the same member. Average stretch is calculated as by computing a stretch for each receiver (latency in ALM divided by latency with unicast) and averaging. The results are graphed in Figure 7-8.
![]() |
Figure 7-8 shows that, as expected, stretch decreases with increased levels of deployment. At 30% deployment, the primitives have eliminated approximately 60% of the latency overhead. At complete deployment, they have eliminated nearly all added latency.
Reflection can reduce the latency of the two-hop route. The waypoint sends
a reflection request that causes the network to route the packet more
efficiently:
In the following experiment, three nodes are chosen at random to act as the endpoints of communication and the waypoint. Two packet transmissions are timed to set baselines before testing reflection. First, normal IP unicast between the source and destination; then, a two-hop unicast route that uses the waypoint to route packets. These represent the two extremes of possible performance. Finally, after the waypoint sends its reflection request, a third time is measured that represents how well reflection has reduced the two-hop situation to IP unicast.
As usual, the experiment is conducted over the 10 transit-stub topologies at various levels of deployment. In each experiment, 100 triples are randomly selected for measurement. Two stretches are calculated for each triple using the IP unicast time as a baseline. Figure 7-9 graphs the result.
As deployment increases the performance of the system moves smoothly from that of two-hop unicast case toward that of single unicast. Again, it is encouraging that improvement is fastest at the early stages of deployment. At 30% deployment, the primitives have eliminated well over half of the overhead.
![]() |
![]() |
Figure 7-10 shows that the average state requirements decrease as deployment increases. This result is unsurprising. Consider the difference between two networks, one of which has one extra enabled router. In general, that router will receive some amount of state from its downstream neighbors. The amount of state associated with a single reflection request decreases as it is propagated (because some routers see that they cannot fulfill portions of the request). Therefore the new node is likely to receive less state that its downstream neighbors had, bringing the average down. Furthermore, in small groups, it will be common for added routers to find themselves completely unused, lowering the average state requirements.
The maximum state required in any router can be as important as average state requirements. If the state requirements are extremely unbalanced, one router may be forced to carry too much information and refuse requests or experience degraded performance. Although the primitives are designed to degrade gracefully under such circumstances, they will surely perform better if they avoid it. In the same experiment as the previous section, we now examine the sizes of the largest reflection table in the network, rather than the average.
![]() |
Figure 7-11 show two interesting facts. First, for smaller group sizes, the maximum state held in any one router is fairly constant, regardless of deployment levels. Second, large groups disproportionately load single routers at low deployment levels, but the maximum state held in any one router decreases as deployment increases. This decrease indicates that the work is more smoothly shared when more routers are enabled.
By comparing Figure 7-11 to Figure 7-10, we can draw one more conclusion. For a given group size, the shapes of the state requirements are similar in each graph. This property indicates that maximum state size requirements tend to move with average state size requirements. This fact gives some encouragement to the notion that in very large networks, such as the Internet, maximum state requirements will remain manageable if deployment is sufficiently high to keep average state size requirements down. This conclusion is speculative, however. The next section explores what can be done when a router finds that its state requirements are excessive.
Section 3.5 described the occurrence of ``misfires'', instances in which packets are duplicated extraneously. These misfires occur when route asymmetry occurs on the portion of a path following duplication, causing the original packet to travel to the reflect requester in a way that prevents success tags from being written correctly.
To study the likelihood of misfires, networks are constructed with intentional asymmetry. Beginning with the same topologies as used previous experiments, links are chosen at random to have their latency doubled in one direction only. These asymmetric links cause asymmetric paths.
![]() |
Figure 7-12 demonstrates that misfires occur more frequently in asymmetric networks. However, the high variability of the results (despite averaging over ten runs for each data point), implies that this experiment may not be perfectly designed to demonstrate the conditions under which misfires will occur more frequently. The characterization of the asymmetry in networks is difficult, however, and as such, a subject of further study.
Overlay networks are an important way for applications to obtain network behavior that would otherwise require widespread router modifications. By their very nature, it is possible to deploy overlay networks with no additional support. Yet doing so creates inefficiencies. Path painting and packet reflection address those inefficiencies with simple router extensions that can be used in creative ways to perform packet routing and duplication at appropriate locations in the network.
Both primitives are incrementally deployable. Overlay networks can be built without them, but in portions of the network in which they are deployed, these systems will be considerably more efficient.
Furthermore, the focus on incremental deployment has created numerous subsidiary benefits. Routers may choose to ignore requests for any reason, ranging from administrative policies, security concerns, or resource exhaustion. All of these cases are handled gracefully because they are functionally identical to routers that do not support the primitives.
In depth analysis of the behavior of the primitives in large configurations has been difficult. A number of factors make analysis and understanding more difficult than the equivalent analysis of IP Multicast protocols.
First, routers have much greater flexibility in reacting to reflection requests. For example, Rule 4 and Rule 5 are both reasonable, but they have the opposite effect on the distance that reflection requests propagate.
Second, the flexibility introduced by the decoupling of paint (IPM's join) and reflection (IPM's forwarding) makes it difficult to create trees exactly like those of IP Multicast. In IP Multicast, branchpoints occur where join messages graft to the distribution tree. Using reflection, they occur where the combination of IP routing and individual router decisions push them. The next section outlines an approach that could give reflection requests enough power to control their eventual location based on information gathered during painting.
Finally, incremental deployment (and similar effects, such as resource based rejection of requests) vastly complicates the possible outcomes compared to IP Multicast. Not only must the analysis now account for two types of network nodes, it must also provide a reasonable model of deployment in order to choose deployed nodes. Deployment in the core of the network, for example, has different effects from deployment at the edges. Yet the deployment models we have considered are entirely guesswork.
More generally, a node might wish to scope its reflection requests, providing some minimum and maximum distance for them to be propagated. This feature could be used to ensure that a reflection request propagates exactly to the router that responded to the node's paint request thereby reducing the uncertainty involved in the current scheme.
Second, both primitives might benefit from a mechanism which would allow a node to ask for requests to be sent to it rather than from it. The basic idea behind this notion is to eliminate difficulties due to asymmetric routes by causing requests (reflection and paint) to follow the same paths as the packets they are intended to affect.
In addition, reflection may be used to amplify existing denial of service attacks. Although reflection is acting ``properly'' by optimizing an operation that the end-hosts could legitimately perform (duplicating and sending packets), ISPs will no doubt wish to limit such activity. Some ISPs may wish to restrict the use of reflection to trusted end-hosts, others may wish to use a dynamic approach that limits the activity of a single reflection entry, or entries submitted from a single end-host. These strategies have not yet been explored.