Lightweight Network Support for Scalable End-to-End Services Kenneth L. Calvert, James Griffioen, Su Wen University of Kentucky http://www.acm.org/sigcomm/sigcomm2002/papers/esp.html Summary for network reading group 2002-Oct-18-Fri Adam M. Costello http://www.nicemice.net/amc/ Motivation: Building end-to-end network services on top of the IP forwarding service can in some cases be difficult, inefficient, and/or unscalable. In some cases, moving some functionality into the routers can make the service simpler, more efficient, and more scalable. Examples: * detecting partial losses in a multicast session * recovering from those losses * aggregating feedback from many nodes (like finding the average/min/max value) * discovering branch points in the topology for placing overlay nodes and links Two classes of approaches: * Targeted simple solutions for particular problems, like: - EXPRESS, SSM (single-source multicast) for multicast tree building & forwarding - parentcast, subcast, randomcast, breadcrumb for multicast loss recovery * General powerful solution (active networks) routers run code contained in packets challenges in security, performance, scalability IP is successful because it is simple, efficient, and scalable. Goal: A simple, efficient, scalable building block that could be added to routers that would benefit a broad range of services. Proposal: ESP = Ephemeral State Processing A packet can contain an ESP instruction, which operates on a small amount of state in the router and in the packet. Ephemeral state: Once created, will be deleted a constant well-known amount of time later (perhaps 10 seconds), regardless of how often it is altered in the meantime. This enables simple implementations with performance guarantees. Contrast with soft state, which can be refreshed. Paper claims these are fundamentally different, but if you implement a refresh by copying the state and invalidating the old copy, then it looks the same to me. I think the only effective difference is that people usually imagine soft state timeouts to be minutes, rather than seconds. Requirements: * packets can leave information in routers for other packets to read or modify * the space-time storage product per packet is bounded * the processing per packet per node is comparable to IP forwarding * anonymous (no need for end hosts and routers to authenticate or track each other) * no need to modify or duplicate the existing network layer Architecture: A packet may contain one ESP instruction (as an option, a shim header, or a payload). Routers that don't support ESP don't see the instruction, and forward the packet normally. Routers that support ESP have one ESP context for each input port and each output port, plus one central context. The instruction contains three flags indicating which context(s) to should execute the instruction. Each context can be parallelized (implemented as multiple independent contexts), so an instruction contains an opaque computation ID to make sure all packets involved in the same computation get hashed to the same context. Each context contains an ephemeral state store--an associative array of tag/value pairs. The store supports two primitives: put(tag,value) and get(tag). Operands of an instruction can be stored pairs (referenced by their tags), or well-known parameters of the router (like MIB variables), or fields in the packet itself. (Do in-packet operands need to be in the ESP instruction, or can they be other IP header fields, like source and destination addresses?) [I assume that for a given instruction, the number of operands of each time is constant or at least bounded, so that the execution time can be bounded.] By using large tags (64 bits in the example implementation) chosen randomly by end hosts, the probability of collisions is made small. Example applications: * Topology discovery - find branch points: Send a mark instruction to one destination that leaves a trail of markers, then send a query instruction to another destination that reads marker locations, and ends up with the location of the last marker. * Controlling packet flow - feedback thinning (COMPARE) Each member sends its value, and ESP nodes record the minimum value seen so far, dropping packets that don't alter the value. Each link will carry on average O(log n) packets if order of arriving values is random. - multicast loss recovery (like breadcrumb) NACKs toward the source leave a trail of markers (or increment a trail of counters), and stop when they hit a marker (or when a counter exceeds a threshold). This avoids implosion, and allows for redundancy (at most k NACKs traverse each link). Multicast data retransmissions contain an ESP instruction that drops packets that would stray from the trail, so only members who request the data receive it. Assumes symmetric paths. More complex mechanism is needed if paths are not symmetric. * Aggregation (COUNT, COLLECT) Two phases: All members first send COUNT, so each node gets a count of the number of child nodes below it. Then each member sends a COLLECT instruction containing its value. The ESP node combines the incoming value with the stored value using a commutative and associative operator (like plus) and decrements the counter. Practical Considerations: Incremental deployment is possible. To start computations, need multicast trigger, or loosely synchronized clocks (to within about a maximal round-trip time). ESP packets can be lost. If instructions need to be retransmitted, they need to be idempotent. For redundancy, idempotent instructions can be sent k times rather than once, and the instruction can cause routers to forward at most k instructions rather than at most one. If an instruction isn't inherently idempotent, it can be made so by doing duplicate detection using Bloom filters. [This seems to be getting overly complex.] Security: Anyone who can eavesdrop on packets can also interfere with computations by injecting ESP packets that use the same tags and computation IDs. Routes need to be stable for the lifetime of the state (about 10 seconds). Implementation and performance: Because the lifetime of a pair is constant, pairs are deleted in the same order they are created. Tags and values are stored in two parallel circular queues. There is a hash table for quickly finding a tag. Using the programmable Intel IXP1200 network processor, hundreds of thousands of ESP packets could be processed per second. A small packet has around 1000 bits, so this corresponds to a bit rate in the hundreds of Mbps. There is room for more parallelism and faster hardware. Evolution: Standardize a few fairly complex instructions (which can be added slowly over time)? Or define a limited microcode that can be used to construct new instructions on demand? Other applications (my ideas, not in the paper): * Building hierarchies Hierarchies of end hosts that are correlated with the network topology could be used for efficient aggregation that is more complex and/or more stateful than what could be done in ESP routers, and could be used for a breadcrumb-like service (for multicasting and multicast loss recover) without any packet duplication service in the routers. If the value of an ESP tag/value pair is large enough to hold a network address, and if ESP instructions can modify not only the ESP field of the packet but also the IP destination field, then a forwarding service similar to parentcast can be implemented using the following ESP instruction: PARENT: let v = get(tag) case v not found: put(tag, source address) source address: do nothing anything else: let destination address = v and change opcode to no-op This is inherently idempotent. Each member can send a few of these toward the root, and any that don't get dropped will all go to the same member, which is the sender's parent in a newly-constructed hierarchy (an overlay tree). Parents can optionally reply to their children, so that subsequent communication need not use ESP, but can simply use regular unicast. The hierarchy will be well correlated with the topology, and the overlay links of the hierarchy will have low latency if all packets were sent at the same time, or in response to a synchronized trigger (see below). Hierarchies can be ephemeral, to implement a breadcrumb-like service. For example, if a subset of nodes want a particular data item, they can all sent a request for it using the PARENT instruction, and after they get the data from their parent, relay it to whomever they got requests from. Pitfall: Some members might have very many children, depending on the fanout in the underlying topology. * Synchronized triggers Sometimes we'd like to use a multicast packet as a trigger to cause members to do something. This is exactly the sort of trigger we'd like to use for sending PARENT packets to build a hierarchy, so that the end host closest (in terms of round-trip time) to a branch point will win the race and become an ancestor of everyone in that subtree. But without multicast in the routers, can we fake this trigger? Yes, if we have an age(tag) primitive that returns a high-resolution measure of the time that a tag has existed in the store. [Shouldn't be hard, the example implementation already stored a low-resolution measure for each tag, so it's just a matter of spending a few more bits.] We also need the ability to read/write the source and destination addresses of the packet, and to tell the difference between a time value and a network address (either by using an extra bit, or encoding time values in an unused portion of the address space). SYNC: let p = value in packet let v = get(tag) case type(p) address: case type(v) not found: put(tag,p) let p = router interface address address: drop packet time: let destination address = p let p = v - age(tag) endcase time: case type(v) not found: drop packet time: drop packet address: let destination address = v put(tag,p) endcase Members send SYNC repeatedly toward a root until a time value comes back, then set a timer that counts down from the value in the packet. The root acts as if it were an ESP node that had stored a time value v0 at time t0, before any SYNC packets had arrived. The timers at the members will expire at the same time that a multicast packet from the root would have arrived if it had been sent at time t0+v0. v0 needs to be greater than the amount of time it takes this whole process to complete, otherwise members will get back negative times (meaning the trigger is already in the past).