Lightweight Network Support for Scalable End-to-End Services
Kenneth L. Calvert, James Griffioen, Su Wen
University of Kentucky
http://www.acm.org/sigcomm/sigcomm2002/papers/esp.html

Summary for network reading group 2002-Oct-18-Fri
Adam M. Costello
http://www.nicemice.net/amc/


Motivation: Building end-to-end network services on top of the IP
forwarding service can in some cases be difficult, inefficient, and/or
unscalable.  In some cases, moving some functionality into the routers
can make the service simpler, more efficient, and more scalable.

Examples:
  * detecting partial losses in a multicast session
  * recovering from those losses
  * aggregating feedback from many nodes
    (like finding the average/min/max value)
  * discovering branch points in the topology for placing
    overlay nodes and links

Two classes of approaches:
  * Targeted simple solutions for particular problems, like:
      - EXPRESS, SSM (single-source multicast)
        for multicast tree building & forwarding
      - parentcast, subcast, randomcast, breadcrumb
        for multicast loss recovery
  * General powerful solution (active networks)
    routers run code contained in packets
    challenges in security, performance, scalability

IP is successful because it is simple, efficient, and scalable.

Goal: A simple, efficient, scalable building block that could be added
to routers that would benefit a broad range of services.

Proposal: ESP = Ephemeral State Processing

A packet can contain an ESP instruction, which operates on a small
amount of state in the router and in the packet.

Ephemeral state:  Once created, will be deleted a constant well-known
amount of time later (perhaps 10 seconds), regardless of how often it
is altered in the meantime.  This enables simple implementations with
performance guarantees.

Contrast with soft state, which can be refreshed.  Paper claims these
are fundamentally different, but if you implement a refresh by copying
the state and invalidating the old copy, then it looks the same to me.
I think the only effective difference is that people usually imagine
soft state timeouts to be minutes, rather than seconds.

Requirements:
  * packets can leave information in routers for other packets to
    read or modify
  * the space-time storage product per packet is bounded
  * the processing per packet per node is comparable to IP forwarding
  * anonymous (no need for end hosts and routers to authenticate or
    track each other)
  * no need to modify or duplicate the existing network layer

Architecture:

A packet may contain one ESP instruction (as an option, a shim
header, or a payload).  Routers that don't support ESP don't see the
instruction, and forward the packet normally.

Routers that support ESP have one ESP context for each input port
and each output port, plus one central context.  The instruction
contains three flags indicating which context(s) to should execute the
instruction.

Each context can be parallelized (implemented as multiple independent
contexts), so an instruction contains an opaque computation ID to make
sure all packets involved in the same computation get hashed to the same
context.

Each context contains an ephemeral state store--an associative array of
tag/value pairs.  The store supports two primitives: put(tag,value) and
get(tag).

Operands of an instruction can be stored pairs (referenced by their
tags), or well-known parameters of the router (like MIB variables), or
fields in the packet itself.  (Do in-packet operands need to be in the
ESP instruction, or can they be other IP header fields, like source and
destination addresses?)  [I assume that for a given instruction, the
number of operands of each time is constant or at least bounded, so that
the execution time can be bounded.]  By using large tags (64 bits in the
example implementation) chosen randomly by end hosts, the probability of
collisions is made small.

Example applications:

* Topology discovery
    - find branch points:
      Send a mark instruction to one destination that leaves a trail of
      markers, then send a query instruction to another destination that
      reads marker locations, and ends up with the location of the last
      marker.

* Controlling packet flow
    - feedback thinning (COMPARE)
      Each member sends its value, and ESP nodes record the minimum
      value seen so far, dropping packets that don't alter the value.
      Each link will carry on average O(log n) packets if order of
      arriving values is random.

    - multicast loss recovery (like breadcrumb)
      NACKs toward the source leave a trail of markers (or increment a
      trail of counters), and stop when they hit a marker (or when a
      counter exceeds a threshold).  This avoids implosion, and allows
      for redundancy (at most k NACKs traverse each link).  Multicast
      data retransmissions contain an ESP instruction that drops packets
      that would stray from the trail, so only members who request the
      data receive it.  Assumes symmetric paths.  More complex mechanism
      is needed if paths are not symmetric.

* Aggregation (COUNT, COLLECT)
    Two phases:  All members first send COUNT, so each node gets a
    count of the number of child nodes below it.  Then each member sends
    a COLLECT instruction containing its value.  The ESP node combines the
    incoming value with the stored value using a commutative and
    associative operator (like plus) and decrements the counter.

Practical Considerations:

  Incremental deployment is possible.

  To start computations, need multicast trigger, or loosely synchronized
  clocks (to within about a maximal round-trip time).

  ESP packets can be lost.  If instructions need to be retransmitted,
  they need to be idempotent.  For redundancy, idempotent instructions
  can be sent k times rather than once, and the instruction can cause
  routers to forward at most k instructions rather than at most one.  If
  an instruction isn't inherently idempotent, it can be made so by doing
  duplicate detection using Bloom filters.  [This seems to be getting
  overly complex.]

  Security:  Anyone who can eavesdrop on packets can also interfere
  with computations by injecting ESP packets that use the same tags and
  computation IDs.

  Routes need to be stable for the lifetime of the state (about 10
  seconds).

Implementation and performance:

  Because the lifetime of a pair is constant, pairs are deleted in
  the same order they are created.  Tags and values are stored in two
  parallel circular queues.  There is a hash table for quickly finding a
  tag.

  Using the programmable Intel IXP1200 network processor, hundreds of
  thousands of ESP packets could be processed per second.  A small
  packet has around 1000 bits, so this corresponds to a bit rate in the
  hundreds of Mbps.

  There is room for more parallelism and faster hardware.

Evolution:

  Standardize a few fairly complex instructions (which can be added
  slowly over time)?  Or define a limited microcode that can be used to
  construct new instructions on demand?

Other applications (my ideas, not in the paper):

* Building hierarchies

  Hierarchies of end hosts that are correlated with the network topology
  could be used for efficient aggregation that is more complex and/or
  more stateful than what could be done in ESP routers, and could be
  used for a breadcrumb-like service (for multicasting and multicast
  loss recover) without any packet duplication service in the routers.

  If the value of an ESP tag/value pair is large enough to hold a
  network address, and if ESP instructions can modify not only the
  ESP field of the packet but also the IP destination field, then a
  forwarding service similar to parentcast can be implemented using the
  following ESP instruction:

  PARENT:
      let v = get(tag)
      case v
          not found:       put(tag, source address)
          source address:  do nothing
          anything else:   let destination address = v
                           and change opcode to no-op

  This is inherently idempotent.  Each member can send a few of these
  toward the root, and any that don't get dropped will all go to the
  same member, which is the sender's parent in a newly-constructed
  hierarchy (an overlay tree).  Parents can optionally reply to their
  children, so that subsequent communication need not use ESP, but can
  simply use regular unicast.  The hierarchy will be well correlated
  with the topology, and the overlay links of the hierarchy will have
  low latency if all packets were sent at the same time, or in response
  to a synchronized trigger (see below).

  Hierarchies can be ephemeral, to implement a breadcrumb-like service.
  For example, if a subset of nodes want a particular data item, they
  can all sent a request for it using the PARENT instruction, and after
  they get the data from their parent, relay it to whomever they got
  requests from.

  Pitfall:  Some members might have very many children, depending on the
  fanout in the underlying topology.

* Synchronized triggers

  Sometimes we'd like to use a multicast packet as a trigger to cause
  members to do something.  This is exactly the sort of trigger we'd
  like to use for sending PARENT packets to build a hierarchy, so that
  the end host closest (in terms of round-trip time) to a branch point
  will win the race and become an ancestor of everyone in that subtree.
  But without multicast in the routers, can we fake this trigger?

  Yes, if we have an age(tag) primitive that returns a high-resolution
  measure of the time that a tag has existed in the store.  [Shouldn't
  be hard, the example implementation already stored a low-resolution
  measure for each tag, so it's just a matter of spending a few more
  bits.]

  We also need the ability to read/write the source and destination
  addresses of the packet, and to tell the difference between a time
  value and a network address (either by using an extra bit, or encoding
  time values in an unused portion of the address space).

  SYNC:
      let p = value in packet
      let v = get(tag)
      case type(p)
          address: case type(v)
                       not found: put(tag,p)
                                  let p = router interface address

                       address:   drop packet

                       time:      let destination address = p
                                  let p = v - age(tag)
                   endcase

          time:    case type(v)
                       not found: drop packet
                       time:      drop packet

                       address:   let destination address = v
                                  put(tag,p)
      endcase

  Members send SYNC repeatedly toward a root until a time value comes
  back, then set a timer that counts down from the value in the packet.

  The root acts as if it were an ESP node that had stored a time value
  v0 at time t0, before any SYNC packets had arrived.  The timers at the
  members will expire at the same time that a multicast packet from the
  root would have arrived if it had been sent at time t0+v0. v0 needs
  to be greater than the amount of time it takes this whole process to
  complete, otherwise members will get back negative times (meaning the
  trigger is already in the past).