Monday 20 March 2017

Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics


Summary:



Most geo-distributed streaming applications ingest data from geo-distributed resources for data analytics. Typically, such settings are implemented under hub and spoke model. Under this model numerous distributed data sources send their streams of data to edge servers which perform partial computation before forwarding results to central location responsible for performing any remaining computation, storing the results and providing analytics service to users. Under this setting, we must trade between the staleness and error in the data, due to the limited WAN bandwidth required to forward traffic from edge devices to a central location.

The crucial question is how best to utilize both edge and the central server, how much computation to offload to the central server and when to send these partial computations? These questions are tried to be answered in the context of windowed group aggregation algorithms, a key primitive for streaming analytics, where data streams are logically subdivided into non-overlapping time windows within which records are grouped by common attributes, and summaries are computed for each group. The records of data streams are in the form of (k,v) where k is a key and v is the value. The data is then grouped into a set of records with the same key by the aggregation function.

An aggregation algorithm runs on the edge and takes as input the sequence of arrivals of data records in each time window [T, T +W). The algorithm produces as output a sequence of updates that are sent to the center and its goal is to either minimize staleness given an error constraint or to minimize error given a staleness constraint.

Staleness can be due to the delay in network bandwidth or delaying transmissions until the end of the window. The error is caused by any updates that are not delivered to the center, either because the edge sent only partial aggregates, or because the edge omitted some aggregates altogether. Thus, we can reduce staleness by avoiding some updates altogether, and send updates earlier during the window. Unfortunately, both options for reducing staleness lead directly to sources of error.

Study of optimal offline algorithms for minimizing error under a staleness constraint and minimizing staleness under error constraint is used as a reference to design efficient online algorithms for effectively trading off timeliness and accuracy under bandwidth limitations.


Key Points:

  • Discussion on the challenges in optimizing error and staleness tradeoff for streaming, batch aggregation, and random sampling approaches.
  • Discussion on the optimal offline algorithms to optimization problems of minimizing staleness under an error constraint(EPE)  and minimizing error under a staleness constraint(SPEF) which is further used to evaluate and design the online algorithms.
  • Deriving an alternative offline algorithm SPEF with Early omissions using the optimal offline algorithms SPEF(Smallest Potential Error First).
  • The offline optimal algorithms provide some high-level lessons -they flush each key at most once, thereby avoiding wasting scarce network resources. and  they make the best possible use of network resources
  • View edge devices as a two level cache abstraction, and implement online algorithms for the partition policy which define the boundary between the two caches.
  • The online error-bound algorithm uses a value-based cache partitioning policy and emulates the offline optimal EPE algorithm.It uses LRU for cache replacement policy.
  • The online staleness- bound online algorithm uses a dynamic sizing-based cache partitioning policy to emulate the offline optimal SPEF-EO algorithm.It uses LRU for cache replacement policy.
  • The workload being used for the experiment is a web analytics service offered by Akamai, a large commercial CDN.
  • Used trace-driven simulations as well as experiments using Apache Storm based implementation deployed on PlanetLab.

Strengths:


  • The paper presents a tradeoff between staleness and error using windowed grouped aggregation algorithms which can be used under practical settings for dynamic workloads and bandwidth conditions and across a range of error and staleness constraints.
  • Presents a novel approach of using two level cache implementation for practical windowed grouped aggregation algorithms.
  • The techniques described in the paper can be applied across a diverse set of aggregates, from distributive and algebraic aggregates such as Sum and Max to holistic aggregates such as unique count.
  • Authors have evaluated their implementation for large query workloads and with different aggregation algorithms based on streaming, batching, batching with random early updates and optimal offline algorithms suggesting a very neat evaluation of their design.

Negative Points/Discussion:

  • Relies on grouped aggregation of data so not applicable for certain type of applications which doesn't rely on grouped aggregation of data.
  • The authors have assumed the computational overhead of the aggregation operator to be a small constant compared to the network overhead of transmitting an update but for very compute intensive aggregation functions computational overhead needs to be considered.
  • Replication/fault tolerance mechanisms and interconnectivity between edge nodes have not been discussed.

1 comment:

  1. Good job. As far as discussion point #1, can you think of an example? Discussion point #2 is valid, though if computation is added, the problem is completely different. Many applications are modeled by this assumption (mostly aggregation and minimal computation), but not all, of course.

    ReplyDelete

Note: only a member of this blog may post a comment.