Monday 20 March 2017

SpanEdge: Towards Unifying Stream Processing over Central and Near-the-Edge Data Centers

Summary:


The authors mention that downside of typical stream processing is that 1) Communication happens over WAN between the edge network and stream-processing application hosted in a central data center; this WAN bandwidth is expensive and can be scarce, 2) Network traffic has increased dramatically because the number of connected devices in the network edge have increased 3) Latency sensitive applications can’t tolerate the delay introduced over WAN links and 4) Data Movement restrictions because of data privacy issues. Span-edge unifies stream processing over a geo-distributed data-centers,  and leverages the state-of-the-art solutions for hosting applications close to the network edge. To address these problems they have introduced a customized stream processing framework - SpanEdge which categorizes data centers into two tiers- the central data centers in first tier and the near-the-edge data centers in the second tier. SpanEdge, places the stream processing applications close to the network edge and hence, reduces the latency incurred by the WAN links and speeds up the analysis.


Architecture of SpanEdge is almost similar to Apache Storm. However, unlike Storm SpanEdge has two types of workers - hub-worker and spoke-worker, where a hub-worker resides in a data center in the first tier and spoke-worker resides in a data center in the second tier near the edge.
SpanEdge also enables programmers to group the operators of a stream processing graph either as a local-task or a global-task, where a local-task refers to the operators that require to be close to the streaming data sources and a global-task refers to the operators that process the results of a group of local-tasks. It also introduced a new Scheduler, which converts the stream processing graph to an execution graph by assigning local-tasks and global-tasks to hub-workers and spoke-workers, much like what Storm does.
  


Strengths:


Fault-tolerance and reliability are realized using Storm’s heartbeats and ack messages. Similarly SpanEdge also inherits all the the advantages of Storm.  


The notion of operator groupings provides a general and extensible model in order to develop any arbitrary operations. This enables programmers to implement algorithms in a geo-distributed infrastructure with a standard programming language and stream processing graph.  


Programmers can develop a stream processing application, regardless of the number of data sources and their geographical distributions.


The paper retained most of the semantics of stream processing frameworks, and introduced the novel notion of hub-workers and spoke-workers, which differ in their geographical location relative to the edge, and provide low-latency computation guarantees.


Since Storm is a pure stream processing framework, it ensures low latency guarantees.  


Explanations in paper are present in very simplistic and straightforward manner. Evaluation results are impressive and compelling enough to consider.


Weaknesses / Discussion Points:


Storm is not the state-of-the-art stream processing framework. The authors might have got better results had they used other stream processing frameworks such as Flink or Spark Streaming. It is a low throughput framework and ensures at least guarantees.


The possibility of output data generated from local tasks can be aggregated/batched and sent to global tasks.


Many a times a felt that crucial details were not mentioned in the paper. Entities such as “source discovery service”  which provides map of streaming data sources to the spoke-workers, were not discussed at all.  


Adaptive filtering wherein only a subset of output is passed along the execution graph could have been incorporated.     


What if the environment is highly dynamic? How does the scheduling algorithms adapt when there are network congestion, highly mobile resources ?

Caching, acks and flow-control could be adopted between central and near-the-edge data centers.

2 comments:

  1. Nice analysis. DP #2 may make sense -- doing some in-network aggregation before sending to the hub. DP #3 -- by dropping data they are emulating filtering. DP #4 -- how would you modify scheduling to take network dynamics into account?

    ReplyDelete
  2. One way is to use dynamic heuristics and calculate a metric based upon factors such as cost of latency, bandwidth consumption, and reduction in latency had the data been assigned to task in other local data center. If such a move seems profitable, then sure we can take it, otherwise no. This model can also become more and more robust if past data is looked upon to predict a future move.

    ReplyDelete

Note: only a member of this blog may post a comment.