Tuesday, 7 March 2017

Nebula: Distributed Edge Cloud for Data Intensive Computing

Summary
------------
Many big data applications rely on geographically distributed data, and with exisitng frameworks consisting of centralized computational resources, a non-trivial portion of the execution time and budget for processing such data is consumed in data upload. Another problem with  the exisiting applications is the high overhead involved in instantiating virtualized cloud resources. To address the mentioned problems, the paper proposes Nebula, a distributed cloud infrastructure which uses volunteer nodes as edge clouds for data intensive computing. Large amounts of data processing can be done at the edge which is located near the data, resulting in significant data compression, and thus reducing the time and cost involved in transoporting the data for centralized processing.
Nebula system architecture consists of the following components -
Nebula Central : A front-end web-based portal for volunteer nodes to join the Nebula system, and application deveopers to inject code into the system.
DataStore : A simple per-application storage service consisting of the volunteer nodes that store the data, and the DataStore Master which manages how and where data is stored at the volunteer nodes.
ComputePool : A per-application computation resources service consisting of the volunteer compute nodes, and the ComputePool Master which schedules and coordinates the execution at volunteer compute nodes.
Nebula Monitor : A central system to monitor the performance of the volunteer nodes, and the network characterstics. It is used by the DataStore Master and the ComputePool master for data placement and scheduling.


Strengths
------------
  • The paper is well-detailed and clearly explains and analyzes all the design decisions and performance results.
  • Location-aware data placement and scheduling, and replication helps make Nebula highly efficient, scalable and fault tolerant for data-intensive computations.
  • Provides performance comparison results with the exisiting volunteer based computing platforms - Central Source Central Intermediate Data, and Central Source Distributed Intermediate Data.
  • Proper care is taken to protect volunteer node from malicious code.
  • The design descisions have been well thought-out, considering specific boundary cases as well 
    • Locality-aware scheduler limits the number of tasks per node in each scheduler iteration to avoid many concurrent tasks being assigned to high-speed nodes.
    • To avoid resource wastage, the timeout value(to label a compute node as unresponsive) is set large enough, thus, giving a chance to make progress if the node becomes responsive again quickly.       
  
Weaknesses
---------------
  • The paper assumes that the data has already been stored into Nebula, and decomposition of input files is not needed as the number of files are much more than the number of tasks.
  • Nebula Monitor, DataStore Master, and ComputePool Master may act as single points of failure.
  • Edits or appends on files stored on the DataStore are not supported.
  • The result with 30 volunteer nodes to prove scalability could have been further validated by running an experiment with nodes atleast of the order of hundreds as we can expect thousands of nodes to join a volunteer based system.

Discussion Points
----------------------
  • The paper does not talk about privacy of data/code from the perspective of the person initiating the computation. Should the data/code be encrypted ?
  • Usually, there is a tradeoff between replication and performance in case of no failures. But, the results in the paper show that the runtime with replication is lower even for the no failure case, because with more replicas the ComputePool Master has more choices to assign tasks to compute nodes closer to the data. It would be interesting to see after what replication factor does the performance start decreasing due to replication.


For additional information related to Nebula -
[1] http://dcsg.cs.umn.edu/Projects/Nebula/

1 comment:

  1. Good job. For the last discussion point -- there are certainly tradeoffs -- as workload increases replication takes compute cycles away from other tasks, e.g.

    ReplyDelete

Note: only a member of this blog may post a comment.