CSCi 8980 Edge Cloud Research: March 2017

Thursday 23 March 2017

QuiltView: a Crowd-Sourced Video Response System

QuiltView: a Crowd-Sourced Video Response System

Gist of the paper:

QuiltView leverages the functionality of Google glass micro-interactions to provide a first person view point video to users enabling them to send out real time queries to other subscribed users who reply with video snippets. Furthermore, using this approach, the paper goes on to consider creating a real-time social network capitalizing on the rich meaningfulness of videos Vs texts. A few more use cases have been mentioned in the paper such as finding a missing child, traffic emergency, free food finder etc-

They also cache the results, use geolocation and query similarity detection in order to optimize the number of queries the users receive.

Architecture:

A Global catalogue of users & their preferences, queries and responses with Youtube URLs of the videos uploaded by the responding users is maintained(Both the uploading and viewing of videos are done using standard YouTube mechanisms that are wrapped inside QuiltView query and response software).
The QuiltView catalog is used as a result cache that short-circuits query processing.

Queries are posed using a web interface to Google Maps as shown below:

Implementation Details:

Glass Client: Implemented with the Glass Development Kit (GDK) and Google Cloud Messaging as the service to talk to the Glass devices from the central cloud.
Quit View Service: QuiltView is implemented as a web-based service in a single virtual machine at Amazon EC2 East. Standard load balancing and scaling mechanisms for web-based services can be used in the future to cope with increased load. They assume one device belongs to only one user.
Query Workflow:
1) User zooms into the area he wants the response from, a location share link is created on this map
2) He then types the text of his query, uploads any associated image (such as a picture of a missing child or pet), and adds details such as reward ordered and desired timeliness of responses.
3) Then the query optimizer works in order to ascertain if there are any cached results already pertaining to any similar query.
Query Similarity detection: They use Gensim, open source framework for unsupervised semantic modeling and Latent Dirichlet Allocation model for topical inference. They give three scenarios for when the similarity detection is correct, when it gives a false positive and when it gives a false negative. While the false negative is just as they mention an ‘opportunity lost’ but false positives could cost heavy.
Synthetic Users: Due to the absence of enough Glass devices, most of the glass devices have been simulated. These do not send out queries but are responders.

Key Assumptions:

User Distraction is the major cost(Higher than Network bandwidth, storage, CPU and Power Utilization). [Thus we will not discuss the cost of uploading videos to youtube furthermore]
Raw Data Vs Refined Data : They take note that transmitting raw data such as a video delivers more than delivering the refined data.
Requests for videos are relatively rare (a few an hour, perhaps, for a typical user).

From the point of Distributed Systems:

Security aspects:

Glass clients use SSL to communicate with this service.
They assume one device belongs to only one user, and establish a decentralized authentication via based on the BrowserID protocol, allows a user to verify his/her identity via a participating email provider's OpenID or OAuth gateway. No new password creation is involved. While that might be a part of lessening user interaction involved, but it does not speak for strong privacy features.

Fault Tolerance: Not much has been mentioned about this in the paper, and this might as well be a negative. The presence of Django could also be a point of contention.

It does not follow OOP features and a minor update to a dependency could cause breakage of code. In short, Django is highly break-prone.

Power Utilization: The paper does not speak about it but does give it lower priority than User Distraction. They are only implementing Polling from the cloud by the participating devices, which would be a downfall in this case.

Load Balancing : Since User distraction is their costliest metric, load balancing also has been implied upon this. The user can specify their preferences of not receiving queries in certain time/geographic locations and the number of queries in a certain time period that they wish to not exceed. Moreover, there is result caching with query similarity detection so multiple common videos are not uploaded and some cached results could be directly given to the user.

CPU Utilization: The paper, again does not speak about this directly.

Network Latency : The paper does not talk about this either

Processing Latency of ML : The paper does not talk about this either

Disadvantages:
Absence of concrete metrics to judge the working system. Also many issues have not been addresses, some of which are discussed below.

1) “Glass users may be spread over a large area such as a city or a county” - This means they are only considering geographical distances of a city/country, which considering there are such few glass devices with people anyway, might be too less to serve any actual use cases.

2) For the Missing Child Use Case: “Many video responses are immediately received by the police, and they are soon able to apprehend the suspect and rescue the child”
In many use cases, one query is calling for multiple responses, the user has no way of regulating the number of responses they get. Many of these may not be relevant at all to the user.

3) QuiltView is implemented as a web-based service in a single virtual machine at Amazon EC2 East. Thus scaling has not been tested at this front.
If it were to scale, we take note that they poll from the central cloud and device mobility is also there, this would lead to additional considerations of which server a device should poll from and this would pile on more compute-related exhaustion to the device.

4) User needs to have an idea of the geographic location from where he wants the response.

5) No word is mentioned about Network Latency, CPU Utilization, Power concerns(apart from moving out of Polling in future).

6) The use of Django(makes the features easy breakable)

7) False Positive Alarms: In case the user was provided with cached results, he might have to wait to reject some results, which takes on user distraction and makes this costly, no word has been mentioned as to how much more costly this operation becomes.

8) Speed of ML processing has not been mentioned. There might be added processing latency due to this.(As text corpus they have 9GB of English Wikipedia)

9) Does not mention how they choose the subset of potential responders: In fact, in the Load Balancing paragraph, they mention that after a small subset of users have been obtained based on their preferences, “QuiltView randomly chooses the desired number of users and delivers the query to them.” However in the beginning of the paper it is mentioned that

10) The synthetic users do not pose any queries and do not move.

Advantages :
1) Leveraging the Richness of videos

2) A video-led social networking concept seems to hold potential.

Some of the future work that the paper acknowledges :
1) Dynamic adaptation of estimates based on actual experience is possible and more sophisticated user selection mechanisms (such as those based on user reputation)

can be envisioned for the future.

2) Scaling the number of central servers.

3) GCM-based notifications instead of polling from the centra server.

4) Introducing mobility in the synthetic users.

Suggestion Points from Reviewer :

Can we have a tag based approach to finding similarities?
Should the user be able to choose how many responses they want, else this would lead to network congestion?
In case of uploading video as part of queries, there must be some constraints applied there as well, but even if videos are part of queries, not many responders might follow due to the excessive user distraction and bandwidth upload involved. Other's thoughts on this are welcome.

------------ Edit 1 ---------------
Suggestion/Discussion Points contd.

1. Privacy: Privacy is a concern because videos are taken and transmitted publicly. This could facilitate criminal activities against people included in the videos, their day-to-day activities may be tracked. This does not just breach individual privacy but also privacy laws of a region/state.

2. Local/Edge Processing:
1) One way in which this may be leveraged is peer-peer cache metadata checking. Thus, communication with the central cloud is not always needed.

2) Furthermore, the ML processing might also be computes on the edge by the Glass devices. We could also have a set of volunteer nodes ready to do the ML Processing. However, seeing that we are feeding a text of 9GB as base, this doesn't seem to be an edge-compliant task.

3. Reliability: Reliability of videos received by the user is questionable due to a few issues:
1) They might be getting stale cached data, they would not know if they have to reject the video in this case.

2) There is still no way of ascertaining the reliability of the videos given

Timestamp: Thu, 1:19 pm

Wednesday 22 March 2017

Medusa: A programming framework for Crowd-Sensing Applications

This study proposes a novel approach on crowd-sensing by leveraging high number of smart phone users. Crowd-sensing is the idea of retrieving sensor data by distributing work to smartphone users. This paper is an innovative integration of two research, I) supporting incentives and worker-mediation (involvement of human workers in the tasks) and II) enabling requestors (job recruiters) to gather sensor data from mobile users with minimal human interference.

The major contributions of this paper is a high level programming language called MedScript for crowd sensing tasks which can even used by non-technical requestors upto certain level of ease, and Medusa which is a runtime engine for cloud as well as smartphone. The way it works is that requestor comes up with the task divided into certain stages with incentives written in MedScript, where workers can than join in (using Medusa runtime engine on smartphone) to fulfill that stages in order to benefit from the incentives. The workers than have the option to chose which part of the data should be sent to cloud for the requestors to observe and verify. Requestors can specify which stages of the tasks requires human intelligence. Also this approach supports reverse incentive, which means that workers can pay requestors in order to nominate themselves for tasks.

In designing these, authors followed three core architectural principles which were:
I) Partitioned Services: Providing collection of services for both cloud and smartphones,
II) Dumb smartphones: Minimizing amount of task execution task state on smartphones, to prevent data-loss from failure (or simply turning off) of smartphones, and
III) Opt-in Data transfers: Before uploading any data to the cloud, user permission is required to opt-out of sending parts(or any) data

Strengths:

1) The paper in general is well written with good detailed examples. Moreover it was really to have a single example through out the paper to explain so many different components.

2) The evaluations metrics chosen are appropriate to justify the importance of Medusa. Moreover it showed two orders of magnitude improvement in number of lines of code with corresponding standalone applications which is pretty impact-ful. It was also impressive to see that authors didn't hesitate from mentioning the bottlenecks their approach has and where it can be improved. For example authors mentioned about sending SMS to the workers for tasks as a huge bottleneck.

3) The Stage libraries (stages are elemental operations required to complete a task) provided are extensible, that is, if requestors needs more functionalities, they can implement it themselves.

4) The high level language MedScript is really flexible, and it is easier for non-technical requestors to implement a task using this language.

5) The failure handline is explained really well and sounds reasonable for each of the component and its functionalities.

Discussions:

1) Worker acquisition cost: One of the things that authors have not discussed that much about is acquisition cost of the workers (smartphone users). It was confusing sometimes to understand how workers will be acquired and at what cost. For example, authors mentioned that requestor now need to hire 50 people in hope of getting 15 right. However it is also noteworthy to mention that requestors also need to spend on curating the jobs of 50 workers instead of 15.

2) Scalability: Authors have discussed scalability in terms of number of tasks instances running which was really nice. It would also be interesting to see how it will perform if there are many worker scenario where the number of workers required are huge to achieve a common goal

3) Liability: This system can quickly turn to liability misery if not handled properly. The reason is because the requestor now no longer direct control over the tasks performed by the workers upto the end when the data is uploaded to the cloud. For example, in the video documentation what if the workers plagarizes the videos from somewhere else and if requestor have no way of verifying that.

4) Frequent failures: What if the tasks have timeline associated with it, but that timeline was not met because of frequent failures (turning of) of smartphones.

Monday 20 March 2017

Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics

Summary:

Most geo-distributed streaming applications ingest data from geo-distributed resources for data analytics. Typically, such settings are implemented under hub and spoke model. Under this model numerous distributed data sources send their streams of data to edge servers which perform partial computation before forwarding results to central location responsible for performing any remaining computation, storing the results and providing analytics service to users. Under this setting, we must trade between the staleness and error in the data, due to the limited WAN bandwidth required to forward traffic from edge devices to a central location.

The crucial question is how best to utilize both edge and the central server, how much computation to offload to the central server and when to send these partial computations? These questions are tried to be answered in the context of windowed group aggregation algorithms, a key primitive for streaming analytics, where data streams are logically subdivided into non-overlapping time windows within which records are grouped by common attributes, and summaries are computed for each group. The records of data streams are in the form of (k,v) where k is a key and v is the value. The data is then grouped into a set of records with the same key by the aggregation function.

An aggregation algorithm runs on the edge and takes as input the sequence of arrivals of data records in each time window [T, T +W). The algorithm produces as output a sequence of updates that are sent to the center and its goal is to either minimize staleness given an error constraint or to minimize error given a staleness constraint.

Staleness can be due to the delay in network bandwidth or delaying transmissions until the end of the window. The error is caused by any updates that are not delivered to the center, either because the edge sent only partial aggregates, or because the edge omitted some aggregates altogether. Thus, we can reduce staleness by avoiding some updates altogether, and send updates earlier during the window. Unfortunately, both options for reducing staleness lead directly to sources of error.

Study of optimal offline algorithms for minimizing error under a staleness constraint and minimizing staleness under error constraint is used as a reference to design efficient online algorithms for effectively trading off timeliness and accuracy under bandwidth limitations.

Key Points:

Discussion on the challenges in optimizing error and staleness tradeoff for streaming, batch aggregation, and random sampling approaches.
Discussion on the optimal offline algorithms to optimization problems of minimizing staleness under an error constraint(EPE) and minimizing error under a staleness constraint(SPEF) which is further used to evaluate and design the online algorithms.
Deriving an alternative offline algorithm SPEF with Early omissions using the optimal offline algorithms SPEF(Smallest Potential Error First).
The offline optimal algorithms provide some high-level lessons -they flush each key at most once, thereby avoiding wasting scarce network resources. and they make the best possible use of network resources
View edge devices as a two level cache abstraction, and implement online algorithms for the partition policy which define the boundary between the two caches.
The online error-bound algorithm uses a value-based cache partitioning policy and emulates the offline optimal EPE algorithm.It uses LRU for cache replacement policy.
The online staleness- bound online algorithm uses a dynamic sizing-based cache partitioning policy to emulate the offline optimal SPEF-EO algorithm.It uses LRU for cache replacement policy.
The workload being used for the experiment is a web analytics service offered by Akamai, a large commercial CDN.
Used trace-driven simulations as well as experiments using Apache Storm based implementation deployed on PlanetLab.

Strengths:

The paper presents a tradeoff between staleness and error using windowed grouped aggregation algorithms which can be used under practical settings for dynamic workloads and bandwidth conditions and across a range of error and staleness constraints.
Presents a novel approach of using two level cache implementation for practical windowed grouped aggregation algorithms.
The techniques described in the paper can be applied across a diverse set of aggregates, from distributive and algebraic aggregates such as Sum and Max to holistic aggregates such as unique count.
Authors have evaluated their implementation for large query workloads and with different aggregation algorithms based on streaming, batching, batching with random early updates and optimal offline algorithms suggesting a very neat evaluation of their design.

Negative Points/Discussion:

Relies on grouped aggregation of data so not applicable for certain type of applications which doesn't rely on grouped aggregation of data.
The authors have assumed the computational overhead of the aggregation operator to be a small constant compared to the network overhead of transmitting an update but for very compute intensive aggregation functions computational overhead needs to be considered.
Replication/fault tolerance mechanisms and interconnectivity between edge nodes have not been discussed.

SpanEdge: Towards Unifying Stream Processing over Central and Near-the-Edge Data Centers

Summary:

The authors mention that downside of typical stream processing is that 1) Communication happens over WAN between the edge network and stream-processing application hosted in a central data center; this WAN bandwidth is expensive and can be scarce, 2) Network traffic has increased dramatically because the number of connected devices in the network edge have increased 3) Latency sensitive applications can’t tolerate the delay introduced over WAN links and 4) Data Movement restrictions because of data privacy issues. Span-edge unifies stream processing over a geo-distributed data-centers, and leverages the state-of-the-art solutions for hosting applications close to the network edge. To address these problems they have introduced a customized stream processing framework - SpanEdge which categorizes data centers into two tiers- the central data centers in first tier and the near-the-edge data centers in the second tier. SpanEdge, places the stream processing applications close to the network edge and hence, reduces the latency incurred by the WAN links and speeds up the analysis.

Architecture of SpanEdge is almost similar to Apache Storm. However, unlike Storm SpanEdge has two types of workers - hub-worker and spoke-worker, where a hub-worker resides in a data center in the first tier and spoke-worker resides in a data center in the second tier near the edge.

SpanEdge also enables programmers to group the operators of a stream processing graph either as a local-task or a global-task, where a local-task refers to the operators that require to be close to the streaming data sources and a global-task refers to the operators that process the results of a group of local-tasks. It also introduced a new Scheduler, which converts the stream processing graph to an execution graph by assigning local-tasks and global-tasks to hub-workers and spoke-workers, much like what Storm does.

Strengths:

Fault-tolerance and reliability are realized using Storm’s heartbeats and ack messages. Similarly SpanEdge also inherits all the the advantages of Storm.

The notion of operator groupings provides a general and extensible model in order to develop any arbitrary operations. This enables programmers to implement algorithms in a geo-distributed infrastructure with a standard programming language and stream processing graph.

Programmers can develop a stream processing application, regardless of the number of data sources and their geographical distributions.

The paper retained most of the semantics of stream processing frameworks, and introduced the novel notion of hub-workers and spoke-workers, which differ in their geographical location relative to the edge, and provide low-latency computation guarantees.

Since Storm is a pure stream processing framework, it ensures low latency guarantees.

Explanations in paper are present in very simplistic and straightforward manner. Evaluation results are impressive and compelling enough to consider.

Weaknesses / Discussion Points:

Storm is not the state-of-the-art stream processing framework. The authors might have got better results had they used other stream processing frameworks such as Flink or Spark Streaming. It is a low throughput framework and ensures at least guarantees.

The possibility of output data generated from local tasks can be aggregated/batched and sent to global tasks.

Many a times a felt that crucial details were not mentioned in the paper. Entities such as “source discovery service” which provides map of streaming data sources to the spoke-workers, were not discussed at all.

Adaptive filtering wherein only a subset of output is passed along the execution graph could have been incorporated.

What if the environment is highly dynamic? How does the scheduling algorithms adapt when there are network congestion, highly mobile resources ?

Caching, acks and flow-control could be adopted between central and near-the-edge data centers.

Thursday 9 March 2017

A hybrid edge-cloud architecture for reducing on-demand gaming latency

Summary:

This paper aims to improve the on-demand gaming (cloud gaming) experience for the end-users by keeping low network latency as the primary requirement. The authors examine the existing infrastructure for cloud gaming, and find that the current ability of the cloud is only serving 70% of the end-users with acceptable latency target of 80ms. They examine an approach with having only the edge-servers to serve the end-users. By the positive results of the previous approach, they propose hybrid infrastructure which augments the cloud with the edge servers and therefore, leverage the advantages of both the cloud and the edge-servers. The proposed hybrid approach shows significant improvement in terms of percentage of end-users served (90% from 70%) by using various strategies for game placement and edge servers site selection.

Key Points:

Current architecture with cloud-only deployment can serve only 70% of the end-users with the targeted latency of 80ms. Amazon EC2 instances and PlanetLab nodes have been used for measurement purposes.
2,504 BitTorrent clients are utilized for latency measurements since they provide help in identifying the geographic location of the users as well as an openTCP port for measuring the latency.
The latency requirement of 80ms is the network latency, and the authors have assumed that the client latency and server latency (processing delay) is around 20ms.
Having large cloud infrastructure (20 datacenters) also can’t help in achieving acceptable latency. This also cannot cover 50% of the population which require less than 40 ms latency.
Edge-only deployment is not cost-effective. CDN edge-server cannot serve more than one end-user for cloud gaming purposes. Edge servers are chosen randomly from the remaining subset BitTorrent clients (1500 out of 2504 clients are end-users).
Configurations for edge-servers include, edge server to end-user latency, random game placement on the edge nodes. Simulation results show that 2000 edge-servers are required to serve 70% population (same as in cloud-only setting).
300 Smart Edge nodes (edge in hybrid infrastructure) are selected from 2,504 IP addresses. Heuristics used for smart-edge selection -> Random, Population based, Region-Based, Voting for closest latency, Voting for all clients.
Heuristics used for game placement on smart edges -> Random, Top X most popular games, Equal opportunity, Voting based.
End users are matched with smart edges using online bipartite matching algorithm, which reduces any possibility of synchronization latency when new smart edges are added.
Voting-based smart edge selection and game placement strategies prevail over other strategies, when considering requirements for hosting less number of games on edge servers as well as having stricter requirements of 40ms latency.

Strengths:

This paper is very well written and comprehensive study has been done on cloud gaming. Almost every aspect of the cloud gaming experience has been covered and addressed in detail.
The authors have taken various genre/types of games like MMOGs, action/arcade games into account which have strict latency requirements.
Studies done on top cloud gaming providers like OnLive, Gaikai and GamingAnywhere, help the authors gather better requirements and evaluation metrics for their approaches.
Cloud gaming challenges include strict latency requirement (less than 100 ms), and the specialized hardware such as machines with GPUs, and the authors have addressed both the challenges by having the edge-servers and the cloud as the hybrid architecture. They have also discussed various challenges which come in using CDN servers as well as having larger cloud infrastructure.
Study on stricter latency requirement of 40ms and the effect of using hybrid infrastructure is a great find by the authors. (50% improvement by voting based strategies)
The authors have tried various combinations of using the edge-servers, number of games, requirements based on their genres, which shows extensive evaluations done for finding the optimal combination for low latency. The simulations are repeated 16 times to further check their readings on the latency and percentages of users served.

Weaknesses:

The client latency along with servers latency as 20ms is a big assumption. Though the authors have mentioned it that it may vary, 20ms is very less and optimistic. The client may be running various applications using the network which would impact the overall experience.
Bit-Torrent clients are considered to be CDN servers which serve as the edge-servers. As discussed in the paper, the CDN edge-servers requires GPUs for better gaming experience, which I think is not considered in their simulations.
Authors have not discussed about any potential failures in the network in their simulations. Network can drop packets, which would result into loss, which eventually degrades gaming experience.

Discussion Points:

Do you think if this work will still attract the extreme gamers (uses standalone machines), who require best performance while playing the games? Ideally, cloud gaming is equivalent to “remote-desktop” gaming which has its drawbacks on game performance and overall experience. The authors have ignored about the server and client latency, and have taken into account just the network latency. What if the user experiences packet loss? In First-Person-Shooter (FPS) games, loss and choke (server lag) are the primary culprits for high latency. Alternatively I think this work could be of great use to live video streaming service in the sports industry.

CSGO and DoTA 2 are the two top most played games according to the STEAM game statistics. Considering the top 10 games played on per day basis, 75% of the users population play the top two games, and if we consider the top 5 games 85% of the users population play the top two games. Instead of selecting 5 games per edge-nodes, what if the authors selected only the top 2 games and those two games could be deployed on the edge nodes ?

Wednesday 8 March 2017

Nebula: Distributed Edge Cloud for Data Intensive Computing

Objective:
Nebula is a dispersed cloud infrastructure that uses voluntary edge resources for both computation and storage. It is designed with following goals:
1) Support for distributed data-intensive computing.
2) Location-aware resource management.
3) Sandboxed execution environment.
4) Fault tolerance.

Framework Architecture:

The main components of Nebula is shown in the above figure.
Data Nodes: they are volunteer servers which offer storage service to the system.
DataStore Master: it maintains the storage system metadata and makes data placement decisions.
Nebula Monitor: it monitors performance of volunteer nodes and network characteristics.
Nebula Central: it is a front-end for Nebula eco-system. It allows volunteers to join the system, application writers to inject applications into the system.
ComputerPool Master: it is used to coordinate the task executions. This is to say, it will choose best compute nodes to run a specific tasks.
Compute Nodes: it provides computation resources. Compute node will first download required data from data nodes and run tasks on the data.

The Most Important Contribution of Nebula:
In the beginning of the paper, the authors list 4 goals. But I think the most interesting one is the location-aware resource management. With the help monitor service, Nebula can store data in a geo-distributed manner. Also, Nebula can process data in a geo-distributed manner.

Strengths:
1) Location-aware resource management helps Nebula to achieve efficiency in geo-distributed computing framework.
2) DataStore master and ComputePool master help Nebula to achieve fault tolerance. DataStore master will maintain enough replicates for one file in the system in case of file lost. ComputePool master will restart failed tasks if the master ensures the task has failed.
3) Compared to the previous volunteer computing models(central source central intermediate data and central source distributed intermediate data), Nebula makes a huge enhancement.

Weaknesses:
1) are there many volunteers in the real world to offer their own resources?
2) The most important part for Nebula is how it manage the geo-distributed resources. This paper doesn't give us a quantified metrics or method on how to store the data and disperse the computing tasks.
3) Nowadays, computing-intensive frameworks are much more common and these kinds of frameworks do not need expensive hardware. Besides, these frameworks can also do data-intensive work. Do we really need to design a new architecture?

Discussion & possible future direction:
1) Nebula is not very suitable to be a computing framework, but it can be good geo-distributed storage framework which can also offer fair computing ability by itself.
2) If you are researchers, how would you develop Nebula in the future?

Tuesday 7 March 2017

Nebula: Distributed Edge Cloud for Data Intensive Computing

Summary
------------
Many big data applications rely on geographically distributed data, and with exisitng frameworks consisting of centralized computational resources, a non-trivial portion of the execution time and budget for processing such data is consumed in data upload. Another problem with the exisiting applications is the high overhead involved in instantiating virtualized cloud resources. To address the mentioned problems, the paper proposes Nebula, a distributed cloud infrastructure which uses volunteer nodes as edge clouds for data intensive computing. Large amounts of data processing can be done at the edge which is located near the data, resulting in significant data compression, and thus reducing the time and cost involved in transoporting the data for centralized processing.
Nebula system architecture consists of the following components -
Nebula Central : A front-end web-based portal for volunteer nodes to join the Nebula system, and application deveopers to inject code into the system.
DataStore : A simple per-application storage service consisting of the volunteer nodes that store the data, and the DataStore Master which manages how and where data is stored at the volunteer nodes.
ComputePool : A per-application computation resources service consisting of the volunteer compute nodes, and the ComputePool Master which schedules and coordinates the execution at volunteer compute nodes.
Nebula Monitor : A central system to monitor the performance of the volunteer nodes, and the network characterstics. It is used by the DataStore Master and the ComputePool master for data placement and scheduling.

Strengths
------------

The paper is well-detailed and clearly explains and analyzes all the design decisions and performance results.
Location-aware data placement and scheduling, and replication helps make Nebula highly efficient, scalable and fault tolerant for data-intensive computations.
Provides performance comparison results with the exisiting volunteer based computing platforms - Central Source Central Intermediate Data, and Central Source Distributed Intermediate Data.
Proper care is taken to protect volunteer node from malicious code.
The design descisions have been well thought-out, considering specific boundary cases as well

Locality-aware scheduler limits the number of tasks per node in each scheduler iteration to avoid many concurrent tasks being assigned to high-speed nodes.
To avoid resource wastage, the timeout value(to label a compute node as unresponsive) is set large enough, thus, giving a chance to make progress if the node becomes responsive again quickly.

Weaknesses
---------------

The paper assumes that the data has already been stored into Nebula, and decomposition of input files is not needed as the number of files are much more than the number of tasks.
Nebula Monitor, DataStore Master, and ComputePool Master may act as single points of failure.
Edits or appends on files stored on the DataStore are not supported.
The result with 30 volunteer nodes to prove scalability could have been further validated by running an experiment with nodes atleast of the order of hundreds as we can expect thousands of nodes to join a volunteer based system.

Discussion Points
----------------------

The paper does not talk about privacy of data/code from the perspective of the person initiating the computation. Should the data/code be encrypted ?
Usually, there is a tradeoff between replication and performance in case of no failures. But, the results in the paper show that the runtime with replication is lower even for the no failure case, because with more replicas the ComputePool Master has more choices to assign tasks to compute nodes closer to the data. It would be interesting to see after what replication factor does the performance start decreasing due to replication.

For additional information related to Nebula -
[1] http://dcsg.cs.umn.edu/Projects/Nebula/

Monday 6 March 2017

SOUL: An Edge-cloud System for Mobile Applications in a Sensor-rich World

Summary:
This paper focuses on challenges faced today in terms of different kinds of sensors available and management of corresponding protocols, how to overcome bottlenecks as far as resource constraints of devices are concerned, access privilege management for the data generated and scalability as far as number of sensors available and their mobile(dynamic) nature. SOUL stands for Sensors of Ubiquitous Life and it moves all the processing(interaction between sensors and actuators) to edge or remote cloud. It always tries to provide applications with best sensor and corresponding actuator available as far as resource availability and proximity is concerned. Access privileges for third party applications to access certain end users sensor-actuator is given at runtime. Paper presents a system for virtualizing sensors-actuator-services triplet to provide consistent and easy access to them. Authors compare the performance of apps and effect on battery life when using android sensor framework vs SOUL aggregate. Authors believe that SOUL will help applications use more(100s) number of sensors with ease of use and without actually effecting the battery of a mobile device as compared to current usage of 1 sensors by around 98.08% of apps analysed in the survey.

Design:
There are two main components of SOUL:
1. SOUL core : Built on top of edge cloud infrastructure(PCloud). It processes all the requests coming in from apps and gathers necessary data from sensor-actuator-service triplet.
2. SOUL engine : It exposes a virtual layer called aggregate to the apps. It runs on user's device and is a middle layer between the app and SOUL core.

Sensor data when requested is first stored in SOUL's sensor datastore before delivering it to the app. This helps SOUL to provide past and current data which will realize into better sensor processing. One more important aspect of SOUL is runtime access permissions for nearby sensors. Authors rightly say that there are few use cases which makes sensor data sharing inevitable. For better user experience SOUL assists users to setup privacy policies via reference monitor. Since all of the processing is done at remote resources, the app code is executed sand-boxed. The access authentication is driven by a Facebook app between the user app and owner of the sensor.

SOUL provides two kinds of data access, 1) Glance, which means that you can get summarized data(based on what sensor owner wants to expose for everyone) even without access permission, 2) Read, for which the app needs to have proper access privilege before requesting data and it will be detailed.

Strengths of the Paper:
1. One very good feature I liked about SOUL is that it uses social network(Facebook) to decide on policy that should be applied to a user. I feel that there's a lot of context between both the parties that can be leveraged, hence there's a lot of scope of finding new heuristic for policy generator.
2. Segregating glance and read type of access request helps SOUL cover a lot of use cases as compared to other similar systems. It uses the fact that there's always some processed data that can be broadcasted to everyone for community benefit.
3. Improved performance because of enhancement in data record transfer through batching.
4. Ability to provide composition of several different sensors as one helps app leverage better dataset. Composed sensor can serve entirely different objective than individual ones.
5. Demonstrated benefits achieved in all the stated key principles well in the evaluation section.

Weaknesses of the Paper:
1. There's no support for sensors connected via USB or Bluetooth, because for that there needs to be a custom SDK which may not support legacy apps and SOUL's one of the key principle is to provide full support to legacy apps.
2. There's no support for apps to specify QoS requirements to SOUL. This can eventually turn out to be a great user-experience feature.
3. Authors extensively(via results) show that SOUL gives better scalability for sensor use but they haven't shown any performance evaluation on how a mobile/moving app will have seemless execution of sensor access.

Discussion Points:
1. Guest token is renew every 2 hours. Authors state that estimating social ties is time consuming because of tracing all the social ties on the owners facebook account. Is there a smarter way to reduce social ties to look at based on past data?
2. Privacy aspect for sensor datastore in SOUL core is not discussed in this paper.
3. There's a room for discussion on discovery mechanism followed by SOUL. Currently there is a directory service hosted on remote cloud. What are few other mechanisms(technologies) that we can look into and their pros and cons?
4. There's not much discussion about why authors chose externalization as a default option for running app's SOUL aggregate. Sometimes it's better to run it locally.

The Design and Implementation of a Wireless Video Surveillance System

Summary:

The current wired surveillance system requires massive cost in the deployment and development of the infrastructure. This restricts the coverage of these cameras. The authors propose a real time distributed wireless video surveillance system, Vigil, that leverages edge computing to scale up the surveillance to many cameras and expand the coverage region in the presence of limited bandwidth. The major challenge in a wireless system is the limited capacity of the wireless spectrum. Vigil aims at minimizing bandwidth consumption without sacrificing surveillance accuracy. It consists of two major components:

Edge Computing Nodes (ECN) : These are small computing platforms like a laptop or embedded system that are attached to the camera. Each ECN receives the video feed from its connected camera and executes initial processing on the the stream like face detection, indexing, compression and storage of video for a short period of time. It uploads the analytics of the feed to the Controller.
Controller : The controller is located in the cloud and receives the user queries and then runs a frame scheduling algorithm requesting the ECNs to upload only the relevant video frames. This content aware uploading strategy suppresses a large fractions of redundant data transfer to the cloud thereby saving the bandwidth.

Architecture and Design:

Each ECN calls a frameUtility on the frames to generate an array of analytics data called the util. These analytics are uploaded to the controller. The controller then makes the decision about the most valuable frames and requests the respective ECNs to upload them. Two designs are discussed in the paper:

Intra-cluster processing: When a number of cameras are capturing the same scene from different angles, there might be overlapping frames capturing the same object or person. The design suggests a re-identification algorithm that identifies frames with same objects and based on this the controller executes a scheduling algorithm to request frames from the ECNs in same cluster.
inter-cluster processing: The paper defines a metric, useful objects per second (ops), that helps maximize the number of useful objects per second delivered to the controller. The ops captures how many useful objects per second the frame at the ith index of the selected image sequence from cluster c will deliver if it is selected for transmission. Thus the design uses the ops for the inter-cluster scheduling.

Strength:

The paper provides a very comprehensive description of the design and implementations. The examples were very useful in understanding the frameUtitily, utils and the Vigil DRR algorithm. The initial scoping and problem description was very clear with assumptions clearly stated leaving less room for speculations.
good evaluation methodology of providing experimental results of the intra-cluster an inter-cluster algorithms discussed in the paper.
The system proposed is flexible in the sense that the definition of the utility can be changed based on the specific use case without changing the underlying algorithms.
Vigil outperforms the Round Robin and single camera approaches at high activity level providing a gain of 23-30%.
The inter-cluster design outperforms the traditional benchmarks for all activity levels and bandwidth.
The system is already deployed at two indoor and one outdoor site for the purpose of testing.

Weakness:

Each camera requires an ECN for connecting to the controller. This is overhead for infrastructure and management. The battery or power requirements of these devices should also be considered.
Additional hardware requirements to do video and image processing at the edge nodes. Some of these algorithms can be very computation intensive.

Discussion Points:

The re-identification algorithm fails when the objects are very close by in two frames. The author aim to target in future work the switching between traditional algorithm in such cases. How feasible is this switching keeping in mind that these are real time-systems?
In outdoor scenarios, face detection or object detection can be tricky since there is a lot of interference and noise. What will be the accuracy of content-based frame selection in such real life scenarios?

Thursday 2 March 2017

The Swarm at the Edge of the Cloud

This paper mainly discusses the opportunities and challenges of integrating sensors and cloud computing.

Background

In alignment with the Cloud and its all-present mobile access devices, a third layer of information acquisition and processing devices - called the sensory Swarm - is rapidly emerging, enabled by even more pervasive wireless networking and the introduction of novel ultra-low power technologies. The Swarm gives rise to the true emergence of concepts such as cyber-physical and cyber-biological systems, immersive computing, and augmented reality.

The TerraSwarm Research Center, at UC Berkley is addressing the huge potential (and associated risks) of pervasive integration of smart, networked sensors and actuators into our connected world.

https://www.terraswarm.org/index.html

Important Aspects

Swarm- Set of sensors and actuators devices interacting with users and the physical world.

TerraSwarm- the emerging cyber-physical network which consists of large number of sensors and actuators.

Swarmlets - The TerraSwarm applications which are characterized by their ability to dynamically recruit resources such as sensors, communication networks, computation and information from the cloud.

SwarmOS - This will serve as a distributed executive and resource manager for TerraSwarm applications. The SwarmOS will mediate the needs of applications for services and clusters of resources where resource clusters may be, for example, a portion of a processor's resources, or a slice of bandwidth.

To achieve the idea of TerraSwarm, the authors propose a 3-level model comprising of Cloud as backbone that will offer extraordinary computing and networking capability, global data analytics, access and archiving. With the middle layer consisting of battery powered mobile devices which acts as the connection between nearby swarm devices and the cloud.

TerraSwarm research is aligned with the 4 major themes.

a. Realization of a “Smart City” scenario.

b. Platform architectures and operating systems.

c. Services, applications, and cloud interaction.

d. Methodologies, models, and tools.

Key Challenges in TerraSwarm

Characteristics of TerraSwarm - Large scale, Distributed, Cyber–physical, Adaptive, Heterogeneous.

The important challenges in TerraSwarm includes generation of massive amounts of the data, Data privacy and security.
A critical research challenge is how to recruit and compose heterogeneous resources, how to dynamically adapt applications to changing resources and contention for resources, and how to share resources without compromising safety, security, or privacy.
Need to develop an energy-efficient hardware support for encryption/decryption, authentication, and hardware-enforced key management.
Aggregation of data without compromise on Data privacy.
Ensuring that different components and subsystems of TerraSwarm can be dynamically recombined yet still function properly will require new, highly advanced development methodologies, models, and tools.
The security of information, actuation, and brokerage is essential to the success of the TerraSwarm vision.
In the dynamic network of a TerraSwarm system, noninterference properties is very important.

Strengths of the Paper

1. Excellent introduction on the architecture, model, themes of the TerraSwarm.

2. Clearly provides the details about the TerraSwarm vision.

3. Clearly provides details about the challenges that are encountered in TerraSwarm and the issues are discussed to provide more details to the further research.

4. Multiple TerraSwarm applications have been discussed in alignment with the research. Examples includes consumer applications, Health-related applications etc.

Discussion Points

1. Scalability of the TerraSwarm application is clearly not discussed in the paper.

2. How to do handle the interference between the different swarm devices in the TerraSwarm is not addressed.

3. Even though paper discusses on the ideas of data privacy and protection and suggests on the multiple potential techniques which might be used to increase data privacy, there is still no concrete vision on what exact protocols work for the Terraswarm. The techniques are vaguely discussed.

4. One of the challenges of TerraSwarm is real time transfer of data and processing on the cloud for mission critical systems. How to address this challenge is clearly not discussed.

Wednesday 1 March 2017

The Swarm at the Edge of the Cloud

The paper presents a vision of the future where thousands of smart sensing devices per person interact with other sensing devices, mobile personal devices and cloud to form a three tier structure that projects a TerraSwarm vision.

This web of sensing devices form a swarm and the terraswarm application that makes use of such a system are called swarmlets. The authors believe that the swarm based systems can have at least as much impact as web when it was launched.

Development and management of such a large scale system comes with its own challenges. The paper discuses four themes to address these challenges. The first of the themes focuses on idea of smart cities where swarmlets can drive innovation. The other themes include -

Platform architecture and operating systems
Services, applications and cloud interaction
Methodologies, models and tools

Key Points :

The most important factor in the transition from single application with dedicated set of resources to terraswarm application is the privacy and security of the information being shared in the swarm network.

Large scale presence of distributed network of swarm posses a big challenge of management, security and policy control.

Such an implementation needs “SwarmOS” which provides a support for continual reconfiguration of applications due to the dynamic nature of the sending devices and solution to the resource management problems when swarmlets are involved in a race condition.

Implementation of an adaptive system due to the dynamic nature of the environment.

TerraSwarm vision has the potential to disrupt consumer and enterprise market due to its context aware design and the wide range of application domains it can run.

Strengths

The paper presents a clear vision of the structure of TerraSwarm and overview of interaction between swarm applications, swarm resources and swarmOS.

One of the very strong points of the paper is the acknowledgement of the challenges due to the highly scaled model of the swarm and the ever changing nature of the swarm environment.

The 4 theme model presented for addressing the challenges provide an overview for future researchers and swarm application developers.

Weakness and Discussions

Data accrued by the sensors is highly sensitive and is at greater risk of being targeted by cyber criminals. Solution to such security risks presented by the integration and sharing of sensor data with multiple third-party vendors are not addressed in this paper.

The paper talks about development and configuration of applications which have not been invented yet without giving details about the scope and viability of such applications.

Continuously evolving nature of the swarm environment posses a difficulty in testing and verifying functionality of the applications components.

The paper does not address the communication and interfacing problems associated with the multilayered application interaction.

Potential cloud service failures can disrupt the commercial IT market. However, failure of services which support TerraSwarm applications like healthcare and disaster management can have more serious life threatening consequences.