Tuesday 31 January 2017

ThinkAir: Dynamic resource allocation and parallel execution in cloud for mobile code offloading

Summary:
    ThinkAir is a framework that allows complex mobile applications to be partially run at cloud, thereby removing the restrictions from the handhelds of limited processing and storage capabilities, battery life etc. ThinkAir attains this with remote execution of code (method-level computation offloading) using multiple virtual machines (VM) at the cloud. The main focuses of the paper are to address 'scalability' issues and to demonstrate parallel execution of offloaded tasks by exploiting multiple VMs available at cloud. ThinkAir claims superiority over related solutions like MAUI and CloneCloud with its on-demand and scalable infrastructure.

  ToolChain: (i). ThinkAir Library (API) (ii). Compiler (iii). VM manager and, (iv). Parallel processing module at cloud.

  Flow:
    Offline :-
      Programmer annotates (@Remote) at the code on methods that the system could consider as candidates for remote execution.
      The ThinkAir compiler translates the annotated code and generates remoteable method wrappers and utility functions.
    Online :-
                                     Mobile                                                                     Cloud                                    -------------------------------------------------------------------------------------------------------------------------
   Execution Controller at the mobile decides during run time
   if the executing method could be transferred to cloud or
   proceed with local execution.
      - utilizes various profilers (Software, Hardware and Network)
         to determine if the execution switch is needed.
      - transferred if the input exceeds a Boundary Input Value (BIV)


                                                                                    A light-weight application at a surrogate
                                                                                    server called Client Handler receives
                                                                                    and executes offloaded code and data.
                                                                                       - it instantiates one or more VMs and
                                                                                         delegates the task to it.
                                                                                       - secondary VMs are in 'paused' state
                                                                                         and could be resumed for task
                                                                                         processing
                                                                                       - parallelization by direct splitting of
                                                                                         sub-tasks to different VMs.
 ------------------------------------------------------------------------------------------------------------------------

Objectives and solutions:
1. Dynamic adaptation:
            what?
                adapt quickly as conditions change to achieve high performance and ensure correctness.
            how?
                - if connectivity is lost with server, the framework falls back to local execution.
                - exceptions thrown at application server are caught and re-thrown at client side.
                - OutofMemory situations at App.server are handled by cloning a powerful VM to own the task.
2. Ease of use:
            what?
                less learning curve owing to a simple interface to developers.
            how?
                - no modifying cost involved other than tag/annotate the methods at the code.
                - semi-automatic offloading of code based on real-time environmental factors assessed by profilers.
 3. Performance improvement:
            what?
                improve computational performance and power efficiency of mobile devices by leveraging mobile clouds.
            how?
                - cloud-augmented execution of offloaded tasks exploiting smartphone virtualization techniques (Android x86 + Virtual Box)
                - recursive and data-intensive algorithms are split into multiple tasks and are distributed over multiple VMs to achieve an efficient parallel execution model.
 4. Dynamic scaling:
            what?
                dynamically scale the computational power at the application server to optimize the performance.
            how?
                - on-demand resource allocation: client can request for extra computational power.
                - exploits parallelism by dynamically creating, resuming and destroying VMs.

Limitations:

        1. High VM instantiation time (32s is a lot!!); even though it is addressed at the paper by retaining the VMs at a 'paused' state and thereby reduces the need to create a VM frequently, it has been noted that even for seven simultaneous requests, the resume time takes over 7 seconds. This adds a considerable overhead to processing the tasks.
        2. ThinkAir assumes a trustworthy cloud server exec environment: there is hope that whenever data is offloaded to the cloud, the code and state of the data are not maliciously modified or stolen! No authentication mechanism involved; the app server will process requests to process any submissions from any client (it has been proposed to address this security concern in a future work).
        3. It is assumed that the smartphone and smartphone clone at the cloud will be pre-synchronized. And, it is not discussed anywhere in the paper.
 

Scope for discussions:

        1. Did they say multi-users? It is understandable from the paper that the ThinkAir system doesn't consider the tasks based on the users but based on different processes (or applications). If more number of users are connected to the server, it limits the scalability scope - as each VM reserve considerable amount of memory (the paused VMs also hold the memory, only CPU cycles will not be consumed!). This posts a resource constraint and casts doubts on the commercial deployment of the system.
        2. It was not discussed in detail how the profiling of the VMs will be performed at the server side of the framework. If more applications are processed by the server at the same time, some applications will be delayed to get VM for processing and the profiler data will also include the wait-time period which will negatively influence the future decisions by the Execution Controller.
        3. More discussion related to network constraints are to be present. It is assumed that the Round Trip Time between the clients and server will be negligible; High bandwidth is assumed.
        4. ThinkAir uses a sub-optimal data transfer approach. Not all of data instance objects will be required by the cloud for processing; and between successive execution calls from the same client and application, the state changes could be limited. This idea could be utilized to cache the data at server side and work with incremental updates from the client on successive processing calls.
        5. Will the users/clients be remembered between tasks? If there's a client queue overflow scenario at the app server, how will it be addressed? Will there be any priority based scheduling?
        6. BIV is estimated at client side. So, it is understandable that BIV is learnt by each instance of the process (application). This requires each client to start from the scratch for BIV learning, thus submitting all batches it sees as annotated as 'remote'.
        7. Possible use-cases for deploying ThinkAir:
                i.   High-end processing: image processing applications like Face recognition systems
                ii.  Games: n-queens, sudoku solver etc.
                iii. Social media processing: video decoding, image file conversions etc.
                iv.  other data intensive applications?
        8. How they compare with other related works, MAUI, CloneCloud, COMET etc.?
 

Some other interesting observations at the paper:
 
       1. cascading profilers have issues with working distributed kernel?
        2. energy usage pattern for 3g vs wifi
        3. networking cannot be neglected in a distributed solution (from the performance discussion between 3g, wifi and wifi-local)
        4. communications may impact dramatically on performance
        5. carefully select proper technologies

Monday 30 January 2017

Parametric Analysis for Adaptive Computation Offloading


SUMMARY
This paper frames the task of computation offloading as an optimization problem which can be solved using parametric analysis.
Computation Offloading is useful only when it leads to an overall improvement in performance and energy saving. Therefore, when making a decision about whether or not a particular computation should be offloaded there exists a tradeoff between the communication cost and the computation cost. The premise of this paper is based on the fact that for most applications these costs depend on the input parameters of the specific application instance and hence an optimal program partitioning cannot be completely determined a priori. The paper associates with the program a cost formula which uses the runtime values of parameters to efficiently determine an optimal partitioning.


The solution presented by this paper is as follows -
  1. Program Partitioning  -
    1. The program is partitioned into schedulable tasks.
    2. A Task Control Flow Graph (TCFG) is used to represent the program. Each node is a task and the directed edges denote the associated data transfer between these tasks.
    3. ‘Data validity States’ are maintained on each host to avoid redundant data transfers.
    4. ‘Memory Abstraction’ is used to resolve the issue of unknown data dependencies.
  2. Cost Analysis -
    1. Computation, Communication, Scheduling and Data Registration (Validity) costs are expressed as functions of program input parameter values.
    2. These are then used to derive a cost formula which must be optimised.
    3. This optimisation problem is modelled as a ‘min-cut network flow’ problem
    4. A parametric partitioning algorithm is used to solve this problem and the result is an optimal program partitioning, one for each possible range of input parameters.
    5. At runtime, the program’s tasks schedule themselves on either host based on the current parameters and the corresponding partitioning solution from 2.d.


STRENGTHS OF THE PAPER
  1. The Program Partitioning Model could be an important take away from this paper. Most computation offloading schemes require restructuring of the program and the details discussed in section 2 may easily be applied to other offloading techniques. It clearly outlines the task granularity and breakpoints and the TCFG provides a good visualisation.
  2. Most of the program partitioning and cost analysis work is done beforehand which means that this does not add latency unlike frameworks that make offloading decisions completely at runtime. At the same time these decisions are not agnostic to the program execution state. At runtime, the program refers to a predetermined chart to find its optimal partitioning which means that better decisions are made than a purely static system.
  3. The authors provide a lot of details of the experimental setup which makes their results easier to  understand and accept. The experiments are run with 4 different applications which differ in their complexity, the number of input parameters as well as the results of the parametric analysis.
  4. The Results of this approach as discussed in this paper are impressive. There is a 37% performance improvement with offloading when compared to running the whole application locally.
  5. The writing style of the paper makes understanding very involved math and a lot of details relatively easier due to the abundance of examples, edge cases and explanations. A few instances of this are -
    1. The authors state that different input parameters can require different optimal partitioning and they put this point across well with the help of an exact example of an audio file encoder in section 1.1
    2. The authors provide reasoning for their decisions where possible such as they explain why the granularity of a task must be smaller than a function in section 2.  
    3. Edge Cases are handled where possible like section 2.2 provides the ‘Data Validity States’ to ensure communication costs are as close to real as possible.  


WEAKNESSES OF THE PAPER
  1. Program Partitioning -
    1. The program partitioning discussed in this paper analyses the code line by line to find task headers and branches as discussed in section 2.1. This means that considerable programer effort is required to restructure the program to break it into tasks. An even deeper understanding may be required to construct the correct TCFG.
    2. A point that is discussed in [2] states that such methods can only be applied to specific kinds of applications (multimedia). I believe this is true because it may be impossible to model very complex applications as a simple graph.
  2. Cost Analysis
    1. Extensive Experiments are needed to determine the values of the constants in the Cost Formula. These experiments must be re-run each time the application logic is changed and for every new application.
    2. For the more complex applications, user annotations (section 3.4) may be required to express the costs as functions of the input parameters.
    3. The results of parametric analysis are different partitioning for different ranges of input parameters. The range of an input parameter can be very large and more details are necessary to understand how these results are stored/evaluated/searched at runtime.  
  3. Experiments -
    1. Experiments are only conducted with multimedia applications. As discussed in [1], experiments with simpler applications such as text editors could also have been included.
    2. The experiments do not provide comparisons with other similar frameworks and any exact numbers on energy consumption.
  4. Scheduling -
    1. The paper talks about scheduling only very briefly in section 2.
    2. Only one host is active at any point but this unnecessarily adds delay to the application execution. While the overall runtime reduces due to the offloading, such scheduling is not fully exploiting the benefit of a distributed architecture.
  5. There is a lot of additional bookkeeping required in terms of the Data Validity States, the Abstract Memory Locations, the Mapping Tables etc which need to be maintained at both the client as well as the server.
  6. This paper does not address the question of the ‘How’ to offload and hence has no discussion on fault tolerance, security or details of the message passing architecture.
  7. Weaknesses discussed in the paper itself (section 5.2 & 5.3)
    1. The Degeneracy problem which exists due to the nature of the linear systems itself
    2. The Path sensitivity problem which says that the order in which tasks are executed is not considered while making offloading decision.


DISCUSSION POINTS
  1. For the more mathematically minded -
The paper frames the task of finding an optimal partitioning into an optimization problem and then goes on to model it as a ‘min-cut network flow’ problem. The paper does not explain why this approach is selected. So why this approach? Could another technique be used just as well?
  1. Is the 37% improvement in performance really worth all this extra effort? Do other techniques provide comparable improvements with lesser effort?
  1. How does this approach compare to MAUI, CloneCloud and COMET in terms of -
    1. The complexity of applications targeted
    2. Being fully automatic
    3. The offloading decision (dynamic or static)


More Related to this Paper and Computation Offloading -

Wednesday 25 January 2017

CloneCloud: Elastic Execution between Mobile Device and Cloud


Motivation:
The intuition behind the CLoneCloud comes from the probable cost optimization if the cost for sending and receiving the relevant data and code in addition to the execution cost on the cloud, is significantly faster than that on the mobile device. Similar approaches were proposed earlier but they did not have fine-grained decision taking capability based on the current resource availability with the device.

Main Points:
- In the design proposed by the paper, the programmer does not have to predefine the partitions of the services between mobile and cloud. This is done automatically and seamlessly by the CloneCloud.

- Partitioning of services is carried out on the basis of the expected workload and execution environment (CPU speeds, network performance and energy consumption).

- CloneCloud migrator operates at thread granularity which is a new approach in comparison to the traditional suspend-migrate-resume mechanisms for app migration. For offloaded execution, the design supports the migration of particular threads with relevant execution state on demand and merging of migrated state back to original process.

- Using the proposed design, CloneCloud achieved upto 20x speed increase and 20x less energy consumption of smartphone.

Tradeoffs/Shortcomings:
- Static analyzer restricts the migration and reintegration points to routine entry and exit points only (to deal with optimization problem easily). Dynamic profiler uses random inputs for the cost modelling, and this might not give us the best cost model as some executions paths might not get covered.

- The system does not support the distributed shared memory (DSM) model i.e. worker threads sharing the same state cannot be offloaded at the same time.

- The design only supports the migration at execution points where no native state need to be collected and migrated. This is done as the migrator would then have to collect the native context for transfer which increases the complexity of the system due to the processor architectural differences, file descriptor differences etc.

- To control the complexity further, the pre-existing state on the mobile device remains unaltered until the migrant thread returns.


- For optimization the paper considers networks characteristics, CPU speeds and energy consumption. There is no mention about the overhead in case of faulty clone.

COMET: Code Offload by Migrating Execution Transparently

Overview

COMET is a runtime system that allows unmodified mutli-threaded applications to utilize multiple machines. It is publicly available here.


The design goals are: correctness in multi-threaded programs, speed-up of computation, no manual effort, fault tolerance, and generalize with existing applications.

COMET uses Distributed Shared Memory (DSM) to access and change memory between systems (as opposed to Remote Procedure Calls) thus allowing for multi-threading support and thread migration. It's the first to apply DSM to offloading.

The system operates on a VM-Synchronization primitive using a push-pull protocol. Deltas among updates of objects is tracked through a "tracked set table" that denotes dirty fields.
1. The pusher and puller enter an executable exchange protocol to exchange binaries.
2. The pusher sends over information about each thread. All local threads are temporarily suspended.
3. The pusher sends over an update of the shared heap.
4. The puller buffers the rest of the synchronization operation, then temporarily suspending its local threads (just like step 3 on the pusher's end). It merges in update to the heap first. Then it pulls in updates to the stack.


Advantages

  • Significant speed-up in computation while still correctly managing multi-threaded programs.
  • Even with some amount of synchronization between threads, COMET still operates well. 
    • The paper demonstrates this through an application that favors a queue of integers. There was an impressive speedup of 202x of Wi-Fi (equating to 44 minutes locally compared to 13 seconds over Wi-Fi).
  • As a side-effect of improving the speed of computation, COMET also improves energy-efficiency by offloading computationally intensive portions of code.
  • Failure recovery is almost cost-free. Clients can just resume computations on server failure.
    • However, the client should never enter a non-recoverable state upon server failure. Thus wait for all changes to be pulled/buffered before committing any changes.
  • Smart adaptability to network conditions in terms of latency and bandwidth of the connection.

Shortcomings

  • The authors concede that there is no built-in security mechanisms that protect data privacy or mitigate and tampering with the computation results. 
    • Thus, the client must trust the server. However the converse does not apply because the server has no private data or dependency on the accuracy of results.
    • This may limit usage for enterprises -- ECOS (Gember et al.) attempts to address this problem of ensuring data privacy.
  •  COMET may decide to send over data that is not needed for computation thus wasting bandwidth.
  • The scheduling algorithm tasked with moving threads between endpoints (in order to maximize throughput) is somewhat naive. Essentially, a thread is migrated when its time has exceeded some (configurable) parameter t
    • However, it should be noted that scheduling isn't a main component of the paper.

Discussion Points

  • The authors state that at the time of writing (2012) they had not found many mobile applications that rely on heavy computation. As a result, the practicality of COMET can be put to question -- however, the authors quickly follow-up with the statement that COMET can now allow for these types of applications to exist and is further generalized to existing applications that require no offloading logic.
    • In the last 4 years, have mobile applications become any more computationally intensive?
  • COMET is limited to Android systems for its reliance on the Java Memory Model and how synchronicity is conducted. An interesting experiment would researching an implementation for iOS devices.
    • The authors do state that the design of COMET is general enough that it can be applied to other environments such as Microsoft's Common Language Runtime.
  • How can the scheduler be improved? In other words, how can we increase throughput via scheduling thread migrations?

Saturday 21 January 2017

KaZaa

Hi,
I was reading more about the paper aiming to give us insight about how Kazaa works, and here are a few points which got me thinking:

1) "KaZaA hashes every file to a hash signature, which becomes the ContentHash of the file"

I understand that Content Hash helps in uniquely identifying a file and its copies and also helps detect bogus files. But the paper does not speak much about how these have been constructed. 

From this paper: Malware Prevalence in the KaZaA File-Sharing Network which I found on the internet
"The KaZaA content hash is 20 bytes in size: the first 16 bytes are the MD5 [15] of the first 300 Kbyte of the file. The last 4 bytes are the value of the custom made hash function of the length of the file"

2) "When a SN receives a query, it may forward the query to one or more of the SNs to which it is connected." followed by 
"Our measurement work has determined that SNs often change their SN-to-SN connections on a time scales of tens of minutes" 
-- Are they implying that an SN ache is maintained with the metadata from previously visited nodes? Because they have stated the contrary on Page 2 of the paper saying "Our investigations have determined that SNs do not
cache metadata when ONs disconnect from them."

3) DBB Files: Does the 'active monitor' reside in the SN and keeps polling for any changes to its copy of the DBB? Is this something obvious or is there something I'm missing out?

4) "We have determined that, as part of the signalling traffic, KaZaA nodes frequently exchange with each other lists of supernodes"

The paper claims that this is to ensure locality-aware ON-SN & SN-SN connections. But I don't under why KaZaa would need its ONs to have this list because unlike Kazaa Lite they are not skipping between SNs.

5) With regards to their own tests, they state:
" When one of the workstations was promoted to a SN, we manipulated the Windows Registries in the other two ONs so that each of the two registries listed only the promoted SN" 
but later on in the paper state that
"First, we have observed that initially at startup, an ON probes candidate SNs listed in its Supernode List Cache with UDP packets for possible connections"
as the reason for noting many ON-SN connections within a stipulated time."

If the first point is true, why would the 2nd point have happened at all? I'm sorry if I am missing out some information here

6) Definition of Round Trip Time

7)  "It can be observed from the plot that 13% of the ON peers are responsible for over 80% of the meta-data uploaded. It is interesting to compare this data with results reported in [6], wherein on University of Washington campus, 8.6% of KaZaA peers were serving 80% of the requests."
I don't get the importance of checking how much meta-data was uploaded, is this not very subjective to the peers themselves and how much sharable data they contain?

I can't seem to come to terms with a few of the conclusions in this paper:
While they themselves claim that they have tested only from one location, isn't it rather a small local test bed? 
They forced the test bed to stick to a particular SN, I am not sure if their own version of KaZaa clients used this. And if they did, how do they claim to use the 200 node list that SN passes to ON?
I also found extrapolation of results in the paper for higher scale networks, but I'm not sure if extrapolating is the best way to proceed.



Your thoughts about these are welcome!
Best Regards,
Ayushi