Thursday 23 February 2017

Toward Global Data Infrastructure

Summary:

The paper details a new IoT Edge infrastructure prototype that focuses on developing a data-centric design matching IoT requirements. The authors discussed in detail how IoT space differs fundamentally from web services, and why conventional cloud-centric approaches are not enough for IoT applications.

Their developed work, GDP (Global Data Plane), is focused on handling transport, replication, preservation and integrity of the data streams from IoT clients. It gives a new layer of data-centric abstraction for IoT applications. The core concept of the design is a 'single-writer append-only log based' infrastructure; It also speaks about using a Common Access API and a flat address name-space for accessing the logs at the GDP.

System Design:

The GDP acts above the network level, and offers Common Access APIs (CAPPIs) to applications rather than raw packet routing. The key mechanism for data storage and communication in GDP is the secure single-writer log, which gives a narrow waist model for the infrastructure.

An authenticated data structure, Log, is the time-series append only data unit associated with each sensor. These logs can be migrated to any locality; support simultaneous readers and replication. Clients - sensors that generate data, actuators that act on the data, gateway devices (smart phones to Rasp-Pi), connect to the GDP and have read/write access on these logs. A flat 256-bit identifier called GDP-name is used to address the logs and the clients.

GDP-routers (SDN based Click modular router implementations) provide location-independent routing over an overlay network - employs DHT + selective routing. Control plane services enforce policies required by the GDP. eg., A Control plane replication service could ensure durability of the logs for the GDP.

Advantages:
1. The GDP design gives more functions to the sensors and actuators, which can support historical data query with log-based data structure.
2. Subscription based logging allows realtime data update from the sensors. 
3. Access control is easily implemented at the log-level and thus avoids the dependency on
vendor specific authentication mechanisms.
4. The paper also claims that GDP ensures security best practices with reduction in attack surface.
5. Cleaner design with separation of policy control (locality and replication decisions etc.)
from the application.
6. The solution provides support for heterogeneous hardware infrastructure
7. Single-writer read-only design gives a fault-tolerance model with simple concurrency issues.

Disadvantages:
1. GDB is still a PoC; Not a bulletproof design. It has been acknowledged at the paper that
GDP has not withstood wide-scale deployment testing.
2. The burden of encryption is left to the applications and clients (not all clients are crypto-
graphic friendly. It may only be possible with smart-phones and Rasp-Pis with reasonable computational power.)
3. Simple key-management procedures. The keys are also backed by logs and transfer of keys to
the remote-entity is assumed to be secure over a presecured 'tamper-proof' channel.
4. The metadata adds more networking load at the packets thereby making the data transfer heavier.
5. Suggested overlay networks can severly affect round-trip latencies and cause serious performance
penalty. (Locality-aware distributing is suggested as an option; but still the latency will be an issue)
6. There is no clear separation of services between GDP and Control plane.

Room for Discussion:
1. GDP is still in idea stage, and need more implementation specifics.
2. Policy-driven storage is a possibility with GDP. What do you think are the options? How would such a system could be implmented for IoT with GDP?
3. How do you see the security aspects of the GDP sound? ACL at log level will not be enough? For multiple applications working on the same GDP, could we go with more secured infrastructure like SGX or container isolation?
4. What is the opinion on having 256-bit address space - is it more or less for an IoT environment?  (Considering the namespace is inclusive for the historic log data)
5. Not many applications individually treat the sensor data from the devices directly. How do GDP support aggregation and analytics?

3 comments:

  1. Ajay: super blog. Great discussion questions!

    ReplyDelete
  2. I am curious about the details of the "single writer". If there is only one writer and at sometimes, it crashes, who will write the log? So I think we need to use several writers (or if one crashes, another one can be selected quickly). So we use something like zookeeper. Or we can use lock to synchronize the writes for log.

    ReplyDelete
    Replies
    1. Hey Zheng! That's a great question. And, it's interesting to know about the Apache ZooKeeper process synchronization. My guess about what they mean by "single-writer" is that: each log is associated with a single input resource (one sensor per log) that saves the trouble of handling concurrency. So, the data has been saved as a log, and then exposed to any readers (actuators, that may require this data). This makes the logs as 'one-time write read-only' data which gives GDP all the flexibilities as described in the paper!

      Delete

Note: only a member of this blog may post a comment.