Cellmate achieves low latency for an interactive application, high accuracy and scalability while taking into consideration the energy limitations of the mobile devices. A 96% success rate and a feasible implementation for realistic use-cases make the paper impressive.
The main application components and flow is as follows -
- A Thin Client: Firstly, After running some experiments, a decision is made to offload all computation to the server as this gives the best runtimes taking Network Latency and computation into consideration.
- User Input: The user needs to upload short videos of the “area” of interest. The user also needs to manually label/identify each appliance once in this feed. This forms the search database for this area.
- Offline Processing: These database images(video) are then used to create a 3D model of the area. If an appliance is labeled once in a single image, its projected image in the 3D model can help identify this appliance in all the other images.
- The Android Application: When the user wants an Android phone to locate an appliance in the area, he/she simply captures an image using the app on the phone and it gets passed on to the server.
- Runtime Identification Pipeline:
- On the server-side, query image is processed - its SURF features are extracted and a Bag of words description generated.
- The database is searched for a reference image that most closely matches the query image.
- Appropriate transformations are employed to identify the appliance in the query image and the result is returned.
Evaluation - The paper provides detailed evaluation of different aspects of the design including the impact of varying the number of areas, edge cases with occluded objects, server run times, scalability etc
Strengths of the Paper
- Any home automation application would need a way to identify appliances in the target area. Most prior approaches require either explicit and tedious queries or specialized infrastructure. Cellmate has no such requirement and yet it provides low latency (achieving almost human intelligence!)
- A changed environment can easily be updated by simply recording and uploading a new video to the CellMate server.
- As claimed by the authors, the Labeling Tool is first of its kind, which is a major contribution. Also, the technique of propagating labels by generating a 3-D model is an interesting and creative solution.
- Detailed statistics are provided for every component of their design. For example, the decision to offload all computation to the server is supported with thorough experiments
- While propagating labels and identifying appliances, it is not just the SURFs that are taken into consideration but also the visual context (relative angles, positions etc).They prove this using their experiments by adding and removing objects to a scene.
- This paper takes into consideration the energy constraints of mobile devices which is always important for edge devices.
- They have addressed scalability by using a larger database compared to previous work and also by experimenting with multiple concurrent users. When using a database with information about a large number of areas, the results decline slightly. However this is taken care of by the use of Indoor Localisation. This reduces the search size to a maximum of 10 areas and hence scalability is not impacted by the increase in the number of areas in a practical scenario.
Weaknesses of the Paper
- One of the major downsides to this paper is that it requires a lot of manual user input and labeling. Even though this might be easier than the prior approaches, a lot depends on how well the user runs the application both to create the database and the query images.
- The requirement of special depth cameras might limit CellMate’s practical use, specially for personal homes etc. The authors agree that the current design fails without a depth camera and techniques are needed to remove this requirement.
- The application design makes a lot of static choices such as the offload-all and using 5 threads for the runtime pipeline (even though there are 8 cores?). As we have seen in other papers, even with domain knowledge static offloading and parallelizing decisions are not the most optimal.
- There are several failed cases discussed in the paper which make it is clear that the application either requires better training/learning techniques or the user needs to be trained to take better and more useful images. These require extensive experiments to find the best techniques.
- Additionally, I felt that more details could have been included at some places where they have cited references to support their choices and assumptions as this paper borrows conclusions from many places. It would have been an easier read if they could have included a couple of lines about the reasons for their choices in this paper itself along with the reference (for example, 400ms, kd-trees, Building Management System Apps etc).
Discussion Points
- Using an entire video for identifying an appliance will involve some redundant work. What could be some techniques to reduce the size of this video or the number of images required? (like finding an optimal subset? Guide the user to capture better videos?)
- Do you think such a design could be implemented to achieve something along the lines of Amazon Go? What changes might be necessary?
- The paper discusses only a couple of use-cases. What else can you think of?
- A domain specific question - What techniques can be used to remove the depth camera requirement?
Great job. Weakness #5 is spot on. DP #4 is also worth discussing. Is the solution over-kill for this problem domain? Are there other areas where the methodology might be applied.
ReplyDeleteThe method does seem a bit much for simple applications like detecting a projector in a room, considering that most office rooms will only have countable appliances and there could be a much easier way to tag and identify these.
ReplyDeleteAlso, as the complexity of scenes increases, the manual labeling would become unreasonable. This process needs to be automated using some object recognition techniques to make the design really scalable.
A 3D model is constructed to identify relative locations. Just a thought -
DeleteCan we not make use of the absolute geographic coordinates of appliances to do this (with some calculations, in some way)? Phones do have GPS.