Many number of apps are being developed on the acoustic
event detection i.e. on basis of speech, music, heartbeat etc. It is
inconvenient for the app developers to write efficient acoustic processing
algorithm separately. Auditeur is a platform where developers can register for
acoustic events and whenever an event occurs the app would be notified.
Auditeur provides API to enable the registration of the applications
for which events could be manmade sounds, music, vehicle sounds etc. It
generates a context aware and energy aware classifier for event detection and
notifies the application when the event occurs.
Collection of training
data:
Training data that are used in the classification problem
are collected in form of soundlet which is 3- 30s audio clip. The audio clip is
attached with the contextual information which combines phone generated context
about audio clip and also the user tags. There are two types of tags: content
which describes what sound is and container that describes the background e.g.
office. An example of phone context information could be location of the phone,
body position w.r.t. phone etc.
Storage of the training
data:
The collection of data is logically divided into public and
private spaces. Soundlet in public space can be shared between developers but
private space is for each developer. To prevent malicious tags in the public
user domain, sanity check are performed and there are fixed set of tags that
are predefined.
Training model:
In cloud, the feature vector is generated from the audio
clip which is 121 dimension vector. The content tags are used in two different
ways, look for tags that describes a class of soundlet and within tags which describe
sounds in the universe. The request for training a model contains the sound app
is looking for, other unwanted sounds that can occur in the environment,
contextual information and energy constraints. These parameters can be
controlled by the developer of the app.
The process of Auditeur is as follows:
1. Auditeur provides and API that can used to record,
add tags and upload soundlet to cloud.
2. After sound is captured, tagging is done by the
user / phone generated.
3. Upload the audio clip and tags to the cloud.
4. Auditeur generated a model taking into account
the energy constraints of the mobile device and transfers the plan to the
mobile phones in XML format in terms of components to attach for a particular
model.
5. Periodically if there are changes in the mobile
phone resources such as battery etc. a model is regenerated by reducing the
number of features used.
6. Sound engine service present inside phone detects
the events by running the model and notifies the model.
Strengths
1.
The paper is well described and thoroughly
evaluated for all the design choices made.
2. The framework is built to make it easy for the
application developers.
3. The framework presented in the paper is a
combination of both in-phone and cloud but once a model is obtained there is no
latency in communicating with cloud for further steps.
4. It is a single framework where all speech
recognition applications can subscribe for any event detection. If every
application has their own event detection then energy consumption in that case
is more.
5. The context information is also taken into
consideration while training a model which gives improves accuracy of model.
6. The pipeline developed is adaptable i.e. it can
be changed dynamically based on the energy constraints.
7. An experiment is conducted for the user
experience both for developers and users and feedback given is incorporated
into the system.
8. While doing classification, the pipeline
contains both frame level classification and also window level classification.
Weakness
1. The User while labelling the context and
container should label them properly especially for videos in the private
space.
2. The paper mentions that the typical users for
Auditeur are the developers. For the private space videos, they are stored for
every user separately. What if there is a particular sound that is required by
multiple developers? Will Auditeur provide any API to actually share the
private space between developers?
3. When uploading an audio clip to the cloud it is
uploading the whole audio clip instead of just the feature vector. The feature
vector would be small in size compared to the whole audio clip which would save
energy and bandwidth of the phone.
4. Sanity check is performed for public space sound
clips using outlier detection method. But this method doesn’t perform well when
there is huge increase in the training data present.
Discussion
1. Can edge devices such as Wi-Fi routers etc. can
be used to process the sounds in the public domain? What could be advantages
and challenges for such an approach?
2. Can this framework be used for smart home
applications?
3. Since the processing is divided into stages, can
we improve performance by offloading some of the computations like window level
classification which takes up almost 90% of the time to a nearby edge device
like desktop when at home?
4. How feasible is it to extend the framework to
adapt to different languages?
D1, D3 makes sense. Any thoughts on how it could be applied to Smart Homes?
ReplyDeleteGood job overall.