The task of the Emotion Recognition in the Wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based “bag-of-mouths” model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67 % on the 2014 dataset.
This paper describes our approach towards robust facial expression recognition (FER) for the third Emotion Recognition in the Wild (EmotiW2015) challenge. We train multiple deep convolutional neural networks (deep CNNs) as committee members and combine their decisions. To improve this committee of deep CNNs, we present two strategies: (1) in order to obtain diverse decisions from deep CNNs, we vary network architecture, input normalization, and random weight initialization in training these deep models, and (2) in order to form a better committee in structural and decisional aspects, we construct a hierarchical architecture of the committee with exponentially-weighted decision fusion. In solving a seven-class problem of static FER in the wild for the EmotiW2015, we achieve a test accuracy of 61.6 %. Moreover, on other public FER databases, our hierarchical committee of deep CNNs yields superior performance, outperforming or competing with state-of-the-art results for these databases.
This paper summarizes recent developments in audio and tactile feedback based assistive technologies targeting the blind community. Current technology allows applications to be efficiently distributed and run on mobile and handheld devices, even in cases where computational requirements are significant. As a result, electronic travel aids, navigational assistance modules, text-to-speech applications, as well as virtual audio displays which combine audio with haptic channels are becoming integrated into standard mobile devices. This trend, combined with the appearance of increasingly user-friendly interfaces and modes of interaction has opened a variety of new perspectives for the rehabilitation and training of users with visual impairments. The goal of this paper is to provide an overview of these developments based on recent advances in basic research and application development. Using this overview as a foundation, an agenda is outlined for future research in mobile interaction design with respect to users with special needs, as well as ultimately in relation to sensor-bridging applications in general.
In this paper, we propose an audio-visual emotion recognition system using multi-directional regression (MDR) audio features and ridgelet transform based face image features. MDR features capture directional derivative information in a spectro-temporal domain of speech, and, thereby, suitable to encode different levels of increasing or decreasing pitch and formant frequencies. For video inputs, interest points in a time frame are detected using spectro-temporal filters, and ridgelet transform is applied to cuboids around the interest points. Two separate extreme learning machine classifiers, one for speech modality and the other for face modality, are used. The scores of these two classifiers are fused using a Bayesian sum rule to make the final decision. Experimental results on eNTERFACE database show that the proposed method achieves accuracy of 85.06 % using bimodal inputs, 64.04 % using speech only, and 58.38 % using face only; these accuracies outnumber the accuracies obtained by some other state-of-the-art systems using the same database.
Currently, lane change decision aid systems primarily address foveal vision and thus compete for drivers’ attention with interfaces of other assistant systems. Also, alternative modalities such as acoustic perception (Mahapatra et al., in: 2008 International conference on advanced computer theory and engineering, pp 992–995. https://doi.org/10.1109/ICACTE.2008.165, 2008), tactile perception (Löcken et al., in: Adjunct proceedings of the 7th international conference on automotive user interfaces and interactive vehicular applications, AutomotiveUI ’15, pp 32–37. ACM, New York, NY, USA. https://doi.org/10.1145/2809730.2809758, 2015), or peripheral vision (Löcken et al., in: Proceedings of the 7th international conference on automotive user interfaces and interactive vehicular applications, AutomotiveUI ’15, pp 204–211. ACM, New York, NY, USA. https://doi.org/10.1145/2799250.2799259, 2015), have been introduced for lane change support. We are especially interested in ambient light displays (ALD) addressing peripheral vision since they can adapt to the driver’s attention using changing saliency levels (Matthews et al., in: Proceedings of the 17th Annual ACM symposium on user interface software and technology, UIST ’04, pp 247–256, ACM. https://doi.org/10.1145/1029632.1029676, 2004). The primary objective of this research is to compare the effect of ambient light and focal icons on driving performance and gaze behavior. We conducted two driving simulator experiments. The first experiment evaluated an ambient light cue in a free driving scenario. The second one focused on the difference in gaze behavior between ALD and focal icons, called “abstract faces with emotional expressions” (FEE). The results show that drivers decide more often for safe gaps in rightward maneuvers with ambient light cues. Similarly, drivers decide to overtake more often when the gaps are big enough with both displays in the second experiment. Regarding gaze behavior, drivers looked longer towards the forward area, and less often and shorter into the side mirrors when using ALD. This effect supports the assumption that drivers perceive the ALD with peripheral vision. In contrast, FEE did not significantly affect the gaze behavior when compared to driving without assistance. These results help us to understand the effect of different modalities on performance and gaze behavior, and to explore appropriate modalities for lane change support.
Using touchscreens while driving introduces competition for visual attention that increases crash risk. To resolve this issue, we have developed an auditory-supported air gesture system. We conducted two experiments using the driving simulator to investigate the influence of this system on driving performance, eye glance behavior, secondary task performance, and driver workload. In Experiment 1 we investigated the impact of menu layout and auditory displays with 23 participants. In Experiment 2 we compared the best systems from Experiment 1 with equivalent touchscreen systems with 24 participants. Results from Experiment 1 showed that menus arranged in 2 × 2 grids outperformed systems with 4 × 4 grids across all measures and also demonstrated that auditory displays can be used to reduce visual demands of in-vehicle controls. In Experiment 2 auditory-supported air gestures allowed drivers to look at the road more, showed equivalent driver workload and driving performance, and slightly decreased secondary task performance compared to touchscreens. Implications are discussed with multiple resources theory and Fitts’s law.
In the last decade, the number and variety of secondary tasks in modern vehicles has grown exponentially. To address this variety, drivers can choose between alternative input modalities to complete each task in the most adequate way. However, the process of switching between different modalities might cause increased cognitive effort and finally result in a loss of efficiency. Therefore, the effects of switching between input modalities have to be examined in detail. We present a user study with 18 participants that investigates these effects when switching between touch and speech input on task efficiency and driver distraction in a dual-task setup. Our results show that the sequential combination of adequate modalities for subtasks did not affect task completion times and thus reduced the duration of the entire interaction. We argue to promote modality switches and discuss the implications on application areas beyond the automotive context.
Emotions influence the way drivers process and react to internal or environmental factors, but relatively little research has focused on drivers’ emotions. Of many emotional states, anger is considered the most serious threat on the road. Therefore, having an affective intelligent system in the car that can estimate drivers’ anger and respond to it appropriately can help drivers adapt to moment-to-moment changes in driving situations. To this end, we integrated behavioral, physiological, and subjective data to monitor drivers’ affective states in various driving contexts to address the question: “can self-selected music mitigate the effects of anger on driving performance?” In our experiment, three groups of participants (in total 52) drove using a driving simulator: anger without music, anger with music, and neutral without music. Results showed that angry drivers who did not listen to music had riskier driving behavior than emotion-neutral drivers. Results from heart rate, oxygenation level in prefrontal cortex, and self-report questionnaires showed that music could help angry drivers react at the similar level to emotion-neutral drivers. Regarding personality characteristics, drivers who had anger-expression out style had riskier driving behavior. Divers’ workload data showed lower performance and higher effort for angry drivers without music. In conclusion, this study shows that multimodal sensing can be effectively used to holistically assess drivers’ emotional states and that music can be used as a possible multimodal strategy to mitigate the anger effects on driving performance as well as drivers’ subjective experiences.
The general role of personal assistants in form of anthropomorphised conversational, virtual or robotic agents in cars is subject to research since a few years and the first results indicate numerous positive effects of these anthropomorphised interfaces. However, no comprehensive review of the conducted studies has been comprised yet. Furthermore, existing studies on the effect of anthropomorphism mainly focus on passenger cars. This article provides a comprehensive review and summary of the conducted studies and investigates the applicability to commercial transportation, in particular to anthropomorphised interaction between truck driver and truck. In the first part of the article, a literature review describes the details, aspects and various forms of anthropomorphism as well as its observed positives effects. The review focusses on studies referring to anthropomorphism in passenger cars, complemented by relevant research results from non-automotive disciplines. The second part of this article aims to derive innovative and applicable concepts for the anthropomorphised driver-truck interfaces using the Design-Thinking approach: building on a comprehensive literature review to identify user needs and problems, an interdisciplinary expert workshop developed the two first anthropomorphised driver-truck interaction concepts. The paper finishes with carving out the differences between anthropomorphised car-driver and truck-driver interaction. The next step of research will then be the implementation of the developed interaction concepts in a first prototype followed by the respective user evaluation.
In this paper we investigate how natural language interfaces can be integrated with cars in a way such that their influence on driving performance is being minimized. In particular, we focus on how speech-based interaction can be supported through a visualization of the conversation. Our work is motivated by the fact that speech interfaces (like Alexa, Siri, Cortana, etc.) are increasingly finding their way into our everyday life. We expect such interfaces to become commonplace in vehicles in the future. Cars are a challenging environment, since speech interaction here is a secondary task that should not negatively affect the primary task, that is driving. At the outset of our work, we identify the design space for such interfaces. We then compare different visualization concepts in a driving simulator study with 64 participants. Our results yield that (1) text summaries support drivers in recalling information and enhances user experience but can also increase distraction, (2) the use of keywords minimizes cognitive load and influence on driving performance, and (3) the use of icons increases the attractiveness of the interface.
Assistive technology for the visually impaired and blind people is a research field that is gaining increasing prominence owing to an explosion of new interest in it from disparate disciplines. The field has a very relevant social impact on our ever-increasing aging and blind populations. While many excellent state-of-the-art accounts have been written till date, all of them are subjective in nature. We performed an objective statistical survey across the various sub-disciplines in the field and applied information analysis and network-theory techniques to answer several key questions relevant to the field. To analyze the field we compiled an extensive database of scientific research publications over the last two decades. We inferred interesting patterns and statistics concerning the main research areas and underlying themes, identified leading journals and conferences, captured growth patterns of the research field; identified active research communities and present our interpretation of trends in the field for the near future. Our results reveal that there has been a sustained growth in this field; from less than 50 publications per year in the mid 1990s to close to 400 scientific publications per year in 2014. Assistive Technology for persons with visually impairments is expected to grow at a swift pace and impact the lives of individuals and the elderly in ways not previously possible.
When engaging in social interaction, people rely on their ability to reason about unobservable mental content of others, which includes goals, intentions, and beliefs. This so-called theory of mind ability allows them to more easily understand, predict, and influence the behavior of others. People even use their theory of mind to reason about the theory of mind of others, which allows them to understand sentences like ‘Alice believes that Bob does not know about the surprise party’. But while the use of higher orders of theory of mind is apparent in many social interactions, empirical evidence so far suggests that people do not use this ability spontaneously when playing strategic games, even when doing so would be highly beneficial. In this paper, we attempt to encourage participants to engage in higher-order theory of mind reasoning by letting them play a game against computational agents. Since previous research suggests that competitive games may encourage the use of theory of mind, we investigate a particular competitive game, the Mod game, which can be seen as a much larger variant of the well-known rock–paper–scissors game. By using a combination of computational agents and Bayesian model selection, we simultaneously determine to what extent people make use of higher-order theory of mind reasoning, as well as to what extent computational agents can encourage the use of higher-order theory of mind in their human opponents. Our results show that participants who play the Mod game against computational theory of mind agents adjust their level of theory of mind reasoning to that of their computer opponent. Earlier experiments with other strategic games show that participants only engage in low orders of theory of mind reasoning. Surprisingly, we find that participants who knowingly play against second- and third-order theory of mind agents apply up to fourth-order theory of mind themselves, and achieve higher scores as a result.
Social interactions entail reciprocal reactions where one’s communicative acts triggers responses in others. Fluent interpersonal exchange relies on the ability to discriminate behaviors produced by others that are responses to one’s actions, thus involving a social sense of agency. Given the pivotal role of gaze in human communication, we propose to use gaze following as a model for studying the sense of agency in social actions. The experiment investigates the influence of sensory expertise and timing of the action’s effects by comparing feedback provided by a human avatar versus a nonfigurative animated object (an arrow) and by varying the control exerted by participants’ gaze on the feedback (avatar vs arrow). Results revealed a linear relationship between the judgement of agency and feedback latencies and higher agency discriminating performances with the avatar. These outcomes suggest that classical cognitive accounts of the sense of agency can be expanded to the realm of social actions and provide important information for designing virtual agents to train social gaze interactions.
The way doctors deliver bad news has a significant impact on the therapeutic process. In order to facilitate doctor’s training, we have developed an embodied conversational agent simulating a patient to train doctors to break bad news. In this article, we present an evaluation of the virtual reality training platform comparing the users’ experience depending on the virtual environment displays: a PC desktop, a virtual reality headset, and four wall fully immersive systems. The results of the experience, including both real doctors and naive participants, reveal a significant impact of the environment display on the perception of the user (sense of presence, sense of co-presence, perception of the believability of the virtual patient), showing, moreover, the different perceptions of the participants depending on their level of expertise.
A multimodal, cross-cultural corpus of affective behavior is presented in this research work. The corpus construction process, including issues related to the design and implementation of an experiment, is discussed along with resulting acoustic prosody, facial expressions and gesture expressivity features. However, research work presented here focuses more on the cross-cultural aspect of gestural behavior defining a common corpus construction protocol aiming to identify cultural patterns within non-verbal behavior across cultures i.e. German, Greek and Italian. Culture specific findings regarding gesture expressivity are derived from the affective analysis performed. Additionally, the multimodal aspect, including prosody and facial expressions, is researched in terms of fusion techniques. Finally, a release plan of the corpus to the public domain is discussed aiming to establish the current corpus as a benchmark multimodal, cross-cultural standard and reference point.
To date, multimodal speech recognition systems based on the processing of audio and video signals show significantly better results than their unimodal counterparts. In general, researchers divide the solution of the audio–visual speech recognition problem into two parts. First, in extracting the most informative features from each modality and second, in the most successful way of fusion both modalities. Ultimately, this leads to an improvement in the accuracy of speech recognition. Almost all modern studies use this approach with video data of a standard recording speed of 25 frames per second. The choice of such a recording speed is easily explained, since the vast majority of existing audio–visual databases are recorded with this rate. However, it should be noticed that the number of 25 frames per second is a world standard for many areas and has never been specifically calculated for speech recognition tasks. The main purpose of this study is to investigate the effect brought by the high-speed video data (up to 200 frames per second) on the speech recognition accuracy. And also to find out whether the use of a high-speed video camera makes the speech recognition systems more robust to acoustical noise. To this end, we recorded a database of audio–visual Russian speech with high-speed video recordings, which consists of records of 20 speakers, each of them pronouncing 200 phrases of continuous Russian speech. Experiments performed on this database showed an improvement in the absolute speech recognition rate up to 3.10%. We also proved that the use of the high-speed camera with 200 fps allows achieving better recognition results under different acoustically noisy conditions (signal-to-noise ratio varied between 40 and 0 dB) with different types of noise (e.g. white noise, babble noise).