In this paper we propose a new no-reference (NR) image quality assessment (IQA) metric using the recently revealed free-energy-based brain theory and classical human visual system (HVS)-inspired features. The features used can be divided into three groups. The first involves the features inspired by the free energy principle and the structural degradation model. Furthermore, the free energy theory also reveals that the HVS always tries to infer the meaningful part from the visual stimuli. In terms of this finding, we first predict an image that the HVS perceives from a distorted image based on the free energy theory, then the second group of features is composed of some HVS-inspired features (such as structural information and gradient magnitude) computed using the distorted and predicted images. The third group of features quantifies the possible losses of "naturalness" in the distorted image by fitting the generalized Gaussian distribution to mean subtracted contrast normalized coefficients. After feature extraction, our algorithm utilizes the support vector machine based regression module to derive the overall quality score. Experiments on LIVE, TID2008, CSIQ, IVC, and Toyama databases confirm the effectiveness of our introduced NR IQA metric compared to the state-of-the-art.
The recently developed depth sensors, e.g., the Kinect sensor, have provided new opportunities for human-computer interaction (HCI). Although great progress has been made by leveraging the Kinect sensor, e.g., in human body tracking, face recognition and human action recognition, robust hand gesture recognition remains an open problem. Compared to the entire human body, the hand is a smaller object with more complex articulations and more easily affected by segmentation errors. It is thus a very challenging problem to recognize hand gestures. This paper focuses on building a robust part-based hand gesture recognition system using Kinect sensor. To handle the noisy hand shapes obtained from the Kinect sensor, we propose a novel distance metric, Finger-Earth Mover's Distance (FEMD), to measure the dissimilarity between hand shapes. As it only matches the finger parts while not the whole hand, it can better distinguish the hand gestures of slight differences. The extensive experiments demonstrate that our hand gesture recognition system is accurate (a 93.2% mean accuracy on a challenging 10-gesture dataset), efficient (average 0.0750 s per frame), robust to hand articulations, distortions and orientation or scale changes, and can work in uncontrolled environments (cluttered backgrounds and lighting conditions). The superiority of our system is further demonstrated in two real-life HCI applications.
For intelligent systems to make best use of the audio modality, it is important that they can recognize not just speech and music, which have been researched as specific tasks, but also general sounds in everyday environments. To stimulate research in this field we conducted a public research challenge: the IEEE Audio and Acoustic Signal Processing Technical Committee challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this paper, we report on the state of the art in automatically classifying audio scenes, and automatically detecting and classifying audio events. We survey prior work as well as the state of the art represented by the submissions to the challenge from various research groups. We also provide detail on the organization of the challenge, so that our experience as challenge hosts may be useful to those organizing challenges in similar domains. We created new audio datasets and baseline systems for the challenge; these, as well as some submitted systems, are publicly available under open licenses, to serve as benchmarks for further research in general-purpose machine listening.
The emerging high efficiency video coding standard (HEVC) adopts the quadtree-structured coding unit (CU). Each CU allows recursive splitting into four equal sub-CUs. At each depth level (CU size), the test model of HEVC (HM) performs motion estimation (ME) with different sizes including 2N × 2N, 2N × N, N × 2N and N × N. ME process in HM is performed using all the possible depth levels and prediction modes to find the one with the least rate distortion (RD) cost using Lagrange multiplier. This achieves the highest coding efficiency but requires a very high computational complexity. In this paper, we propose a fast CU size decision algorithm for HM. Since the optimal depth level is highly content-dependent, it is not efficient to use all levels. We can determine CU depth range (including the minimum depth level and the maximum depth level) and skip some specific depth levels rarely used in the previous frame and neighboring CUs. Besides, the proposed algorithm also introduces early termination methods based on motion homogeneity checking, RD cost checking and SKIP mode checking to skip ME on unnecessary CU sizes. Experimental results demonstrate that the proposed algorithm can significantly reduce computational complexity while maintaining almost the same RD performance as the original HEVC encoder.
Recent progress in using long short-term memory (LSTM) for image captioning has motivated the exploration of their applications for video captioning. By taking a video as a sequence of features, an LSTM model is trained on video-sentence pairs and learns to associate a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention mechanism which allows for selecting salient features. Furthermore, existing approaches usually model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multimodal representations (i.e., words and visual content) for generating sentences with rich semantic content. Specifically, we first propose an attention mechanism that uses the dynamic weighted sum of local two-dimensional convolutional neural network representations. Then, an LSTM decoder takes these visual features at time t and the word-embedding feature at time t-1 to generate important words. Finally, we use multimodal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate that our method using single feature can achieve competitive or even better results than the state-of-the-art baselines for video captioning in both BLEU and METEOR.
In this paper, we consider the problem of pedestrian detection in natural scenes. Intuitively, instances of pedestrians with different spatial scales may exhibit dramatically different features. Thus, large variance in instance scales, which results in undesirable large intracategory variance in features, may severely hurt the performance of modern object instance detection methods. We argue that this issue can be substantially alleviated by the divide-and-conquer philosophy. Taking pedestrian detection as an example, we illustrate how we can leverage this philosophy to develop a Scale-Aware Fast R-CNN (SAF R-CNN) framework. The model introduces multiple built-in subnetworks which detect pedestrians with scales from disjoint ranges. Outputs from all of the subnetworks are then adaptively combined to generate the final detection results that are shown to be robust to large variance in instance scales, via a gate function defined over the sizes of object proposals. Extensive evaluations on several challenging pedestrian detection datasets well demonstrate the effectiveness of the proposed SAF R-CNN. Particularly, our method achieves state-of-the-art performance on Caltech [P. Dollar, C. Wojek, B. Schiele, and P. Perona, "Pedestrian detection: An evaluation of the state of the art," IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34, no. 4, pp. 743-761, Apr. 2012], and obtains competitive results on INRIA [N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. , 2005, pp. 886-893], ETH [A. Ess, B. Leibe, and L. V. Gool, "Depth and appearance for mobile scene analysis," in Proc. Int. Conf. Comput. Vis ., 2007, pp. 1-8], and KITTI [A. Geiger, P. Lenz, and R. Urtasun, "Are we ready for autonomous driving? The KITTI vision benchmark suite," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit ., 2012, pp. 3354-3361].
Face images appearing in multimedia applications, e.g., social networks and digital entertainment, usually exhibit dramatic pose, illumination, and expression variations, resulting in considerable performance degradation for traditional face recognition algorithms. This paper proposes a comprehensive deep learning framework to jointly learn face representation using multimodal information. The proposed deep learning structure is composed of a set of elaborately designed convolutional neural networks (CNNs) and a three-layer stacked auto-encoder (SAE). The set of CNNs extracts complementary facial features from multimodal data. Then, the extracted features are concatenated to form a high-dimensional feature vector, whose dimension is compressed by SAE. All of the CNNs are trained using a subset of 9,000 subjects from the publicly available CASIA-WebFace database, which ensures the reproducibility of this work. Using the proposed single CNN architecture and limited training data, 98.43% verification rate is achieved on the LFW database. Benefitting from the complementary information contained in multimodal data, our small ensemble system achieves higher than 99.0% recognition rate on LFW using publicly available training set.
Online social and news media generate rich and timely information about real-world events of all kinds. However, the huge amount of data available, along with the breadth of the user base, requires a substantial effort of information filtering to successfully drill down to relevant topics and events. Trending topic detection is therefore a fundamental building block to monitor and summarize information originating from social sources. There are a wide variety of methods and variables and they greatly affect the quality of results. We compare six topic detection methods on three Twitter datasets related to major events, which differ in their time scale and topic churn rate. We observe how the nature of the event considered, the volume of activity over time, the sampling procedure and the pre-processing of the data all greatly affect the quality of detected topics, which also depends on the type of detection method used. We find that standard natural language processing techniques can perform well for social streams on very focused topics, but novel techniques designed to mine the temporal distribution of concepts are needed to handle more heterogeneous streams containing multiple stories evolving in parallel. One of the novel topic detection methods we propose, based on -grams cooccurrence and topic ranking, consistently achieves the best performance across all these conditions, thus being more reliable than other state-of-the-art techniques.
To offload and alleviate the heavy base station (BS) traffic load caused by the rapidly growing video services, device-to-device (D2D) communication, as one of the most indispensable technologies of the future cellular networks, can be potentially exploited by mobile users to distribute videos for a BS. In this paper, an effective pricing-based multicast video distribution system and a grid-based clustering method are proposed to support the distribution. Moreover, with the consideration of users' mobility and social characteristics, we classify them into multicast and core types by studying the user stay probability and familiarity. In particular, core users can cooperate with the BS to distribute videos to the multicast users through intracluster D2D multicast. However, core users cannot selflessly help the BS to distribute videos; instead, they will evaluate their personal benefits before distributing the videos to the multicast users. Further, a Stackelberg game-based pricing mechanism is proposed to inspire the core users to distribute videos. Simulation results demonstrate that the proposed mechanism can not only effectively alleviate the BS traffic load, but also significantly improve the effectiveness and reliability of video transmission.
This paper introduces a novel rotation-based framework for arbitrary-oriented text detection in natural scene images. We present the Rotation Region Proposal Networks , which are designed to generate inclined proposals with text orientation angle information. The angle information is then adapted for bounding box regression to make the proposals more accurately fit into the text region in terms of the orientation. The Rotation Region-of-Interest pooling layer is proposed to project arbitrary-oriented proposals to a feature map for a text region classifier. The whole framework is built upon a region-proposal-based architecture, which ensures the computational efficiency of the arbitrary-oriented text detection compared with previous text detection systems. We conduct experiments using the rotation-based framework on three real-world scene text detection datasets and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.
With the widespread adoption of multidevice communication, such as telecommuting, screen content images (SCIs) have become more closely and frequently related to our daily lives. For SCIs, the tasks of accurate visual quality assessment, high-efficiency compression, and suitable contrast enhancement have thus currently attracted increased attention. In particular, the quality evaluation of SCIs is important due to its good ability for instruction and optimization in various processing systems. Hence, in this paper, we develop a new objective metric for research on perceptual quality assessment of distorted SCIs. Compared to the classical MSE, our method, which mainly relies on simple convolution operators, first highlights the degradations in structures caused by different types of distortions and then detects salient areas where the distortions usually attract more attention. A comparison of our algorithm with the most popular and state-of-the-art quality measures is performed on two new SCI databases (SIQAD and SCD). Extensive results are provided to verify the superiority and efficiency of the proposed IQA technique.
As an essential way of human emotional behavior understanding, speech emotion recognition (SER) has attracted a great deal of attention in human-centered signal processing. Accuracy in SER heavily depends on finding good affect- related , discriminative features. In this paper, we propose to learn affect-salient features for SER using convolutional neural networks (CNN). The training of CNN involves two stages. In the first stage, unlabeled samples are used to learn local invariant features (LIF) using a variant of sparse auto-encoder (SAE) with reconstruction penalization. In the second step, LIF is used as the input to a feature extractor, salient discriminative feature analysis (SDFA), to learn affect-salient, discriminative features using a novel objective function that encourages feature saliency, orthogonality, and discrimination for SER. Our experimental results on benchmark datasets show that our approach leads to stable and robust recognition performance in complex scenes (e.g., with speaker and language variation, and environment distortion) and outperforms several well-established SER features.
Multimodal medical image fusion, as a powerful tool for the clinical applications, has developed with the advent of various imaging modalities in medical imaging. The main motivation is to capture most relevant information from sources into a single output, which plays an important role in medical diagnosis. In this paper, a novel fusion framework is proposed for multimodal medical images based on non-subsampled contourlet transform (NSCT). The source medical images are first transformed by NSCT followed by combining low- and high-frequency components. Two different fusion rules based on phase congruency and directive contrast are proposed and used to fuse low- and high-frequency coefficients. Finally, the fused image is constructed by the inverse NSCT with all composite coefficients. Experimental results and comparative study show that the proposed fusion framework provides an effective way to enable more accurate analysis of multimodality images. Further, the applicability of the proposed framework is carried out by the three clinical examples of persons affected with Alzheimer, subacute stroke and recurrent tumor.
Whereas deep neural networks were first mostly used for classification tasks, they are rapidly expanding in the realm of structured output problems, where the observed target is composed of multiple random variables that have a rich joint distribution, given the input. In this paper we focus on the case where the input also has a rich structure and the input and output structures are somehow related. We describe systems that learn to attend to different places in the input, for each element of the output, for a variety of tasks: machine translation, image caption generation, video clip description, and speech recognition. All these systems are based on a shared set of building blocks: gated recurrent neural networks and convolutional neural networks, along with trained attention mechanisms. We report on experimental results with these systems, showing impressively good performance and the advantage of the attention mechanism.
The newly developed HEVC video coding standard can achieve higher compression performance than the previous video coding standards, such as MPEG-4, H.263 and H.264/AVC. However, HEVC's high computational complexity raises concerns about the computational burden on real-time application. In this paper, a fast pyramid motion divergence (PMD) based CU selection algorithm is presented for HEVC inter prediction. The PMD features are calculated with estimated optical flow of the downsampled frames. Theoretical analysis shows that PMD can be used to help selecting CU size. A k nearest neighboring like method is used to determine the CU splittings. Experimental results show that the fast inter prediction method speeds up the inter coding significantly with negligible loss of the peak signal-to-noise ratio.
Previous works on image emotion analysis mainly focused on predicting the dominant emotion category or the average dimension values of an image for affective image classification and regression. However, this is often insufficient in various real-world applications, as the emotions that are evoked in viewers by an image are highly subjective and different. In this paper, we propose to predict the continuous probability distribution of image emotions which are represented in dimensional valence-arousal space. We carried out large-scale statistical analysis on the constructed Image-Emotion-Social-Net dataset, on which we observed that the emotion distribution can be well-modeled by a Gaussian mixture model. This model is estimated by an expectation-maximization algorithm with specified initializations. Then, we extract commonly used emotion features at different levels for each image. Finally, we formalize the emotion distribution prediction task as a shared sparse regression (SSR) problem and extend it to multitask settings, named multitask shared sparse regression (MTSSR), to explore the latent information between different prediction tasks. SSR and MTSSR are optimized by iteratively reweighted least squares. Experiments are conducted on the Image-Emotion-Social-Net dataset with comparisons to three alternative baselines. The quantitative results demonstrate the superiority of the proposed method.
Weakly-supervised image segmentation is a challenging problem with multidisciplinary applications in multimedia content analysis and beyond. It aims to segment an image by leveraging its image-level semantics (i.e., tags). This paper presents a weakly-supervised image segmentation algorithm that learns the distribution of spatially structural superpixel sets from image-level labels. More specifically, we first extract graphlets from a given image, which are small-sized graphs consisting of superpixels and encapsulating their spatial structure. Then, an efficient manifold embedding algorithm is proposed to transfer labels from training images into graphlets. It is further observed that there are numerous redundant graphlets that are not discriminative to semantic categories, which are abandoned by a graphlet selection scheme as they make no contribution to the subsequent segmentation. Thereafter, we use a Gaussian mixture model (GMM) to learn the distribution of the selected post-embedding graphlets (i.e., vectors output from the graphlet embedding). Finally, we propose an image segmentation algorithm, termed representative graphlet cut, which leverages the learned GMM prior to measure the structure homogeneity of a test image. Experimental results show that the proposed approach outperforms state-of-the-art weakly-supervised image segmentation methods, on five popular segmentation data sets. Besides, our approach performs competitively to the fully-supervised segmentation models.
This paper proposes a new hashing framework to conduct similarity search via the following steps: first, employing linear clustering methods to obtain a set of representative data points and a set of landmarks of the big dataset; second, using the landmarks to generate a probability representation for each data point. The proposed probability representation method is further proved to preserve the neighborhood of each data point. Third, PCA is integrated with manifold learning to lean the hash functions using the probability representations of all representative data points. As a consequence, the proposed hashing method achieves efficient similarity search (with linear time complexity) and effective hashing performance and high generalization ability (simultaneously preserving two kinds of complementary similarity structures, i.e., local structures via manifold learning and global structures via PCA). Experimental results on four public datasets clearly demonstrate the advantages of our proposed method in terms of similarity search, compared to the state-of-the-art hashing methods.
Recently, position-patch based approaches have been proposed to replace the probabilistic graph-based or manifold learning-based models for face hallucination. In order to obtain the optimal weights of face hallucination, these approaches represent one image patch through other patches at the same position of training faces by employing least square estimation or sparse coding. However, they cannot provide unbiased approximations or satisfy rational priors, thus the obtained representation is not satisfactory. In this paper, we propose a simpler yet more effective scheme called Locality-constrained Representation (LcR). Compared with Least Square Representation (LSR) and Sparse Representation (SR), our scheme incorporates a locality constraint into the least square inversion problem to maintain locality and sparsity simultaneously. Our scheme is capable of capturing the non-linear manifold structure of image patch samples while exploiting the sparse property of the redundant data representation. Moreover, when the locality constraint is satisfied, face hallucination is robust to noise, a property that is desirable for video surveillance applications. A statistical analysis of the properties of LcR is given together with experimental results on some public face databases and surveillance images to show the superiority of our proposed scheme over state-of-the-art face hallucination approaches.
Uyghur text localization in images with complex backgrounds is a challenging yet important task for many applications. Generally, Uyghur characters in images consist of strokes with uniform features, and they are distinct from backgrounds in color, intensity, and texture. Based on these differences, we propose a FASTroke keypoint extractor, which is fast and stroke-specific. Compared with the commonly used MSER detector, FASTroke produces less than twice the amount of components and recognizes at least 10% more characters. While the characters in a line usually have uniform features such as size, color, and stroke width, a component similarity based clustering is presented without component-level classification. It incurs no extra errors by incorporating a component-level classifier while the computing cost is drastically reduced. The experiments show that the proposed method can achieve the best performance on the UICBI-500 benchmark dataset.