In the last few years, deep learning has led to very good performance on a variety of problems, such as visual recognition, speech recognition and natural language processing. Among different types of deep neural networks, convolutional neural networks have been most extensively studied. Leveraging on the rapid growth in the amount of the annotated data and the great improvements in the strengths of graphics processor units, the research on convolutional neural networks has been emerged swiftly and achieved state-of-the-art results on various tasks. In this paper, we provide a broad survey of the recent advances in convolutional neural networks. We detailize the improvements of CNN on different aspects, including layer design, activation function, loss function, regularization, optimization and fast computation. Besides, we also introduce various applications of convolutional neural networks in computer vision, speech and natural language processing.
This paper presents a fiducial marker system specially appropriated for camera pose estimation in applications such as augmented reality and robot localization. Three main contributions are presented. First, we propose an algorithm for generating configurable marker dictionaries (in size and number of bits) following a criterion to maximize the inter-marker distance and the number of bit transitions. In the process, we derive the maximum theoretical inter-marker distance that dictionaries of square binary markers can have. Second, a method for automatically detecting the markers and correcting possible errors is proposed. Third, a solution to the occlusion problem in augmented reality applications is shown. To that aim, multiple markers are combined with an occlusion mask calculated by color segmentation. The experiments conducted show that our proposal obtains dictionaries with higher inter-marker distances and lower false negative rates than state-of-the-art systems, and provides an effective solution to the occlusion problem.
Identifying the same individual across different scenes is an important yet difficult task in intelligent video surveillance. Its main difficulty lies in how to preserve similarity of the same person against large appearance and structure variation while discriminating different individuals. In this paper, we present a scalable distance driven feature learning framework based on the deep neural network for person re-identification, and demonstrate its effectiveness to handle the existing challenges. Specifically, given the training images with the class labels (person IDs), we first produce a large number of triplet units, each of which contains three images, i.e. one person with a matched reference and a mismatched reference. Treating the units as the input, we build the convolutional neural network to generate the layered representations, and follow with the distance metric. By means of parameter optimization, our framework tends to maximize the relative distance between the matched pair and the mismatched pair for each triplet unit. Moreover, a nontrivial issue arising with the framework is that the triplet organization cubically enlarges the number of training triplets, as one image can be involved into several triplet units. To overcome this problem, we develop an effective triplet generation scheme and an optimized gradient descent algorithm, making the computational load mainly depend on the number of original images instead of the number of triplets. On several challenging databases, our approach achieves very promising results and outperforms other state-of-the-art approaches.
We present an analysis of three possible strategies for exploiting the power of existing convolutional neural networks (ConvNets or CNNs) in different scenarios from the ones they were trained: full training, fine tuning, and using ConvNets as feature extractors. In many applications, especially including remote sensing, it is not feasible to fully design and train a new ConvNet, as this usually requires a considerable amount of labeled data and demands high computational costs. Therefore, it is important to understand how to better use existing ConvNets. We perform experiments with six popular ConvNets using three remote sensing datasets. We also compare ConvNets in each strategy with existing descriptors and with state-of-the-art baselines. Results point that fine tuning tends to be the best performing strategy. In fact, using the features from the fine-tuned ConvNet with linear SVM obtains the best results. We also achieved state-of-the-art results for the three datasets used.
Facial expression recognition has been an active research area in the past 10 years, with growing application areas including avatar animation, neuromarketing and sociable robots. The recognition of facial expressions is not an easy problem for machine learning methods, since people can vary significantly in the way they show their expressions. Even images of the same person in the same facial expression can vary in brightness, background and pose, and these variations are emphasized if considering different subjects (because of variations in shape, ethnicity among others). Although facial expression recognition is very studied in the literature, few works perform fair evaluation avoiding mixing subjects while training and testing the proposed algorithms. Hence, facial expression recognition is still a challenging problem in computer vision. In this work, we propose a simple solution for facial expression recognition that uses a combination of Convolutional Neural Network and specific image pre-processing steps. Convolutional Neural Networks achieve better accuracy with big data. However, there are no publicly available datasets with sufficient data for facial expression recognition with deep architectures. Therefore, to tackle the problem, we apply some pre-processing techniques to extract only expression specific features from a face image and explore the presentation order of the samples during training. The experiments employed to evaluate our technique were carried out using three largely used public databases (CK+, JAFFE and BU-3DFE). A study of the impact of each image pre-processing operation in the accuracy rate is presented. The proposed method: achieves competitive results when compared with other facial expression recognition methods – 96.76% of accuracy in the CK+ database – it is fast to train, and it allows for real time facial expression recognition with standard computers.
High-dimensional problem domains pose significant challenges for anomaly detection. The presence of irrelevant features can conceal the presence of anomalies. This problem, known as the ‘ ’, is an obstacle for many anomaly detection techniques. Building a robust anomaly detection model for use in high-dimensional spaces requires the combination of an unsupervised feature extractor and an anomaly detector. While one-class support vector machines are effective at producing decision surfaces from well-behaved feature vectors, they can be inefficient at modelling the variation in large, high-dimensional datasets. Architectures such as deep belief networks (DBNs) are a promising technique for learning robust features. We present a hybrid model where an unsupervised DBN is trained to extract generic underlying features, and a one-class SVM is trained from the features learned by the DBN. Since a linear kernel can be substituted for nonlinear ones in our hybrid model without loss of accuracy, our model is scalable and computationally efficient. The experimental results show that our proposed model yields comparable anomaly detection performance with a deep autoencoder, while reducing its training and testing time by a factor of 3 and 1000, respectively.
Recently, increasing attention has been directed to the study of the emotional content of speech signals, and hence, many systems have been proposed to identify the emotional content of a spoken utterance. This paper is a survey of speech emotion classification addressing three important aspects of the design of a speech emotion recognition system. The first one is the choice of suitable features for speech representation. The second issue is the design of an appropriate classification scheme and the third issue is the proper preparation of an emotional speech database for evaluating system performance. Conclusions about the performance and limitations of current speech emotion recognition systems are discussed in the last section of this survey. This section also suggests possible ways of improving speech emotion recognition systems.
In 1962 Hough earned the patent for a method , popularly called Hough Transform (HT) that efficiently identifies lines in images. It is an important tool even after the golden jubilee year of existence, as evidenced by more than 2500 research papers dealing with its variants, generalizations, properties and applications in diverse fields. The current paper is a survey of HT and its variants, their limitations and the modifications made to overcome them, the implementation issues in software and hardware, and applications in various fields. Our survey, along with more than 200 references, will help the researchers and students to get a comprehensive view on HT and guide them in applying it properly to their problems of interest.
Nonlinear methods such as Deep Neural Networks (DNNs) are the gold standard for various challenging machine learning problems such as image recognition. Although these methods perform impressively well, they have a significant disadvantage, the lack of transparency, limiting the interpretability of the solution and thus the scope of application in practice. Especially DNNs act as black boxes due to their multilayer nonlinear structure. In this paper we introduce a novel methodology for interpreting generic multilayer neural networks by decomposing the network classification decision into contributions of its input elements. Although our focus is on image classification, the method is applicable to a broad set of input data, learning tasks and network architectures. Our method called deep Taylor decomposition efficiently utilizes the structure of the network by backpropagating the explanations from the output to the input layer. We evaluate the proposed method empirically on the MNIST and ILSVRC data sets.
In computer vision and pattern recognition researches, the studied objects are often characterized by multiple feature representations with high dimensionality, thus it is essential to encode that multiview feature into a unified and discriminative embedding that is optimal for a given task. To address this challenge, this paper proposes an ensemble manifold regularized sparse low-rank approximation (EMR-SLRA) algorithm for multiview feature embedding. The EMR-SLRA algorithm is based on the framework of least-squares component analysis, in particular, the low dimensional feature representation and the projection matrix are obtained by the low-rank approximation of the concatenated multiview feature matrix. By considering the complementary property among multiple features, EMR-SLRA simultaneously enforces the ensemble manifold regularization on the output feature embedding. In order to further enhance its robustness against the noise, the group sparsity is introduced into the objective formulation to impose direct noise reduction on the input multiview feature matrix. Since there is no closed-form solution for EMR-SLRA, this paper provides an efficient EMR-SLRA optimization procedure to obtain the output feature embedding. Experiments on the pattern recognition applications confirm the effectiveness of the EMR-SLRA algorithm compare with some other multiview feature dimensionality reduction approaches.
Multi-label learning has received significant attention in the research community over the past few years: this has resulted in the development of a variety of multi-label learning methods. In this paper, we present an extensive experimental comparison of 12 multi-label learning methods using 16 evaluation measures over 11 benchmark datasets. We selected the competing methods based on their previous usage by the community, the representation of different groups of methods and the variety of basic underlying machine learning methods. Similarly, we selected the evaluation measures to be able to assess the behavior of the methods from a variety of view-points. In order to make conclusions independent from the application domain, we use 11 datasets from different domains. Furthermore, we compare the methods by their efficiency in terms of time needed to learn a classifier and time needed to produce a prediction for an unseen example. We analyze the results from the experiments using Friedman and Nemenyi tests for assessing the statistical significance of differences in performance. The results of the analysis show that for multi-label classification the best performing methods overall are random forests of predictive clustering trees (RF-PCT) and hierarchy of multi-label classifiers (HOMER), followed by binary relevance (BR) and classifier chains (CC). Furthermore, RF-PCT exhibited the best performance according to all measures for multi-label ranking. The recommendation from this study is that when new methods for multi-label learning are proposed, they should be compared to RF-PCT and HOMER using multiple evaluation measures. ► Experimental comparison of 12 multi-label learning methods on 11 benchmark datasets. ► Experimental evaluation using 16 performance measures and training and testing times. ► SVM-based methods perform better on small datasets compared to the other methods. ► RF-PCT and HOMER perform best over all evaluation measures.
The validation of the results obtained by clustering algorithms is a fundamental part of the clustering process. The most used approaches for cluster validation are based on internal cluster validity indices. Although many indices have been proposed, there is no recent extensive comparative study of their performance. In this paper we show the results of an experimental work that compares 30 cluster validity indices in many different environments with different characteristics. These results can serve as a guideline for selecting the most suitable index for each possible application and provide a deep insight into the performance differences between the currently available indices. ► We compare 30 cluster validity indices (CVIs) in 720 synthetic and 20 real datasets. ► We use a new comparison methodology and three clustering algorithms: k-means, Ward and Average-linkage. ► The CVI performance drops dramatically when noise is present or clusters overlap. ► Statistical tests suggest a division of three groups of CVIs.
Classification is an essential task for predicting the class values of new instances. Both -fold and leave-one-out cross validation are very popular for evaluating the performance of classification algorithms. Many data mining literatures introduce the operations for these two kinds of cross validation and the statistical methods that can be used to analyze the resulting accuracies of algorithms, while those contents are generally not all consistent. Analysts can therefore be confused in performing a cross validation procedure. In this paper, the independence assumptions in cross validation are introduced, and the circumstances that satisfy the assumptions are also addressed. The independence assumptions are then used to derive the sampling distributions of the point estimators for -fold and leave-one-out cross validation. The cross validation procedure to have such sampling distributions is discussed to provide new insights in evaluating the performance of classification algorithms.
Recently, sparse coding has been successfully applied in visual tracking. The goal of this paper is to review the state-of-the-art tracking methods based on sparse coding. We first analyze the benefits of using sparse coding in visual tracking and then categorize these methods into appearance modeling based on sparse coding (AMSC) and target searching based on sparse representation (TSSR) as well as their combination. For each categorization, we introduce the basic framework and subsequent improvements with emphasis on their advantages and disadvantages. Finally, we conduct extensive experiments to compare the representative methods on a total of 20 test sequences. The experimental results indicate that: (1) AMSC methods significantly outperform TSSR methods. (2) For AMSC methods, both discriminative dictionary and spatial order reserved pooling operators are important for achieving high tracking accuracy. (3) For TSSR methods, the widely used identity pixel basis will degrade the performance when the target or candidate images are not aligned well or severe occlusion occurs. (4) For TSSR methods, norm minimization is not necessary. In contrast, norm minimization can obtain comparable performance but with lower computational cost. The open questions and future research topics are also discussed. ► A comprehensive review of visual tracking based on sparse coding. ► Extensive experimental comparison between 15 state of the art tracking methods on a total of 20 challenging sequences. ► Analyze the benefits of using sparse coding in visual tracking. ► Point out the future research topics.
Local Binary Patterns (LBP) have emerged as one of the most prominent and widely studied local texture descriptors. Truly a large number of LBP variants has been proposed, to the point that it can become overwhelming to grasp their respective strengths and weaknesses, and there is a need for a comprehensive study regarding the prominent LBP-related strategies. New types of descriptors based on multistage convolutional networks and deep learning have also emerged. In different papers the performance comparison of the proposed methods to earlier approaches is mainly done with some well-known texture datasets, with differing classifiers and testing protocols, and often not using the best sets of parameter values and multiple scales for the comparative methods. Very important aspects such as computational complexity and effects of poor image quality are often neglected. In this paper, we provide a systematic review of current LBP variants and propose a taxonomy to more clearly group the prominent alternatives. Merits and demerits of the various LBP features and their underlying connections are also analyzed. We perform a large scale performance evaluation for texture classification, empirically assessing forty texture features including thirty two recent most promising LBP variants and eight non-LBP descriptors based on deep convolutional networks on thirteen widely-used texture datasets. The experiments are designed to measure their robustness against different classification challenges, including changes in rotation, scale, illumination, viewpoint, number of classes, different types of image degradation, and computational complexity. The best overall performance is obtained for the Median Robust Extended Local Binary Pattern (MRELBP) feature. For textures with very large appearance variations, Fisher vector pooling of deep Convolutional Neural Networks is clearly the best, but at the cost of very high computational complexity. The sensitivity to image degradations and computational complexity are among the key problems for most of the methods considered.
In recent years, there has been a proliferation of works on human action classification from depth sequences. These works generally present methods and/or feature representations for the classification of actions from sequences of 3D locations of human body joints and/or other sources of data, such as depth maps and RGB videos. This survey highlights motivations and challenges of this very recent research area by presenting technologies and approaches for 3D skeleton-based action classification. The work focuses on aspects such as data pre-processing, publicly available benchmarks and commonly used accuracy measurements. Furthermore, this survey introduces a categorization of the most recent works in 3D skeleton-based action classification according to the adopted feature representation. This paper aims at being a starting point for practitioners who wish to approach the study of 3D action classification and gather insights on the main challenges to solve in this emerging field.
Human action recognition based on skeletons has wide applications in human–computer interaction and intelligent surveillance. However, view variations and noisy data bring challenges to this task. What’s more, it remains a problem to effectively represent spatio-temporal skeleton sequences. To solve these problems in one goal, this work presents an enhanced skeleton visualization method for view invariant human action recognition. Our method consists of three stages. First, a sequence-based view invariant transform is developed to eliminate the effect of view variations on spatio-temporal locations of skeleton joints. Second, the transformed skeletons are visualized as a series of color images, which implicitly encode the spatio-temporal information of skeleton joints. Furthermore, visual and motion enhancement methods are applied on color images to enhance their local patterns. Third, a convolutional neural networks-based model is adopted to extract robust and discriminative features from color images. The final action class scores are generated by decision level fusion of deep features. Extensive experiments on four challenging datasets consistently demonstrate the superiority of our method.
Classification problems involving multiple classes can be addressed in different ways. One of the most popular techniques consists in dividing the original data set into two-class subsets, learning a different binary model for each new subset. These techniques are known as binarization strategies. In this work, we are interested in ensemble methods by binarization techniques; in particular, we focus on the well-known one-vs-one and one-vs-all decomposition strategies, paying special attention to the final step of the ensembles, the combination of the outputs of the binary classifiers. Our aim is to develop an empirical analysis of different aggregations to combine these outputs. To do so, we develop a double study: first, we use different base classifiers in order to observe the suitability and potential of each combination within each classifier. Then, we compare the performance of these ensemble techniques with the classifiers' themselves. Hence, we also analyse the improvement with respect to the classifiers that handle multiple classes inherently. We carry out the experimental study with several well-known algorithms of the literature such as Support Vector Machines, Decision Trees, Instance Based Learning or Rule Based Systems. We will show, supported by several statistical analyses, the goodness of the binarization techniques with respect to the base classifiers and finally we will point out the most robust techniques within this framework. ► One-vs-one and one-vs-all are ensembles for multi-class problems. ► The confidence estimates and their aggregation are key factors of these ensembles. ► Aggregations based on voting and estimation of probabilities are the most robust. ► One-vs-one is more robust, one-vs-all has received less attention. ► Binarization is beneficial even when it is not necessary.
Shape-from-focus (SFF) has widely been studied in computer vision as a passive depth recovery and 3D reconstruction method. One of the main stages in SFF is the computation of the focus level for every pixel of an image by means of a focus measure operator. In this work, a methodology to compare the performance of different focus measure operators for shape-from-focus is presented and applied. The selected operators have been chosen from an extensive review of the state-of-the-art. The performance of the different operators has been assessed through experiments carried out under different conditions, such as image noise level, contrast, saturation and window size. Such performance is discussed in terms of the working principles of the analyzed operators. ► Focus measure operators are applied in shape from focus for 3D recovery. ► Operators have been implemented and adapted in order to compare their performances. ► The relative performance of the operators depends highly on the imaging conditions. ► Operators of the same family respond similarly to noise, contrast and window size.
This paper presents a novel method for interest region description. We adopted the idea that the appearance of an interest region can be well characterized by the distribution of its local features. The most well-known descriptor built on this idea is the SIFT descriptor that uses gradient as the local feature. Thus far, existing texture features are not widely utilized in the context of region description. In this paper, we introduce a new texture feature called center-symmetric local binary pattern (CS-LBP) that is a modified version of the well-known local binary pattern (LBP) feature. To combine the strengths of the SIFT and LBP, we use the CS-LBP as the local feature in the SIFT algorithm. The resulting descriptor is called the CS-LBP descriptor. In the matching and object category classification experiments, our descriptor performs favorably compared to the SIFT. Furthermore, the CS-LBP descriptor is computationally simpler than the SIFT.