This paper investigates human mobility patterns in an urban taxi transportation system. This work focuses on predicting human mobility from discovering patterns of in the number of passenger pick-ups quantity （PUQ） from urban hotspots. This paper proposes an improved ARIMA based prediction method to forecast the spatial-temporal variation of passengers in a hotspot. Evaluation with a large-scale real- world data set of 4 000 taxis＇ GPS traces over one year shows a prediction error of only 5.8%. We also explore the applica- tion of the pl~di~fioti approach to help drivers find their next passetlgerS, The sinatllation results using historical real-world data demonstrate that, with our guidance, drivers can reduce the time taken and distance travelled, to find their next pas- senger＋ by 37.1% and 6.4% respectively,
Microblogging provides a new platform for com- municating and sharing information among Web users. Users can express opinions and record daily life using microblogs. Microblogs that are posted by users indicate their interests to some extent. We aim to mine user interests via keyword extraction from microblogs. Traditional keyword extraction methods are usually designed for formal documents such as news articles or scientific papers. Messages posted by mi- croblogging users, however, are usually noisy and full of new words, which is a challenge for keyword extraction. In this paper, we combine a translation-based method with a frequency-based method for keyword extraction. In our ex- periments, we extract keywords for microblog users from the largest microblogging website in China, Sina Weibo. The re- suits show that our method can identify users＇ interests accu- rately and efficiently.
With the explosive growth of information, more and more organizations are deploying private cloud systems or renting public cloud systems to process big data. However, there is no existing benchmark suite for evaluating cloud performance on the whole system level. To the best of our knowledge, this paper proposes the first benchmark suite CloudRank-D to benchmark and rank cloud computing sys- tems that are shared for running big data applications. We an- alyze the limitations of previous metrics, e.g., floating point operations, for evaluating a cloud computing system, and propose two simple metrics： data processed per second and data processed per Joule as two complementary metrics for evaluating cloud computing systems. We detail the design of CloudRank-D that considers representative applications, di- versity of data characteristics, and dynamic behaviors of both applications and system software platforms. Through experi- ments, we demonstrate the advantages of our proposed met- tics. In several case studies, we evaluate two small-scale de- ployments of cloud computing systems using CloudRank-D.
Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbrevi- ations, and coined acronyms and words exacerbate the prob- lems of synonymy and polysemy, and bring about new chal- lenges to data mining applications such as text clustering and classification. To address these issues, we dissect some poten- tial causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed ap- proach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investi- gate potential factors that contribute to the improved perfor- mance.
In the rising tide of the Internet of things, more and more things in the world are connected to the Internet. Recently, data have kept growing at a rate more than four times of that expected in Moore＇s law. This explosion of data comes from various sources such as mobile phones, video cameras and sensor networks, which often present multidi- mensional characteristics. The huge amount of data brings many challenges on the management, transportation, and pro- cessing IT infrastructures. To address these challenges, the state-of-art large scale data center networks have begun to provide cloud services that are increasingly prevalent. How- ever, how to build a good data center remains an open chal- lenge. Concurrently, the architecture design, which signifi- cantly affects the total performance, is of great research inter- est. This paper surveys advances in data center network de- sign. In this paper we first introduce the upcoming trends in the data center industry. Then we review some popular design principles for today＇s data center network architectures. In the third part, we present some up-to-date data center frame- works and make a comprehensive comparison of them. Dur- ing the comparison, we observe that there is no so-called op- timal data center and the design should be different referring to the data placement, replication, processing, and query pro- cessing. After that, several existing challenges and limitations are discussed. According to these observations, we point out some possible future research directions.
Social networks often serve as a critical medium for information dissemination, diffusion of epidemics, and spread of behavior, by shared activities or similarities be- tween individuals. Recently, we have witnessed an explosion of interest in studying social influence and spread dynamics in social networks. To date, relatively little material has been provided on a comprehensive review in this field. This brief survey addresses this issue. We present the current significant empirical studies on real social systems, including network construction methods, measures of network, and newly em- pirical results. We then provide a concise description of some related social models from both macro- and micro-level per- spectives. Due to the difficulties in combining real data and simulation data for verifying and validating real social sys- tems, we further emphasize the current research results of computational experiments. We hope this paper can provide researchers significant insights into better understanding the characteristics of personal influence and spread patterns in large-scale social systems.
System management is becoming increasingly complex and brings high costs, especially with the advent of cloud computing. Cloud computing involves numerous platforms often of virtual machines （VMs） and middleware has to be managed to make the whole system work cost- effectively after an application is deployed. In order to re- duce management costs, in particular for the manual activi- ties, many computer programs have been developed remove or reduce the complexity and difficulty of system mamnage- ment. These programs are usually hard-coded in languages like Java and C＋＋, which bring enough capability and flexi- bility but also cause high programming effort and cost. This paper proposes an architecture for developing management programs in a simple but powerful way. First of all, the man- ageability of a given platform （via APIs, configuration files, and scripts） is abstracted as a runtime model of the plat- form＇s software architecture, which can automatically and immediately propagate any observable runtime changes of the target platforms to the corresponding architecture mod- els, and vice versa. The management programs are devel- oped using modeling languages, instead of those relatively low-level programming languages. Architecture-level man- agement programs bring many advantages related to perfor- mance, interoperability, reusability, and simplicity. An experiment on a real-world cloud deployment and comparison with traditional programming language approaches demonstrate the feasibility, effectiveness, and benefits of the new model based approach for management program development.
It is increasingly common to find graphs in which edges are of different types, indicating a variety of relation- ships. For such graphs we propose a class of reachability queries and a class of graph patterns, in which an edge is specified with a regular expression of a certain form, ex- pressing the connectivity of a data graph via edges of var- ious types. In addition, we define graph pattern matching based on a revised notion of graph simulation. On graphs in emerging applications such as social networks, we show that these queries are capable of finding more sensible informa- tion than their traditional counterparts. Better still, their in- creased expressive power does not come with extra complex- ity. Indeed, （1） we investigate their containment and mini- mization problems, and show that these fundamental prob- lems are in quadratic time for reachability queries and are in cubic time for pattern queries. （2） We develop an algorithm for answering reachability queries, in quadratic time as for their traditional counterpart. （3） We provide two cubic-time algorithms for evaluating graph pattern queries, as opposed to the NP-completeness of graph pattern matching via subgraph isomorphism. （4） The effectiveness and efficiency of these al- gorithms are experimentally verified using real-life data and synthetic data.
Increasingly powerful computational technology has caused enormous data growth both in size and complexity. A key issue is how to organize the data to adapt the challenges of data analysis. This paper borrows ideas from the Internet of things （IOT） into the digital world and organize the data entities to form a network, the Internet of data （IOD）, which has huge potential in data-intensive applications. In the IOD, data hiding technology is utilized to embed virtual tags, which record all the activities of the data entities since they are created, into every data entity in the system. The IOD aims to organize the data to be interconnected as a network and collect useful information for data identification, data tracing, data vitalization and further data analysis.
A vehicular ad-hoc network （VANET） can be visualized as a network of moving vehicles communicating in an asynchronous and autonomous fashion. Efficient and scalable information dissemination in VANET applications is a major challenge due to the movement of vehicles which causes unpredictable changes in network topology. The publish/subscribe communication paradigm provides decoupling in time, space, and synchronization between communicating entities, and presents itself as an elegant solution for information dissemination for VANET like environments. In this paper, we propose our approach for information dissemination which utilizes publish/subscribe and distributed hash table （DHT） based overlay networks. In our approach, we assume a hybrid VANET consisting of stationary info-stations and moving vehicles. These info-stations are installed at every major intersection of the city and vehicles can take the role of publisher, subscriber, or broker depending upon the context. The info-stations form a DHT based broker overlay among themselves and act as rendezvous points for related publications and subscriptions. Further, info-stations also assist in locating vehicles that have subscribed to information items. We consider different possible deployments of this hybrid VANET with respect to the number of info-stations and their physical connectivity with each other. We perform simulations to assess the performance of our approach in these different deployment scenarios and discuss their applicability in urban and semi-urban areas.
Meta-modelling plays an important role in model driven software development. In this paper, a graphic exten- sion of BNF （GEBNF） is proposed to define the abstract syn- tax of graphic modelling languages. From a GEBNF syntax definition, a formal predicate logic language can be induced so that meta-modelling can be performed formally by spec- ifying a predicate on the domain of syntactically valid mod- els. In this paper, we investigate the theoretical foundation of this meta-modelling approach. We formally define the se- mantics of GEBNF and its induced predicate logic languages, then apply Goguen and Burstall＇s institution theory to prove that they form a sound and valid formal specification lan- guage for meta-modelling.
The field of social computing emerged more than ten years ago. During the last decade, researchers from a vari- ety of disciplines have been closely collaborating to boost the growth of social computing research. This paper aims at iden- tifying key researchers and institutions, and examining the collaboration patterns in the field. We employ co-authorship network analysis at different levels to study the bibliographic information of 6 543 publications in social computing from 1998 to 2011. This paper gives a snapshot of the current re- search in social computing and can provide an initial guid- ance to new researchers in social computing.
Model-driven architecture （MDA） has become a main stream technology for software-intensive system design. The main engineering principle behind it is that the inherent complexity of software development can only be mastered by building, analyzing and manipulating system models. MDA also deals with system complexity by provid- ing component-based design techniques, allowing indepen- dent component design, implementation and deployment, and then system integration and reconfiguration based on com- ponent interfaces. The model of a system in any stage is an integration of models of different viewpoints. Therefore, for a model-driven method to be applied effectively, it must pro- vide a body of techniques and an integrated suite of tools for model construction, validation, and transformation. This requires a number of modeling notations for the specifica- tion of different concerns and viewpoints of the system. These notations should have formally defined syntaxes and a unified theory of semantics. The underlying theory of the method is needed to underpin the development of tools and correct use of tools in software development, as well as to formally ver- ify and reason about properties of systems in mission-critical applications. The modeling notations, techniques, and tools must be designed so that they can be used seamlessly in sup- porting development activities and documentation of artifactsin software design processes. This article presents such a method, called the rCOS, focusing on the models of a system at different stages in a software development process, their se- mantic integration, and how they are constructed, analyzed, transformed, validated, and verified.
Set queries are an important topic and have attracted a lot of attention. Earlier research mainly concentrated on set containment queries. In this paper we focus on the T-Overlap query which is the foundation of the set similarity query. To address this issue, unlike traditional algorithms that are based on an inverted index, we design a new paradigm based on the prefix tree （trie） called the expanded trie index （ETI） which expands the trie node structure by adding some new properties. Based on ETI, we convert the T- Overlap problem to finding query nodes with specific query depth equaling to T and propose a new algorithm called T- Similarity to solve T-Overlap efficiently. Then we carry out a three-step framework to extend T-Overlap to other simi- larity predicates. Extensive experiments are carried out to compare T-Similarity with other inverted index based algorithms from cardinality of query, overlap threshold, dataset size, the number of distinct elements and so on. Results show that T-Similarity outperforms the state-of-the-art algorithms in many aspects.
Proxy signature schemes enable an entity to del- egate its signing rights to any other party, called proxy signer. As a variant of proxy signature primitive, proxy multi- signature allows a group of original signers to delegate their signing capabilities to a single proxy signer in such a way that the proxy signer can sign a message on behalf of the group of original signers. We propose a concrete ID-based proxy multi-signature scheme from bilinear pairings. The proposed scheme is existential unforgeable against adaptively chosen message and given ID-attack in random oracle model under the computational Diltie-Hellman （CDH） assumption. The fascinating property of new scheme is that the size of a proxy multi-signature is independent of the number of original sign- ers. Furthermore the proposed scheme is simple and com- putationally more efficient than other ID-based proxy multi- signature schemes.
When reengineering a monolithic application to be a distributed one, programmers always have to decide how many distributed parts the application should be partitioned, and write many codes related to where a part will be placed on network nodes and how these parts communicate with each other through the network. These codes usually have nothing to do with the business functions of the application, and they are laborious to write. In addition, as the distribution architecture of the application is finalized beforehand, it may not adapt well to the everchanging execution environment. In this paper, we propose DPartner, an automatic partitioning system, to help programmers create a distributed Java application without explicitly writing the distribution-related codes. Unlike the other partitioning systems, DPartner does not partition an application directly into the coarse-grained client and server. Instead, it first partitions the application into several modules where each module performs a relatively independent business function of the application. Then it makes these modules be distributable through automatic bytecode rewriting. These modules can distribute in different nodes and cooperate to work just as the original monolithic application. Such a module-based partitioning approach enables a relatively easy reshaping of the distribution architecture of an application, which facilitates the application adapt to the environmental changes without manual recoding or repartitioning with regard to distribution. This paper gives the detailed design of DPartner, and evaluates it using real-world applications. The evaluation results demonstrate the effectiveness and efficiency of DPartner.
In recent years, the deep web has become ex- tremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is chal- lenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this pa- per, we target two related data mining problems, association mining and differential rule mining. These are proposed to ex- tract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these min- ing tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively pro- cesses the query space of a deep web data source, and con- siders both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our ex- perimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a strat- ified sampling method that only considers estimation error. In addition, compared with simple random sampling, our al- gorithm has higher sampling accuracy and lower sampling costs.
Classical recommender systems provide users with a list of recommendations where each recommendation consists of a single item, e.g., a book or DVD. However, sev- eral applications can benefit from a system capable of recom- mending packages of items, in the form of sets. Sample appli- cations include travel planning with a limited budget （price or time） and twitter users wanting to select worthwhile tweeters to follow, given that they can deal with only a bounded num- ber of tweets. In these contexts, there is a need for a system that can recommend the top-k packages for the user to choose from. Motivated by these applications, we consider composite recommendations, where each recommendation comprises a set of items. Each item has both a value （rating） and a cost associated with it, and the user specifies a maximum total cost （budget） for any recommended set of items. Our composite recommender system has access to one or more component recommender systems focusing on different do- mains, as well as to information sources which can provide the cost associated with each item. Because the problem of deciding whether there is a recommendation （package） whose cost is under a given budget and whose value exceeds some threshold is NP-complete, we devise several approximation algorithms for generating the top-k packages as recommen- dations. We analyze the efficiency as well as approximation quality of these algorithms. Finally, using two real and two synthetic datasets, we subject our algorithms to thorough ex- perimentation and empirical analysis. Our findings attest tothe efficiency and quality of our approximation algorithms for the top-k packages compared to exact algorithms.
In most priority scheduling algorithms, the num- ber of priority levels is assumed to be unlimited. However, if a task set requires more priority levels than the system can support, several jobs must in practice be assigned the same priority level. To solve this problem, a novel group priority earliest deadline first （GPEDF） scheduling algorithm is pre- sented. In this algorithm, a schedulability test is given to form a job group, in which the jobs can arbitrarily change their or- der without reducing the schedulability. We consider jobs in the group having the same priority level and use shortest job first （SJF） to schedule the jobs in the group to improve the performance of the system. Compared with earliest deadline first （EDF）, best effort （BE）, and group-EDF （gEDF）, simu- lation results show that the new algorithm exhibits the least switching, the shortest average response time, and the fewest required priority levels. It also has a higher success ratio than both EDF and gEDF.
Automatic object classification in traffic scene videos is an important issue for intelligent visual surveillance with great potential for all kinds of security applications. However, this problem is very challenging for the following reasons. Firstly, regions of interest in videos are of low res- olution and limited size due to the capacity of conventional surveillance cameras. Secondly, the intra-class variations are very large due to changes of view angles, lighting conditions, and environments. Thirdly, real-time performance of algo- rithms is always required for real applications. In this paper, we evaluate the performance of local feature descriptors for automatic object classification in traffic scenes. Image inten- sity or gradient information is directly used to construct ef- fective feature vectors from regions of interest extracted via motion detection. This strategy has great advantages of ef- ficiency compared to various complicated texture features. We not only analyze and evaluate the performance of differ- ent feature descriptors, but also fuse different scales and fea- tures to achieve better performance. Numerous experiments are conducted and experimental results demonstrate the ef- ficiency and effectiveness of this strategy with robustness to noise, variance of view angles, lighting conditions, and environments.