This paper gives an overview of distributional modelling of word meaning for contemporary lexicography. We also apply it in a case study on automatic semantic shift detection in Slovene tweets. We use word embeddings to compare the semantic behaviour of frequent words from a reference corpus of Slovene with their behaviour on Twitter. Words with the highest model distance between the corpora are considered as semantic shift candidates. They are manually analysed and classified in order to evaluate the proposed approach as well as to gain a better qualitative understanding of the problem. Apart from the noise due to pre-processing errors (45%), the approach yields a lot of valuable candidates, especially the novel senses occurring due to daily events and the ones produced in informal communication settings.
Abstract The article presents the results of a survey on dictionary use in Europe, focusing on general monolingual dictionaries. The survey is the broadest survey of dictionary use to date, covering close to 10,000 dictionary users (and non-users) in nearly thirty countries. Our survey covers varied user groups, going beyond the students and translators who have tended to dominate such studies thus far. The survey was delivered via an online survey platform, in language versions specific to each target country. It was completed by 9,562 respondents, over 300 respondents per country on average. The survey consisted of the general section, which was translated and presented to all participants, as well as country-specific sections for a subset of 11 countries, which were drafted by collaborators at the national level. The present report covers the general section.
The paper focuses on collocations typical of Slovene computer-mediated communication (CMC), which comprises communication via social networks, forums, blogs, etc. The study examines the CMC-specific collocates of the most frequent Slovene nouns, as well as collocates of CMC-typical nouns. Collocations were automatically extracted with procedures based on the Sketch Engine corpus tool. For the identification of CMC-specific collocations, collocates occurring in the CMC and the reference corpus were compared. In the study, the extracted data were categorised as: (a) new vocabulary in relation to existing lexicographical resources; (b) orthographically or lexically non-standard; (c) terminology; (d) topic-/genre-related. The new vocabulary was furthermore explored as per the distinction between new words, new collocations, and new meanings, where semantic shifts are of special interest. The results indicate that comparative collocation extraction provides a good framework for detecting lexical novelties and can thus provide good support for describing contemporary language use.
The paper aims to establish a synergy between the lexicographic and natural language processing (NLP) communities in relation to concepts and classifications of multiword expressions (MWEs), their representation in dictionaries, dictionary databases, and NLP-oriented MWE lexicons. It begins with an overview of basic MWE-related linguistic concepts and how they are reflected in the lexicographic treatment of MWEs, as well as their role in language technology. A comparison of different lexicographic and NLP classifications of MWEs is presented, with an elaboration of why different typologies are (or are not) useful for different users from both communities. The methodology for the description of MWEs in a set of dictionary databases is discussed, and the results of an analysis of the representation of MWEs based on a small sample of dictionary projects are presented. Finally, some suggestions are provided on how to improve dictionary databases in relation to MWE description and how to improve the results of NLP tasks by using existing descriptions of MWEs in dictionaries.
This paper presents a new type of network lexicon for the Croatian language based on a syntactic and semantic computational framework. It begins with an overview of the existing Croatian e-dictionaries and online repositories, as well as a brief outline of other relevant network ontological models. The network lexicon, which is based on an innovative approach to word tagging, is described in the remainder of the paper. Instead of presenting a linear (e.g. MULTEX-East) structure, this paper proposes a new hierarchical tree-like T-structure that is very similar to the structure of an ontology. In this approach, each word is processed on multiple levels: from its internal structure (morphs or syllables), via links to external network resources (encyclopaedias), to multiword expressions that can have distinctive roles, such as semantic domains, collocations and even figurative expressions. A network framework facilitates the fetching and filtering of the information related to the searched word in a paradigmatic sense because of the integration of the CroWN, the Croatian version of the English WordNet, and in a syntagmatic sense by building the database of the T-structure patterns from a selected corpus. Finally, the network framework enables the dynamic integration of the lexicon with the Linguistic Linked Open Data cloud; thus, each change in the lexicon will be automatically reflected in the cloud. It is therefore not necessary to perform any periodical synchronisation of the data, a task that is quite common when working with triples stored in a Virtuoso database. Special attention has been paid to the technical components and the data preparation process, which are described in detail to serve as a guide for transforming existing lexicographic data into Linked Open Data triples.
Abstract This study aims to explore types of motivation for smartphone dictionary use among Chinese university EFL learners. It is a mixed-method inquiry carried out under the frameworks of the Function Theory of Lexicography (Tarp 2007) and the Strategy Inventory for Dictionary Use (Gavriilidou 2013). Twenty-two semi-structured interviews were conducted, followed by a confirmatory survey (N=577). The interview data revealed ten themes of user strategies and purposes. Using a latent class analytical tool, Mplus, we identified from the survey a model with three user classes: Customisation (33.3%), Learning (51.9%) and Utility (14.8%). They respectively imply individuating use of dictionary features and automatic messaging, authentic English language learning, and utilitarian purposes like passing exams. In addition, three variables were used for predicting class membership, namely gender (male and female), English proficiency (high and low), and university type (key and non-key). Multinomial logistic regressions showed motivation tendencies among these demographic groups: male users or non-key university students were more likely to fall into the Customisation class; high-proficiency learners or key university students, Learning; and female users or low-proficiency learners, Utility. In sum, this research has furthered our understanding of motivation for second language learning through mobile technology applications. It has theoretical, methodological and practical implications for future studies, offering fresh insights into e-dictionary customisation and education in e-dictionary use.