This paper explores durational aspects of pauses, gaps and overlaps in three different conversational corpora with a view to challenge claims about precision timing in turn-taking. Distributions of pause, gap and overlap durations in conversations are presented, and methodological issues regarding the statistical treatment of such distributions are discussed. The results are related to published minimal response times for spoken utterances and thresholds for detection of acoustic silences in speech. It is shown that turn-taking is generally less precise than is often claimed by researchers in the field of conversation analysis or interactional linguistics. These results are discussed in the light of their implications for models of timing in turn-taking, and for interaction control models in speech technology. In particular, it is argued that the proportion of speaker changes that could potentially be triggered by information immediately preceding the speaker change is large enough for reactive interaction controls models to be viable in speech technology.
Spontaneous phonetic imitation is the process by which a talker comes to be more similar-sounding to a model talker as the result of exposure. The current experiment investigates this phenomenon, examining whether vowel spectra are automatically imitated in a lexical shadowing task and how social liking affects imitation. Participants were assigned to either a Black talker or White talker; within this talker manipulation, participants were either put into a condition with a digital image of their assigned model talker or one without an image. Liking was measured through attractiveness rating. Participants accommodated toward vowels selectively; the low vowels /æ ɑ/ showed the strongest effects of imitation compared to the vowels /i o u/, but the degree of this trend varied across conditions. In addition to these findings of phonetic selectivity, the degree to which these vowels were imitated was subtly affected by attractiveness ratings and this also interacted with the experimental condition. The results demonstrate the labile nature of linguistic segments with respect to both their perceptual encoding and their variation in production. ► Social factors such as liking and dialect influence the degree of spontaneous phonetic imitation. ► Phonetic knowledge is labile with respect to both perception and production. ► Auditory exposure influence subsequent production.
Just over fifty years ago, Lisker and Abramson proposed a straightforward measure of acoustic differences among stop consonants of different voicing categories, Voice Onset Time (VOT). Since that time, hundreds of studies have used this method. Here, we review the original definition of VOT, propose some extensions to the definition, and discuss some problematic cases. We propose a set of terms for the most important aspects of VOT and a set of Praat labels that could provide some consistency for future cross-study analyses. Although additions of other aspects of realization of voicing distinctions (F0, amplitude, duration of voicelessness) could be considered, they are rejected as adding too much complexity for what has turned out to be one of the most frequently used metrics in phonetics and phonology.
The performance of the rhythm metrics Δ , % , PVIs and Varcos, said to quantify rhythm class distinctions, was tested using English, German, Greek, Italian, Korean and Spanish. Eight participants per language produced speech using three elicitation methods, spontaneous speech, story reading and reading a set of sentences divided into “uncontrolled” sentences from original works of each language, and sentences devised to maximize or minimize syllable structure complexity (“stress-timed” and “syllable-timed” sets respectively). Rhythm classifications based on pooled data were inconsistent across metrics, while cross-linguistic differences in scores were often statistically non-significant even for comparisons between prototypical languages like English and Spanish. Metrics showed substantial inter-speaker variation and proved very sensitive to elicitation method and syllable complexity, so that the size of both effects was large and often comparable to that of language. These results suggest that any cross-linguistic differences captured by metrics are not robust; metric scores range substantially within a language and are readily affected by a variety of methodological decisions, making cross-linguistic comparisons and rhythmic classifications based on metrics unsafe at best. ► The performance of rhythm metrics, % , Δ , nPVI, rPVi, VarcoC and VarcoV, was tested. ► Large and varied samples of English, German, Greek, Italian, Korean and Spanish were used. ► Metrics showed substantial inter-speaker variation. ► Metrics proved very sensitive to elicitation method and syllable complexity. ► Timing features captured by metrics are unstable and overlap across languages.
The imitation paradigm ( ) has shown that speakers shift their production phonetically in the direction of the imitated speech, indicating the use of episodic traces in speech perception. Although word-level specificity of imitation has been shown, it is unknown whether imitation also can take place with sub-lexical units. By using a modified imitation paradigm, the current study investigated: (1) the generalizability of phonetic imitation at phoneme and sub-phonemic levels, (2) word-level specificity through acoustic measurements of speech production; and (3) automaticity of phonetic imitation and its sensitivity to linguistic structure. The sub-phonemic feature manipulated in the experiments was VOT on the phoneme /p/. The results revealed that participants produced significantly longer VOTs after being exposed to target speech with extended VOTs. Furthermore, this modeled feature was generalized to new instances of the target phoneme /p/ and the new phoneme /k/, indicating that sub-lexical units are involved in phonetic imitation. The data also revealed that lexical frequency had an effect on the degree of imitation. On the other hand, target speech with reduced VOT was not imitated, indicating that phonetic imitation is phonetically selective. ► Subjects' speech was examined before and after they listen to words with extended/shortened VOT. ► Lengthened VOT was imitated, and the change was generalized at phoneme and sub phonemic levels. ► Lexical frequency was found to influence the degree of imitation. ► These results suggest that phonological representations are multi-leveled. ► Reduced VOT was not imitated, indicating that phonetic imitation is phonetically selective.
Previous research has shown that in languages like English, the implementation of voicing in voiced obstruents is affected by linguistic factors such as utterance position, stress, and the adjacent sound. The goal of the current study is to extend previous findings in two ways: (1) investigate the production of voicing in connected read speech instead of in isolation/carrier sentences, and (2) understand the implementation of partial voicing by examining where in the constriction voicing appears or dies out. The current study examines the voicing of stops and fricatives in the connected read speech of 37 speakers. Results confirm that phrase position, word position, lexical stress, and the manner and voicing of the adjacent sound condition the prevalence of voicing, but they have different effects on stops and fricatives. The analysis of where voicing is realized in the constriction interval shows that bleed from a preceding sonorant is common, but voicing beginning partway through the constriction interval (i.e., negative voice onset time) is much rarer. The acoustic, articulatory, and aerodynamic sources of the patterns of phonation found in connected speech are discussed.
Despite abundant evidence of malleability in speech production, previous studies of the effects of late second-language learning on first-language speech production have been limited to advanced learners. This study examined these effects in novice learners, adult native English speakers enrolled in elementary Korean classes. In two acoustic studies, learners' production of English was found to be influenced by even brief experience with Korean. The effect was consistently one of assimilation to phonetic properties of Korean; moreover, it occurred at segmental, subsegmental, and global levels, often simultaneously. Taken together, the results suggest that cross-language linkages are established from the onset of second-language learning at multiple levels of phonological structure, allowing for pervasive influence of second-language experience on first-language representations. The findings are discussed with respect to current notions of cross-linguistic similarity, language development, and historical sound change. ► Native English speakers produced modified English after brief Korean instruction. ► Changes in English production converged with phonetic properties of Korean. ► Korean-to-English influence showed both generality and specificity. ► Findings indicate early establishment of cross-language linkage at multiple levels.
Previous studies have found that talkers converge or diverge in phonetic form during a single conversational session or as a result of long-term exposure to a particular linguistic environment. In the current study, five pairs of previously unacquainted male roommates were recorded at four time intervals during the academic year. Phonetic convergence over time was assessed using a perceptual similarity test and measures of vowel spectra. There were distinct patterns of phonetic convergence during the academic year across roommate pairs, and perceptual detection of convergence varied for different linguistic items. In addition, phonetic convergence correlated moderately with roommates' self-reported closeness. These findings suggest that phonetic convergence in college roommates is variable and moderately related to the strength of a relationship. ► Examines phonetic convergence in college roommates over the academic year. ► Listeners perceived phonetic convergence in college roommates. ► Roommates' vowel spectra did not converge consistently. ► Perceived phonetic convergence was related to roommates' rated closeness. ► Phonetic convergence is a social device, not an automatic consequence of speech perception.
To understand how language influences the vocal communication of emotion, we investigated how discrete emotions are recognized and acoustically differentiated in four language contexts—English, German, Hindi, and Arabic. Vocal expressions of six emotions (anger, disgust, fear, sadness, happiness, pleasant surprise) and neutral expressions were elicited from four native speakers of each language. Each speaker produced pseudo-utterances (“nonsense speech”) which resembled their native language to express each emotion type, and the recordings were judged for their perceived emotional meaning by a group of native listeners in each language condition. Emotion recognition and acoustic patterns were analyzed within and across languages. Although overall recognition rates varied by language, all emotions could be recognized strictly from vocal cues in each language at levels exceeding chance. Anger, sadness, and fear tended to be recognized most accurately irrespective of language. Acoustic and discriminant function analyses highlighted the importance of speaker fundamental frequency (i.e., relative pitch level and variability) for signalling vocal emotions in all languages. Our data emphasize that while emotional communication is governed by display rules and other social variables, vocal expressions of ‘basic’ emotion in speech exhibit modal tendencies in their acoustic and perceptual attributes which are largely unaffected by language or linguistic similarity.
Competition between words in the lexicon is associated with hyperarticulation of phonetic properties in production. This correlation has been reported for metrics of competition varying in the phonetic specificity of the relationship between target and competitor (e.g., neighborhood density, onset competition, cue-specific minimal pairs). Sampling a systematic array of competition metrics, we tested their ability to predict voice onset times in both voiced and voiceless word-initial stops of conversational English. Linear mixed effects models were compared according to their corrected Akaike’s Information Criterion (AIC ) values. High-performing models were evaluated using evidence ratios, with the competition metrics of top-performing models tested for significance using nested model comparisons. Words with a minimal pair defined for initial stop voicing were contrastively hyperarticulated, with shorter voice onset times for voiced stops and longer voice onset times for voiceless stops. No other competition metric reliably predicted hyperarticulation for both stop types. These results suggest that contrastive hyperarticulation is phonetically specific, increasing the perceptual distance between target and competitor.
PRIMIR (Processing Rich Information from Multidimensional Interactive Representations; ) is a framework that encompasses the bidirectional relations between infant speech perception and the emergence of the lexicon. Here, we expand its mandate by considering infants growing up bilingual. We argue that, just like monolinguals, bilingual infants have access to rich information in the speech stream and by the end of their first year, they establish not only language-specific phonetic category representations, but also encode and represent both sub-phonetic and indexical detail. Perceptual biases, developmental level, and task demands work together to influence the level of detail used in any particular situation. In considering bilingual acquisition, we more fully elucidate what is meant by task demands, now understood both in terms of external demands imposed by the language situation, and internal demands imposed by the infant (e.g. different approaches to the same apparent task taken by infants from different backgrounds). In addition to the statistical learning mechanism previously described in PRIMIR, the necessity of a comparison–contrast mechanism is discussed. This refocusing of PRIMIR in the light of bilinguals more fully explicates the relationship between speech perception and word learning in all infants. ► The PRIMIR theoretical framework is extended to infants growing up bilingual. ► Refocusing PRIMIR in light of bilinguals further explicates the relationship between speech perception and word learning. ► A mechanism for comparing and contrasting information is added to the framework. ► A distinction is made between internal and external task demands. ► This expansion of PRIMIR helps to explain the behavior of infants growing up in a wide variety of language backgrounds.
In phonetics, many datasets are encountered which deal with dynamic data collected over time. Examples include diphthongal formant trajectories and articulator trajectories observed using electromagnetic articulography. Traditional approaches for analyzing this type of data generally aggregate data over a certain timespan, or only include measurements at a fixed time point (e.g., formant measurements at the midpoint of a vowel). This paper discusses generalized additive modeling, a non-linear regression method which does not require aggregation or the pre-selection of a fixed time point. Instead, the method is able to identify general patterns over dynamically varying data, while simultaneously accounting for subject and item-related variability. An advantage of this approach is that patterns may be discovered which are hidden when data is aggregated or when a single time point is selected. A corresponding disadvantage is that these analyses are generally more time consuming and complex. This tutorial aims to overcome this disadvantage by providing a hands-on introduction to generalized additive modeling using articulatory trajectories from L1 and L2 speakers of English within the freely available R environment. All data and R code is made available to reproduce the analysis presented in this paper.
Variation across talkers in the acoustic-phonetic realization of speech sounds is a pervasive property of spoken language. The present study provides evidence that variation across talkers in the realization of American English stop consonants is highly structured. Positive voice onset time (VOT) was examined for all six word-initial stop categories in isolated productions of CVC syllables and in a multi-talker corpus of connected read speech. The mean VOT for each stop differed considerably across talkers, replicating previous findings, but importantly there were strong and statistically significant linear relations among the means (e.g., the mean VOTs of [pʰ] and [kʰ] were highly correlated across talkers, >0.80). The pattern of VOT covariation was not reducible to differences in speaking rate or other factors known to affect the realization of stop consonants. These findings support a uniformity constraint on the talker-specific realization of a phonetic property, such as glottal spreading, that is shared by multiple speech sounds. Because uniformity implies mutual predictability, the findings also shed light on listeners׳ ability to generalize knowledge of a novel talker from one stop consonant to another. More broadly, structured variation of the kind investigated here indicates a relatively low-dimensional encoding of talker-specific phonetic realization in both speech production and speech perception.
This paper presents a comparative evaluation of metrics for the quantification of speech rhythm, comparing pairwise variability indices (nPVI-V and rPVI-C) and interval measures (ΔV, ΔC, %V), together with rate-normalised interval measures (VarcoV and VarcoC). First, we examined how well these metrics discriminated “stress-timed” English and Dutch and “syllable-timed” Spanish and French. Metrics of interval standard deviation such as ΔV and ΔC were strongly influenced by speech rate, but rate-normalised metrics of vocalic interval variation, VarcoV and nPVI-V, were shown to discriminate between hypothesised “rhythm classes”, as did %V, an index of the relative duration of vocalic and consonantal intervals. Second, we applied these metrics to quantifying the influence of first language on second language rhythm, with the expectation that speakers switching “rhythm classes” should show rhythm scores different from both their native and target languages. VarcoV offered the most discriminative analysis in this part of the study, with %V also suggesting insights into the process of accommodation to second language rhythm.
This study documents the relation between f0 and prevoicing in the production and perception of plosive voicing in Afrikaans. Acoustic data show that Afrikaans speakers differed in how likely they were to produce prevoicing to mark phonologically voiced plosives, but that all speakers produced large and systematic f0 differences after phonologically voiced and voiceless plosives to convey the contrast between the voicing categories. This pattern is mirrored in these same participants’ perception: although some listeners relied more than others on prevoicing as a perceptual cue, all listeners used f0 (especially in the absence of prevoicing) to perceptually differentiate historically voiced and voiceless plosives. This variation in the speech community is shown to be generationally structured such that older speakers were more likely than younger speakers to produce prevoicing, and to rely on prevoicing perceptually. These patterns are consistent with generationally determined differential cue weighting in the speech community and with an ongoing sound change in which the original consonantal voicing contrast is being replaced by a tonal contrast on the following vowel.
This study explores the relationship between prosodic strengthening and linguistic contrasts in English by examining temporal realization of nasals ( -duration) in CV # and # VC, and their coarticulatory influence on vowels (V-nasalization). Results show that different sources of prosodic strengthening bring about different types of linguistic contrasts. Prominence enhances the consonant׳s [nasality] as reflected in an elongation of -duration, but it enhances the vowel׳s [orality] (rather than [nasality]) showing coarticulatory resistance to the nasal influence even when the nasal is phonologically focused (e.g., - - ). Boundary strength induces different types of enhancement patterns as a function of prosodic position (initial vs. final). In the position, boundary strength reduces the consonant׳s [nasality] as evident in a shortening of N-duration and a reduction of V-nasalization, thus enhancing CV contrast. The opposite is true with the nasal in which -duration is lengthened accompanied by greater V-nasalization, showing coarticulatory vulnerability. The systematic coarticulatory variation as a function of prosodic factors indicates that V-nasalization as a coarticulatory process is indeed under speaker control, fine-tuned in a linguistically significant way. In dynamical terms, these results may be seen as coming from differential intergestural coupling relationships that may underlie the difference in V-nasalization in CVN# . #NVC. It is proposed that the timing initially determined by such coupling relationships must be fine-tuned by prosodic strengthening in a way that reflects the relationship between dynamical underpinnings of speech timing and linguistic contrasts.
Listeners use lexical or visual context information to recalibrate auditory speech perception. After hearing an ambiguous auditory stimulus between /aba/ and /ada/ coupled with a clear visual stimulus (e.g., lip closure in /aba/), an ambiguous auditory-only stimulus is perceived in line with the previously seen visual stimulus. What remains unclear, however, is what exactly listeners are recalibrating: phonemes, phone sequences, or acoustic cues. To address this question we tested generalization of visually-guided auditory recalibration to (1) the same phoneme contrast cued differently (i.e., /aba/-/ada/ vs. /ibi/-/idi/ where the main cues are formant transitions in the vowels vs. burst and frication of the obstruent), (2) a different phoneme contrast cued identically (/aba/-/ada/ vs. /ama/-/ana/ both cued by formant transitions in the vowels), and (3) the same phoneme contrast with the same cues in a different acoustic context (/aba/-/ada/ vs. /ubu/-/udu/). Whereas recalibration was robust for all recalibration control trials, no generalization was found in any of the experiments. This suggests that perceptual recalibration may be more specific than previously thought as it appears to be restricted to the phoneme category experienced during exposure as well as to the specific manipulated acoustic cues. We suggest that recalibration affects context-dependent sub-lexical units.
Within quantitative phonetics, it is common practice to draw conclusions based on statistical significance alone. Using of final devoicing in German as a case study, we illustrate the problems with this approach. If researchers find a significant acoustic difference between voiceless and devoiced obstruents, they conclude that neutralization is incomplete; and if they find no significant difference, they conclude that neutralization is complete. However, such strong claims regarding the existence or absence of an effect based on significant results alone can be misleading. Instead, the totality of available evidence should be brought to bear on the question. Towards this end, we synthesize the evidence from 14 studies on incomplete neutralization in German using a Bayesian random-effects meta-analysis. Our meta-analysis provides evidence in favor of incomplete neutralization. We conclude with some suggestions for improving the quality of future research on phonetic phenomena: ensure that sample sizes allow for high-precision estimates of the effect; avoid the temptation to deploy researcher degrees of freedom when analyzing data; focus on estimates of the parameter of interest and the uncertainty about that parameter; attempt to replicate effects found; and, whenever possible, make both the data and analysis available publicly.
In conversation, turn transitions between speakers often occur smoothly, usually within a time window of a few hundred milliseconds. It has been argued, on the basis of a button-press experiment [De Ruiter, J. P., Mitterer, H., & Enfield, N. J. (2006). Projecting the end of a speaker's turn: A cognitive cornerstone of conversation. , (3):515–535], that participants in conversation rely mainly on lexico-syntactic information when timing and producing their turns, and that they do not need to make use of intonational cues to achieve smooth transitions and avoid overlaps. In contrast to this view, but in line with previous observational studies, our results from a dialogue task and a button-press task involving questions and answers indicate that the identification of the end of intonational phrases is necessary for smooth turn-taking. In both tasks, participants never responded to questions (i.e., gave an answer or pressed a button to indicate a turn end) at turn-internal points of syntactic completion in the absence of an intonational phrase boundary. Moreover, in the button-press task, they often pressed the button at the same point of syntactic completion when the final word of an intonational phrase was cross-spliced at that location. Furthermore, truncated stimuli ending in a syntactic completion point but lacking an intonational phrase boundary led to significantly delayed button presses. In light of these results, we argue that earlier claims that intonation is not necessary for correct turn-end projection are misguided, and that research on turn-taking should continue to consider intonation as a source of turn-end cues along with other linguistic and communicative phenomena.
Research on the development of speech processing in bilingual children has typically implemented a cross-sectional design and relied on behavioral measures. The present study is the first to explore brain measures within a longitudinal study of this population. We report results from the first phase of data analysis in a longitudinal study exploring Spanish-English bilingual children and the relationships among (a) early brain measures of phonetic discrimination in both languages, (b) degree of exposure to each language in the home, and (c) children's later bilingual word production abilities. Speech discrimination was assessed with event-related brain potentials (ERPs). A bilingual questionnaire was used to quantify the amount of language exposure from all adult speakers in the household, and subsequent word production was evaluated in both languages. Our results suggest that bilingual infants' brain responses to speech differ from the pattern shown by monolingual infants. Bilingual infants did not show neural discrimination of either the Spanish or English contrast at 6–9 months. By 10–12 months of age, neural discrimination was observed for both contrasts. Bilingual infants showed continuous improvement in neural discrimination of the phonetic units from both languages with increasing age. Group differences in bilingual infants' speech discrimination abilities are related to the amount of exposure to each of their native languages in the home. Finally, we show that infants' later word production measures are significantly related to both their early neural discrimination skills and the amount exposure to the two languages early in development. ► English/Spanish bilingual infants studied at 6–9 and 10–12 months of age. ► Brain responses to English/Spanish sounds changed with age and language exposure. ► Neural responses to English/Spanish linked to exposure to each language at home. ► Amount of exposure to each language at home linked to later word production. ► Bilingual infants may remain “open” longer to language experience than monolinguals.