Collocations in institutional academic English. Corpus and experimental perspectives

Ferraresi, Adriano

This book deals with one of the best–loved, and at the same time most elusive, notions in corpus linguistics, the notion of collocation. Its aim is to investigate its many facets, triangulating lexicographic, informant and psycholinguistic evidence. The seemingly straightforward task that occupies the bulk of the book consists in evaluating different statistical measures for the automatic extraction of lexical collocations from an ESP corpus. The task is far more complex than it seems, and has three related aspects to it. The first is mainly theoretical, and concerns the overlap between a performance–based view of collocation and its competence–based counterpart. While the former considers collocations as retrieved from a corpus, regardless of the method employed for retrieval, the latter relies on implicit evidence of psychological salience, or explicit endorsement of collocation status. Whether the two overlap substantially, somewhat or not at all, is still an open question. The second, more practical/methodological concern has to do with comparing the different results yielded by four different measures of collocativity, in order to determine which measure better matches the competence–based evidence collected. This point has important implications for corpus use in general, and is crucial for NLP applications that aim to retrieve collocations automatically — as is the case, for instance, in unsupervised term extraction tasks. The third issue addressed in this book is descriptive, and concerns the typical phraseology employed within a well–defined genre, namely degree course descriptions published by British Universities on the web. The increasingly globalised worldof higher education consistently uses English as its lingua franca. Access to quality contents is essential to develop reference resources (such as term and phrase banks), supporting non–native authors in their effort to write institutional academic texts in English. Let us briefly consider each of these three aspects in a slightly more detailed manner. Adopting a (very) loose corpus–based definition we might describe collocations as sequences of words that occur repeatedly in texts, and that do so because they are “the preferred way of putting things” (Kennedy ): for instance, based on the relative frequency of occurrence of “final year” and “concluding year” in a corpus of degree course descriptions, it is possible to conclude that students who are at the end of their studies are more likely to be described as being in the former than in the latter, regardless of the fact that the two adjectives are near–perfect synonyms in context. The terms “preferred” and “likely” in the previous sentence hint at the fact that repeated cooccurrence is hypothesised not to be a random feature of texts, but rather the textual instantiation of psychological salience. Collocations form a non–negligible part of a speakers’ mental lexicon, therefore they are uttered or written often, therefore they are highly frequent in texts, and vice versa. While this seems a fair assumption, that is taken for granted, either implicitly or explictly, in most studies on the topic, the actual relation between a performance– and a competence–oriented view of collocation, i.e. between corpus and psycholinguistic data, is still underexplored in the field of corpus linguistics (Gilquin and Gries 2009; Siyanova–Chanturia 2015). Moving on to the methodological focus of this book, different statistical measures have been proposed and are currently used in the literature, that give more prominence (i.e. a higher collocativity score and rank) to one or another sequence (say, “second year” vs. “concluding year”). The question then arises as to what is the “best” measure of collocativity available, or what measure is able to retrieve from corpora thehighest number of salient collocations while minimizing or scoring down non–collocations. Evidence that “final year” (but not “concluding year”) is implicitly or explicitly recognised as a collocation by speakers of English would suggest that it is memorised as a single unit, i.e. that it is part of their mental lexicon rather than being compositional. An association measure that gives a higher score to “final year” than to “concluding year” better reflects human intuition than one that does the opposite, and there are obvious descriptive/theoretical and practical/methodological advantages in knowing which does what. A final point about the variety of English focused upon, namely institutional academic language published on the web. The collocations evaluated in this book are extracted from a purpose–built corpus of BA degree course descriptions, collected through a semi–automatic procedure from the websites of British universities. This genre was selected since it provides a well–defined and clearly recognizable subset of an ESP variety that is currently the object of both descriptive (Biber 2006) and applied interest (Depraetere et al. 2011). The corpus construction method is described in a very detailed manner, so as to allow interested readers to repeat or adapt the procedure for similar ESP web corpora –– an added bonus. Lastly, while the analysis is limited to adjective–noun pairs, it does provide insights about typical phrases used in this native variety of English, that are of interest both on their own, and for subsequent comparisons with lingua franca varieties. The book is structured as follows. Chapters I to III describe the theoretical background to the work, focusing on the aspects of collocation studies of more immediate relevance to the present concerns, namely frequency–oriented views, statistical methods and process–oriented perspectives. Chapter IV describes the ESP under study (institutional academic English, and in particular degree course descriptions), and presents the corpus that was built for this work, as well as the procedure developed for this purpose. Chapter V focuses on methodological aspects, providing a detailed account of the various phases of the research — formulation of the research hypotheses, evaluation tasks, and statistical methods used in the analysis of results. Chapters VI to VIII report on the results obtained in the three evaluation tasks, carrying out extensive quantitative and qualitative comparisons and discussing points of contact and differences observed. Finally Chapter IX recaps on the main findings of the experiments, comments on their theoretical, methodological and applied relevance, and makes suggestions for further work.