Data Sci. Discussions: Natural annotation of crowdsourced online dictionaries

General discussion of the natural annotation of crowdsourced online dictionaries

I am interested in leveraging the natural annotation provided by a crowdsourced online dictionary Wiwords, to actively/continuously address word sense disambiguation issues that arise from the analysis of a low-resource language (Saint Lucian Creole)’s social media (Facebook, Instagram, Twitter, etc.) text data. By improving word sense disambiguation issues with the creole language, one may be better able to assess the language’s vitality via assessing the frequency of its usage in social media posts.

For example, avocado translates to zabòka2 in the official Saint Lucian creole dictionary. However, on Instagram3, Facebook4, and Twitter5 (and even a book on Amazon, refers to the same item using the spelling ‘zaboca’, ‘zabocca’ 67, and ‘zaboka’8. Yet, the crowdsourced online dictionary Wiwords notes the same item with the spelling ‘zaboca’9 and includes pictures for clarification.

A cursory search of Twitter revealed that the ‘zaboca’ spelling was just about as common as ‘zaboka’, yet the accented spelling, ‘zabòka’, was not present. One could say that Wiwords did indeed reflect the term’s typical informal (social media) form. While Wiwords does not list the official spelling or all other alternative spellings, it does assist with understanding instances of the presence or absence of diacritical marks and providing needed context to improving the word sense disambiguation.

Brief literature review

Online dictionaries allow users and compilers opportunities far beyond those of traditional publishing constructions. For an endangered language, the use of such a resource becoming increasingly essential. However, the use of static online dictionaries may not be sufficient to meet the demands of language that is slowly dying. Static dictionaries with dynamic qualities may better serve the needs of cataloging an endangered language.

Online dictionaries can benefit from using technology to update their records regularly. This not only pertains to new sections of the alphabet but “also with words of particular interest to its users at the time”. Words such as LOL and YOLO were online slang that were added to the Oxford English Dictionary (OED) in 2011, and 2016 respectively. The third edition of the OED appears to no longer suppress widely‑used slang terms, however, it certainly is incapable of documenting the meaning and use of every fleeting term used among small assortments of speakers within the entirety of the English‑speaking world.

The Urban Dictionary is the free service of an online dictionary of contemporary English slang usage; it is a “collaborative project of over 1 million definitions for over 400,000 unique headwords”. It serves as an online lexicography where public contributors explain the meanings of words and phrases not readily covered by traditional dictionaries; contributors collaborate, cooperate, and compete for meaning-making. Eventually, depending on the proliferation of use in society, official dictionaries, such as OED, may elect to officially adopt a word from such a collection.

The collaborative compilation of online lexicography, therefore, is known to linguistics. In fact, there is a webpage, similar to Urban Dictionary, that strategically caters to Caribbean consumers. While the region bears various peoples with differing complex histories, they do share similarities; while most of these similarities are the unfortunate result of colonialization from similar superpowers, certain words are ubiquitous due to the ease of traversing and settling in neighboring countries. Persons of these areas experience similar flora and fauna, and environmental and social phenomena that are endemic to their setting and way of life. The website, Wiwords, is a dynamic online dictionary that highlights instances of shared vocabulary, as well as allowing for contributions that are unique to certain countries in the Caribbean.

Overall, the Antillean creole spoken in Saint Lucia can benefit from utilizing online language resources. A pidgin is a language variety developed by speakers in contact who share no common language; it would have limited functions of use, is linguistically simplified, and has developed its own rules and norms of usage. Creole languages are developed from pidgins, and severe as the first language of some members of a speech community, where it is used for a wide range of functions. Since creole languages seldomly achieve official status, the speakers of a fully formed creole may eventually feel compelled to conform their speech to one of the parent languages. This decreolization process typically brings about a post-creole speech continuum characterized by large scale variation and hypercorrection in the language.

It should, however, be noted that Saint Lucians do not appear to be highly active in this online linguistic community. For example, there are no contributions to the section outlined for local “quotes” or sayings, when these are quite abundant in the local language. Most participation from Saint Lucia pertains to discussions of food, and animal and plant life. Therefore, to better preserve the creole language within Saint Lucia, a dynamic online community dictionary would be very useful in keeping certain words active.

It is also important to note that natural language processing issues may arise when manipulating the creole language, or attempting to adapt it to a digital environment such as a comprehensive online-dictionary.

Issues with natural annotation

Public contributors may understand the language in terms of speaking it, however, they may not be the best teachers; that is to say that contributors may not always be clear in their explanations or contributions. Overall, straying from official spellings of words can contribute to data entry issues. For example, one can observe the accents used with “chofé” - “to heat up”, and “chofè” - a “driver”. These words without context and accents would be very difficult to decipher. Persons inaccurately applying accents, or unable to access the necessary unique characters were required (diacritics) for formal grammar can present problems in documenting the language.

Problems of polysemy

Another issue with dictionary compilation in this region may deal with the overlap of meanings attributed to certain words. Many-to one-relations between words and their meanings (e.g. homonymy and polysemy) are often not included in lists when studying this grammatical phenomenon for a unique language like creole. In addition to synonyms, frequent polysemy, the presence of homonyms and homophones and the phenomena sometimes referred to as the “false friends” of the translator. Dictionaries recognize the distinction between polysemy and homonymy by making a polysemous item a single dictionary entry and making homophonous lexemes two or more separate entries; polysemous terms can though different can be derived from one basic idea, and the related terms reflect that image.

Polysemy is a common occurrence in the endangered language of Saint Lucian Antillean creole. An indication of a polysemous verb in English is one that corresponds to different verbs when translated into other languages. For example, one can review the English word for “ask” (for information) and “ask” (for action); this can be interpreted as one word “vragen” in Dutch, however, the majority of other languages use different words for each interpretation in English, like “fragen” and “bitten” in German, and “preguntar” and “pedir” in Spanish. In Saint Lucian creole, this can be seen where the term “mwen” can mean “me” or “my”, or the term “asou” can mean “on” or “lean against”.

It is important to note that there may need to be considerations for word sense disambiguation issues that arise from the intermingling of an almost identical creole from the neighboring island of Dominica. Both islands appear to sound similar and some words are indeed the same, but the written form appears to vary slightly. This might be due to the liberties taken by the different authors of their dictionaries; Frank (missing reference) highlighted his penchant for leaning on French when considering the spelling of words. Overall, it appears that while both countries considered phonetic spelling, there are slight differences in diacritic use and placement, and spelling.

Take for example the days of the week; the words for Sunday, Monday, Tuesday, and Saturday are the same, however, Wednesday, Thursday, and Friday are different. Dominicans write Wednesday as Mèkwédi whereas Saint Lucians write it as Mékwédi, and Dominicans write Thursday as Jèdi whereas Saint Lucians write it as Jédi; the accent placement is different (of Dominica, 2018). Dominicans write Friday as Vanwédi whereas Saint Lucians write it as Vandwédi; here, while the accent is the same, the Saint Lucians appear to include an additional ‘d’, reminiscent of the original French [< Fr. vendredi] (according to Frank (missing reference)).

It is said that the shared creole alphabet writing system arose out of two creole ethnography workshops held in St. Lucia in January 1981 and September in 1982; this was developed through the efforts of researchers at “the University of the West Indies (U.W.I.), The Université Antilles – Guyane groups from St. Lucia (MOKWÉYÓL), Dominica (K.E.K.) and the Groupé d’Etude et de Recherche en Espace, Creolophone (GEREC) from Martinique and Guadeloupe” (of Dominica, 2020).

Even the word, “Creole” can be viewed as a contested, polysemous term in the English language. The term has been used at different times and in various geographical regions to describe distinct identities, languages, peoples, ethnicities, racial heritages, and cultural artefacts. As an adjective, Creole was applied as an indicator of higher status bestowed upon Louisiana-born slaves to distinguish them from those born in African. It was also used as a noun to designate local birth in Louisiana, regardless of racial heritage; later Creole was used by Americans when referring to people of Spanish or French descent, yet it has often been conflated with the term “Cajan” (which described French colonists that settled in Canada’s Acadia region, then migrated to Louisiana). In fact, for some time, there was also a misconception that the term only referred to whites born in Louisiana.

Currently, Creole primarily refers to one’s linguistic heritage (French culture and a unique Franco-linguistic dialect) as the main source of their ethnic identity (particularly those of mixed or foreign ancestry). Creole, ‘Kreyol’ or ‘Kweyol’ also refers to the creole languages in the Caribbean, including Antillean Creole, Haitian Creole, and Jamaican Creole, among others.

Polysemy may indeed present an issue when attempting to study a language or dialects, however, it is in fact, the premise of much of the active literature in this language. Calypso, and most other endemic forms of music, may celebrate this ability to utilize words or phrases bearing double meanings to indirectly discuss topics that are often crude. Philips suggested that Calypsos can engage this method via ‘lamina lyrics’. Much like an onion, these Calypsos have a number of different levels of meaning, concealed one underneath the other. Achieving this phenomenon, Calypsonians use frames and masks that manifest in Calypsos as a metaphor, metonym, polysemy, irony, and satire.

In 1863, August Schleicher, the German linguist, discussed languages as natural organisms that come into being, develop, age, and die according to laws that are quite independent of man’s will (cited by Arens 1969:259). This perception of language is contrary to recent the modern view that it is one’s set of communication habits that have principally been defined by their social experience, led by an innate sense to decipher and learn the language practices of others. However, despite having a few members of the “native speaking generation” left, often communities lack the implementation of structures dedicated to the retention of language. Therefore, successive generations may require resources that express the necessity of the language as well as its desirability to ensure its continuance.

Mohan is of the view that living languages can die before “the last trace of memory has vanished”; that these languages may be experiencing a feigned guise of life through its observation of “sympathetic post-users from outside its system” (1979:42). To further quote Mohan, an obsolescent language: is actually dead before its forms have totally disappeared, in two different senses. A small part of the non-native speaking generation has preserved dead tokens of the language. Also, the time gap between the age of the youngest native speaker and the latest possible age of language acquisition, in infancy, shows the language dead at its source, but with a now finite community of native speakers continuing, like the earlier light of a dead star, to travel its original course and give an illusion picture of vitality (313).

Moreover, Sasse (2001) lists ten changes occurring during language obsolescence; several of which bear similarities to those described by Campbell & Muntzel (1989) and Palosaari & Campbell (2011), namely: loss of phonological distinctions, regularization of morphophonemics, loss of function words, analyticity, loss of morphology, loss of syntactic complexity, agrammatism, phonological and grammatical variability, reduction of vocabulary, and an increase in polysemy.

Evaluating and creating structures is crucial for the preservation and promotion of endangered languages.

Need to expand on the definition of systems dynamic model and other models here…

Perceptions of creole

Perceptions of the locals’ attitudes towards the creole language are often complicated. Language change in Saint Lucia has been a pressing concern for academics for many years. As recent as 1970, Douglas Midgett highlighted that the formal teaching of Patois would be viewed as “unjustifiable and in any case, would never be tolerated by even its most ardent user”. At the time, he noted that most people agreed that increased proficiency in spoken and written English (or French) would be an educational must, however, writers and other academics and actual educators had differing views; the former group believed that use of creole in schools could aid with recognizing English as a second language, whereas educators adamantly argued against any use of Creoles in the schools. He believed that the use of Patois in the schools interchangeably with less formal, more colloquial English would aid in establishing English in the minds of students as a functional Patois equivalent. He suggested that as long as educational institutions reinforce the conventional traditional opinion of separating the two languages, the campaign for English literacy and spoken usage will not have widespread effectiveness. Midgett, however, underestimated how pervasive the English language would be; in fact, it is the creole language that currently lacks proper literacy among the public.

In 1998, Frank explored and even expanded the written form of the creole language in Saint Lucia while attempting to effectively translate an English bible into the local language. Upon concluding his tasks he remarked that the bible would indirectly boost creole literacy through the motivational passages of the bible: “… for all practical purposes Creole remains an unwritten language for the majority of the population, which remains unaware of the books published in Creole. Attempts to teach Creole literacy have not met with much success because of lack of interest. Motivation is the most important factor in the success of any literacy program, and having something people want to read is the most important motivating factor”.

Investigation

This ultimately leaves one to wonder if it is logical and possible to merge an official online static dictionary with a dynamic slang dictionary to increase overall fluency in an endangered language? Such a system would require access to an existing dictionary, as well as opportunities for dictionary contributions. The creation of an interactive dictionary may assist with reinvigorating the usage of the language through active use and contributions to this resource, but also offer opportunities to actively discuss and clarify words and concepts associated with creole.

  • Need to clean this up and add the references mentioned.

References:

  1. of Dominica, G. (2018). Kweyol Language. In A virtual Dominica. gov.dm. https://www.avirtualdominica.com/project/creole-kweyol-language/
  2. of Dominica, G. (2020). A Brief History of Kwéyòl (Patwa). In The Division of Culture. gov.dm. http://divisionofculture.gov.dm/creole-languages/5-kweyol

Published: November 27 2020

blog comments powered by Disqus