Data Sci. Discussions: Capitonyms

General discussion of Capitonyms

Because of this Ph.D. program, I believe I have become a bit more capable of noticing applications of NLP concepts in my surroundings. I now look at Grammarly.com and wonder about the underlying tools and infrastructure. I think about Noam Chomsky’s ideas on the “autonomy of syntax”, where syntax and grammar depict information that is independent of the meaning and semantics of words, and how that may impact autocorrection here. It seems to have some sort of an interactive spelling correction, with an interesting combination of background lemmatization, tokenization, and PoS tagging; Grammarly.com probably has some “named entity recognition”, with “sentence boundary detection”, and a “Word Sense Disambiguation” feature using some kind of Lesk algorithm and snowball stemmer and (like those present in Python’s NLTK). Concepts like the editing of stopwords are not applied here, and it is not yet capable of properly handling corrections for various languages.

Te quiero/ Yo te quiero:

I also noticed that some phrases do not trigger error alerts in Grammarly.com. It appears that it does allow for Spanish phrases to be addressed as correct or incorrect, on an assumingly English platform. Sometimes it flags a Spanish word as incorrect simply because the word is not capitalized. For example, one can look at the Spanish phrase ‘te quiero’; this is the informal way of saying ‘I love you’, and has a lighter connotation of love and care when compared to the more romantic phrase ‘te amo’. Grammarly.com currently suggests that one capitalizes only the word quiero, to make the phrase acceptable/ remove the grammar alert (‘te Quiero’).

‘Quiero’ appears to have Latin roots from the word ‘quaero’ - meaning to ‘seek’ or ‘look for’. The word ‘quiero’ is indeed a tricky one as it can be arrived at through conjugation, based on different scenarios. It can be derived from the reciprocal verb ‘Quererse’, where it can mean ‘to feel affection for’ (to love each other). It can also be derived from the transitive verb ‘Querer’; however, in a transitive form (where it demands things), it also has several slightly different meanings; it can mean ‘to feel affection’ which includes ‘to love’, ‘to like’, and ‘to be fond of’; it can also indicate intention (‘to mean’ and ‘to try’), or to wish (‘to want’). I will, however, state that the fact that quiero can be applied to so many different scenarios, can serve to support an argument for it being a demonstration of polysemy.

Perhaps, this issue with Grammarly.com is indeed due to quiero’s typical usage; it typically indicates a want. For example, one can want an item, like ice-cream; ‘Quiero helado’ translates to ‘I want ice-cream’. Here Quiero’s use is viewed as correct, yet instead, helado is marked here as non-English. I suppose the Spanish phrase for “want ice-cream” may not be as popular/common, and one could say that this may have been a flaw with the original training data; however, I am uninformed of the underlying structure, so I am unable to state if is completely based on machine learning models (or combined with a supporting language-translation tool) where quiero would have been typically encountered at the beginning of a sentence.

Interestingly, GoogleTranslate and Python’s Textblob library may also be a bit faulty as it seems that they offer varied capitalizations or meanings when a word or phrase is hyphenated, separate, or concatenated (Sarkar, 2019; Sarkar, 2019). When using GoogleTranslate from English to Spanish, to indicate that I want ice cream (with the words ice and cream separated) there are no capitalizations. However, if I hyphenate or concatenate, it capitalizes the word “Quiero”.

This translation issue might also be due to the distinction between direct and indirect objects is not quite the same in Spanish, as it is in English; and thus the pronouns that represent them are sometimes called accusative and dative pronouns, respectively. Or it could be that the verb, Querer, is considered to be a transitive verb (one that needs to have a direct object) and, therefore its usage is a bit more complex, thus difficult to program? Anyway, since the language is so temperamental does that mean that larger training datasets are needed to detect the nuance of the appropriate use or translation, or is this something a language package update/ new package could address?

Current example of machine translation issues in action:

Moreover, in the last week, the singer Jennifer Lopez had a controversial Spanish line in her latest song that caused an interesting look into the NLP capitonym challenges. She called herself “Tu negrita del Bronx”. Negrita as you can probably guess can imply that something is dark, and “ita” is usually added to the end of something to indicate a term of endearment, infantilizing, or insulting to a dark-skinned person. People were unsure of what she meant by its use since she is clearly not dark in skin tone, but she tried to explain that in some parts of the Spanish world, calling someone dark-skinned is universal- basically, saying anyone who is not white is dark, and they do uplift each other for this quality. So confused fans went to Google for a literal translation, only to encounter further confusion; if you capitalized Bronx, it would display the negative version (your n-word from the Bronx), and if you did not capitalize it would have translated to the milder version ( “Your black girl from the bronx”). Textblob also has the same translation result for both scenarios.

Interestingly, that capitonym also appears to be used in Textblob to a certain extent; it appeared to be a means to address the differentiation between formal forms and informal forms of conversations. The formal form of ‘yo te quiero’ and the informal from ‘te quiero’ result in ‘I love you’ and ‘i love u’, respectively. Note that regardless of capitalization in the formal phrase, the result is capitalized. Yet only when the first character is capitalized in the informal (‘Te quiero’), will it be reciprocated in the result (‘I love u’); yet, the other words remain in their informal forms (‘you’ remains as ‘u’). An additional consideration is that when the word ‘mucho’ accompanies the phrase ‘te quiero’ to form ‘te quiero mucho’ (‘I love you very much’) it seems to be regarded as formal, as all the necessary formatting is present.

It will be interesting to see if translation changes over time to directly address this concern as the scandal would have generated additional internet chatter. It does not appear that Google is addressing this matter based on a live training dataset but perhaps based on a package like Textblob. Using keywords like “capitonym” “te quiero” “textblob”, I was unable to find any current literature on this occurrence via any Google or present in any scholarly archives.

Ultimately, I guess would love to do some sort of commentary paper on this phenomenon; to perhaps discuss the possible inner workings of the underlying algorithm that furthered the scandal, and discuss the possible future implications. Since world leaders are increasing their use of informal text and speech through the use of social media platforms, yet these same governments rely on this technology for translation, there could be potential future conflicts due to literal misinterpretation. Perhaps data science studies can raise the alarm for a need for improvement in the underlying flawed structure that lacks live updating to formal and informal dictionaries or demonstrates overfit models. That paper would have gone under my Maslow’s needs sections on belonging or security, since miscommunication can facture one’s sense of belonging, and subsequently lead to a lack of sense of security.

I am currently reading a work called ‘The definite article in Spanish as a polysemous category’ since it was published in the Journal of Cognitive Science as if 2017; however, I acknowledge this is just for personal edification since this is more along the realms of Cognitive Linguistics and not pure Data Science. I am also reading ‘Linguistic-based evaluation criteria to identify statistical machine translation errors’ since I am interested in statistical machine translation errors.

References:

  1. Sarkar, D. (2019). Text analytics with Python: a practitioner’s guide to natural language processing. Apress.
  2. Sarkar, D. (2019). Text Analytics with Python. Apress. https://doi.org/10.1007/978-1-4842-4354-1

Published: September 29 2020

blog comments powered by Disqus