Compiling a Latin American Corpus: Reflections

by Kenia Rodriguez and Vanessa López
Published August 2, 2018

The WE1S Research Blog posts discoveries, observations, and questions by project members bearing on WE1S's themes and methods. (For context, see "About" WE1S.)
This summer, the Latin America team completed a variety of interesting tasks to create a foundational corpus of Mexico, Central America, and South America. We sought to gather public discourse about the humanities through the collection of online news articles. Due to the limited number of Latin American news outlets published in English, this region remains underrepresented in archives and databases, such as LexisNexis. We collected data manually and assembled a corpus of articles and online publication using classic web searching methods. Although the data accumulation process was both time consuming and tedious, our methods resulted in a variety of unexplored content. After the subsequent development of effective web scraping tools our corpus grew rapidly, providing WE1S with a wider range of Latin American publications. Even though we faced a number of challenges compiling useful data, we discovered vital information for future expansion of the WE1S corpus.

The evolving web scraping technologies employed throughout the summer research camp led to findings about the general corpus collection process about Latin America. Before any of our software developers created a web scraping tool, we gathered data by manually. We searched for articles, copied their title and body, converted that information into plain text format, and stored it for future use. After the implementation of an early web scraping tool, we learned that many of our sources’ formats were difficult to navigate. For instance, some sites were missing a search bar, while others required a subscription for access. As far as search term results, we found that “humanities,” and “liberal arts,” had the least number of hits compared to “the arts,” a common theme in the WE1S corpora. This fact paved way for our discussion about possible search terms that are used in Latin America to refer to the humanities, such as social sciences. Even though we are waiting for our developers to create an adaptable and consistent web scraping tool, our corpus now contains hundreds of Latin American articles. Though we progressed, this team faced a variety of challenges trying to produce a large corpus that is representative, because of the WhatEvery1Says English language requirement.

The majority of Latin American countries consider the English language to be a classed skill, mostly accessible to the university-educated and middle-to-upper class individuals. In addition to the creoles and Indigenous languages spoken daily in countries such as Honduras, Nicaragua, and Guatemala, Spanish is the official language for all Central American countries except Belize. This remains true for South America, except Brazil whose official language is Portuguese. This makes us question why the online English news sources were created and who their intended audience is. It is apparent that these sources are exclusionarily aimed at a certain group, since so few in Latin America can actually read their news in English. Our findings conclude that many of these sources are written by expatriates living in Latin America, reflecting biases that are skewed by individual social location. It is clear that some sources promote a world view that does not reflect that of the majority of the population. In our case, much of the online English news discourse about the humanities in Latin America does not reflect WhatEvery1Says, but what a select few are saying. Truly exploring the conversation as a whole would mean incorporating Spanish language sources to reflect the majority’s views, presenting a further challenge to this project.

Another challenge we faced connected to language is our limited number of search terms. Aside from the difficulty of achieving representativeness through the English language, our search terms (“humanities,” “liberal arts,” “the arts”) exclude some Latin American terminology relating to the humanities. After scraping multiple Latin American articles, we began to question the scarcity of articles in each of our sources in comparison to other group corpora sizes. For instance, in a corpus collection process of Amandala, a Belizean online news source, we found nine articles for humanities, 14 for “liberal arts,” and a total of 39 for “the arts.” The majority of these articles are irrelevant to our mission, because they focus on different aspects of art such as festivals, museums, and visual artists. Although these articles may be useful in the future, these results proves that we must expand our search terms and create an ontology in order to compile a relevant Latin American corpus.

To expand our knowledge about terminology relating to the discourse about the humanities in Latin America, we read “¿Quién piensa en las artes y las humanidades?” (Who Thinks in the Arts and the Humanities?), an online Columbian news article published in Semana in 2016. In this Spanish-language article we find that the humanities is being discussed with/was “ciencias sociales,” (social sciences). This article proves that one of the reasons we found such a small number of hits is our exclusion of relevant terms. In order to build a representative corpus, we must create a set of terms that will reflect our understanding of the terminology used in Latin American discourse about the humanities.

Alongside this speculation, the article brought to light some of the social issues shaping the conversation surrounding higher education in Colombia. Like in the United States, there is an ongoing debate throughout Latin America regarding the value of the humanities. On one end, there appears to be an inherent bias towards professions that are deemed most profitable, both for the individual and the country as a whole. The humanities are often viewed as inferior to the hard sciences, with many promoting fields like engineering, chemistry and mathematics as more valuable. Such fields are presumed to provide a competitive edge for success in the global market, crucial to the progress of many developing nations. Scholars in the opposite camp insist that measurement of success should be both quantitative and qualitative, insisting that progress should not result in the death of the humanities because it promotes understanding of social phenomena like power, culture, and artistic expression. Their decline could consequently result in reduced critical thinking capacity. They argue that technological advancement should increase employment in STEM fields, but also promote artistic and social theory, which can lead to an increase in economic and cultural capital in the international market.

Such conversations remind us about the importance of including commonly underrepresented, or third world, regions in our corpora. Discussion about the humanities will vary from region to region, but there are patterns to be discovered and differences to be discussed. As we learn from “¿Quién piensa en las artes y las humanidades?”, common Eurocentric opinions regarding the humanities are being emulated by a number of scholars around the world. Many ideas overlap, but it is important to study third world regions, such as Latin America, because their fight for development differs from our own. Analyzing such data will allow for a discussion about the stance of the humanities on an international level. To ensure that we do not miss any relevant data, we must broaden our searches and begin to analyze discourse about the humanities in foreign languages.