Scoping Representativeness

by Giorgina Paiella and Tyler Shoemaker
Published July 23, 2018

The WE1S Research Blog posts discoveries, observations, and questions by project members bearing on WE1S's themes and methods. (For context, see "About" WE1S.)
During the academic year 2017-18, the WE1S C-Hackers (short for “Corpus Research and Design Group”) at UC Santa Barbara and U Miami began to explore aspects of corpus collection for the WE1S main corpus. This scoping work identified key databases and publications that now serve as initial working materials for this summer’s Research Camp. Each RA selected an “area of focus,” for which they acted as a domain expert, acquainting themselves with the broadest topological features of their respective investigations – features like a country’s mediascape, a publications’ audience, source availability, and so forth.

These features had a tendency to turn into obstacles: the C-Hackers’ explorations quickly led them to confront their own situatedness in a global mediascape. Our team simply does not know what goes in every corner of the world’s news, nor do we know, for example, how the organization of a country’s political parties differ from that of the US. A previous post has described that there is no a priori definition of representativeness for our project, and this means that defining what this means for WE1S also entails naming and reflecting on the partiality of our own shared knowledge about the world.

One obstacle that has been under much discussion is the projected corpus’s bias toward English language sources. In part, this is an issue of pragmatics: our team is comprised of mostly native English speakers, and thus we have a more thorough understanding of Anglophone publications native to the US. Furthermore, current topic modeling software only works on one language at a time, and so our project’s methods necessarily limit what we are able to analyze. But we cannot ignore the fact that this bias marks and shapes our investigations. Nor, too, can we discount the continued influence of colonialist histories on proprietary databases, in which source availability often tends to fall along the lines of the colonizer and the colonized.

Patterns emerge across the C-Hacker Area of Focus Reports, articulating how these legacies pose difficulties for collecting meaningful metadata and scoping out viable sources for corpus inclusion. The five main areas that most acutely pose the problem of representation during WE1S corpus design and construction are below, along with reflections written by our C-Hackers as they collectively thought through these categories: Access and AvailabilityAdequate MetadataLanguage BarriersAudience, and Researcher Biases.

The C-Hackers’ Area of Focus Reports can be found here.

This slideshow requires JavaScript.