These features had a tendency to turn into obstacles: the C-Hackers’ explorations quickly led them to confront their own situatedness in a global mediascape. Our team simply does not know what goes in every corner of the world’s news, nor do we know, for example, how the organization of a country’s political parties differ from that of the US. A previous post has described that there is no a priori definition of representativeness for our project, and this means that defining what this means for WE1S also entails naming and reflecting on the partiality of our own shared knowledge about the world.
One obstacle that has been under much discussion is the projected corpus’s bias toward English language sources. In part, this is an issue of pragmatics: our team is comprised of mostly native English speakers, and thus we have a more thorough understanding of Anglophone publications native to the US. Furthermore, current topic modeling software only works on one language at a time, and so our project’s methods necessarily limit what we are able to analyze. But we cannot ignore the fact that this bias marks and shapes our investigations. Nor, too, can we discount the continued influence of colonialist histories on proprietary databases, in which source availability often tends to fall along the lines of the colonizer and the colonized.
Patterns emerge across the C-Hacker Area of Focus Reports, articulating how these legacies pose difficulties for collecting meaningful metadata and scoping out viable sources for corpus inclusion. The five main areas that most acutely pose the problem of representation during WE1S corpus design and construction are below, along with reflections written by our C-Hackers as they collectively thought through these categories: Access and Availability, Adequate Metadata, Language Barriers, Audience, and Researcher Biases.
The C-Hackers’ Area of Focus Reports can be found here.