Our recognition of diverse populations research biases and gaps began when we noted, for example, the problem of global media conglomerates’ control of both national and local news sources and the homogenization of news content. While harder to identify in topic modeling, larger media conglomerates (such as the Gannett company, owner of USA Today) publish newspapers across the globe, duplicating content in subtle ways across local and national news sources. This means the sources we seek out as supposedly representing diverse populations may be owned by companies with both national and global interests. This also means that adapted content also may escape the de-duplication process in our topic models and affect our outcomes in unknown but significant ways.
In addition, our research consistently demonstrated an historical and contemporary lack of available materials and accessibility for news outlets of underrepresented and minority groups. When considering the historical media archive of news, we are limited by technology but also the availability of sources itself, such as the lack of historical record and media footprint for minority groups including African Americans and Asian Americans. As Giorgina Paiella indicates in her previous WE1S research post, there is even a problematic lack of “representativeness” insinuated within the LexisNexis database itself, an all too common problem when dealing with issues of inclusion and exclusion and larger institutional, economic, and university structures. And while we have access to a larger corpus of contemporary resources (primarily from 1980 on), they still encompass a culturally exclusive collection of data. This highlights an ongoing gap in resources, which makes it difficult to obtain and analyze an inclusive, historically representative narrative of “humanities” as a discourse.
Finally, we also noted the absence of conversation itself, which can be both a noteworthy line of inquiry (to consider what everyone is not saying) but also problematic (we don’t have access to what everyone is saying). For example, diverse news outlets accessible through LexisNexis, such as Ebony and Out and About, returned little to no results for multiple-year query searches using humanities keywords (including “liberal arts” and “the arts”). In contrast, national U.S. news outlets, like The Los Angeles Times and The New York Times, returned many results for each of the keywords. Though difficult to determine exactly why these discussions are missing from our datasets, we can question if and how various groups use distinct languages and terms to discuss the humanities. We can also consider whether or not the sources in LexisNexis are perhaps the most accessible, but not the most representative of what these communities are talking about in relation to the humanities. As Giorgina mentions in “Thoughts on Diversity in the Archive,” the LexisNexis database “also allows us to take stock of what everyone is not saying about the humanities.” Considering this, the sources that we do have and the lack of information they return signify an important message about the accessibility of humanities discourses within diverse populations and how this affects our understanding of WE1S research.
After encountering these gaps and biases, the CSUN members representing the diverse populations group realized the unique needs of representing diverse populations in machine learning research. We must consider issues of accessibility, but also our methods of representation when considering the diversity within the diverse communities themselves and our methods of advocacy. To improve representation of diversity would involve, for example, topic modeling multiple journals representing a single community (such as pairing Ebony, Black Enterprise, and a regional historical black newspaper for a diverse black American identity) while also modeling single newspapers or journals. This, at the very least, will give us insight into the uneven media landscape of diverse populations and could ameliorate the gap between what everyone is saying about diverse populations and what diverse peoples are actually saying. We also need to balance our method between accessing larger databases (such as LexisNexis and ProQuest), while engaging in more time-consuming data collecting, such as web scraping, that would expand our representativeness for a more rich corpus.
Furthering our goals of representativeness, we may also study the absence of conversation itself as a noteworthy line of inquiry (to consider what everyone is not saying), while topic modeling the easily accessible conversations about diverse groups within major news outlets. In short, what can we do with the limited tools we have to create more representative outputs? We could, for example, topic model different national newspapers (or retrieve topic models from the North American team) that reveal associated topics between different diverse populations and humanities; topic model global and national journals linked by identity (such as a national newspaper of Mexico and a Mexican American newspaper); topic model the term “diversity” or “diverse” within a corpus of articles containing the word “humanities”; and topic model specific community identity terms within a corpus of “humanities.”
For our future endeavors, in addition to experimenting with sources, applications, and methods, we may also work on advocating for greater inclusion and access to diverse sources, from databases to digitization projects. In particular, raising awareness about diversity in the archives and what is missing in the databases seems integral to the philosophy of WE1S: making sure that everyone understands what they’re finding or not finding when they search for what everyone says.