Reflections on the Methodology of the WE1S Project

by Scott Kleinman, Abigail Droge, Lindsay Thomas, and Alan Liu
Published November 23, 2021
This blog post complements the article in Daedalus we published after the conclusion of the WE1S’s Mellon Foundation grant in 2021: Alan Liu, Abigail Droge, Scott Kleinman, Lindsay Thomas, Dan C. Baciu, and Jeremy Douglass, “What Everyone Says: Public Perceptions of the Humanities in the Media” (Daedalus, 2022). The Daedalus article includes briefer reflections on the nature and challenges of our methods and corpus. Also see another article by project members in the same period that sums up our data and findings: Lindsay Thomas and Abigail Droge, “The Humanities in Public: A Computational Analysis of US National and Campus Newspapers” (Journal of Cultural Analytics, 2021). Prior to this period while we were in mid-stream, many other project participants—including graduate-student and undergraduate research assistants—posted reports and blog posts reflecting richly, critically, and meditatively on our methods and corpus. See our Reports and our Research Blog.

As we in the WE1S project bring the active stage of our work to a close, we wanted to reflect a bit on some of the methods we employed in the project. In the discussion below, we provide links to many of our Key Methods and Key Findings cards where more specific details can be found.

Topic Modeling and Distant Reading

Topic modeling is one of many valuable methods for performing text analysis at scale and it has both its strengths and weaknesses. Topic modeling employs a so-called “bag of words” method to examine vocabulary co-occurrence across texts. For example, a topic about economics might be statistically heavier than one about the humanities in a particular collection of documents. It also shows which specific documents participate strongly in a topic (or in several topics at once, since an article mentioning “London” might talk partly about politics but also partly about economics because London is also a financial capital). However, topic modeling does not account for the syntactic context in which words occur. For some purposes, we thus complemented our topic modeling using a keyphrase extraction algorithm that identifies prominent words within articles and takes into account the surrounding linguistic context. Assembling the most frequently occurring keyphrases across a collection of texts gives us a good basis for comparison to the topic model. In some cases, we additionally used the Wilcoxon rank sum test, which compares two sets of texts based on their word frequency. This test can also be used to identify the most “significant” words in a set of texts. In some cases, these supplementary methods provided additional insights but they also tended to reinforce what we found in our main topic models. Such guidance is an improvement on cherry-picking examples anecdotally or because they fit a researcher’s preconceived “hunch” that a document is worth examining because it might bear out a thesis.

Of course, statistical models of linguistic and cultural data present many difficulties for drawing valid, meaningful conclusions—conundrums that are the subject of active investigation in the digital humanities and elsewhere (such as in “interpretability” and “explainability” studies in computer science). One problem we faced is the uncertain epistemological relationship between patterns identified by our topic models (and other statistical methods) and society’s views about the humanities, which were our ultimate goal. We take journalism to be one important proxy for social discourses that construct, reify, and/or reflect such views. And we take the patterns we find in the media to be a proxy of that proxy. Where specific vocabulary occurs with notably high frequency or prominence in the media, we assume that it provides evidence of attitudes about the humanities that are reflected or instilled by journalism in readers’ minds, either on the basis of pre-existing cultural norms or through the influence of the media itself. But the conventions for inferring from discourse the “mentality” of a society (as it was called in histoire des mentalités) have rapidly evolved in social and cultural disciplines in recent decades, and machine-learning methods add further changes, and thus uncertainty, to such conventions for understanding how “topics,” for example, are related to ancient and traditional understandings of “topoi” or commonplaces. To make it transparent how we combined machine learning and human reading to come to conclusions, we innovated a Topic Model Interpretation Protocol, which guides researchers through standard observation waypoints and tool use, and which—in the mode of “grounded theory” in the social sciences—also lays out standard steps for taking notes and evolving them into conclusions. We felt that this approach allowed us to confidently follow up on “distant reading” by using humanistic “close reading” to study specific, representative texts. We also hope that the Protocol can be adopted (and refined) by others to enable a transparent method of interpreting the results of complex topic models.

Corpus Collection, Representation, and Absences

We also had reservations about the uncertain ethical status of any large corpus of materials (and data derived from them) whose aggregation is controlled by organizations with different standards of selection, access, and transparency—driven, ultimately, by a different purpose—than those of scholars. There is much discussion today about the use of unrepresentative or biased datasets in machine learning. The specific aspect of this problem for our project is that our effort to study what “everyone” says about the humanities was constrained by necessary reliance on the proprietary databases from which we gathered most of our materials at scale. Database licensing stipulations and copyright restrictions, such as those associated with LexisNexis (our main data source), limited what materials we could share publicly. But we were also hindered by the fact that such databases provide little transparency about what materials are included or excluded and why, how complete materials are, and what low-level filtering and other features have been pre-set. LexisNexis’s holdings also include relatively few newspapers specializing in coverage of, or for, particular gender, racial, ethnic, or sexual identity communities. We thus turned to ProQuest’s Ethnic NewsWatch and GenderWatch database resources to facilitate the study of such groups as part of the “everyone” included in “what everyone says.” However, the relatively high level of manually-assisted processing required to access these materials for our purposes meant that we could only compile a comparatively small set of texts for topic modeling. In other words, corporate decisions about what sources are made available in digital form, and in what algorithmically accessible ways, distort what gets counted as “what everyone says” in the media or elsewhere (see Giorgina Paiella’s blog post “Thoughts on Diversity in the Archive” for further discussion).

Other problems might be mentioned as well. The amount of data held in such databases—and the internet writ large—is vast, so, for pragmatic purposes, we needed to limit the scope of our data collection in some way. We chose to focus on documents selected by the presence of the word “humanities” (and a few related key terms) that we thought might target documents likely to contain direct or indirect discourse about the humanities. However, we may rightfully question whether relying on a narrow set of search terms for collecting materials distorts the baseline of our observations. Likewise, in the “post-print” era we may well ask whether text-based journalism now functions in the same way as either an influencer or follower of social attitudes towards the humanities (or towards anything else). Further discussion of this point can be found in Ryan Leach’s report “Media Impact”. We have begun to address these issues by collecting material as well from social media and television broadcast news. But we thought we needed to start somewhere, and we did so by gathering what materials we could in as thoughtful a way as we were able (prepared for by “Area of Focus” and “Scoping Reports” through which we educated ourselves about journalism around the world and problems in achieving a representative corpus). We acknowledge the limitations of our materials and hope that future researchers will add more materials for such a study as well as refine our methods, data, and analysis.

Even with such a “targeted” corpus, we find that the humanities occupies a relatively small share of the overall discourse. Nevertheless, we found ample evidence of meaningful engagements with the humanities in our corpus. Amongst our significant findings are that the humanities are embedded in the public’s day-to-day experiences through cultural events and museum exhibits and on campus through the daily activities of students. For instance, a Wilcoxon test on Collection 1 showed that in articles containing the word “humanities” collected from college, local, and regional newspapers, the most distinctive vocabulary includes “experience,” “community,” “opportunity,” and “event,” alongside verbs that index a host of routine pedagogical, collaborative, and community-oriented practices such as “learn,” “share,” “create,” “speak,” “talk,” “hear,” “read,” “enjoy,” “thinking,” “engage,” “discussed,” and “remember.” Similarly, keyphrase extraction of student newspapers shows “event” (and phrases containing “event”) to be extremely prominent in articles containing “humanities” or “liberal arts”. The words “community” and “experience” are also highly ranked and occur in a number of compound phrases such as “LGBT community” and “learning experience.” These results can be found in the daedalus/wilcoxon-tests/top-circulation and daedalus/keyphrase-extraction/universitywire folders respectively at https://zenodo.org/record/5711303.

The humanities have been at the forefront of much research and teaching about race, ethnicity, gender, and sexuality, to name just a few concepts that concern the histories, lived experiences, and cultures of various intersecting groups of people in the U.S. Yet we found it striking that academic discourse about race was not strongly represented in our corpus. While our analyses suggest that discussions of race seem to be the most clearly shared point of connection between humanities research and the humanities in journalistic media, such media discussions do not tend to recognize the intellectual underpinning provided by humanities scholarship. (We should emphasize that dates of our corpus collection do not include the past few years in which Critical Race Theory has emerged as a prominent subject in the media.) For instance, in Goldstone and Underwood’s model of literary journals, Topic 26, which contains vocabulary related to race and racism, shows an increase in prominence during the publication dates covered by our collection of student newspapers. However, beyond basic terminology like “race,” “black,” and “white,” there is little shared vocabulary between the topic in the model of scholarly journals and highly weighted vocabulary in Topic 39 of our student newspaper model (Collection 14) or in the most-frequently occurring keyphrases extracted from this model (available in the daedalus/keyphrase-extraction/universitywire folder at https://zenodo.org/record/5711303). Overall, the impression we get is that scholarly research in the humanities does not intersect with discussions of race, ethnicity, gender, or sexuality in our corpus (see our Key Finding cards KF-3-1 and KF-3-5 for further discussion).

This remains true even if we examine sources about or for particular racial, ethnic, gender, or sexual identity groups (for example, The Gay and Lesbian Review, or The Philadelphia Tribune, the oldest continuously published African-American newspaper in the U.S.). When we look at our Collection 15 (a relatively small collection of articles from such sources gathered from ProQuest’s Ethnic NewsWatch and GenderWatch bundlings of news materials), we see that the humanities, or specific concepts drawing on humanities research, are rarely overtly discussed as such. Rather, awareness of the humanities is dispersed through discussions of the lived experiences of members of these communities and communicated implicitly or indirectly through reviews of scholarly and trade books, film and book reviews, and interviews. When we compare Collection 15 to collections sampled from top-circulation newspapers using a Wilcoxon test, we see that articles from its sources are characterized by a broad cone of language about cultural diversity (“identity,” “diverse,” “culture”), civic and national action (“justice,” “rights,” “challenges”), storytelling (“literature,” “stories,” “books”), and intellectual pursuit (“scholarship,” “learning,” “understanding,” “knowledge”). Words distinctive to articles from these publications also convey a sense of moving through time, whether in respecting past action (“honored,” “legacy,” “recognized,” “award”), emphasizing personal or family experiences (“generation,” “family,” “experiences,” “life,” “youth,” “love”), or looking to the future (“goal,” “mission,” “opportunity”). Importantly, the top word associated with these sources is “community.” These findings complement and build from other project members’ analyses of the values most salient in relation to particular groups. See, for example, Susan (Su) Burtner and Giorgina Paiella, “Mapping HSIs, HBCUs, Women’s Colleges, and Tribal Colleges” and “Word Embeddings of College and University Mission Statements: Preliminary Findings”.

We also found that representations of specific content and subject matter of humanities research are strikingly absent in our corpus. In our keyphrase extraction experiments, names of historical or literary figures, historical events, and even cultural movements that delineate various subfields within the humanities are not sufficiently prominent to suggest that they form a significant part of the media discourse about the humanities. The one exception involves theatrical productions and other events associated with the arts, such as museum exhibitions, musical performances, and the like. For example, Shakespeare and the titles of his plays occur with some frequency in our corpus but are more likely to be found in references to dramatic performances than in those to academic study (see, for instance, Topic 82 in our 100 topic model of Collection 14). Further, though humanities practices such as events are mentioned frequently in both student newspapers and in publications for the wider community, we discovered that there is strikingly little shared vocabulary between these contexts and the language found, for example, in the scholarly works studied computationally in Andrew Goldstone and Ted Underwood’s “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us,” New Literary History 45.3 (2014): 359-84. (Points of connection do occur, however, notably around the categories of “race” and “war,” which are prominent in the keyphrases extracted from our collections as well as from Goldstone and Underwood’s model. For instance, keyphrases extracted from mainstream newspapers in our Collection 4 of top-circulating newspapers notably highlight method-oriented words like “development,” “discovery,” and “interpretation” which show a keen, if general, understanding of the bases of scholarly activity in the humanities. A list of shared terms between our keyphrases and Goldstone and Underwood’s model can be found in the daedalus/keyphrase-extraction/top-circulation folder at https://zenodo.org/record/5711303.) Other project members have noted further absences in our corpus, calling attention to under-discussed humanities disciplines (KF-7-2), a lack of awareness around the labor dynamics of humanities personnel (KF-4-11), and the absence of media coverage for funding of humanities teaching (KF-7-4), particularly in relation to primary and secondary education (KF-7-3).

Takeaways

These observations—and the challenges we faced in collecting and analyzing the data that produced them—lead us to ask what role the digital humanities might play in changing how we access and account for “what everyone says” about the humanities. What is needed is databases created with a firm grounding in humanistic principles and methods, attentive to questions of diversity and representation. This need grows when we consider the tools and methods needed to address global contexts for the humanities, as laid out in Rebecca Baker’s “The Global Humanities and the ‘Crisis’ Therein”. Such work requires the interdisciplinary nuance, collaboration, and communication that arises when data scientists are well-versed in the humanities and humanists are skilled in data science practices.

WE1S is part of a recent wave of digital, data science, and machine-learning humanities research that combines humanistic, scientific, and social science methods, and we close by suggesting its importance as a methodological example — one, which, if extended and integrated into higher education more broadly, has the potential to address the some of the problems that we have detailed here. WE1S has been a collaborative endeavor across three institutions, both public and private, involving researchers from multiple disciplines and all levels of the university institution, from undergraduates to senior faculty. In our media corpus, we saw that conversations about such collaborative research are most often associated with the sciences rather than with the humanities. Words like “team” and “group” align with scientific topics in our models. See, for example, Topic 65 in our 100-topic model of Collection 33. Importantly, however, discourse surrounding the digital humanities in our corpus expresses enthusiasm about the promise of new methods and working styles. As some articles put it, digital projects and resources can cross seemingly “uncrossable boundaries” and “create new opportunities for public scholarship and allow humanities researchers to work together in unprecedented ways.” Examples of such crossover can be found associated with Topic 46 of the same model: for instance, Tony Moore, “Visualize: Students Bring Data to Life through the Digital Boot Camp,” The Dickinsonian: Dickinson College, February 11, 2015, and Catherine Goldsmith, “Olin Digital CoLab to Bridge Gap Between Humanities and Technology,” Cornell Daily Sun: Cornell University, January 29, 2017.

The structure and infrastructure of our project is notable as an example of spanning across humanistic and scientific paradigms. An advantage of crossing methodologically between the humanities and scientific and social-scientific research is the potential that such an approach has for spanning scales of inquiry, which as some of our project findings suggest, is something the humanities have difficulty showing they can do. Methods such as the digital humanities demonstrate back-and-forth practices of observing texts at scale and delving deeply into specific texts that can potentially update for today’s challenges somewhat older methods in the humanities (and “human sciences”) of negotiating scales between micro and macro (such as historicism, formalism, structuralism, cultural criticism, and so on). In this regard, the WE1S Topic Model Interpretation Protocol mentioned earlier is a practical example of a scale-spanning method. The protocol (essentially a standard questionnaire that our researchers fill out when studying a topic model) systematically leads investigators through multiple scales of observation and analysis to synthesize findings. Different modules of it are titled “Take an Overview of a Topic Model,” “Analyze a Keyword,” “Analyze a Topic,” “Compare Sets of Topics (Multiple Topics),” and so on. Also relevant in this context are the explanatory “cards”—inspired by forms of scientific and data science reporting—which we have repeatedly referenced above. Our cards act in concert with our larger reports to combine humanistic and scientific communication forms in a mixing of scales of knowledge that adapts to different public and academic audiences.

Such methodological crossing is already underway in academic circles. Consider, for example, the “cultural analytics” research in the digital humanities represented by the Journal of Cultural Analytics, started in 2016, or research in a similar mode conducted in Peter de Bolla’s The Architecture of Concepts: The Historical Formation of Human Rights (2013) and the Cambridge Concept Lab that he led from 2014 to 2018. Much in the same spirit, those of us who worked on WE1S would like to foster future speculation about how our methods might bring the liberal arts and human sciences into conjunction with today’s most recent pan-disciplinary formation of knowledge: data science. Like the liberal arts, after all, data science now potentially spans all domain areas and disciplines, whether those of the sciences in the classical quadrivium or of the humanities in the classical trivium. Yet the relations of similarity and difference between the liberal arts and data science formations of knowledge have yet to be mapped. That is one intellectual problem that will need to be thought through as we move forward to meet the challenges of our time.