We also bring to attention below some of the methodological problems we encountered, which range from the theoretical to the technical. As researchers, we know that no method is worth its salt (to use an old idiom) unless it comes with challenging problems. (Remembering the technical and ethical problems of slave-manned salt mining going back in history to Roman times and earlier, when salt was so highly valued as to be the basis of the word “salary,” is a good caution for “data mining” today!)
WE1S uses digital humanities methods to conduct its research. “DH” draws on computer science, information science, machine learning, text analysis, visualization, and other methods covered in our Bibliography. We’ve created 1-page explanation “cards” in plain language for many of these methods, each of which includes links to further information. Some cards also point out important issues and limitations in methods.
The primary method we use to study our collections of textual materials is “topic modeling.” Topic modeling is a leading method of “unsupervised” machine learning that discovers “topics” in texts by analyzing the statistical co-occurrence of words. It finds out which words tend to come up together in a document set (and in individual texts) when people discuss something or, as in newspapers, many things. Co-occurring words suggest “topics.” Topic modeling not only identifies topics but also indicates their relative weights. It also shows which specific documents participate strongly in a topic (or several topics at once). Topic models thus not only help with “distant reading” large collections but also guide researchers to specific documents to “close read” because of their statistical association with a topic.
Because complex data analysis can have a “black box” effect, researchers using machine-learning methods need not just to document technical workflows but also to make humanly understandable their the steps in such workflows. (See WE1S Bibliography of Interpretability in Machine Learning and Topic Model Interpretation.) WE1S has developed a topic-model interpretation protocol that declares standard instructions and observation steps for researchers using topic models. Our Interpretation Protocol is customized for our materials and tools (e.g., advising our researchers which visualization interfaces in our Topic Model Observatory to use in observing parts of a topic model and what to record in their notes). We publish our because it sets a paradigm. We hope that it will be forked, evolved, and adapted by other projects to create a shared, but customizable, practice of open, reproducible digital humanities research.
Besides topic modeling, we also used other machine learning methods to assist us in understanding our collections. Deployed for special purposes, these included text classification and word embedding.
Using machine-learning methods to “distant read” our materials, however, does not mean that we neglect to “close read”—that is, to apply the method of closely analyzing words, metaphors, style, and structure in texts that humanities scholars, especially in literary studies, developed into a distinctive modern method of textual study beginning in the 1920s. Among other purposes, our project uses topic modeling to identify specific articles in our collections that are highly associated statistically with topics of interest in addressing a research question—for example, topics in which the co-occurence of words related to the sciences and humanities help us grast how the media discusses the two areas together. This makes the best of both distant and close reading. Machine learning steers us to particular texts to pay attention to (correcting for the anecdotal, intuition-based, or canonical selection principles that traditionally guided close reading); and our human readers, including graduate students extensively trained in literary or other humanities fields, then read those texts closely.
Using social-science “grounded theory” methods for the human annotation of research materials, we also “hand code” (annotated and labeled) selected sources and articles in our collections of media in order to create what are essentially small-scale, human-made topic models of what materials are “about.” We used these annotations to help us understand better what the topics in our machine-learning models are “about,” and also to guide us in assembling appropriate subsets of materials for modeling to address specific research questions. We also used a similar method of hand coding to add to the metadata in our database so that we could understand our materials with a more fine-grained view, for example, of which topics in a topic model are prevalent in mainstream news publications compared to student newspapers.
To help us better understand the wider media discourse related to the humanities by seeing it from various specific perspectives—for example, from the viewpoint of students (including that of our project institutions’ many first-generation college or first-generation immigrant students)—we complemented our study of the media with localized “human subjects” research. We conducted surveys and focus groups at two of our project’s home institutions—U. California Santa Barbara, and U. Miami—to see how actual people around us viewed and experienced the humanities. While localized and small-scale, this research aided us in honing the research questions we addressed with our primary materials and methods and in grasping the results.
Interesting Methodological Problems
The following are the most important questions that we had problems with for methodological reasons intermixed with technical and other challenges:
We carefully scoped the “representative” corpus of journalistic and related materials we ideally wanted to gather by researching media sources in many nations and regions (see our Area of Focus reports). We also critically examined the notion of a “representative” corpus itself by comparing it to such concepts as “canon, ” “edition,” and “corpus linguistics” in relation to the history and impact of newspapers (see our Scoping Research reports). But our ideal representative corpus met the hard reality of the fact that our primary means of collecting journalistic materials at scale was through LexisNexis and other databases. The provenance and balance of materials in such aggregators are outside our control and largely “black boxed.”
Originally, we wanted to gather representative English-language media from both the U.S. and much of the world for study in relation to the humanities. (Indeed, we began collecting a significant amount of materials from the United Kingdom and some other nations.) However, the scale involved in collecting and analyzin materials worldwide became unrealistic. In addition, we realized that even though our researchers included graduate students, postdoctoral scholars, and others originally from other nations we could not as a group acquire an adequate understanding of the social, political, and cultural contexts of media in other nations. For example, we were not confident we could grasp the mission and audience of Anglophone newspapers in nations where English is not the native language. Other factors complicated researching our issues in other nations. For example, in the U.K. and any Commonwealth nations, humanities is a relatively recent Americanism for what was earlier, and often still predominantly, referred to as “the arts.” (See research we conducted on U.S. versus U.K. terms for the humanities early our project.) In the end, our Advisory Board suggested that we concentrate on the U.S. to do something smaller, but better. We invite others to use our methods and tools to extend our project to other nations and languages.
One of our goals was to research how underrepresented social groups are positioned—and position themselves—in the media in relation to the humanities. In working carefully on this issue, we discovered difficulties on several fronts. One is that the amount of discussion of such groups in relation to the humanities is statistically tiny compared, for example, to discussion of women or people from underrepresented races and ethnicities in the sciences. Another is that the hyphenated or bigram names often used for underrepresented groups over the past several decades (e.g., “African-American,” “Asian-American,” “first-generation”) are invisible from the point of view of computers engaged in natural language processing and unsupervised machine-learning unless researchers intervene selectively in advance to decide which such names to treat as single “tokens” without neglecting other bigrams or similar semantic constructions that unforeseeably bear on the problem. (See our card on “Word Order Matters.”)
A significant proportion of the journalistic media that is published (and then aggregated in databases such as LexisNexis) consists of what a human reader would consider duplicate articles. For example, newspapers publish articles in separate editions for different print or online markets, in sequences of original and corrected versions, and, for event listings or similar material, in gradually changing text (e.g., a few changed listings at a time). News services such as the Associated Press, whose material many newspapers include, complicate the problem further because their material can be inserted in varying ways. (For instance, a newspaper can use all or only part of an AP article.) We developed a “de-duplication” algorithm to remove duplicate articles from our study corpus with “fuzziness” settings we could set to determine when differences between texts are minor enough to treat texts as the “same” and when differences are substantial enough to treat texts as “different.” But deciding what is meaningfully the same versus different is a difficult theoretical and practical challenge.
Our primary machine-learning method, topic modeling, is very fussy in two main ways. One has to do with pre-setting model parameters. For example, researchers must decide in advance what number of topics to model (e.g., 50 versus 250) based on their knowledge of the material and a feel for how granular to make the topics. Or, again, researchers must fuss with “hyperparameter” settings. Configuring, optimizing, and assessing topic models for best number of topics, topic coherence, and other issues is as much art as science (issues with which our project’s Interpretation Lab researchers assisted us). The other kind of fussiness is the perverse sensitivity of topic models to even minor revisions in their underlying texts. As researchers, our instinct is to improve our collections of texts and make iterative adjustments such as de-duplicating more articles. In most kinds of research, such improvement would gradually improve our understanding. But in topic modeling, changing the underlying corpus even in minor ways has a radically discontinuous effect on the resulting model. For example, what was topic number 32 in the previous state of a model will no longer be there under the same number, andmay not be there at all or will have its gist distributed differently among other topics. At one point, resetting our primary study collection (C-1) and its topic models due to de-duplication errors cost us much time in redoing our interpretive work.