Meanwhile, we study relevant contexts that help guide us in forming research questions and interpreting results. For example, our Area of Focus Reports canvass media sources in their sociopolitical mileus in different parts of the world, and our Scoping Research Reports explore a variety of concepts related to journalism, media impact and the problem of creating a “representative” study corpus. We also conduct “human subjects research” surveys and focus groups with students and others to complement our work with a ground-level view of what people think and feel about the humanities.
Our research outputs take the form of reports, blog posts, presentations, publications, datasets, topic models, and software and visualization tools. For broad accessibility, we also also write up 1-page Key Findings and similar 1-page summaries of Key Methods, Key Software, and other results. (Explanation of our “cards” reporting system.)
One-page “cards” in PDF format summarizing our key findings. Cards include narrative explanation and selected examples, visualizations, and references with links.
Area of Focus reports identify publication and database sources for journalistic materials for different nations and regions, alternative/indie sources, and news sources related to various racial, ethnic, and gender groups. They also discuss sociopolitical and other contexts of journalism within an area, as well a challenges for accessing a representative sample of materials.
Scoping Reports explore relevant contexts for the WE1S project by reviewing research on journalism and media impact as well as concepts such as the “edition,” “canon,” and “corpus linguistics” providing analogies for creating a “representative” corpus of journalistic materials.
Surveys we conducted of students and others at two of our project campuses—UC Santa Barbara and U. Miami—complement our big-data analysis of media documents. See WE1S Human Subjects Research on our human subjects research, with highlights of results. For more results, see our Key Findings (look for those labeled “Survey results”) and also our U. Miami Survey Mini-reports. In addition, we describe our survey response collections.
The WE1S corpus (body of materials) consists of over 1 million (1,028,629) unique English-language journalistic media articles and related documents mentioning the literal word “humanities” (and for some research purposes also the “liberal arts,” “the arts,” and “sciences”) from 1,053 U.S. and 437 international news and other sources from the 1980s to 2019, mostly after 2000 when news media began producing digital texts en masse. For comparison and other uses, we also gathered about 1.38 million unique documents representing a random sampling of all news articles. In addition, we harvested over 6 million posts mentioning the “humanities” and related terms from social media (about 5 million from Twitter and 1 million from Reddit).
Our primary source for news articles is LexisNexis Academic (via the LexisNexis “Web Services Kit” API), supplemented by other databases of news and other materials. We also directly harvested born-digital news and additional texts from the web, and posts from social media.
So that other researchers can fully benefit from the data we derived from these materials, we deposited our full datasets (six in total) in the Zenodo repository operated by CERN to promote “open science.” (These do not include readable original texts except for those we directly harvested from open and public sources.) (See our Datasets page for more information and access to the Zenodo deposits.)
Our datasets represent all documents collected by our project. By contrast, our “collections” are typically subsets of these datasets, filtered by keyword, source, or document type to help us address particular research questions. For instance, our Collection 1 (C-1) is a subset of data from our
humanities_keywords dataset of 474,930 unique documents mentioning the word “humanities.” C-1 focuses on mainstream U.S. journalism, reducing the set to 82,324 articles mentioning “humanities” from 850 U.S. news sources, excluding student newspapers. Other “collections” focus just on top-circulation newspapers, student newspapers, and so on, whose data is drawn from the same dataset or different datasets. (See the metadata tags we also add to our data to help analyze groups of publication sources.)
We make available fifteen of our collections for others to use and explore. (See the “cards” describing these on our Collections page.)
Each of our “collections” has a “Start Page” (example) that provides access to topic models of the texts in the collection that we created using the MALLET tool in a workflow conducted in our WE1S Workspace of Jupyter notebooks. (See our cards explaining “Topic Modeling” and “WE1S Workspace.”) We generated several models for each collection, each at a different granularity of numbers of topics (typically 25, 50, 100, 150, 200, and 250 topics). We make available the MALLET data and other files of these models.
Our topic models come with a suite of interactive visualizations allowing users to explore their topics and associated words and texts (though due to copyright restrictions the links to texts show only word frequency counts and other data instead of readable articles). We call this suite of visualizations, some original and others adapted from existing topic-model interfaces, our Topic Model Observatory. (See “Tools” below.)
WE1S uses digital humanities methods to conduct its research. “DH” draws on computer science, information science, machine learning, text analysis, visualization, and other methods covered in our Bibliography. We’ve created 1-page explanation “cards” in plain language for many of these methods, each of which includes links to further information. Some cards also point out important issues and limitations in methods.
The primary method we use to study our collections of textual materials is “topic modeling.” Topic modeling is a leading method of “unsupervised” machine learning that discovers “topics” in texts by analyzing the statistical co-occurrence of words. It finds out which words tend to come up together in a document set (and in individual texts) when people discuss something or, as in newspapers, many things. Co-occurring words suggest “topics.” Topic modeling not only identifies topics but also indicates their relative weights. It also shows which specific documents participate strongly in a topic (or several topics at once). Topic models thus not only help with “distant reading” large collections but also guide researchers to specific documents to “close read” because of their statistical association with a topic.
Because complex data analysis can have a “black box” effect, researchers using machine-learning methods need not just to document technical workflows but also to make humanly understandable their the steps in such workflows. WE1S has developed a topic-model interpretation protocol that declares standard instructions and observation steps for researchers using topic models. Our Interpretation Protocol is customized for our materials and tools (e.g., advising our researchers which visualization interfaces in our Topic Model Observatory to use in observing parts of a topic model and what to record in their notes). We publish our because it sets a paradigm. We hope that it will be forked, evolved, and adapted by other projects to create a shared, but customizable, practice of open, reproducible digital humanities research.
Besides topic modeling, we also used other machine learning methods to assist us in understanding our collections. Deployed for special purposes, these included text classification and word embedding.
Using machine-learning methods to “distant read” our materials, however, does not mean that we neglect to “close read”—that is, to apply the method of closely analyzing words, metaphors, style, and structure in texts that humanities scholars, especially in literary studies, developed into a distinctive modern method of textual study beginning in the 1920s. Among other purposes, our project uses topic modeling to identify specific articles in our collections that are highly associated statistically with topics of interest in addressing a research question—for example, topics in which the co-occurence of words related to the sciences and humanities help us grast how the media discusses the two areas together. This makes the best of both distant and close reading. Machine learning steers us to particular texts to pay attention to (correcting for the anecdotal, intuition-based, or canonical selection principles that traditionally guided close reading); and our human readers, including graduate students extensively trained in literary or other humanities fields, then read those texts closely.
Using social-science “grounded theory” methods for the human annotation of research materials, we also “hand code” (annotated and labeled) selected sources and articles in our collections of media in order to create what are essentially small-scale, human-made topic models of what materials are “about.” We used these annotations to help us understand better what the topics in our machine-learning models are “about,” and also to guide us in assembling appropriate subsets of materials for modeling to address specific research questions. We also used a similar method of hand coding to add to the metadata in our database so that we could understand our materials with a more fine-grained view, for example, of which topics in a topic model are prevalent in mainstream news publications compared to student newspapers.
To help us better understand the wider media discourse related to the humanities by seeing it from various specific perspectives—for example, from the viewpoint of students (including that of our project institutions’ many first-generation college or first-generation immigrant students)—we complemented our study of the media with localized “human subjects” research. We conducted surveys and focus groups at two of our project’s home institutions—U. California Santa Barbara, and U. Miami—to see how actual people around us viewed and experienced the humanities. While localized and small-scale, this research aided us in honing the research questions we addressed with our primary materials and methods and in grasping the results.
Interesting Methodological Problems
The following are the most important questions that we had problems with for methodological reasons intermixed with technical and other challenges:
We carefully scoped the “representative” corpus of journalistic and related materials we ideally wanted to gather by researching media sources in many nations and regions (see our Area of Focus reports). We also critically examined the notion of a “representative” corpus itself by comparing it to such concepts as “canon, ” “edition,” and “corpus linguistics” in relation to the history and impact of newspapers (see our Scoping Research reports). But our ideal representative corpus met the hard reality of the fact that our primary means of collecting journalistic materials at scale was through LexisNexis and other databases. The provenance and balance of materials in such aggregators are outside our control and largely “black boxed.”
Originally, we wanted to gather representative English-language media from both the U.S. and much of the world for study in relation to the humanities. (Indeed, we began collecting a significant amount of materials from the United Kingdom and some other nations.) However, the scale involved in collecting and analyzin materials worldwide became unrealistic. In addition, we realized that even though our researchers included graduate students, postdoctoral scholars, and others originally from other nations we could not as a group acquire an adequate understanding of the social, political, and cultural contexts of media in other nations. For example, we were not confident we could grasp the mission and audience of Anglophone newspapers in nations where English is not the native language. Other factors complicated researching our issues in other nations. For example, in the U.K. and any Commonwealth nations, humanities is a relatively recent Americanism for what was earlier, and often still predominantly, referred to as “the arts.” (See research we conducted on U.S. versus U.K. terms for the humanities early our project.) In the end, our Advisory Board suggested that we concentrate on the U.S. to do something smaller, but better. We invite others to use our methods and tools to extend our project to other nations and languages.
One of our goals was to research how underrepresented social groups are positioned—and position themselves—in the media in relation to the humanities. In working carefully on this issue, we discovered difficulties on several fronts. One is that the amount of discussion of such groups in relation to the humanities is statistically tiny compared, for example, to discussion of women or people from underrepresented races and ethnicities in the sciences. Another is that the hyphenated or bigram names often used for underrepresented groups over the past several decades (e.g., “African-American,” “Asian-American,” “first-generation”) are invisible from the point of view of computers engaged in natural language processing and unsupervised machine-learning unless researchers intervene selectively in advance to decide which such names to treat as single “tokens” without neglecting other bigrams or similar semantic constructions that unforeseeably bear on the problem. (See our card on “Word Order Matters.”)
A significant proportion of the journalistic media that is published (and then aggregated in databases such as LexisNexis) consists of what a human reader would consider duplicate articles. For example, newspapers publish articles in separate editions for different print or online markets, in sequences of original and corrected versions, and, for event listings or similar material, in gradually changing text (e.g., a few changed listings at a time). News services such as the Associated Press, whose material many newspapers include, complicate the problem further because their material can be inserted in varying ways. (For instance, a newspaper can use all or only part of an AP article.) We developed a “de-duplication” algorithm to remove duplicate articles from our study corpus with “fuzziness” settings we could set to determine when differences between texts are minor enough to treat texts as the “same” and when differences are substantial enough to treat texts as “different.” But deciding what is meaningfully the same versus different is a difficult theoretical and practical challenge.
Our primary machine-learning method, topic modeling, is very fussy in two main ways. One has to do with pre-setting model parameters. For example, researchers must decide in advance what number of topics to model (e.g., 50 versus 250) based on their knowledge of the material and a feel for how granular to make the topics. Or, again, researchers must fuss with “hyperparameter” settings. Configuring, optimizing, and assessing topic models for best number of topics, topic coherence, and other issues is as much art as science (issues with which our project’s Interpretation Lab researchers assisted us). The other kind of fussiness is the perverse sensitivity of topic models to even minor revisions in their underlying texts. As researchers, our instinct is to improve our collections of texts and make iterative adjustments such as de-duplicating more articles. In most kinds of research, such improvement would gradually improve our understanding. But in topic modeling, changing the underlying corpus even in minor ways has a radically discontinuous effect on the resulting model. For example, what was topic number 32 in the previous state of a model will no longer be there under the same number, andmay not be there at all or will have its gist distributed differently among other topics. At one point, resetting our primary study collection (C-1) and its topic models due to de-duplication errors cost us much time in redoing our interpretive work.
WE1S has created or adapted for use software tools for processing, analyzing, and visualizing topic models of large collections of texts. These tools are assembled into an open-source workflow platform we call the WE1S Workspace, whose “modules” (sets of related Jupyter notebooks and associated tools) we make available through a containerized computing environment that can be downloaded for deployment on your own computers. You can run these tools on our collections of data about media coverage of the humanities (part of the way we support open and reproducible digital humanities). Or you can run the tools on your own texts by starting with our Jupyter notebooks for creating a project (usertest_create_template_archive.ipynb) and importing your materials. (Import.ipynb). [[The WE1S Workspace and all tools below will soon be released.]]
Our Workspace is an ensemble of Jupyter notebooks that can be spun up in a user’s computer from the containerized WE1S Computing Environment. The notebooks can be used modularly or in a workflow sequence to collect, manage, analyze, topic model, visualize, and perform other operations on texts.
Important modules in our Workspace include those for creating and running a topic modeling project—setting up the project; importing, exporting, managing texts; pre-processing texts; performing various analyses (such as counting documents or terms); topic modeling; and conducting topic model diagnostics.
Our Workspace also includes Jupyter notebook modules for generating interactive visualizations of topic models. We call our suite of original or adapted visualization interfaces our Topic Model Observatory. Visualization interfaces in the Topic Model Observatory useful for general purposes—exploring a topic model and its underlying materials with various degrees of freedom in looking into topics, words, and documents—inclkude Dfr-browser, TopicBubbles, and pyLDAvis. More specialized interfaces include Metatata&D, GeoD, and DendrogramViwer. (See our Topic Model Observatory Guide.)
We also make available our Chomp—a set of Python tools designed to find and collect text from webpages on specified sites that contain search terms of interest. Unlike other web scraping tools, Chomp is designed first and foremost to take a wide sweep—working at scale and across a variety of different platforms to gather material.
For collecting from Twitter, we offer our TweetSuite, a set of tools used to collect data from Twitter and prepare it for topic modeling. See also our research blog post on our methodology of collecting materials from Reddit.
We developed a topic-model “interpretation protocol” that declares standard instructions and observation steps for researchers using topic models. Our goal is a transparent, documented, and understandable process for the interaction between machine learning and human interpretation. We do not to assert a definitive topic-model interpretation process (because this will be different depending on the nature of projects, materials, resources, and personnel). We declare our interpretation protocol to serve as a paradigm to be adapted, improved, and varied by others. The protocol takes the form of survey-style questionnaires that step researchers through looking at a topic model and drawing conclusions from it.
Our Interpretation Protocol is a workflow that is modularly customizable. For example, a researcher can start by intitially exploring a model (modules 1-2) and then choose chains of other modules for specific research purposes—e.g.,”Analyze a topic” followed by “Analyze a keyword.”
We implemented the Interpretation Protocol for our project as a modular series of Qualtrics questionnaires providing instructions to researchers about what to observe in a topic model and what questions to answer (in note fields that follow the principles of Grounded Theory human reporting on data). We provide our questionnaires not just as Qualtrics files (importable by others with institutional access to Qualtrics) but also as Word and PDF documents.
WE1S researchers are in the process of publishing, presenting, reporting, blogging, and posting on social media about our project’s work.
We’re starting to publish about our project’s findings, methods, tools, and recommentations.
See our research blog reporting on discoveries, challenges, and reflections on our work as we proceeded.
Our Bibliography includes research works, information resources, guides, and other materials that we found useful in conducting its exploration of public discourse on the humanities. Over a thousand citations cover the main categories of the Humanities & Liberal Arts, Journalism & media, Corpus Collection, Data Science & Machine Learning, Digital Humanities, and Quantitative Analysis Methods. (Main categories open into subcategories.) The Bibliography on our website is populated from an underlying, searchable Zotero libary, which we also make accessible here.)