Collections & Topic Models

WE1S studies a corpus of journalistic media and other documents related to the humanities that it harvested for text analysis (but does not store or make available as readable text due to copyright constraints).* This corpus is organized as approximately 30 “WE1S Collections of News and Other Media” (combinations of different kinds of sources and year ranges) to facilitate exploring different research questions. Collections are represented and made available as word-frequency, topic-modeling, and other data generated from analyzing the original texts.

The explanation “cards” below provide summaries of the collections and “start page” links for fuller information and visualizations of topic models. [Cards are under construction.]
In addition, surveys that WE1S conducted according to approved human subjects research protocols at two of its project campuses–UC Santa Barbara and U. Miami–complement its big-data analysis of media documents. Anonymized results of these surveys are presented as “WE1S Collections of Survey Results.”

WE1S also describes selected topic models of some collections under the heading below, “WE1S Topic Models of Collections (selected).”


* WE1S makes available only derived-data, “non-consumptive use” word frequency, topic model, and other datasets along with their visualizations. Datasets cannot be used to access, read, or reconstruct the original texts.

WE1S Collections of News and Other Media

Each collection described in a card below represents thousands to hundreds of thousands of documents assembled in specific combinations of sources and years from WE1S's overall corpus of news, media, and social media materials (currently primarily from the U.S.). Due to copyright and other constraints, WE1S makes available for download or interactive exploration only word-frequency and other data and metadata derived from the original texts, along with topic models and visualizations of the material.

Different collections are designed to facilitate asking certain kinds of research questions--e.g., about the profile of the humanities in the media at large, in top U.S. newspapers, in college and university newspapers, in articles mentioning the humanities and/or the sciences, etc. For example, Collection 1 is the large set of most of what WE1S gathered except social media that mentions the humanities. Collection 32 by contrast, is the large set of materials that is approximately a level sample of articles from top U.S. newspapers (not necessarily including mentions of humanities). Collection 14 includes articles from campus student newspapers. Collection 38 includes Reddit posts mentioning the humanities. And so on.

The descriptive cards below provide summary information about each collection and its source, start page with fuller description and links topic model visualizations, and location of its downloadable dataset models (not including plain text)s. The cards also commonly include screenshots from some of WE1S' topic models of collections. (For cards describing some some of these models, see below on this page.)

WE1S Collections of Survey Results

To complement its big-data analysis of news and other media mentioning the "humanities," WE1S also surveyed and held focus group meetings with students and others at two of its project's campuses: UC Santa Barbara and U. Miami. The following are data collections of survey results.

WE1S Topic Models of Collections (selected)

WE1S systematically topic-modeled its "collections"to create models at various levels of topic granularity–typically 25, 50, 100, 150, 200, and 250 topics. Each model of a collecton comes with a number of interactive visualizations–including those in the WE1S Topic Model Obveratory. Below are cards describing a few of the specific models that have been important in WE1S's research. (These topic models are labeled according to a "shelfmark" system so that "C-14.100" means the 100-topic (or "grain") model of Collection 14.)