Conceptual Assumptions in WE1S Topic Modeling

by Joyce McGee Brummet, Mauro Carassai, Dr. Colleen Tripp, and Katie Wolf
Published July 30, 2018

The WE1S Research Blog posts discoveries, observations, and questions by project members bearing on WE1S's themes and methods. (For context, see "About" WE1S.)
In the Philosophy of Methods research group, we hoped this summer to make transparent the implicit assumptions of WE1S’s methods and consider how these assumptions affect our research outcomes. In particular, we were interested in the overall interpretive protocol of the WE1S research project. For example, while the current WE1S “Interpreting Topic Models” resource page includes a focus on topic model interpretation via quantitative methods, our research on the Philosophy of WE1S Methods indicates the significance of ways of close-reading during the process and the importance of literary theory in WE1S’s project of machine-coded learning.

Considering this, the Philosophy of Methods research group developed a tentative interpretive protocol based on the significance of language and theory-oriented close-reading processes in the creation of outcomes for WE1S. For example, when we develop WE1S topic labels, is it, as critic Miriam Posner would argue, something as simple as “finding patterns” in the associated topic words and crafting a label? Or, is the process more complex and affected by our implicit ways of close-reading? Joan A, a humanities scholar interested in economics, for example, may see and label a preponderance of topics based on conceptual ideas, such as money, power, and labor, while Mary K., a computer science high school teacher, may note a preponderance of topics based on the particular and material, such as the work day in Dublin and middle class administrators in a European university. Our group’s informal experimenting with topic labeling produced these very same diverse results in labels.

In short, even with the quantitative aspect of our machine-coded learning, WE1S’s methods should incorporate qualitative protocol methods and address issues of close-reading and literary theory. Operating within the WE1S open-source philosophy, our research group also aims to consider and create a reproducible, standardized way of close-reading and topic labeling that acknowledges the WE1S’s implicit ways of reading in the creation of our outputs. In this post, we outline the basic assumptions of the WE1S method. In our next post, we explain processes of a standardized, tentative interpretive protocol, “Between Subjectivity and Objectivity in Topic Labeling.”

We list below the implicit assumptions about our collective ways of reading that have guided the WE1S activities so far:

  1. A large collection of documents (produced by many different authors, produced under many different circumstances, in different contexts, for different purposes, and available through many different channels) can be made sense of by using probabilistic computational methods (topic modeling) and interpreted through a humanistic lens.
  2. Even before the topic model is produced and considered by an interpreter, the topic modeling process will include multiple stages of subjective interpretational actions/decisions including: corpus design, creating stop word lists, organizing data, cleaning data, choosing the number of topics that appear in the model. etc.
  3. We operate under the assumption of a general reconfiguration of authorship (distributed), readership (human and computational blend), and temporal and spatial categories (language acts grouped across variable time intervals and variable geographical areas), and we might have to reinterpret data as a product of discourses (to which authors participate in different ways, at different degrees and times, positionalities, etc.).
  4. A rigorous, standardized interpretive protocol is desirable for the WE1S open source aims. The protocol involves both human reading and other possible computational techniques (including Word Embedding, Word vectors, etc.) and/or statistic parameters (weight, relevance, saliency, marginal probabilities, etc.)

Each of the above pre-assumptions requires the topic modeler to make some kind of explicit decision based on interpretation and close-reading. We will address the in-depth discussions of such decisions in a number of dedicated blog posts.