A Digital Humanities Study of Reddit Student Discourse about the Humanities

A STEM program is not superior to a Liberal Arts program and vice versa. There is a chance for success no matter the route any student takes — Reddit commenter VillageMed

This blog post documents how to locate Reddit social media comments that exemplify students’ and graduates’ discourse about the humanities using the tools and methods of the WhatEvery1Says project. It is based on research begun during the WhatEvery1Says 2018 Summer Research camp and work that continued into Fall 2018.

Reddit comments are the back and forth user posts and replies in titled subreddits–Reddit community forums. I show how Digital Humanities tools produce topics of interest or themes of student discourse such as “Jobs,” “PhD Advice,” “Stem vs. Non-Stem discourse,” “Teaching,” “Admissions,” and “Writing.” The reasons for the students’ positions are often stated clearly within the contexts of these thematic labels. Locating such “topics” of student discourse about the humanities helps to categorically understand student issues. Further, a clearer understanding of the ideas and motives expressed in the Reddit comments facilitates advocacy for the humanities. Additionally, Digital Humanities newbies will learn from this study how to process the Reddit archive to answer their own research questions.

Let’s first look at the “what” of this article: at four exemplary comments expected as the outcome of the research:

Comment #1 PhD Advice Topic 137 subreddit: askacademia
For most fields in the humanities and social_sciences, you have to accept that you may end up in a job that has nothing to do with your degree. There are not enough jobs in academia for the number of students graduating with PhDs [ . . . ] for those who are in top-tier programs and willing to make that sacrifice, grad school can obviously be very personally rewarding

Comment #2 Stem vs. Non-Stem Topic 105 subreddit: badhistory
Physics student, so I can chime in here. STEM students feel that their major is harder and more rigorous than those who major in the liberal_arts, particularly since STEM fields are very math heavy . . .

Comment #3 Political Rhetoric Topic 121 subreddit: changemyview
“SJW” [ . . . ] Notable demographics include liberal_arts majors in college, tumblr, BLM. As the acronym describes, they are “Social Justice Warriors” and fight for “Social Justice”

Comment #4 Stem vs. Non-Stem Topic 105: subreddit: college
It’s because of the stigma in society, because the way such subjects are advertised in society — they make it seem like math and science are difficult subjects [ . . . ] Not everyone can write a 70+ page essay with ease, some people find math and equations to be the easy thing, but many people assume that the opposite is true for everyone.

These comments touch on various issues of interest to the researcher who seeks to understand how students and graduates conceptualize the humanities. As will be seen, the process of generating the topic model itself plays a fundamental role in drawing comments like those above into the same contexts. The model demonstrates the comment’s semantic relationships from one comment to another within the same topic and the semantic relationships of topic to topic. Although this blog post doesn’t conduct a detailed review of how a close reading of exemplary comments such as those above may be used for advocacy, it does answer the question of why a researcher should use Reddit as a resource.

Reddit as a Data Source for Student Discourse about the Humanities

Reddit describes itself as “a website comprised of thousands of user-originated and operated communities, called ‘subreddits,’ or ‘subs,’ dedicated to a variety of interests.” Reddit’s data-rich set of global knowledge and discourse with “more than 330M monthly unique visitors and 18+billion views per month” provides the researcher through the use of Digital Humanities tools an in-depth look into comments about almost every public topic of interest (“Holiday on Reddit”).

All of this data is curated by “Moderators,” or “Mods,” who perform “a variety of functions within th[e] community, including removing spam and enforcing the rules of their subreddit” (Reddit). Since a Reddit user may create any number of pseudonyms to post comments, many times comments are expected to be deleted. The commenter might post angry comments that have nothing to do with the theme of the subreddit, or they may make irrational comments in another voice that aligns with their chosen pseudonym. Although the deletion service of the moderators doesn’t scrub the comments of all irrelevant data, it does spare the researcher some of the work. Off-topic comments and spam less often end up as tokens in the corpus submitted to computational analysis procedures. The result is that fewer spam and off-topic comments get mixed into the topic model.

The Corpus

Each minute as many as 5000 new comments or more than ½ million new words are added. The following graphics snapped from the front page of pushshift.io depict the statistical usage of Reddit (pushift.io).

Statistical usage of Reddit. Source: pushift.io. — Statistical usage of Reddit. Source: pushift.io

The first job in assembling a corpus from Reddit data is to establish constraints on how much of the material to collect. For this study, a total of 3.3 terabytes of open-access Reddit comments (approximately five billion) from January 2006 through October 2018 were downloaded in JSON format from pushshift.io. The scope of this corpus is such that, if the comments were printed three per sheet of paper and each sheet stacked one on top of another, the length of the paper stack would exceed 100 miles. With such a large amount of text in the archive, the question becomes how does the researcher find what they are looking for?

To collect a corpus comprised of documents with exemplary comments such as those above, I initially filtered the downloaded Reddit data for comments containing at least one of the keywords “humanities,” “liberal arts,” or “the arts,” which resulted in a corpus of 154 files, totaling 980.5 MB of text. I next performed three further refinements. First, I filtered comments that contained the keywords “student,” “major,” or “college” (with or without affixes) into a new corpus. The python code to search the text of the 154 source files follows:

#usr/bin/python
import os
import glob
path2 = '/home/path-to-your-source-files/Student-Major/'
for json_filename in glob.glob(os.path.join(path2, '*.json')):
    filename_out = (os.path.basename(json_filename))
    filename_in = filename_out
    grep_command = 'grep -i \'student\\|major\\|college\' /home/path-to-your-source-files/Reddit/Student-Major/' + filename_in + ' > /home/path-to-your-destination-files/' + filename_out  + '-student-majors-college' + '.json'
    os.system(grep_command)

This second search results in 153 files totaling 335.5 MB that were run through a Python preprocessing script for proper formatting before the data were uploaded to the WE1S server. The Python script removes comments containing less than 225 words and comments with a karma score of less than or equal to 2. It also calculates the sentiment and subjectivity values of each comment through the use of the Python Textblob API; it writes out each comment as a single JSON file containing both the comment text and the metadata. The resulting corpus (“Corpus-A”) contains a total of 22,160 comments.

Metadata

The JSON file format of the content downloaded from pushshift.io compliments the researcher’s exploration by making parsing and processing easy with Python. Each line within the files contain the following metadata:

{
    "author": "xPadawanRyan",
    "Author_flair_css_class": "",
    "author_flair_text": "SSW / BA & MA History / PhD* Human Studies",
    "body": "It's because of the stigma in society, because the way such subjects are advertised in society -- they make it seem like math and science are difficult subjects . . . Not everyone can write a 70+ page essay with ease, some people find math and equations to be the easy thing, but many people assume that the opposite is true for everyone.",
    "Can_gild": true,
    "Controversiality": 0,
    "Created_utc": 1509939054,
    "Distinguished": null,
    "Edited": false,
    "Gilded": 0,
    "Id": "dper8w9",
    "Is_submitter": false,
    "Link_id": "t3_7b2gls",
    "Parent_id": "t3_7b2gls",
    "permalink": "/r/college/comments/7b2gls/why_do_people_assume_that_we_major_in_worthless/dper8w9/",
    "Retrieved_on": 1512171532,
    "Score": 3,
    "Stickied": false,
    "Subreddit": "college",
    "Subreddit_id": "t5_2qh3z",
    "subreddit_type": "public"
}

Parsing and extracting information that relates to the research is necessary whatever format the corpus files are in, but if the data are in JSON format, a Python script can extract any of the Reddit metadata fields and use them elsewhere. For example, the permalink value that points to the comment thread on the Reddit website can be reformatted as a link in the dfr-browser tool for visualizing topic models. A comparison of the original comment above with the view JSON link on the document title page (in WE1S’s customization of dfr-browser) shows that the JSON list file downloaded from pushshift.io above has been reformatted by the researcher’s Python script to include the permalink value as a hyperlink to the original Reddit thread. The reformatted comment page includes other essential statistics such as the karma score.

{
    "title": "2017-11-humanities-student-major_569_college.txt",
    "pub_date": "2017-11-05T00:00:00Z",
    "Sentiment": "0.01",
    "Subjectivity": "0.56",
    "KarmaScore": "3",
    "Upvotes": "0",
    "Downvotes": "0",
    "Wordcount": "371",
    "Permalink":         "http://reddit.com/r/college/comments/7b2gls/why_do_people_assume_that_we_major_in_worthless/dper8w9/",
    "Threadlink": "http://reddit.com/r/college/comments/7b2gls/why_do_people_assume_that_we_major_in_worthless",
    "Commenter": "xPadawanRyan",
    "content_scrubbed": "It's because of the stigma in society, because the way such subjects are advertised in society -- they make it seem like math and science are difficult subjects . . . Not everyone can write a 70+ page essay with ease, some people find math and equations to be the easy thing, but many people assume that the opposite is true for everyone."
}

The karma metadata field is an important publicly assigned quality-of-commenter numerical value for the researcher to use as a proxy of authority when filtering for “higher” or “lower” quality comments. According to Reddit, “Posts and comments accrue votes, or points, called ‘karma’ . . . [it] is generally a measure of the perception of [the user’s] contribution to Reddit. Positive karma indicate[s] your fellow users regard your comments or posts as enjoyable and contributory to the subreddit.” The karma value is a seed used to winnow the search results into a corpus that includes the public’s approval of the comments being researched. Assuming users prefer a higher rating based on their overall karma points, then the bias of this metadata value is that it may be used to the exclusion of other commenters. The excluded commenters with low karma values could be authors of equally meaningful comments, but they are either new or their comments aren’t as highly rated by others.

Nonetheless, ghosting of the karma value onto the comments made by a commenter occurs since most commenters desire to increase their karma rating rather than lower it; they tend to produce meaningful comments to win more karma points. The implicit notion of a comment being equivalent to the karma rating of the commenter explicitly carries along with the comment within its metadata. Despite bias, as a research decision, the karma rating and the humanities search terms become adjustable variable values for creating quality corpora to answer the research question.

Overview of the Methodology

Corpus-A is the result of what Jo Guldi refers to as “an iterative research process that require[s] successively re-seeding, re-winnowing, and re-reading resulting samples of text from a corpus” (“Critical Search: A Procedure for Guided Reading in Large-Scale Textural Corpora”, 13). The entire Reddit archive of comments filtered by the six search terms “humanities,” “liberal arts,” “the arts,” “student,” “major,” and “college” has “constrain[ed] a large corpus around a particular question” (Guldi 11). Finding exemplary comments for further analysis of how and why students and graduates talk about humanities fields in the way that they do is what the research seeks. Since every comment includes at least two of the search terms, Corpus-A contains many comments worthy of closer inspection. But, with over 22,000 comments many are not meaningful for understanding student discourse concerning the humanities, and many comments will not help determine what influences commenters’ viewpoints. In general, sociological, economic, cultural, parental authority, and individual preference compel their opinions.

But anticipating these factors may lead to the exclusion of some specific and surprising possibilities. For instance, some comments may reveal stereotyping to the point of stigma as a primary element of student opinions within particular subreddits. Others may reveal an unexpected presence of references to “students” and “humanities” in some gaming subreddits Therefore, the search made with the hope of finding the unknown about student discourse must be wide enough to include the broadest context of possible influences behind student opinions, and be narrow enough to isolate the comments constrained by what is meant by the humanities.

As part of the WE1S project, I have analyzed the Reddit corpus using the WE1S workflow based around topic modeling using MALLET and visualized the resulting model with dfr-Browser (Goldstone) and pyLDAvis (Mabey). “A ‘topic’ consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings” (MALLET). To this point Guldi says that “[t]opic models identify semantic similarities in collections of words that are used together” (19). The semantically similar collections, or topics, may be thought of as themes, such as “jobs,” “admissions,” or “campus infrastructure,” where the documents (in this case, Reddit comments) contain varying proportions of terms most highly associated with those topics. And, each topic visually displayed in dfr-browser includes a list of comments as individual documents that contribute to it. Therefore, the grouping of the documents into coherent themes of discourse by the tools makes it possible to closely analyse within individual Reddit comments the thematic bases of student rhetoric.

PyLDAvis, a Python port of the LDAvis package for R, is an important tool in the WE1S workflow for ascertaining the semantic coherence of topics generated in the model. According to Shirley and Sievert, authors of the original LDAVis package, it “attempts to answer a few basic questions about a fitted topic model: (1) What is the meaning of each topic?, (2) How prevalent is each topic?, and (3) How do the topics relate to each other?” (63). Knowing the meaning of a topic and how prevalent the topic is helped me to label the thematic topics of Corpus-A. The main component of pyLDAvis that helped me to determine the semantic relationships of topics to the comments most heavily represented in them is the relevance indicator. The authors explain “relevance,” as an indicator that gives the user the ability to see the term’s lift — “the ratio of a term’s probability within a topic to its marginal probability across the corpus”— compared with “the familiar ranking of terms in decreasing order of their topic-specific probability” (Sievert 65-6). Knowing the relevance of the terms of the topics, along with a close reading of a few of the comments that made up the topics gave me confidence of the coherent semantic relationship between the topic’s label and the documents that make up the topic.

Interpretation and Methodology

Although not experimented extensively, I generated the Corpus-A topic model of 200 topics from a corpus of 21,018 de-duplicated documents which appears to provide close to an optimal granularity to locate student discourses of interest. I used the following algorithm to prepare the corpus for modeling:

Remove 1376 stop words from the stoplist file
Normalize all versions of “United States of America” to “United States”
Remove punctuation
Merge some phrases from a standard list with underscore
Replace “‘s” with “[.]”
Remove duplicates from the corpus

Once prepared, the corpus completed modeling through the dfr-browser and pyLDAvis modules on the UCSB server Jupyter notebooks.

A structured series of observation and judgment steps made in accordance with the WE1S interpretation protocol provided guidance for locating which topics are the most important in the model. Following WE1S guidelines, I went to dfr-browser’s List View and listed the topics with the most heavily weighted topics on top as in the screenshot below.

Mega topics shown in dfr-browser’s List View

List View shows the top 13 topics relative to their topic weights within the corpus along with their topic words and a graph of the topic distribution over time. In this view, the “mega topics” are those with values of greater than two percent of the corpus. The mega topics have the highest proportional weights and because they consist of general topic words, they are difficult to label meaningfully. For illustrative purposes, I’ve labeled one such mega topic as “Non-noun Stop Words” since it contains mostly adjectives and adverbs: words that researchers sometimes remove by way of adding them to the stop word list file to improve model coherency.

The protocol asks the researchers to note the topics of interest where the topic words appear to have a semantic relationship. In my experience, these topics have typically been topics with a less than two percent representation of the corpus and a higher than .5 percent representation. For instance, in the graphic above, Topic 150, with a 1.6 percent representation of the corpus, has the keywords “degree,“ “job,” “degrees,” “major,” “liberal_arts,” “college,” “people,” “jobs,” “field,” “majors,” “school,” “career,” “work,” “business,” and “market.” What is noteworthy about Topic 150 are the numbers of search terms that appear as keywords within the topic such as “liberal arts,” “major,” and “college.” The presence of key search terms in a topic’s keywords suggests that this should be considered a topic worthy of further investigation. The implied theme or label of the topic might be “degrees that lead to jobs.”

Since the keywords of Topic 150 appeared coherent to me in List View, I turned to Topic View to examine the topic more closely.

Most prominent topics for Humanities and STEM shown in dfr-browser’s Topic View

The protocol asks the researcher to read the comments that contribute to the topic. After reading five of the comments that contribute to Topic 150, I concluded this is a topic of interest but the theme of the comments seemed to talk about the benefits of either STEM or humanities majors rather than degrees that lead to jobs. I therefore labeled the topic “Stem or Non-Stem Discourse.”

As Guldi states, “by thrashing the data with different tools, the digital scholar obtains insight into the bias of the tools themselves, and the variety of answers they can produce” (25). Indeed, my interpretation takes place in a back and forth manner between the dfr-browser and pyLDAvis visualizations as needed. I’m interested in the verification of pyLDAvis by the dfr-Browser and vice-versa. These tools have slightly different ways of representing topics, and comparing these representations aids the interpreter in developing semantically meaningful labels for significant topics.

In the screenshot below, I have added custom labels indicating my interpretations of the topic’s content or theme. For example Topic 121 “Political Rhetoric/Arguments” reflects general political discourse entering into the conversation of students over time. Further research into the individual documents containing this topic may or may not reveal that divisiveness in student political opinions creates a reactionary environment that accentuates stereotyping of humanities students.

To find out how students argue for and against becoming humanities majors the topics containing comments for investigation appear to be Topic 105 (“Stem vs. Non-Stem”), Topic 150 (“Stem or Non-Stem Discourse”), Topic 62 (“Follow Your Bliss”), and Topic 172 (“Humanities and Jobs”).

Topics liklye to represent student discourse — Topics likely to represent student discourse

These topics related to humanities majors together have 4.2% representation of the 200 topics which gives them a better than average overall proportion of the corpus. The documents that make up each topic of interest require sample reading of the underlying comments to verify if they help answer our goal question or not, but their labels indicate what we should expect to find.

Using pyLDAvis further helped me to locate coherent topics that consist of comments related to our inquiry and thus to eliminate much of the need for sample reading beyond the first few comments of each topic of interest (Wieringa). pyLDAvis simplifies the labeling of topics, and therefore it simplifies the process of determining where to search for the answer to our question within the corpus. Its visual interface for locating the topics of choice lets us look deep within the topics to know that the topics, and by association, the documents that represent the topic, are consistent with the theme of the label. In the model below Topic 105, located in the lower right quadrant, stands out in red. The relevance slider In the upper right is the primary tool of pyLDAvis.

Topic 105 in pyLDAvis with Relevance set to 0.6

By sliding the relevance value to 0.6 often the first five or so words inform the researcher with enough information for them to appropriately label a topic. In this case, the words are “stem,” majors,” “fields,” “humanities,” and “non-stem” which suggests that the comments that make up the topic contain opposing student rhetoric about humanities and stem majors.

The researcher may double-check this assumption by returning to dfr-browser’s word index page and clicking on the “humanities” link. Amongst the “Prominent Topics,” Topic 105 (“Stem vs. Non-Stem”) has the highest probability of containing the word “humanities”:

A sample reading of the Reddit comments associated with this topic, supports the interpretation based on Topic 105’s keywords (“humanities,” “stem,” “liberal arts,”, “engineering,” and “non-stem”); the discourse of the topic involves opposing points of view. Going back to pyLDAvis and sliding the relevance indicator to the far left, the two words of Topic 105 with the highest lift (term frequency) are “stem” and “non-stem.” The following diagram shows the image with the value of the relevance metric set to zero.

Topic 105 in pyLDAvis with Relevance set to 0

In this manner of back and forth “thrashing” of the models, the researcher gains assurance that Dfr-browser and pyLDAvis agree: high polarization exists within the comments of Topic 105. The documents constituting “Stem vs. Non-Stem” most likely contain sought after student and graduate rhetoric.

Worthy of note is that the list of subreddits within the top 500 documents constituting Topic 105 contains 171 different subreddits, many of which are non-academic in nature. A partial list of the names of the first 15 of 171 subreddit names that contribute heavily to Topic 105 is “6thForm,” “ABCDesis,” “academiceconomics,” “actuallesbians,” “AdviceAnimals,” “Anarchism,” “Capitalism,” “antisrs,” “ApplyingToCollege,” “asianamerican,” “AsianParentStories,” “AskAcademia,” “AskAnAmerican,” “AskEngineers,” and “AskFeminists.” These results imply that discourse about humanities and STEM majors arises out of a broad demographic base and within context across a spectrum of interests.

Other topics of interest within the list labeled thus far include Topic 62 (“Follow Your Bliss”), wherein the comments, for the most part, subscribe to the idea that passions guide students towards a field of study; Topic 150 (“Stem or Non-Stem Discourse”), wherein the comments do not argue for or against the humanities but rather tell why the students have chosen a particular path; and Topic 172 (“Humanities and Jobs”) which speaks to the issue of job prospects for humanities majors. The comments that comprise each of the topics of interest require further examination to learn how best to address student concerns about the humanities.

Conclusion

The premise matching the goal of this research blog assumes that the researcher will, after studying the rhetoric and diction for and against the humanities in the documents of a topic such as Topic 105, develop optimal insight into how best to frame an answer presentable to the public in support of the humanities. Guldi states that at the end of the “critical search is [the] actual reading of particular texts,” which in this case are individual comments classified as exemplary (29). She refers to this stage as “Guided Reading,” where the “iterative encounters with the algorithm and reading allow[s] the researcher to find documents that fit best with [the] question” (29). And, although this research continues beyond the documentation here to re-model the exemplary comments of this and many other models combined, the results have proven the usefulness of the Digital Humanities tools used to find the comments and themes of student discourse about the humanities.

This blog post contains the technical information necessary for researchers who desire to explore Reddit for answers to particular questions about human discourse. It demonstrates that the Reddit archive is a vast aggregation of the English language worthy of investigating questions that would otherwise be impossible without Digital Humanities tools. Through software such as MALLET, dfr-browser, and pyLDAvis, the study shows that algorithmically analyzing a corpus into topics, or themed genres, consisting of file sets helps to answer the research question of how students talk about the humanities. For a detailed look at the results of this study, download the top-ranked 500 comments of Topic 105 (“Stem vs. Non-Stem”) here.

Works Cited

Goldstone, Andrew. Dfr-browser. “Take a MALLET to Disciplinary History”_. 2013. 2018. _GitHub, https://github.com/agoldst/dfr-browser.

Guldi, Jo. Critical Search: A Procedure for Guided Reading in Large-Scale Textual Corpora. Preprint, SocArXiv, 20 Dec. 2018. DOI.org (Crossref), doi:10.31235/osf.io/g286e.

“Holiday on Reddit.” Upvoted,http://redditblog.com/2018/11/13/holiday-on-reddit/. Accessed 4 Feb. 2019.

Mabey, Ben. Python Library for Interactive Topic Model Visualization. Port of the R LDAvis Package.: Bmabey/PyLDAvis. 2015. 2019. GitHub, https://github.com/bmabey/pyLDAvis.

Pannucci, Christopher J., and Edwin G. Wilkins. “Identifying and Avoiding Bias in Research. Plastic and Reconstructive Surgery, vol. 126, no. 2, Aug. 2010, pp. 619–25. PubMed Central, doi:10.1097/PRS.0b013e3181de24bc. Accessed 4 Mar. 2019.

Pushshift.io, files.pushshift.io/reddit/comments/. Accessed 27 Feb. 2019.

“Reddit: The Front Page of the Internet.” Reddit, https://www.reddit.com/r/askReddit/wiki/index. Accessed 4 Mar. 2019.https://www.reddit.com/r/askReddit/wiki/index

Sievert, Carson, and Kenneth Shirley. “LDAvis: A Method for Visualizing and Interpreting Topics.” Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, Association for Computational Linguistics, 2014, pp. 63–70. ACLWeb, http://www.aclweb.org/anthology/W14-3110.

“TextBlob 0.15.2 Documentation.” https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.TextBlob.sentiment. Accessed 4 Mar. 2019.

Wieringa, Jeri E. “Using PyLDAvis with Mallet· from Data to Scholarship”. http://jeriwieringa.com/2018/07/17/pyLDAviz-and-Mallet/. Accessed 4 Mar. 2019.