WE1S Spreadsheet for Predicting Optimal Number of Topics in a Topic Model

Created by Samantha Wallace (posted Aug. 2, 2018; last revised Aug. 2, 2018) [Documentation of the spreadsheet below is in draft form.]

An Excel spreadsheet implementing an equation for thee optimal number of topics in a WE1S topic model. (Excel is used because it is easier to make trend lines in it than in Google Sheets.) Download the spreadsheet: best-fit-for-number-of-topics

Explanation of the equation from the developer: “In order to predict the optimal number of topics, I began to qualitatively assess what number of topics in a topic model created less junk topics and more coherent topics. I created multiple topic models, with different numbers of topics, for a single corpus with a set number of clean articles. Once I found the best topic model, I input the data into an Excel spreadsheet using the number of clean articles in a corpus as the input (x) and the optimal number of topics for that corpus as the output (y). Once I had enough data, I created a scatter plot graph from this data and inserted a best fit trend line.

I began using exponential equation trend lines to predict the optimal number of topics. I incorrectly predicted that exponential equations would keep the number of topics in control, since a linear trend line could potentially lead us to having 800 topics for an 8,000 article corpus. As I began to input data collected from experimenting with large size corpora, I realized that logarithmic equations would be more accurate for a best-fit trend line. The logarithmic equation has a smaller slope, and the slope gets smaller as x (input data) increases, which means that as we begin to work with very large corpora the optimal number of topics will remain manageable (170 topics for 8,000 articles) and smaller corpora will have enough topics to interpret (75 topics for 250 articles).”

Excel spreadsheet to predict optimal number of topics for WE1S topic models

Download the spreadsheet: best-fit-for-number-of-topics

WE1S

A 4Humanities Project

WE1S Spreadsheet for Predicting Optimal Number of Topics in a Topic Model

WE1S is an initiative of 4Humanities.org.