TMO Guide (4) – Metadata-pyLDAvis

WE1S Topic Model Observatory Guide (TMO Guide), chapter 4

4. Metadata-pyLDAvis

(Document created 2 June 2019. Last revised 2 July 2019.)

[Examples of this topic model interface in action: 1 | 2 (requires WE1S password)]

Credits: Metadata-pyLDAvis created by WE1S (Dan Costa Baciu, Scott Kleinman, Yichen Li, Junqing Sun). pyLDAvis is Ben Mabey’s port to Python of the LDAvis R package by Carson Sievert and Kenny Shirley.

Metadata-pyLDAvis visualizes topics and either the document titles or the publication sources (instead of words, as normally shown in the pyLDAvis interaface) associated with them. (“Metadata” means such identifiers as “publisher,” “author,” and “date” that give information about a document. These metadata are prepared in advance during the collection and pre-processing of the WE1S corpus.)

For general instructions on using the pyLDAvis interface, see TMO Guide, Chapter 3 on the normal version of pyLDAvis. The following instructions on this page focus just on using the special functions of Medadata-pyLDAvis.

(1) Metadata-pyLDAvis for analyzing documents

pyLDAvis.topics.documents (shows titles of documents most relevant to a topic)
Metadata-pyLDAvis for documents (shows titles of documents most relevant to a topic)

a. The left panel of Metadata-pyLDAvis for documents shows circles representing topics (as usual in the pyLDAvis interface). However, selecting a topic, or searching for it in the field at the top left of the interface, shows in the right panel not the words relevant to that topic but instead the titles of the topic’s most relevant documents. (Ignore the tile label for the right panel that reads, “Top-30 most relevant terms for topic,” which is a holdover from the normal pyLDAvis interface.) Adjusting the relevance metric scale at the top of the right panel changes the sorting of the documents by “relevance” as defined in pyLDAvis (in this case, a balance between the frequency of a document in a topic represented by a red bar and the “lift” of a document or how much its frequency sticks out in a topic above the baseline of its overall frequency in the model as a whole as represented by a blue bar). (See TMO Guide Chapter 3, section 1.b, for a more detailed explanation of the relevance metric.)

b. Best practice in using Metadata-pyLDAvis for documents is to select a topic(s) you are interested in, and then examine the top documents relevant to that topic at several relevance metrics, including ? = 1 (emphasizing frequency of documents in a topic), ? = 0 (emphasizing the “lift” of documents in a topic), and values in between. (Note that the value of ? = 0.6 that Sievert and Shirley’s article suggests is optimal after user-testing on a particular topic model does not necessarily apply for Metadata-pyLDAvis for documents, since here the relevancy metric is being used on documents and not words.)

A particularly useful tactic is to run your mouse cursor down the whole series of top documents in a topic, seeing what other topics show up in the left panel as ones that the document is prominent in. Especially worth notice might be any documents that produce in the left panel just a few topic circles (meaning that the document is prominent in just a few of the model’s topics). Among the subset of documents that produce in the left panel just a few topic circles, even more noteworthy by any that produce in the left panel either a very large or very small circle for the topic whose document list you are looking at relative to the circles of other topics (meaning that while the document is prominent in your currently selected topic, it is much more so or less so than in other related topics)

The steps above will give you a sense of which documents you might want to study in closer detail–especially if confirmed by their appearance at the top of the article lists in Dfr-browser and TopicBubbles (see TMO Guide, chapters 1 and 2). You can use Dfr-browser’s “Bibliography” tab to locate the specific documents and look at their JSON files to examine information about them and where to access them for reading.

Also, once you have identified documents of special interest in a topic, you may want to switch to Metadata-pyLDAvis for sources to see what the top publication sources are for that topic. Note in particular if there is a correlation between the most relevant documents and publication sources for a topic–e.g., indicating that a particular newspaper or set of sources is responsible for much of the most relevant discussion.

(2) Metadata-pyLDAvis for analyzing sources

pyLDAvis.topics.sources (shows publication sources most relevant to a topic)
Metadata-pyLDAvis for sources (shows publication sources most relevant to a topic)

a. The left panel of Metadata-pyLDAvis for sources shows circles representing topics as usual in the pyLDAvis interface. However, selecting a topic, or searching for it in the field at the top left of the interface, shows in the right panel not the words of that topic but instead the publication sources most relevant to the topic–e.g., “The_Stanford_Daily:_Stanford_University.” (Ignore the tile label for the right panel that reads, “Top-30 most relevant terms for topic,” which is a holdover from the normal pyLDAvis interface.) Adjusting the relevance metric scale at the top of the right panel changes the sorting of the publication sources by “relevance” as defined in pyLDAvis (in this case, a balance between the frequency of a source in a topic represented by a red bar and the “lift” of a source or how much its frequency sticks out in a topic above the baseline of its overall frequency in the model as a whole as represented by a blue bar). (See TMO Guide Chapter 3, section 1.b, for a more detailed explanation of the relevance metric.)

b. Best practice in using Metadata-pyLDAvis for sources, Method 1: First select a topic(s) you are interested in, and then examine the top sources relevant to that topic at several relevance metrics, including ? = 1 (emphasizing frequency of sources in a topic), ? = 0 (emphasizing the “lift” of sources in a topic), and values in between. (Note that the value of ? = 0.6 that Sievert and Shirley’s article suggests was optimal after user-testing on a particular topic model does not necessarily apply for Metadata-pyLDAvis for sources, since here the relevance metric is being used on sources and not words.)

A particularly useful tactic is to run your mouse cursor down the whole series of top sources in a topic, seeing what other topics show up in the left panel as ones that the source is prominent in. Especially worth notice might be any sources that produce in the left panel just a few topic circles (meaning that the source is prominent in just a few of the model’s topics). Among the subset of sources that produce in the left panel just a few topic circles, even more noteworthy by any that produce in the left panel either a very large or very small circle for the topic whose source list you are looking at relative to the circles of other topics (meaning that while the source is prominent in your currently selected topic, it is much more so or less so than in other related topics)

The steps above will give you a sense of which publication sources you might want to study in closer detail. Also, once you have identified source of special interest in a topic, you may want to switch to pyLDAvis.topics.documents to see what the top documents are for that topic. Note in particular if there is a correlation between the most relevant documents and publication sources for a topic–e.g., indicating that a particular newspaper or set of sources is responsible for much of the most relevant discussion.

c. Best practices in using Metadata-pyLDAvis for sources, Method 2: You can use Metadata-pyLDAvis for sources as one way to compare two parts of a corpus (or part of a corpus to the whole corpus) represented by a topic model. To do so, first create for yourself a watchlist of publication sources representing the part(s) of the corpus you are interested in comparing. For example, create a list of newspapers representing student newspapers, Or create one list of newspapers representing major national news sources and another representing regional or local news. Then you can use Metadata-pyLDAvis for sources to inspect topics in which the publication sources on your watchlist(s) are prominent in. (Unfortunately, at present there is no way to search for particular publication sources in Metadata-pyLDAvis for sources. So you have to look for appearances in the right panel of the sources you are interested in, then click on them when they appear to see what topics in the left panel they are prominent in.)