A Summer 2018 Saga: Webscraping for Subcorpora

The WE1S Research Blog posts discoveries, observations, and questions by project members bearing on WE1S's themes and methods. (For context, see "About" WE1S.)

The What Every 1 Says (WE1S) project aims to visualize public discourse on the humanities across many different states, countries, and regions. Although the majority of our corpus comes from major media conglomerates, we hope to gather as much data as possible from other kinds of sources – especially publications from developing nations. A Latin American subcorpus, then, would be a valuable avenue for future research if we are to take seriously the burgeoning body of research into the “global humanities”.

However, teams faced unique challenges when it came to data collection for a few separate reasons. Because the current manifestation of the WE1S project is English-language only, our list of possible sources was curtailed dramatically. Since Spanish (and in Brazil, Portuguese) is the official language for most of these countries, almost all archived material on Lexis Nexis (our primary database), ProQuest, etc., was not in English. Meanwhile, although many major newspapers in Mexico, for example, had both a Spanish and an English version, only the Spanish version gets digitized and stored. In the rare cases of English-first publications, we lacked access to print archives. We spoke about how a future iteration of this project may involve a Spanish-language subcorpus, and about how we could even reach out to journalists and scholars in these countries to ask about data access. However, in the short time that our summer camp had to gather enough data to generate meaningful topic models and research analyses, it was simply not viable to send out a string of emails and wait for days, possibly weeks, for a response. It quickly became clear that in order to determine what everyone says, we had to emulate what everyone does–that is, turn to web-based news sources for meaningful information.

Thus, we had to collect data from online news sources or online platforms for newspapers, magazines, and blogs published in English throughout Latin America. It is important to note that we limited our searches to public, open-access materials – we did not have the funds to get through paywalls, and we were not interested in craftily circumventing them. In order to err on the side of caution and respect intellectual property rights for all the countries involved, the articles that we collect are “bagified” and not stored in their original forms. Our methods involve “non-consumptive” data analysis – mostly topic modeling – which depends on word contents, but not interpretive reading of the original documents.

Collecting from the Web presented a different set of problems. In order for a topic model to be meaningful, it should have hundreds – ideally, thousands, or even tens of thousands – documents to work with. But we knew of no way to quickly and efficiently “grab” online articles. And so, at first, we were resigned to the fact that we would have to go manually through each page, laboriously copy and paste it into a text editor like Visual Studio Code, name it carefully, and save it in a format that (hopefully!) would import cleanly into the Jupyter Notebook Projects that we were using to run the topic modelling software.

Our initial, manual attempts looked something like this:

But this strategy quickly revealed itself to be unworkable for several reasons. On top of the mind-numbing, hundred-hour grind this would have entailed, another issue was simply finding the articles. The search feature for most websites was either not specific at all (a search of “humanities” tended to bring up “humanity,” “human,” or even “man”), gave a falsely low number of hits (one site that originally returned eleven hits for on “humanities” ended up having over 200 articles that we collected!), or simply didn’t work at all. For real, workable data collection, we had to be able to find the web pages. And this is where the first iteration of web scraping came in.

The first version of the webscraper, which Sean Gilleran developed, was a “bloodhound” of sorts. Its job was to use Google to do a site-specific, exact-term query and produce a formatted list of results.

If nothing else, we could follow the links and use our old method to ensure results. Still, it was not ideal. Meanwhile, Ray Steding built a tool that would take this list and scrape automatically from the web. The first iterations of the tool showed promising results, though we still had some trouble with unicode errors and messy metadata. Many of the sites we collected from lacked maintenance, and the source code we found was a confused jumble of outdated HTML, JavaScript ad bots, and odd fonts. Still, the two tools were the closest that we had yet come to workable data in both human- and machine-readable format:

Ray's web scraping tool — Ray’s web scraping tool

Eventually, Ray’s scraping tool produced the data for a successful topic model for his own team, comprised of 1622 scraped Alt-Right/Alt-Left files:

First topic model using Ray's tool — First topic model using Ray’s tool

But this tool had a lot of “moving parts,” and the lack of programming experience on the part of members of the Latin American team, unfortunately, made it unlikely to be a viable option for us – at least in its current stage of development.

At the same time, Sihwa Park offered us a possible route. He explained the use of a Chrome extension scraping tool that he had been using (successfully) on several Korean websites. His tool required a lot of manual setup specific to each website, which was a bit finicky. However, with some training, it was usable by those who had no programming experience. Two issues arose with this tool, however. The first was that its output was in CSV format, and as a single document for the entire site.

Sihwa's Chrome extension — Sihwa’s Chrome extension

This proved extremely difficult to convert into the JSON files that we needed for the Jupyter Notebooks. Additionally, the Chrome tool was entirely dependent on the site’s internal search – which, for the Korean sites, tended to be highly reliable. However, in our case, because we were working with many small and privately-owned sources, a given website’s search feature could not be trusted to produce valuable or consistent results.

In the meantime, Sean Gilleran had been working on knitting his and Ray’s two tools together, to create a more centralized and full-featured web scraper – fondly called “Chomp” – that would be a “one stop shop” for what we needed. This scraper was designed to gather URLS from Google first, and then collect the articles one-by-one afterwards.

This version ran from the command line, and required (very!) minimal programming experience. One of us (Rebecca Baker), despite having no coding abilities to speak of, did most of the active testing for this tool since all they could do was “paint by number,” following the exact directions that another one of us (Gilleran), very graciously and patiently, offered. The JSON output from Chomp was specifically designed to be compatible with the Jupyter Notebook metadata requirements.

This tool was not perfect. for example, after a certain number of automated searches Google would get suspicious and require the user to complete ever-more-complex CAPTCHAs in order to prove their humanity. The tool would also get confused and run quite slowly, and occasionally collect no body text, from pages with heavy banners or pop-up ads. However, some supervision and re-running was generally enough to get through these issues.

One salient example of the utility of this tool is our team’s collection from The Guyana Chronicle. This website was poorly maintained and extremely ad-heavy, making it a challenge for data collection by human or machine:

A search for “humanities” got about 140 hits, but these required clicking through several pages, many of which were not exact matches. This was a site that we had originally written off as “low value” because of the anticipated difficulty with collection. However, Chomp was able to collect from the entire site (albeit with some human supervision that required completing CAPTCHA puzzles and closing out of pop-ups in real time) in about an hour. This site ended up yielding about 240 hits for the humanities alone, and to date is one of our most valuable Latin American sources for the project.

Chomp has gone through several iterations since its initial launch, and we have yet to fully test the latest version – which includes a feature that allows it to scrape more quickly and efficiently from WordPress-based sites. Eventually, our plan is to have a version of the tool that runs in Docker as part of the overall WE1S Workflow Management System.

The unexpected output of a functional, easy-to-use, and universal web scraper by the WE1S research assistants is an excellent tool for future researchers, both within the WE1S project and externally, for other digital humanities projects. It greatly increased the representativeness of our corpus and the viability of subcorpora with data mostly or exclusively online–not just abroad, but also here in the United States. For example, many popular alternative or independent political news sources have no print counterparts. Our Latin America team – which, at the beginning of the summer, was unsure of its own survival – was able to use these web scrapers to collect hundreds of articles that otherwise would not have been represented.

WE1S

A 4Humanities Project

A Summer 2018 Saga: Webscraping for Subcorpora

WE1S is an initiative of 4Humanities.org.