Newspaper Corpus Design and Representativeness Report

Report by Samina Ali

Created May 31, 2018


The purpose of this report is to provide a preliminary overview of newspaper corpus design in academic research. To narrow down my search, I limited the scholarly articles reviewed to research published within the past twenty-five years. This decision was made for two reasons; (1) to concentrate the focus on contemporary research that might relate to our project, and (2) in consideration of digital accessibility. Furthermore, I found that newspaper research prior to the past twenty-five years often included going into physical archives and/or contacting the publication directly for access, which might not be a practical option for WE1S corpus collection.

The primary goal of this report is to develop a sense of which news publications are considered most impactful, most accessible, and most useful for corpus design and research. After reviewing over fifty scholarly articles on studies involving newspapers around the world, a few important strains stand out most significantly. One is the use of academic databases: overwhelmingly, LexisNexis and ProQuestare used by scholars to collect newspaper articles for analysis. The use of LexisNexisoften determines which newspapers will be selected for research—some scholars cite the use of the LexisNexisMajor World Publications” when compiling a publications list for their project (Good 2008). This report also sheds light on the fields within which newspaper corpora are being constructed. When working with recent newspaper articles, there are very few projects within the humanities that are analyzing contemporary articles from LexisNexis or ProQuest. Newspaper corpus construction in the humanities often relies heavily on physical historical archives, or online historical archives such as Chronicling America, Australian Newspaper Digitisation Program, ProQuest Historical Newspapers,Gale 19thCentury Newspapers.[1]And some researchers do not specify the databases used to retrieve historical articles at all.[2]What we see in terms of contemporary newspaper corpus design falls within the disciplines of Linguistics, Media Studies, Communications, Public Health, Environmental Policy, and Political Science.

Another key aspect to call attention to in this preliminary report is that most research is not explicitly interested in upholding a sense of “representativeness” for corpus construction. The few that do mention “representativeness” specifically (which I discuss below) often include one or two sentences in which they explain their attempt at a representative corpus, why their corpus falls short, and why their final selection of publications comes as close as possible to “representativeness” within the scope of their research.

LexisNexus or Bust 

Much of the academic research I surveyed on newspaper articles relies on LexisNexisto determine which publications are most useful (and most available) for representing regions and countries around the world. In Michael Boyle’s “Adherence to the Protest Paradigm: The Influence of Protest Goals and Tactics on News Coverage in U.S. and International Newspapers” (2012), for instance, publications are selected based on name recognition/prestige and “for geographical and ideological diversity”—as long as they are available on the LexisNexis database (Boyle et al. 132).[3] Similarly, Kim et al., in their research on “Coverage and Framing of Racial and Ethnic Health Disparities in US Newspapers, 1996–2005,” uses LexisNexis to determine the top forty newspapers in the United States. Although they begin by using the Audit Bureau of Circulation’s list of top 100 daily US newspapers, they subsequently use LexisNexisto narrow her research to what was available electronically (Kim et al. 2010). The assumption seems to be that if it is a top publication, it must be on LexisNexis. This rings true at first glance when considering US and European publications; however, cross-referencing with Wikipedia’s “Top Newspapers by Circulation,”you will find that many “top” publications (based on circulation numbers) are actually not available LexisNexis.

Still, many academic researchers do not clarify how they determine “major” publications in a region or country, simply stating that they chose a “top” publication such as The New York Times (US) or The Independent(UK) with no further explanation. One term used quite frequently when explaining publication selection is “quality.” Pertti Alasuutari et al. uses “quality” to narrow down their selection of newspapers in Pakistan, Britain, and Finland (2013).[4]Hans-Jörg Trenz goes as far to state that his research “does not provide a quality test” when selecting publications for analysis (294).[5]The word “quality” seems to be synonymous with “elite”—although not explicitly linked with circulation numbers, the designation denotes a combination of influence, brow, and reach. “Elite” also becomes synonymous with the English language in certain cases: Rebecca de Souza highlights this limitation when she writes that “the English newspapers selected for analysis represent the elite press in India, which caters primarily to the middle and upper-middle classes of society; this is a small but influential segment that controls domains of professional and political prestige in the country” (de Souza 259).

In addition to LexisNexis, ProQuestis a popular database for academic newspaper corpus construction.  Scholars such as Maxwell Boykiff use both LexisNexisand ProQuestwhen compiling articles (2007). In a similar vein to LexisNexis, newspapers are often selected through ProQuestbecause “they are considered the most influential newspapers in their respective countries based on circulation, web-metrics, and influence” (Ford and King138).

Because of my regional approach when searching for projects designing newspaper corpora, I was able to determine other notable academic databases outside of LexisNexisand ProQuest. These include Southern African Migration Project (SAMP), National Library of Canada, NewsLibrary, Canada Newsstand, Factiva,, andSA Media of Sabinet.For those conducting research on regions whose newspapers were less accessible through academic databases, reaching out to the publication itself was often the only option. Devin M. Dotson, whose work investigates climate change in Chile, notes that “since the newspapers are not chronicled in LexisNexisor other academic research portals, the article search and gathering was conducted using each newspaper’s online archives” (Dotson 69).

Complications with a Global Approach

To have a global approach in newspaper corpus design, many researchers acknowledge that they should include a multiplicity of languages when considering their analysis.[6]However, with a multilingual approach comes a new list of issues for analysis. For instance, how does one grapple with the diverse cultural meanings for a keyword search in a publication? Barkmeyer et al. consider this limitation when searching for the term “sustainability” in English, French, Portuguese, Spanish, German, Dutch, Italian, and Danish, but do not provide a solution for this challenge (Barkmeyer et al. 2009). Another complication is howlanguages are selected for this approach—languages from Asia and Africa are usually excluded due to software issues, translator availability, and newspaper access, leading to an overrepresentation of colonial languages within this area of research.

Do We Really Care About Representativeness?

Overall, the majority of publications surveyed do not question the use of LexisNexisin their research. If it is a top-quality newspaper, the consensus is that it will be available in said database. Due to the overwhelming use of LexisNexis, what we see are the same publications used repeatedly in contemporary newspaper research. Representativeness therefore becomes “resolved” by claiming that these major publications produce the most newsworthy stories that are then shared/repeated within other publications. A potential downfall of this approach is the solidification of highbrow viewpoints in how we create/perceive acceptable forms of knowledge. What perspectives are left out of the picture when we focus our research solely on “quality” publications? This is especially relevant with research that focuses on the perception of marginalized groups. Xigen Li and Xudong Liu, in their article “Framing and Coverage of Same-Sex Marriage in U.S. Newspapers” (2010), survey The New York Times, Los Angeles Times, Washington Post, San Francisco Chronicle, andThe Boston Globe, “based on the assumption that they exert more influence on the public because the public views their performance as superior to their peers” (80). While this claim might certainly hold some weight, how would their analysis look different if publications catering to the LGBTQIA community were included? How would these “niche” publications potentially affect how researchers interpret their results? Similarly, Peter Manning’s article “Arabic and Muslim People in Sydney’s Daily Newspapers, Before and After September 11th” (2003) examines two major Australian newspapers: The Sydney Morning Herald andThe Daily Telegraph to understand contemporary manifestations of orientalism. The point of view of the actual people being discussed in these publications is not included in Manning’s research. Both academic articles contribute important work about how marginalized groups are constructed in the public imagination. However, the concern highlighted here is less about the quality of research, and more about the limited attention that “niche” publications, which highlight the voices of those often silenced or ignored in society, seem to garner.

Conclusion: Considerations for WE1S Moving Forward

As WE1S moves forward, there are some key issues that will need to be taken into consideration:

Multilingual approach: Currently, WE1S has compiled of list of publications in English and Spanish. There have been brief discussions about including French (especially for the Caribbean and for Canada). But, can we truly consider our corpus “multilingual” with these three languages?  Publications in additional languages will be challenge for two reasons: the linguistic capabilities of our research team and our software. Further, the inclusion of more languages will complicate how we think about cultural context when searching for keywords like “humanities” and “liberal arts.” As we continue our corpus design, it will become increasingly imperative for us to acknowledge the ways in which we are reinforcing certain hierarchies in language and academic research with our exclusions.

LexisNexis and ProQuestDue to accessibility, LexisNexisand ProQuestare utilized heavily for WE1S’s collection/downloading of articles. Although our list of potential databases is growing, a deeper understanding of the history intent, and objectives of large databases like LexisNexiswill be necessary (report forthcoming). We must consider how these databases reify the status of certain publications while limiting the availability of others.

Quality: When we use popular rankings to determine which influential newspapers should be included in our corpus, we should explore how and why the “quality” label is determined. This label has been created based on a mix of prestige/elite status, circulation numbers, and the publication’s history. One challenge is the reliability of such determinants: in countries like Iran or Cuba, for instance, where the media is heavily controlled by the state, how certain can we be of reports on circulation or influence? We may also face a similar issue when including online publications. Do we consider page hits in the same way we use circulation numbers to determine impact?

Overall Representativeness: When thinking about the question of “representativeness,” the best approach might be to have a verynarrow and focused list of research questions to anchor our approach. WE1S has done an excellent job thus far in including a wide-ranging list of potential publications. We might also consider a report on the history of terms like “quality” for newspapers, and how this title has been distributed over the years.


[1]See “‘Fugitive Verses’: The Circulation of Poems in Nineteenth-Century American Newspapers” (Cordell 2017), “Reprinting, Circulation, and the Network Author in Antebellum Newspapers” (Cordell 2015), and “Retrieving a world of fiction: building an index – and an archive – of serialized novels in Australian newspapers, 1850–1914” (Bode 2015).

[2]Examples of this include “Provincializing Harlem: The ‘Negro Metropolis’ as Northern Frontier of a Connected Caribbean” (Putnam 2013) and Model Blacks or “‘Ras the Exhorter’: A Quantitative Content Analysis of Black Newspapers’ Coverage of the First Wave of Afro-Caribbean Immigration to the United States” (Tillery 2012).

[3]Boyle, McLeod, and Armstrong attempt representativeness by including newspapers from Asia, North America, and the Middle East. Asia: South China Morning Post, China Daily, TheStraits Times(Singapore), and The Nation(Thailand). North America: The Toronto Sunand The Toronto Star(Canada),The New York Times, The Washington Post, The Los Angeles Times, andThe Philadelphia Inquirer (US). Middle East: Jerusalem Post (Israel),The Daily Star (Lebanon), andGulf News (Dubai).

[4]“The Domestication of Foreign News: News Stories Related to the 2011 Egyptian Revolution in British, Finnish, and Pakistani Newspapers” (Alasuutari et al. 2013)

[5]“Media Coverage on European Governance: Exploring the European Public Sphere in National Quality Newspapers” (Trenz 2004). Newspapers included:Frankfurter Allgemeine Zeitung (FAZ), S¨uddeutsche Zeitung (SZ) (Germany), Le Monde (LM), Lib´eration (Li) (France), The Guardian (Gu), The Times (Ti) (UK), La Repubblica (Re), La Stampa (Stp) (Italy), Die Presse (Pr), Der Standard (Sta) (Austria) andEl Pa´ıs (Spain). A control sample of The New York Times (USA)is also included.

[6]One example is “Constructing Climate Change in the Americas: An Analysis of News Coverage in U.S. and South American Newspapers” (Zamith et al. 2012).