Zenodo is the open-science repository for research data and related outputs created through the European OpenAIRE initiative and operated by CERN. Zenodo follows FAIRE (Findable Accessible Interoperable Reusable) principles.

GitHub is a development platform (proprietary) commonly used by software and other project developers to evolve, maintain, and distribute their code and documentation.

WE1S practices principles of research sustainability and openness by depositing its data (datasets and "collections"), tools, and lab notes in the Zenodo open-science repository. We also distribute our code resources — for our computing "Workspace" (tools and workflow) and the Docker containerization of its computing environment — in GitHub repositories. Below are searchable and sortable tables of our Zenodo deposits and GitHub repos.

Glossary of terms useful for understanding WE1S deposits and repositories.

"Corpus / Corpora" -- The total set of texts (and data about them) that WE1S works with. (Compare Collection.)
"Datasets" -- Complete sets of data representing the WE1S corpus of texts that has been derived from the original texts but is not itself readable as plain text. For example, data that the WE1S Workspace generates from texts include: bags-of-words or term frequencies, ngram counts, etc.
"Collection" -- Derived data and visualization files representing a subset of WE1S's datasets and corpus (e.g. just top newspapers, or student newspapers, or only newspaper articles containing both the words humanities and science, etc.).
"Project production files" -- Software code (Jupyter notebooks and related tools), derived data, topic model files, and visualization files used to create models and visualizations of "collections".

WE1S Deposits in Zenodo: Collections

Deposit Title	Type	Brief Description	Open License	DOI
`humanities_keywords` Dataset	Dataset	The WE1S `humanities_keywords` dataset contains word-frequency and other non-consumptive-use data about 474,930 unique documents (no duplicate or close variants) mentioning the word "humanities" in English-language news sources. and other keywords related to the humanities in English-language news sources. Other keywords include "liberal arts," "the arts," "literature," "history," and "philosophy." The documents came from 850 U.S. and 437 international news sources with their associated blogs (including student newspapers) published mostly during 1989-2019. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.5068311
`comparison_not_humanities` Dataset	Dataset	The WE1S `comparison_not_humanities` dataset contains word-frequency and other non-consumptive-use data about 1,380,456 unique English-language news documents (no duplicate or close-variant documents) that do not contain the word "humanities." The documents came from mainstream U.S. news sources published during 2000-2019. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.5068699
`comparison_sciences` Dataset	Dataset	The WE1S `comparison_sciences` dataset contains word-frequency and other non-consumptive-use data about 553,699 unique English-language news documents (no duplicate or close-variant documents) that contain the words "science" or "sciences." The documents came from U.S. mainstream and student news sources published during 1977-2019 (though mostly from 1985-2019). WE1S researchers use this data to understand how public discourse about the humanities compares to public discourse about science. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.5068756
`twitter` Dataset	Dataset	The WE1S `twitter` dataset contains 5,024,756 tweets posted to Twitter between December 6th, 2013 and June 30th, 2019. The dataset is divided into subcollections based on the query terms "humanities", "liberal arts", "stem", "science", and "science-es" (that is a query for the presence of either "science" or "sciences"). Subcollections can be identified in the dataset from the value of the `metapath` property. Collectively, the tweets represent the work of 1,886,739 distinct usernames. Each tweet's mentions, hashtags, and links are recorded, as well the number of likes and retweets. Unlike most other WE1S datasets, the Twitter dataset does not contain extracted features. Instead, it contains the original text of the tweet (the value of the `content` property, along with a `tidy_tweet` property, which contains the text of the tweet after preprocessing. Tweets were preprocessed using a modified form of the WE1S preprocessing algorithm. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.5068253
`reddit` Dataset	Dataset	The WE1S `reddit` dataset contains 1,034,174 Reddit comments containing the terms "humanities", "liberal arts", or "the arts", downloaded by Raymond Steding using pushshift.io. Initially, comments posted between 2006 and 2018 were collected. Comments from 2019 were later added. This data has been processed using the WhatEvery1Says preprocessor, and, in addition to metadata downloaded from Reddit, sentiment scores generated with Textblob have been recorded. A description of the process at an early stage in the production of this dataset can be found in Steding's blog post "A Digital Humanities Study of Reddit Student Discourse about the Humanities." (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.5068267
`tvarchive` Dataset	Dataset	The WE1S `tvarchive` dataset ccontains word-frequency and other non-consumptive-use data about 1,205,844 English-language transcriptions of U.S. television news broadcasts. The documents were scraped from the Internet Archive's TV News Archive, which includes automatic captions of select U.S. news broadcasts since 2009. While the complete TV News Archive contains over 2.2 million transcripts, WE1S researchers were only able to collect about 1.2 million documents containing complete transcripts. The full TV News Archive includes transcripts from 33 networks and hundreds of shows. Unlike other WE1S datasets, the `tvarchive` dataset was not collected using keyword searches for specific terms (i.e., documents containing the word "humanities"). (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.506826
Collection 1	Data / Topic model visualizations	U.S. News Media, c. 1989-2019 (WE1S core collection of articles mentioning humanities") -- A collection of word-frequency and other data representing 82,324 unique articles mentioning "humanities" (no duplicate or close-variant documents) published mostly during 1989-2019 in 850 U.S. news sources and their associated blogs. (About 5,000 articles originate from earlier in the 1980s.) The word "humanities" occurs 134,948 times in the collection. WE1S and other researchers use this data to look for broad patterns and to help guide closer study. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4902187
Project Production Files for Collection 1	Data / Topic model visualizations	This is an archive of the WE1S project folder from which Collection 1 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5028254
Collection 2	Data / Topic model visualizations	U.S. News Media, c. 1989-2019 (articles mentioning "humanities" or "liberal arts") -- A collection of word-frequency and other data representing 94,816 unique articles mentioning "humanities" or "liberal arts" (no duplicate or close-variant documents) published mostly during 1989-2019 in 884 U.S. news sources and their associated blogs. (5,492 articles originate from earlier years going back to 1977.) WE1S and other researchers use this data to look for broad patterns and to help guide closer study. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4908882
Project Production Files for Collection 2	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 2 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5030554
Collection 3	Data / Topic model visualizations	U.S. News Media, c. 1989-2019 (articles mentioning "humanities" or "the arts") -- A collection of word-frequency and other data representing 108,207 unique articles mentioning "humanities" or "the arts" (no duplicate or close-variant documents) published mostly during 1989-2019 in 1,170 U.S. news sources and their associated blogs. (5,308 articles originate from earlier years going back to 1977.) WE1S and other researchers use this data to look for broad patterns and to help guide closer study. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4913688
Project Production Files for Collection 3	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 3 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5030871
Collection 4	Data / Topic model visualizations	U.S. Top Newspapers, 1977-2018 (articles mentioning "humanities") -- A collection of word-frequency and other data representing 28,375 unique articles mentioning "humanities" (no duplicate or close-variant documents) published from 1977 to 2018 in the 15 top-circulation U.S. news sources and their associated blogs. The word "humanities" occurs 39,852 times in 28,375 documents in the collection. WE1S and other researchers use this data to look for broad patterns and to help guide closer study. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4919794
Project Production Files for Collection 4	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 4 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5031979
Collection 5	Data / Topic model visualizations	U.S. Top Newspapers, 1977-2018 (articles mentioning "humanities" or "liberal arts") -- A collection of word-frequency and other data representing 30,323 unique articles mentioning "humanities" or "liberal arts" (no duplicate or close-variant documents) published from 1977 to 2018 in the 15 top-circulation U.S. news sources and their associated blogs. The word "humanities" occurs 39,890 times in 28,398 documents in the collection, while the phrase "liberal arts" occurs 2,888 times in 2,380 documents. WE1S and other researchers use this data to look for broad patterns and to help guide closer study. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4914736
Project Production Files for Collection 5	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 5 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5033192
Collection 14	Data / Topic model visualizations	U.S. Student Newspapers (articles mentioning "humanities" or "liberal arts") -- A collection of word-frequency and other data representing 21,182 unique articles mentioning the "humanities" or "liberal arts" (no duplicates or close variants) published in 1998-2018 (primarily 2005-2018) in about 650 U.S university and college student newspapers that are on the UWire news service. WE1S and other researchers use this data to look for broad patterns and help guide closer study. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4920178
Project Production Files for Collection 14	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 14 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5033590
Collection 15	Data / Topic model visualizations	Articles mentioning "humanities" or "literature" from ProQuest's Ethnic NewsWatch and GenderWatch -- A collection of word-frequency and other data representing 835 unique articles mentioning "humanities" or "literature" (no duplicate or close-variant documents) published mostly during 2016, 2018, and 2019 in 109 U.S. news sources gathered in ProQuest's Ethnic NewsWatch ("ethnic and minority press") and GenderWatch (sources gathered for "gender and women's studies, and gay, lesbian, bisexual, and transgender [GLBT] research"). WE1S and other researchers use this data to look for broad patterns and to help guide closer study. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4925152
Project Production Files for Collection 15	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 15 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5034610
Collection 18	Data / Topic model visualizations	U.S. Student Newspapers (articles mentioning "science(s)" -- A collection of word-frequency and other data representing 81,445 unique articles mentioning "science" or "sciences" from the UWire news service. Articles were published in 2000-2018 in 601 university and college student newspapers, mainly from the United States. There is a noticeable spike up in the number of articles mentioning "science(s) between 2017 and 2018 from 8,116 to 14162. WE1S and other researchers can use this data to look for broad patterns and guide closer study. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4914288
Project Production Files for Collection 18	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 18 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5034718
Collection 20	Data / Topic model visualizations	U.S. Top Newspapers, 2000-2018 (sample of all articles) -- A collection of word-frequency and other data representing 29,183 unique articles (no duplicates or close variants) published during 2000-2018 in 15 top U.S. newspapers and their associated online blogs. WE1S and other researchers use this data to look for broad patterns and help guide closer study. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4927419
Project Production Files for Collection 20	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 20 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5035825
Collection 21	Data / Topic model visualizations	U.S. Top Newspapers, 2000-2018 (articles mentioning "humanities" or "science") -- A collection that contains data representing all 15,692 articles from its set of sources in these years mentioning "humanities" but only a sampling of the 388,691 articles mentioning "science" or "sciences" from those same sources and years. It downsamples "science(s)" articles (while maintaining the proportions of articles from particular sources and years) to achieve a 50/50 balance of articles related to the humanities and sciences. The purpose is to allow media discourse on the humanities to be studied alongside that on the sciences and not be buried so far down in the statistical pile that it cannot easily be seen in detail. Collection 21 is thus not a representation of the relative weight of discussion of the humanities and sciences but instead an aid to studying the fine features and structures of each. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4927745
Project Production Files for Collection 21	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 21 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5039471
Collection 28	Data / Topic model visualizations	Tweets containing keyword "humanities", c. 2014-2017 -- This collection of the WE1S Twitter corpus consists of 799,744 tweets containing the keyword "humanities" from authors who tweeted the term "humanities" more than once between Jan. 1, 2014, and Dec. 31, 2017. (See also C-29, which aggregates tweets by author.) (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4940253
Project Production Files for Collection 28	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 28 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5032911
Collection 29	Data / Topic model visualizations	Tweets containing keyword "humanities", c. 2014-2017 (tweets aggregated by author) -- This collection of the WE1S Twitter corpus consists of 799,744 tweets containing the keyword "humanities" from authors who tweeted the term "humanities" more than once between Jan.1, 2014, and Dec. 31, 2017. This version of our Twitter corpus compiles tweets by each author into single "documents" for topic-modeling analysis, resulting in 132,562 total documents. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4940259
Project Production Files for Collection 29	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 29 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5033338
Collection 32	Data / Topic model visualizations	U.S. Top Newspapers (sample of all articles) -- A collection of word-frequency and other data representing 204,617 unique articles (no duplicates or close variants) published during 2012-2018 in 15 top U.S. newspapers and their associated online blogs. WE1S and other researchers use this data to look for broad patterns and help guide closer study. Included is data based on an approximately 1:40 proportional balance between articles mentioning "humanities" (about 5,000) and a sample of articles on everything else (about 200,000 more or less "random" documents found through searching on common English words). In essence, the collection is a sampled representation of "everything" in these sources for these years (limited by the fact that it is not feasible to know how many articles were actually published in these publications, to determine how completely they were collected in available database repositories, or to harvest everything from such databases.) (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4940326
Project Production Files for Collection 32	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 32 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5040629
Collection 33	Data / Topic model visualizations	Articles classified as being about the humanities or the sciences from U.S. top-circulating newspapers and student newspapers, c. 1998-2018 -- A collection of word-frequency and other data representing 13,214 unique articles (no duplicate or close-variant documents) classified as being about the humanities or science published from 1998-2018 in 507 U.S. top-circulating and student newspapers and their associated blogs. The collection includes 2,477 articles from U.S. top-circulating newspapers and 10,737 articles from student newspapers. Using supervised classification models, 2,869 articles in the collection have been classified as being about the humanities, and 10,345 articles in the collection have been classified as being about science. WE1S and other researchers use this data to look for broad patterns and to help guide closer study. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4940725
Project Production Files for Collection 33	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 33 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5042756
Collection 36	Data / Topic model visualizations	Articles containing the word "humanities" but that have been classified as not being about the humanties from U.S. top-circulating newspapers and student newspapers, c. 1998-2018 -- A collection of word-frequency and other data representing 27,362 unique articles (no duplicate or close-variant documents) that contain the word "humanities" but that have not been classified as being about the humanities published from 1998-2018 in 545 U.S. top-circulating and student newspapers and their associated blogs. WE1S and other researchers use this data to look for broad patterns and to help guide closer study. The collection includes 13,309 articles from U.S. top-circulating newspapers and 14,053 articles from student newspapers. Supervised classification models have classified these articles as not being about the humanities; this collection therefore helps WE1S understand what articles that contain the word "humanities" but that aren't about the humanities per se are like. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4948902
Project Production Files for Collection 36	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 36 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5043200
Collection 37	Data / Topic model visualizations	Articles containing the words "science" or "sciences" but that have been classified as not being about science, c. 1998-2018 -- A collection of word-frequency and other data representing 87,278 unique articles (no duplicate or close-variant documents) that contain the words "science" or "sciences" but that have not been classified as being about science published from 1998-2018 in 610 U.S. top-circulating and student newspapers and their associated blogs. WE1S and other researchers use this data to look for broad patterns and to help guide closer study. The collection includes 13,628 articles from U.S. top-circulating newspapers and 73,650 articles from student newspapers. Supervised classification models have classified these articles as not being about science; this collection therefore helps WE1S understand what articles that contain the words "science" or "sciences" but that aren't about science per se are like. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4958256
Project Production Files for Collection 37	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 37 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5043726
Collection 38	Data / Topic model visualizations	A collection of 124,340 Reddit comments longer than 225 words from 2006 to 2018 containing the terms "humanities," "liberal arts," or "the arts." WE1S and other researchers use this data to look for broad patterns and to help guide closer study. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4958695
Project Production Files for Collection 38	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 38 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5039647
Collection 39	Data / Topic model visualizations	A subset of WE1S's Collection 38 (C-38) Reddit collection -- Collection 39 is tailored to focus on student discourse about the humanities. Where C-38 includes Reddit comments longer than 225 words from 2006 to 2019 containing the terms humanities, liberal arts, or the arts, C-39 consists of 66,290 comments from that larger collection (about half the original number) that also contain at least one of the terms student, major, or college (including plurals and other forms). (Similar to C-38 is WE1S's Corpus-A, an earlier version of the same collection, but including only the years 2006-2018.) (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")	CC BY-SA 4.0	10.5281/zenodo.4959834
Project Production Files for Collection 39	Software code / Data / Topic model files / Visualization files	This is an archive of the WE1S project folder from which Collection 39 is derived. WE1S "projects" are folders that contain all notebooks, publicly available data, topic models and other analyses, and visualizations associated with the project. WE1S "collections" contain just the data and visualizations produced in project folders and may contain code updated for public presentation.	MIT	10.5281/zenodo.5044260

WE1S

A 4Humanities Project

WE1S Repositories & Deposits – Collections

Glossary of terms useful for understanding WE1S deposits and repositories.

WE1S Deposits in Zenodo: Collections

Got to WE1S GitHub

WE1S is an initiative of 4Humanities.org.