https://hobbitdata.informatik.uni-leipzig.de/EML4U/2021-02-10-Wikipedia-Texts/ This repository contains text files from Wikipedia articles. Each text file is available for two points of time, 2010 and 2020. Living people 327,200 American films 11,020 British films 2,147 Indian films 3,596 Insgesamt 343,963 The texts are available in the respective archives: wikipedia-texts-2010-2020-american-films.tar.gz wikipedia-texts-2010-2020-british-films.tar.gz wikipedia-texts-2010-2020-indian-films.tar.gz wikipedia-texts-2010-2020-living-people.tar.gz Additionally, the mappings of text file names and wikipedia titles are available in indexes.tar.gz The used sources are: https://archive.org/details/enwiki_20100408 https://dumps.wikimedia.org/enwiki/20201101/ Used extraction tool: https://github.com/EML4U/WikimediaDumpExtractor/ License: https://dumps.wikimedia.org/legal.html Credits Data Science Group (DICE) at Paderborn University This work has been supported by the German Federal Ministry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080B.