5.2M Feb 15 07:39 deduplicated.pickle.bz2 Generated with: filtering-deduplication.ipynb Contains 1,727,821 review numbers: 489,009 non-duplicate reviews and 1,238,812 deduplicated reviews with same year and stars. 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Sum 1 2 26 597 2512 3015 3597 3689 6643 10413 9943 11125 12661 14150 15822 19132 21570 134897 2 nan 30 437 2162 2541 3048 3364 4880 7053 7050 8067 8417 8846 9536 11363 12041 88835 3 1 65 880 3932 4562 5064 5860 8592 11420 11322 13932 13944 14835 14925 16796 17593 143723 4 4 146 2166 9832 11216 12257 13466 19364 25958 27917 37664 36838 37089 36408 40392 40528 351245 5 14 561 7266 25204 26294 29576 32416 46222 64445 71619 108952 104455 112998 113957 130571 134571 1009121 Sum 21 828 11346 43642 47628 53542 58795 85701 119289 127851 179740 176315 187918 190648 218254 226303 1727821 Note: The contained review-numbers are the line numbers of the original file movies.txt.gz. They can be mapped to the raw-ids of R.F. using: https://github.com/EML4U/ExplainingDriftTextEmbeddings/blob/2820a1f6825b763ca72b1ca2272e2787af717b90/access/amazon_pickle_reader.py#L68