Distributed Bag of Words embeddings of twitter election data Used code: https://github.com/EML4U/Drift-detector-comparison/blob/dec5db45a2f3870188910bde7413f4bb69219fc3/word2vec/doc2vec_twitter_election.py Usage example: import gensim from gensim.models.doc2vec import Doc2Vec model = Doc2Vec.load("twitter_election_768.model") model.infer_vector(gensim.utils.simple_preprocess("Joe Biden is the 46th president of the United States")) Config 50: doc2vec_vector_size = 50 # Dimensionality of the feature vectors doc2vec_min_count = 2 # Ignores all words with total frequency lower than this doc2vec_epochs = 40 # Number of iterations (epochs) over the corpus. Defaults to 10 for Doc2Vec doc2vec_dm = 1 # Training algorithm, distributed memory (PV-DM) or distributed bag of words (PV-DBOW) num_of_tweets = -1 # -1 to process all (for development) Log: python3 doc2vec_twitter.py Loading twitter file /home/eml4u/EML4U/data/twitter-election/election_dataset_raw.pickle all_tweets 1201235 Creating tagged documents Building vocabulary Training model Saving model file /home/eml4u/EML4U/data/amazon/twitter_election_50.model Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3) Runtime: 92.04042950868606 minutes Gensim version: 3.8.3 Config 768: doc2vec_vector_size = 768 # Dimensionality of the feature vectors doc2vec_min_count = 2 # Ignores all words with total frequency lower than this doc2vec_epochs = 40 # Number of iterations (epochs) over the corpus. Defaults to 10 for Doc2Vec doc2vec_dm = 1 # Training algorithm, distributed memory (PV-DM) or distributed bag of words (PV-DBOW) num_of_tweets = -1 # -1 to process all (for development) Log: python3 doc2vec_twitter.py Loading twitter file /home/eml4u/EML4U/data/twitter-election/election_dataset_raw.pickle all_tweets 1201235 Creating tagged documents Building vocabulary Training model Saving model file /home/eml4u/EML4U/data/amazon/twitter_election_768.model Doc2Vec(dm/m,d768,n5,w5,mc2,s0.001,t3) Runtime: 89.1883107582728 minutes Gensim version: 3.8.3