Distributed Bag of Words embeddings of twitter election data

Used code: https://github.com/EML4U/Drift-detector-comparison/blob/dec5db45a2f3870188910bde7413f4bb69219fc3/word2vec/doc2vec_twitter_election.py


Usage example:
import gensim
from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec.load("twitter_election_768.model")
model.infer_vector(gensim.utils.simple_preprocess("Joe Biden is the 46th president of the United States"))


Config 50:
doc2vec_vector_size = 50  # Dimensionality of the feature vectors
doc2vec_min_count   = 2   # Ignores all words with total frequency lower than this
doc2vec_epochs      = 40  # Number of iterations (epochs) over the corpus. Defaults to 10 for Doc2Vec
doc2vec_dm          = 1   # Training algorithm, distributed memory (PV-DM) or distributed bag of words (PV-DBOW)
num_of_tweets       = -1  # -1 to process all (for development)

Log:
python3 doc2vec_twitter.py
Loading twitter file /home/eml4u/EML4U/data/twitter-election/election_dataset_raw.pickle
all_tweets 1201235
Creating tagged documents
Building vocabulary
Training model
Saving model file /home/eml4u/EML4U/data/amazon/twitter_election_50.model
Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3)
Runtime: 92.04042950868606 minutes
Gensim version: 3.8.3


Config 768:
doc2vec_vector_size = 768 # Dimensionality of the feature vectors
doc2vec_min_count   = 2   # Ignores all words with total frequency lower than this
doc2vec_epochs      = 40  # Number of iterations (epochs) over the corpus. Defaults to 10 for Doc2Vec
doc2vec_dm          = 1   # Training algorithm, distributed memory (PV-DM) or distributed bag of words (PV-DBOW)
num_of_tweets       = -1  # -1 to process all (for development)

Log:
python3 doc2vec_twitter.py
Loading twitter file /home/eml4u/EML4U/data/twitter-election/election_dataset_raw.pickle
all_tweets 1201235
Creating tagged documents
Building vocabulary
Training model
Saving model file /home/eml4u/EML4U/data/amazon/twitter_election_768.model
Doc2Vec(dm/m,d768,n5,w5,mc2,s0.001,t3)
Runtime: 89.1883107582728 minutes
Gensim version: 3.8.3