amazonreviews_e.tar.gz
all datasets, distributed bag of words, 40 epochs, 768 dimensions, Gensim 3.8.3

https://github.com/EML4U/Drift-detector-comparison/blob/e88de5f969e607823170c7daa9cde1c440c5529e/word2vec/doc2vec.py

max_year    = -1     # Max year for training
max_docs    = -1     # -1 to process all (for development)
print_texts = False  # Prints iterated texts (for development)

doc2vec_vector_size = 768  # Dimensionality of the feature vectors
doc2vec_min_count   = 2   # Ignores all words with total frequency lower than this
doc2vec_epochs      = 40  # Number of iterations (epochs) over the corpus. Defaults to 10 for Doc2Vec
doc2vec_dm          = 0   # Training algorithm, distributed memory (PV-DM) or distributed bag of words (PV-DBOW)
doc2vec_seed        = -1  # -1, or int for reproducible results (under development)

2021-06-10 23:11:41.177173
Building vocabulary
Training model
Saved model file /home/eml4u/EML4U/data/amazon/amazonreviews_e.model
Doc2Vec(dbow,d768,n5,mc2,s0.001,t3)
Runtime: 157842.79472875595 seconds
= 2630 min = 43 hours 50 min
Gensim version: 3.8.3
EML4U experiment server


amazonreviews_d.tar.gz
all datasets, distributed bag of words, 40 epochs, 50 dimensions, Gensim 3.8.3

https://github.com/EML4U/Drift-detector-comparison/blob/882e5e1fd1da79708fa587bbd1161f1f3a0c3962/word2vec/paragraph-vector.py

max_year    = 9999   # Max year for training
max_docs    = -1     # -1 to process all (for development)
print_texts = False  # Prints iterated texts (for development)

doc2vec_vector_size = 50  # Dimensionality of the feature vectors
doc2vec_min_count   = 2   # Ignores all words with total frequency lower than this
doc2vec_epochs      = 40  # Number of iterations (epochs) over the corpus. Defaults to 10 for Doc2Vec
doc2vec_dm          = 0   # Training algorithm, distributed memory (PV-DM) or distributed bag of words (PV-DBOW)
doc2vec_seed        = -1  # -1, or int for reproducible results (under development)

2021-05-21 18:56:33.734892
Doc2Vec(dbow,d50,n5,mc2,s0.001,t3)
Runtime: 116137.58712434769 seconds
= 1936 min = 32 hours 16 min
Gensim version: 3.8.3
EML4U experiment server


amazonreviews_c.model
amazonreviews_c.model.dv.vectors.npy
up to year 2000, 10 epochs

https://github.com/EML4U/Drift-detector-comparison/tree/ad63c6c0ef10b32348d33a2d388a1491d3571b3c

max_year    = 2000   # Max year for training
max_docs    = -1     # -1 to process all (for development)
print_texts = False  # Prints iterated texts (for development)

doc2vec_vector_size = 50  # Dimensionality of the feature vectors
doc2vec_min_count   = 2   # Ignores all words with total frequency lower than this
doc2vec_epochs      = 10  # Number of iterations (epochs) over the corpus. Defaults to 10 for Doc2Vec
doc2vec_dm          = 1   # Training algorithm, distributed memory (PV-DM) or distributed bag of words (PV-DBOW)
doc2vec_seed        = -1  # -1, or int for reproducible results (under development)

2021-05-19 15:35:30.675268
Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3)
Runtime: 2897.5691492557526 seconds
python3 paragraph-vector.py  3799,65s user 58,23s system 133% cpu 48:18,98 total
Notebook A.W.


amazonreviews_b.model
up to year 1999, 40 epochs

https://github.com/EML4U/Drift-detector-comparison/blob/2e2d8f3539605eea07521410ea88b3c59c6b5471/word2vec/paragraph-vector.py

max_year    = 2000   # Max year for training
max_docs    = -1     # -1 to process all (for development)
print_texts = False  # Prints iterated texts (for development)

doc2vec_vector_size = 50  # Dimensionality of the feature vectors
doc2vec_min_count   = 2   # Ignores all words with total frequency lower than this
doc2vec_epochs      = 40  # Number of iterations (epochs) over the corpus. Defaults to 10 for Doc2Vec
doc2vec_seed        = -1  # -1, or int for reproducible results (dev)

2021-05-17 23:20:54.436872
Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3)
Runtime: 7149.856955528259 seconds
python3 paragraph-vector.py  7655,56s user 60,79s system 107% cpu 1:59:11,07 total
Notebook A.W.


amazonreviews_a.model
up to year 1999, 10 epochs

https://github.com/EML4U/Drift-detector-comparison/blob/d9e93bb03d655ab8170eb2283ffd7ecae1f1d9a4/word2vec/paragraph-vector.py

max_year    = 2000   # Max year for training
max_docs    = -1     # -1 to process all
print_texts = False  # Prints iterated texts (for development)

doc2vec_vector_size = 50  # Dimensionality of the feature vectors
doc2vec_min_count   = 2   # Ignores all words with total frequency lower than this
doc2vec_epochs      = 10  # Number of iterations (epochs) over the corpus. Defaults to 10 for Doc2Vec

2021-05-17 16:07:15.833986
Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3)
Runtime: 1994.4850919246674 seconds
Notebook A.W.