This class generates Tokenizer-s (documents) which reflect the Zipf's law. Words are numbers.
The documents may repeat a word k-times, where k
is a random number 1-9. It implies that a word appears 5 times approximately. When we want to generate documents with
an average length of L words, then we prepare the Tokenizer this way: 1) L/5 unique words are prepared according to
Zipf's law; 2) duplicities are generated and the words are shuffled.