This class produces N-grams of words.
From a sentence with W words w1, w2, w3,... wW it produce W - N + 1 N-grams.
The N is determined by the Constants.WORDNGRAMS_LENGHT constant.
Filter treats all types of input tokens equally.
name - <NGRAM>
text - concatenation of texts of tokens belonging to the N-gram,
with space separator
weight - arithmetic mean of all tokens belonging to the N-gram
colS, lineS - taken from the first token belonging to the N-gram
colE, lineE - taken from the last token belonging to the N-gram
sentence, paragraph, sentenceInParagraph - taken from first token,
are equal for all tokens belonging to the N-gram
reloffset - taken from the first token belonging to the N-gram
Sentence: "the dog smelled like a skunk"
N-grams: "<NGRAM>the dog smelled", "<NGRAM>dog smelled like",
"<NGRAM>smelled like a", "<NGRAM>like a skunk"
Typically the filter ParagraphPunctFilter
shall be applied to the token sequence before this filter. If it is not, than
all the document is taken as one sentence and N-grams are produced "on document level".
This filter should be the last filter to apply.
The filter has an inner context thus it cannot be shared in a filtering chain.
Fields inherited from class org.egothor.core.Filter