
This parameter is ignored if vocabulary is not None. min_df float in range or int, default=1įrequency strictly lower than the given threshold. If float, the parameter represents a proportion of documents, integer When building the vocabulary ignore terms that have a documentįrequency strictly higher than the given threshold (corpus-specific max_df float in range or int, default=1.0

Since v0.21, if input is filename or file, the data isįirst read from the file and then passed to the given callableĪnalyzer. If a callable is passed it is used to extract the sequence of features Word boundaries n-grams at the edges of words are padded with space. Option ‘char_wb’ creates character n-grams only from text inside Whether the feature should be made of word n-gram or character Parameters : input or callable, default=’word’ That does some kind of feature selection then the number of features willīe equal to the vocabulary size found by analyzing the data.

If you do not provide an a-priori dictionary and you do not use an analyzer This implementation produces a sparse representation of the counts using CountVectorizer ( *, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype= ) ¶Ĭonvert a collection of text documents to a matrix of token counts.
