From the book “Building machine learning system with python”, I practiced the below code
Code
1: import os
2: import sys
3: import scipy as sp
4: from sklearn.feature_extraction.text import CountVectorizer
5: data_Dir = "E:\\Machine Learning\\Orange\\Finding Related Post\\toy\\"
6: posts = [open(os.path.join(data_Dir,f)).read() for f in os.listdir(data_Dir)]
7: from sklearn.feature_extraction.text import CountVectorizer
8: vectorizer = CountVectorizer(min_df=1)
9: print vectorizer
10: X_train = vectorizer.fit_transform(posts)
11: print posts
12: print X_train
13: num_samples, num_features = X_train.shape
14: print("#samples: %d, #features: %d" % (num_samples, num_features))
15: print(vectorizer.get_feature_names())
Result:
the below result is printed the parameter settings of the count vectoriser. we can see min_df=1 in the result, which is assigned by us in the line no 8. all other settings are default setting.
1: CountVectorizer(analyzer=word, binary=False, charset=utf-8,
2: charset_error=strict, dtype=<type 'long'>, input=content,
3: lowercase=True, max_df=1.0, max_features=None, max_n=None,
4: min_df=1, min_n=None, ngram_range=(1, 1), preprocessor=None,
5: stop_words=None, strip_accents=None, token_pattern=(?u)\b\w\w+\b,
6: tokenizer=None, vocabulary=None)
from the result printed by the line no 11, 12 & 15, we can understand the what fit_transform function executed

the result of fit_transform is
1: (0, 0) 1
2: (0, 24) 1
3: (0, 23) 1
4: (0, 8) 1
5: (0, 9) 1
6: (0, 3) 1
7: (0, 10) 1
8: (0, 1) 1
9: (0, 12) 1
10: (0, 14) 1
11: (0, 11) 1
12: (0, 22) 1
13: (0, 15) 1
14: (0, 17) 1
15: (1, 18) 1
16: (1, 20) 1
17: (1, 7) 1
18: (1, 2) 1
19: (1, 5) 1
20: (2, 19) 1
21: (2, 13) 1
22: (2, 16) 1
23: (2, 7) 1
24: (2, 5) 1
25: (2, 6) 1
26: (3, 4) 1
27: (3, 7) 1
28: (3, 21) 1
29: (3, 5) 1
30: (4, 4) 3
31: (4, 7) 3
32: (4, 21) 3
33: (4, 5) 3
(0,0) 1 what it actually means?
(Sample numeber, feature name) number of occurrences
first 0 gives the information of sample number that is posts file no 01.txt. second 0 is u’about’ see the result of line no 15 feature_names.
1: [u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']
1 means the word “about” occurred only ones in that sample.
for example the second one (0,24) 1
0 means the sample number, which is 01.txt
“This is a toy post about machine learning. Actually, it contains not much interesting stuff.”
24 means the feature name “this”
1 means the number of occurrences of the word “this”