Code
1: import os 2: import sys 3: import scipy as sp4: from sklearn.feature_extraction.text import CountVectorizer
5: data_Dir = "E:\\Machine Learning\\Orange\\Finding Related Post\\toy\\"6: posts = [open(os.path.join(data_Dir,f)).read() for f in os.listdir(data_Dir)]
7: from sklearn.feature_extraction.text import CountVectorizer
8: vectorizer = CountVectorizer(min_df=1) 9: print vectorizer10: X_train = vectorizer.fit_transform(posts)
11: print posts 12: print X_train13: num_samples, num_features = X_train.shape
14: print("#samples: %d, #features: %d" % (num_samples, num_features))
15: print(vectorizer.get_feature_names())
Result:
the below result is printed the parameter settings of the count vectoriser. we can see min_df=1 in the result, which is assigned by us in the line no 8. all other settings are default setting.
1: CountVectorizer(analyzer=word, binary=False, charset=utf-8,2: charset_error=strict, dtype=<type 'long'>, input=content,
3: lowercase=True, max_df=1.0, max_features=None, max_n=None,
4: min_df=1, min_n=None, ngram_range=(1, 1), preprocessor=None,5: stop_words=None, strip_accents=None, token_pattern=(?u)\b\w\w+\b,
6: tokenizer=None, vocabulary=None)
from the result printed by the line no 11, 12 & 15, we can understand the what fit_transform function executed
the result of fit_transform is
1: (0, 0) 1 2: (0, 24) 1 3: (0, 23) 1 4: (0, 8) 1 5: (0, 9) 1 6: (0, 3) 1 7: (0, 10) 1 8: (0, 1) 1 9: (0, 12) 1 10: (0, 14) 1 11: (0, 11) 1 12: (0, 22) 1 13: (0, 15) 1 14: (0, 17) 1 15: (1, 18) 1 16: (1, 20) 1 17: (1, 7) 1 18: (1, 2) 1 19: (1, 5) 1 20: (2, 19) 1 21: (2, 13) 1 22: (2, 16) 1 23: (2, 7) 1 24: (2, 5) 1 25: (2, 6) 1 26: (3, 4) 1 27: (3, 7) 1 28: (3, 21) 1 29: (3, 5) 1 30: (4, 4) 3 31: (4, 7) 3 32: (4, 21) 3 33: (4, 5) 3
(0,0) 1 what it actually means?
(Sample numeber, feature name) number of occurrences
first 0 gives the information of sample number that is posts file no 01.txt. second 0 is u’about’ see the result of line no 15 feature_names.
1: [u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']1 means the word “about” occurred only ones in that sample.
for example the second one (0,24) 1
0 means the sample number, which is 01.txt
“This is a toy post about machine learning. Actually, it contains not much interesting stuff.”
24 means the feature name “this”
1 means the number of occurrences of the word “this”
No comments:
Post a Comment