Machine Learning With Orange: Counting words - Clustering–Find Related post

From the book “Building machine learning system with python”, I practiced the below code

Code

import os
import sys
import scipy as sp
from sklearn.feature_extraction.text import CountVectorizer
data_Dir = "E:\\Machine Learning\\Orange\\Finding Related Post\\toy\\"
posts = [open(os.path.join(data_Dir,f)).read() for f in os.listdir(data_Dir)]
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
print vectorizer
X_train = vectorizer.fit_transform(posts)
print posts
print X_train
num_samples, num_features = X_train.shape
print("#samples: %d, #features: %d" % (num_samples, num_features))
print(vectorizer.get_feature_names())

Result:

the below result is printed the parameter settings of the count vectoriser. we can see min_df=1 in the result, which is assigned by us in the line no 8. all other settings are default setting.

   1: CountVectorizer(analyzer=word, binary=False, charset=utf-8,   2:         charset_error=strict, dtype=<type 'long'>, input=content,   3:         lowercase=True, max_df=1.0, max_features=None, max_n=None,   4:         min_df=1, min_n=None, ngram_range=(1, 1), preprocessor=None,   5:         stop_words=None, strip_accents=None, token_pattern=(?u)\b\w\w+\b,   6:         tokenizer=None, vocabulary=None)

from the result printed by the line no 11, 12 & 15, we can understand the what fit_transform function executed

the result of fit_transform is

   1: (0, 0)    1   2:   (0, 24)    1   3:   (0, 23)    1   4:   (0, 8)    1   5:   (0, 9)    1   6:   (0, 3)    1   7:   (0, 10)    1   8:   (0, 1)    1   9:   (0, 12)    1  10:   (0, 14)    1  11:   (0, 11)    1  12:   (0, 22)    1  13:   (0, 15)    1  14:   (0, 17)    1  15:   (1, 18)    1  16:   (1, 20)    1  17:   (1, 7)    1  18:   (1, 2)    1  19:   (1, 5)    1  20:   (2, 19)    1  21:   (2, 13)    1  22:   (2, 16)    1  23:   (2, 7)    1  24:   (2, 5)    1  25:   (2, 6)    1  26:   (3, 4)    1  27:   (3, 7)    1  28:   (3, 21)    1  29:   (3, 5)    1  30:   (4, 4)    3  31:   (4, 7)    3  32:   (4, 21)    3  33:   (4, 5)    3

(0,0) 1 what it actually means?

(Sample numeber, feature name) number of occurrences

first 0 gives the information of sample number that is posts file no 01.txt. second 0 is u’about’ see the result of line no 15 feature_names.

   1: [u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']

1 means the word “about” occurred only ones in that sample.

for example the second one (0,24) 1

0 means the sample number, which is 01.txt

“This is a toy post about machine learning. Actually, it contains not much interesting stuff.”

24 means the feature name “this”

1 means the number of occurrences of the word “this”

Machine Learning With Orange

Tuesday, 13 May 2014

Counting words - Clustering–Find Related post–scikit

Code

Result:

No comments:

Post a Comment