Tuesday, 13 May 2014

Counting words - Clustering–Find Related post–scikit

From the book “Building machine learning system with python”, I practiced the below code

Code

   1: import os

   2: import sys

   3: import scipy as sp

   4: from sklearn.feature_extraction.text import CountVectorizer

   5: data_Dir = "E:\\Machine Learning\\Orange\\Finding Related Post\\toy\\"

   6: posts = [open(os.path.join(data_Dir,f)).read() for f in os.listdir(data_Dir)]

   7: from sklearn.feature_extraction.text import CountVectorizer

   8: vectorizer = CountVectorizer(min_df=1)

   9: print vectorizer

  10: X_train = vectorizer.fit_transform(posts)

  11: print posts

  12: print X_train

  13: num_samples, num_features = X_train.shape

  14: print("#samples: %d, #features: %d" % (num_samples, num_features))

  15: print(vectorizer.get_feature_names())

 

Result:


the below result is printed the parameter settings of the count vectoriser. we can see min_df=1 in the result, which is assigned by us in the line no 8. all other settings are default setting.


2014-05-13_15-41-55



   1: CountVectorizer(analyzer=word, binary=False, charset=utf-8,
   2:         charset_error=strict, dtype=<type 'long'>, input=content,
   3:         lowercase=True, max_df=1.0, max_features=None, max_n=None,
   4:         min_df=1, min_n=None, ngram_range=(1, 1), preprocessor=None,
   5:         stop_words=None, strip_accents=None, token_pattern=(?u)\b\w\w+\b,
   6:         tokenizer=None, vocabulary=None)


 


from the result printed by the line no 11, 12 & 15, we can understand the what fit_transform function executed


2014-05-13_15-47-16


 


the result of fit_transform is



   1: (0, 0)    1
   2:   (0, 24)    1
   3:   (0, 23)    1
   4:   (0, 8)    1
   5:   (0, 9)    1
   6:   (0, 3)    1
   7:   (0, 10)    1
   8:   (0, 1)    1
   9:   (0, 12)    1
  10:   (0, 14)    1
  11:   (0, 11)    1
  12:   (0, 22)    1
  13:   (0, 15)    1
  14:   (0, 17)    1
  15:   (1, 18)    1
  16:   (1, 20)    1
  17:   (1, 7)    1
  18:   (1, 2)    1
  19:   (1, 5)    1
  20:   (2, 19)    1
  21:   (2, 13)    1
  22:   (2, 16)    1
  23:   (2, 7)    1
  24:   (2, 5)    1
  25:   (2, 6)    1
  26:   (3, 4)    1
  27:   (3, 7)    1
  28:   (3, 21)    1
  29:   (3, 5)    1
  30:   (4, 4)    3
  31:   (4, 7)    3
  32:   (4, 21)    3
  33:   (4, 5)    3

 


(0,0) 1 what it actually means?


(Sample numeber, feature name) number of occurrences


first 0 gives the information of sample number that is posts file no 01.txt. second 0 is u’about’  see the result of line no 15 feature_names.



   1: [u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']

1 means the word  “about” occurred only ones in that sample.


for example the second one (0,24) 1


0 means the sample number, which is 01.txt


“This is a toy post about machine learning. Actually, it contains not much interesting stuff.”


24 means the feature name “this”


1 means the number of occurrences of the word “this”

No comments:

Post a Comment