Wednesday, 14 May 2014

Counting words - Clustering–Find Related post–part2

 

 

Key point

Euclidean distance comes under the unsupervised learning.

Why Euclidean distance?

Euclidean distance measures the distance between the two points in xy plane. formula Euclidean distance formula and its explanation can be seen in the below link. the formula will match exactly with our dist_raw(v1, v2) function.

http://www.cut-the-knot.org/pythagoras/DistanceFormula.shtml

Aptana studio 3 – Debugger tool


Stepping through code

set the break point by double clicking left side of the code. break point icon will appear next to the line where breakpoint is set.
now click the debug button, it will halt break point set line. from there we can step through code line by line and see result of the execution.

 https://www.youtube.com/watch?v=7ROg6Wwz7Z0

Loading Sub package and avoid attribution error

from scipy import linalg sub-package may be individually imported otherwise we will get error during sub-packages calls inside the code.
http://stackoverflow.com/questions/9819733/scipy-special-import-issue

Tuesday, 13 May 2014

Counting words - Clustering–Find Related post–scikit

From the book “Building machine learning system with python”, I practiced the below code

Code

   1: import os

   2: import sys

   3: import scipy as sp

   4: from sklearn.feature_extraction.text import CountVectorizer

   5: data_Dir = "E:\\Machine Learning\\Orange\\Finding Related Post\\toy\\"

   6: posts = [open(os.path.join(data_Dir,f)).read() for f in os.listdir(data_Dir)]

   7: from sklearn.feature_extraction.text import CountVectorizer

   8: vectorizer = CountVectorizer(min_df=1)

   9: print vectorizer

  10: X_train = vectorizer.fit_transform(posts)

  11: print posts

  12: print X_train

  13: num_samples, num_features = X_train.shape

  14: print("#samples: %d, #features: %d" % (num_samples, num_features))

  15: print(vectorizer.get_feature_names())

 

Result:


the below result is printed the parameter settings of the count vectoriser. we can see min_df=1 in the result, which is assigned by us in the line no 8. all other settings are default setting.


2014-05-13_15-41-55



   1: CountVectorizer(analyzer=word, binary=False, charset=utf-8,
   2:         charset_error=strict, dtype=<type 'long'>, input=content,
   3:         lowercase=True, max_df=1.0, max_features=None, max_n=None,
   4:         min_df=1, min_n=None, ngram_range=(1, 1), preprocessor=None,
   5:         stop_words=None, strip_accents=None, token_pattern=(?u)\b\w\w+\b,
   6:         tokenizer=None, vocabulary=None)


 


from the result printed by the line no 11, 12 & 15, we can understand the what fit_transform function executed


2014-05-13_15-47-16


 


the result of fit_transform is



   1: (0, 0)    1
   2:   (0, 24)    1
   3:   (0, 23)    1
   4:   (0, 8)    1
   5:   (0, 9)    1
   6:   (0, 3)    1
   7:   (0, 10)    1
   8:   (0, 1)    1
   9:   (0, 12)    1
  10:   (0, 14)    1
  11:   (0, 11)    1
  12:   (0, 22)    1
  13:   (0, 15)    1
  14:   (0, 17)    1
  15:   (1, 18)    1
  16:   (1, 20)    1
  17:   (1, 7)    1
  18:   (1, 2)    1
  19:   (1, 5)    1
  20:   (2, 19)    1
  21:   (2, 13)    1
  22:   (2, 16)    1
  23:   (2, 7)    1
  24:   (2, 5)    1
  25:   (2, 6)    1
  26:   (3, 4)    1
  27:   (3, 7)    1
  28:   (3, 21)    1
  29:   (3, 5)    1
  30:   (4, 4)    3
  31:   (4, 7)    3
  32:   (4, 21)    3
  33:   (4, 5)    3

 


(0,0) 1 what it actually means?


(Sample numeber, feature name) number of occurrences


first 0 gives the information of sample number that is posts file no 01.txt. second 0 is u’about’  see the result of line no 15 feature_names.



   1: [u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']

1 means the word  “about” occurred only ones in that sample.


for example the second one (0,24) 1


0 means the sample number, which is 01.txt


“This is a toy post about machine learning. Actually, it contains not much interesting stuff.”


24 means the feature name “this”


1 means the number of occurrences of the word “this”

Monday, 12 May 2014

Regex pattern in token_pattern for CountVectorizer

Regex

The result of the book showed

#samples: 5, #features: 25

[u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']

There is one missing word in the above result that is “a”, which is there in the 01.txt of the toy posts.

01.txt This is a toy post about machine learning. Actually, it contains
not much interesting stuff.

 

Why it is missed?

This is caused by the default token_pattern for CountVectorizer, which removes tokens of a single character:

Default token_pattern for CountVectorizer is (?u)\b\w\w+\b, which removed single character word ‘a’.

 

How to get the single character word?

by changing token_pattern for CountVectorizer to \\b\\w+\\b we can see the single character in the result.

 

Do we need the single character word?

Depend on the apllication, we may need to extract single character words.

Ref

http://stackoverflow.com/questions/20717641/countvectorizer-i-not-showing-up-in-vectorized-text