Machine Learning With Orange: May 2014

Wednesday, 14 May 2014

Counting words - Clustering–Find Related post–part2

Key point

Euclidean distance comes under the unsupervised learning.

Why Euclidean distance?

Euclidean distance measures the distance between the two points in xy plane. formula Euclidean distance formula and its explanation can be seen in the below link. the formula will match exactly with our dist_raw(v1, v2) function.

http://www.cut-the-knot.org/pythagoras/DistanceFormula.shtml

Aptana studio 3 – Debugger tool

Stepping through code

set the break point by double clicking left side of the code. break point icon will appear next to the line where breakpoint is set.
now click the debug button, it will halt break point set line. from there we can step through code line by line and see result of the execution.

https://www.youtube.com/watch?v=7ROg6Wwz7Z0

Loading Sub package and avoid attribution error

from scipy import linalg sub-package may be individually imported otherwise we will get error during sub-packages calls inside the code.
http://stackoverflow.com/questions/9819733/scipy-special-import-issue

Tuesday, 13 May 2014

Counting words - Clustering–Find Related post–scikit

From the book “Building machine learning system with python”, I practiced the below code

Code

import os
import sys
import scipy as sp
from sklearn.feature_extraction.text import CountVectorizer
data_Dir = "E:\\Machine Learning\\Orange\\Finding Related Post\\toy\\"
posts = [open(os.path.join(data_Dir,f)).read() for f in os.listdir(data_Dir)]
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
print vectorizer
X_train = vectorizer.fit_transform(posts)
print posts
print X_train
num_samples, num_features = X_train.shape
print("#samples: %d, #features: %d" % (num_samples, num_features))
print(vectorizer.get_feature_names())

Result:

the below result is printed the parameter settings of the count vectoriser. we can see min_df=1 in the result, which is assigned by us in the line no 8. all other settings are default setting.

   1: CountVectorizer(analyzer=word, binary=False, charset=utf-8,   2:         charset_error=strict, dtype=<type 'long'>, input=content,   3:         lowercase=True, max_df=1.0, max_features=None, max_n=None,   4:         min_df=1, min_n=None, ngram_range=(1, 1), preprocessor=None,   5:         stop_words=None, strip_accents=None, token_pattern=(?u)\b\w\w+\b,   6:         tokenizer=None, vocabulary=None)

from the result printed by the line no 11, 12 & 15, we can understand the what fit_transform function executed

the result of fit_transform is

   1: (0, 0)    1   2:   (0, 24)    1   3:   (0, 23)    1   4:   (0, 8)    1   5:   (0, 9)    1   6:   (0, 3)    1   7:   (0, 10)    1   8:   (0, 1)    1   9:   (0, 12)    1  10:   (0, 14)    1  11:   (0, 11)    1  12:   (0, 22)    1  13:   (0, 15)    1  14:   (0, 17)    1  15:   (1, 18)    1  16:   (1, 20)    1  17:   (1, 7)    1  18:   (1, 2)    1  19:   (1, 5)    1  20:   (2, 19)    1  21:   (2, 13)    1  22:   (2, 16)    1  23:   (2, 7)    1  24:   (2, 5)    1  25:   (2, 6)    1  26:   (3, 4)    1  27:   (3, 7)    1  28:   (3, 21)    1  29:   (3, 5)    1  30:   (4, 4)    3  31:   (4, 7)    3  32:   (4, 21)    3  33:   (4, 5)    3

(0,0) 1 what it actually means?

(Sample numeber, feature name) number of occurrences

first 0 gives the information of sample number that is posts file no 01.txt. second 0 is u’about’ see the result of line no 15 feature_names.

   1: [u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']

1 means the word “about” occurred only ones in that sample.

for example the second one (0,24) 1

0 means the sample number, which is 01.txt

“This is a toy post about machine learning. Actually, it contains not much interesting stuff.”

24 means the feature name “this”

1 means the number of occurrences of the word “this”

Monday, 12 May 2014

Regex pattern in token_pattern for CountVectorizer

Regex

The result of the book showed

#samples: 5, #features: 25

[u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']

There is one missing word in the above result that is “a”, which is there in the 01.txt of the toy posts.

01.txt This is a toy post about machine learning. Actually, it contains
not much interesting stuff.

Why it is missed?

This is caused by the default token_pattern for CountVectorizer, which removes tokens of a single character:

Default token_pattern for CountVectorizer is (?u)\b\w\w+\b, which removed single character word ‘a’.

`How to get the single character word?`

by changing token_pattern for CountVectorizer to \\b\\w+\\b we can see the single character in the result.

Do we need the single character word?

Depend on the apllication, we may need to extract single character words.

Ref

http://stackoverflow.com/questions/20717641/countvectorizer-i-not-showing-up-in-vectorized-text