Monday, 12 May 2014

Regex pattern in token_pattern for CountVectorizer

Regex

The result of the book showed

#samples: 5, #features: 25

[u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']

There is one missing word in the above result that is “a”, which is there in the 01.txt of the toy posts.

01.txt This is a toy post about machine learning. Actually, it contains
not much interesting stuff.

 

Why it is missed?

This is caused by the default token_pattern for CountVectorizer, which removes tokens of a single character:

Default token_pattern for CountVectorizer is (?u)\b\w\w+\b, which removed single character word ‘a’.

 

How to get the single character word?

by changing token_pattern for CountVectorizer to \\b\\w+\\b we can see the single character in the result.

 

Do we need the single character word?

Depend on the apllication, we may need to extract single character words.

Ref

http://stackoverflow.com/questions/20717641/countvectorizer-i-not-showing-up-in-vectorized-text

No comments:

Post a Comment