Regex
The result of the book showed
#samples: 5, #features: 25
[u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']
There is one missing word in the above result that is “a”, which is there in the 01.txt of the toy posts.
01.txt This is a toy post about machine learning. Actually, it contains
not much interesting stuff.
Why it is missed?
This is caused by the default token_pattern
for CountVectorizer
, which removes tokens of a single character:
Default token_pattern
for CountVectorizer is (?u)\b\w\w+\b, which removed single character word ‘a’.
How to get the single character word?
by changing
we can see the single character in the result. token_pattern
for CountVectorizer to \\b\\w+\\b
Do we need the single character word?
Depend on the apllication, we may need to extract single character words.
Ref
http://stackoverflow.com/questions/20717641/countvectorizer-i-not-showing-up-in-vectorized-text
No comments:
Post a Comment