Machine Learning With Orange: Regex pattern in token

Regex

The result of the book showed

There is one missing word in the above result that is “a”, which is there in the 01.txt of the toy posts.

01.txt This is a toy post about machine learning. Actually, it contains
not much interesting stuff.

Why it is missed?

This is caused by the default token_pattern for CountVectorizer, which removes tokens of a single character:

Default token_pattern for CountVectorizer is (?u)\b\w\w+\b, which removed single character word ‘a’.

by changing token_pattern for CountVectorizer to \\b\\w+\\b we can see the single character in the result.

Depend on the apllication, we may need to extract single character words.

Ref