Tuesday, 22 April 2014

Bag-Of-word using orange

for the main work of finding related post, we have to learn bag-of-words approach for text processing in machine learning.

in a default installation we do not have text processing widgets in orange, so we have to install addons. I installed Textable addon for this work.

1. From the Textable addon, add Text Field widget.

2. Copy & Paste the following text with in quotes

“How to format my hard disk

Hard disk format problems”

2014-04-22_15-11-00

3. Add Textable\preprocessor to convert the text to smaller case. we are converting entire text smaller case. so that words of upper and lower case wont be separated as a different words. that is How and how wont be separated as a different word.2014-04-22_15-17-18

4. First we will segment the words in to lines. for this we will ad Textable\segment widget and configure it as shown in the below image.

2014-04-22_15-38-06

Configuration of Textable\segment widget

  • Check advanced settings
  • Select mode to split
  • Set Regex .+, to specify lines.
  • Check Unicode dependent(u)
  • Click add
  • under options, Change the label name
  • Check the Auto-number with key and name it

5. We will segment it further to words. for this we will ad Textable\segment widget and configure it as shown in the below image.

2014-04-22_16-27-52

Configuration of Textable\segment widget

  • Check advanced settings
  • Select mode to Tokenise
  • Set Regex \w+, to specify words.
  • Click add button
  • under options, change the segment label.

6. Count the segments based on the context and words. for this we will add Textable\count widget

2014-04-22_16-35-45

    • set segmentation to words
      • mode to containing segmentation
      • segmentation to Lines
      • annotation key to num

7. Insert Texatable\Convert widget and connect to data/ table widget.

See the complete canvas

2014-04-23_09-10-03

 

Special Thanks

Special thanks to Mr.Axanthos, Site Admin, Langtech.ch. I am new to Machine learning. I don’t know how to implement bag-of-word approach in orange and concept. I have made a support request in Langtech.ch/forums. for my request Mr. axanthos provided a clear explanation and code.

http://langtech.ch/forum/textable/viewtopic.php?f=4&t=12

 

Reference

http://langtech.ch/forum/textable/viewtopic.php?f=4&t=4

http://moodle2.unil.ch/course/view.php?id=574

https://orange-textable.readthedocs.org/en/latest/

http://orange.biolab.si/forum/viewtopic.php?f=4&t=1949#p5655

1 comment:

  1. Hi Swetha,
    Always happy to help. Thank you for posting this explanation. Just a note: in bullet point 4 above, I think you need to set the mode to "Tokenize" (because the regex .+ describes the lines and not the separators between them).
    All the best,
    Aris

    ReplyDelete