Tuesday, 22 April 2014

Bag-Of-word using orange

for the main work of finding related post, we have to learn bag-of-words approach for text processing in machine learning.

in a default installation we do not have text processing widgets in orange, so we have to install addons. I installed Textable addon for this work.

1. From the Textable addon, add Text Field widget.

2. Copy & Paste the following text with in quotes

“How to format my hard disk

Hard disk format problems”

2014-04-22_15-11-00

3. Add Textable\preprocessor to convert the text to smaller case. we are converting entire text smaller case. so that words of upper and lower case wont be separated as a different words. that is How and how wont be separated as a different word.2014-04-22_15-17-18

4. First we will segment the words in to lines. for this we will ad Textable\segment widget and configure it as shown in the below image.

2014-04-22_15-38-06

Configuration of Textable\segment widget

  • Check advanced settings
  • Select mode to split
  • Set Regex .+, to specify lines.
  • Check Unicode dependent(u)
  • Click add
  • under options, Change the label name
  • Check the Auto-number with key and name it

5. We will segment it further to words. for this we will ad Textable\segment widget and configure it as shown in the below image.

2014-04-22_16-27-52

Configuration of Textable\segment widget

  • Check advanced settings
  • Select mode to Tokenise
  • Set Regex \w+, to specify words.
  • Click add button
  • under options, change the segment label.

6. Count the segments based on the context and words. for this we will add Textable\count widget

2014-04-22_16-35-45

    • set segmentation to words
      • mode to containing segmentation
      • segmentation to Lines
      • annotation key to num

7. Insert Texatable\Convert widget and connect to data/ table widget.

See the complete canvas

2014-04-23_09-10-03

 

Special Thanks

Special thanks to Mr.Axanthos, Site Admin, Langtech.ch. I am new to Machine learning. I don’t know how to implement bag-of-word approach in orange and concept. I have made a support request in Langtech.ch/forums. for my request Mr. axanthos provided a clear explanation and code.

http://langtech.ch/forum/textable/viewtopic.php?f=4&t=12

 

Reference

http://langtech.ch/forum/textable/viewtopic.php?f=4&t=4

http://moodle2.unil.ch/course/view.php?id=574

https://orange-textable.readthedocs.org/en/latest/

http://orange.biolab.si/forum/viewtopic.php?f=4&t=1949#p5655

Friday, 18 April 2014

tweets sentiment analysis

Download hand classified tweets from http://www.sananalytics.com/lab/twitter-sentiment/

Counting words - Clustering–Find Related post–part1

assume in our website we already have few posts. we are going to find a most relevant post for the current viewing post “Imaging database”

the Below table shows existing posts in the websites

Si.No

Post Content
1

This is a toy post about machine learning. Actually, it contains
not much interesting stuff.

2 Imaging databases can get huge.
3 Most imaging databases safe images permanently.
4 Imaging databases store images.
5

Imaging databases store images. Imaging databases store
images. Imaging databases store images.

In the book, posts are opened using following script

Python_Script_2014-04-30_15-17-52

Script1 – reading file and printing output

   1: import os
   2: import sys
   3: import scipy as sp
   4: from sklearn.feature_extraction.text import CountVectorizer
   5: data_Dir = "E:\\Machine Learning\\Orange\\Finding Related Post\\toy\\"
   6: posts = [open(os.path.join(data_Dir,f)).read() for f in os.listdir(data_Dir)]
   7: print posts

Result1:


"Running script:


['This is a toy post about machine learning. Actually, it contains not much interesting stuff.', 'Imaging databases provide storage capabilities.', 'Most imaging databases safe images permanently.', 'Imaging databases store data.', 'Imaging databases store data. Imaging databases store data. Imaging databases store data.']

 

Implementation of script1 in orange


we will modify the Bag of words file.

2014-04-30_15-34-00

Text_Files_2014-04-30_15-35-36

1. Replace the Textable\TextFeild widget to Textable\TextFile widget

2. Click advanced settings,

3. Click browse button to open the file

4. select all the files and click open.

5. you can see the result by connecting, Textable\Disply to Textable\Lowercase widget.

 

 

Learning from this post


- opening the directory path and file in python


Reference


http://langtech.ch/forum/textable/viewtopic.php?f=4&t=4


http://moodle2.unil.ch/course/view.php?id=574


https://orange-textable.readthedocs.org/en/latest/


http://orange.biolab.si/forum/viewtopic.php?f=4&t=1949#p5655


how to work with directories http://www.diveintopython.net/file_handling/os_module.html

Thursday, 17 April 2014

Classification of wheat seeds dataset

In the previous post we built classifier for Iris dataset, without knowing or acquiring any theoretical  knowledge on machine learning.

Sameway, in this post I am going to try complex classify seeds dataset, without attempting to learn theory.

wheat seeds data set contains 3 varieties and 7 features. seeds dataset can be obtained from http://archive.ics.uci.edu/ml/ 

Visualising data

1. drag and drop file widget from data palette

2. add scatter plot from data visualisation palette

2014-04-17_12-19-09

 

2014-04-17_12-19-23

 

Find best threshold using inbuilt method

1. double click scatter plot

2. click Vizrank under optimisation dialogs

3. Click Start evaluating procedure and see the result

2014-04-17_12-25-01

4. For better results, click Locally optimise best projection and see the result

2014-04-17_12-27-08

5. From the result it is easy to separate Rosa from Koma and canadian

6. let us find the threshold, We found threshold = 5.573

Apply threshold for Rosa seeds

1. add interactive tree builder from classify palette

2.2. Double click interactive tree builder. set split selection to Length kernel groove, cut off point to 5.573 and click split. see the report.

2014-04-17_12-34-04

Find best threshold for separating Canadian seeds and Koma

1. double click scatter plot (1)

2. click Vizrank under optimisation dialogs

3. Click Start evaluating procedure and click Locally optimise best projection and see the result

2014-04-17_12-54-22

4. From the result I have selected rank 7 combination Length of kernel groove vs area. In this, its easy to put a threshold. but the accuracy of this classification is only 89.8 percentage. threshold = 13.55 area.

Now I feel I should understand what is really happening.

Tuesday, 15 April 2014

Iris dataset classification

First let us visualize the iris dataset

1. drag and drop file widget from data palette

2. add scatter plot from data visualisation palette

3. double click file and load the iris data set.

Iris data set can be downloaded from http://orange.biolab.si/datasets.psp

Iris Data Visualistion

4. Double click scatter plot and see the data visualization by changing the x and y parameters.

Scatter plot

 

How to Change point shape?

under additional point properties, change the point shape attribute to iris

Scatter plot change shape

can I save the plot?

You can save the plot by using save graph. plots can be saved as a picture and matplotlib script.

Sp_Len_Vs_Sp_wid

 

Building the classification model

From the below plot, it seems easy to separate iris-setosa from other two iris species

Spllen_ptllen

We can write a simple logic of if petal length is less than 2.0 cm then it is iris-setosa else other two species.

How to classify based on the petal length?

1. add interactive tree builder from classify palette and rename as Iris setosa classifier

Classifier

2. Double click interactive tree builder. set split selection to petal length, cut off point to 2.0 and click split. see the report.

2014-04-16_12-54-34

We have successfully classified iris-setosa.

How to classify other two species?

We will separate iris-setosa from the data first. and will plot only the Iris-verginica and Iris-versicolor

identifying threshold to separate these two species is not easy as iris-setosa. we have to find a best method.

I found one simple inbuilt method to identify best threshold for the separation of these two species. I am not sure this will work for all the data.

separate Iris-setosa

1. Add select data widget from data palette

2014-04-16_16-39-31

 

2. Double click the select data widget, select iris under attribute, equals under operator, Iris-setosa under value and check negate

3. click add button

2014-04-16_16-39-46

inbuilt method to find best threshold

1. double click scatter plot(1)

2. click Vizrank under optimisation dialogs

2014-04-16_16-40-05

3. Click start evaluating projections button in Vizrank Dialog

4. Click Locally optimise best projections button. (see the plots when you do this action)

5. see the results, in petal width and petal length combination best projection is achieved.

2014-04-16_16-40-25

6. best projection is achieved with the default settings.

2014-04-16_16-40-38

7. Now by applying the threshold of 1.65, we can separate Iris-virginica from Iris-Versicolor

8. add interactive tree builder and set petal width and cutoff as 1.65

2014-04-16_16-56-49

Our Final Code

2014-04-16_16-56-35

If you any problem in understanding feel free to call me.