Machine Learning With Orange: April 2014

Tuesday, 22 April 2014

Bag-Of-word using orange

for the main work of finding related post, we have to learn bag-of-words approach for text processing in machine learning.

in a default installation we do not have text processing widgets in orange, so we have to install addons. I installed Textable addon for this work.

1. From the Textable addon, add Text Field widget.

2. Copy & Paste the following text with in quotes

“How to format my hard disk

Hard disk format problems”

3. Add Textable\preprocessor to convert the text to smaller case. we are converting entire text smaller case. so that words of upper and lower case wont be separated as a different words. that is How and how wont be separated as a different word.

4. First we will segment the words in to lines. for this we will ad Textable\segment widget and configure it as shown in the below image.

Configuration of Textable\segment widget

Check advanced settings
Select mode to split
Set Regex .+, to specify lines.
Check Unicode dependent(u)
Click add
under options, Change the label name
Check the Auto-number with key and name it

5. We will segment it further to words. for this we will ad Textable\segment widget and configure it as shown in the below image.

Configuration of Textable\segment widget

Check advanced settings
Select mode to Tokenise
Set Regex \w+, to specify words.
Click add button
under options, change the segment label.

6. Count the segments based on the context and words. for this we will add Textable\count widget

set segmentation to words

mode to containing segmentation
segmentation to Lines
annotation key to num

7. Insert Texatable\Convert widget and connect to data/ table widget.

See the complete canvas

Special Thanks

Special thanks to Mr.Axanthos, Site Admin, Langtech.ch. I am new to Machine learning. I don’t know how to implement bag-of-word approach in orange and concept. I have made a support request in Langtech.ch/forums. for my request Mr. axanthos provided a clear explanation and code.

http://langtech.ch/forum/textable/viewtopic.php?f=4&t=12

Reference

http://langtech.ch/forum/textable/viewtopic.php?f=4&t=4

http://moodle2.unil.ch/course/view.php?id=574

https://orange-textable.readthedocs.org/en/latest/

http://orange.biolab.si/forum/viewtopic.php?f=4&t=1949#p5655

Friday, 18 April 2014

tweets sentiment analysis

Download hand classified tweets from http://www.sananalytics.com/lab/twitter-sentiment/

Counting words - Clustering–Find Related post–part1

assume in our website we already have few posts. we are going to find a most relevant post for the current viewing post “Imaging database”

the Below table shows existing posts in the websites

Si.No	Post Content
1	This is a toy post about machine learning. Actually, it contains not much interesting stuff.
2	Imaging databases can get huge.
3	Most imaging databases safe images permanently.
4	Imaging databases store images.
5	Imaging databases store images. Imaging databases store images. Imaging databases store images.

In the book, posts are opened using following script

Script1 – reading file and printing output

   1: import os

   2: import sys

   3: import scipy as sp

   4: from sklearn.feature_extraction.text import CountVectorizer

   5: data_Dir = "E:\\Machine Learning\\Orange\\Finding Related Post\\toy\\"

   6: posts = [open(os.path.join(data_Dir,f)).read() for f in os.listdir(data_Dir)]

   7: print posts

Result1:

"Running script:

['This is a toy post about machine learning. Actually, it contains not much interesting stuff.', 'Imaging databases provide storage capabilities.', 'Most imaging databases safe images permanently.', 'Imaging databases store data.', 'Imaging databases store data. Imaging databases store data. Imaging databases store data.']

Implementation of script1 in orange

we will modify the Bag of words file.

1. Replace the Textable\TextFeild widget to Textable\TextFile widget

2. Click advanced settings,

3. Click browse button to open the file

4. select all the files and click open.

5. you can see the result by connecting, Textable\Disply to Textable\Lowercase widget.

Learning from this post

- opening the directory path and file in python

Reference

http://langtech.ch/forum/textable/viewtopic.php?f=4&t=4

http://moodle2.unil.ch/course/view.php?id=574

https://orange-textable.readthedocs.org/en/latest/

http://orange.biolab.si/forum/viewtopic.php?f=4&t=1949#p5655

how to work with directories http://www.diveintopython.net/file_handling/os_module.html

Thursday, 17 April 2014

Classification of wheat seeds dataset

In the previous post we built classifier for Iris dataset, without knowing or acquiring any theoretical knowledge on machine learning.

Sameway, in this post I am going to try complex classify seeds dataset, without attempting to learn theory.

wheat seeds data set contains 3 varieties and 7 features. seeds dataset can be obtained from http://archive.ics.uci.edu/ml/

Visualising data

1. drag and drop file widget from data palette

2. add scatter plot from data visualisation palette

Find best threshold using inbuilt method

1. double click scatter plot

2. click Vizrank under optimisation dialogs

3. Click Start evaluating procedure and see the result

4. For better results, click Locally optimise best projection and see the result

5. From the result it is easy to separate Rosa from Koma and canadian

6. let us find the threshold, We found threshold = 5.573

Apply threshold for Rosa seeds

1. add interactive tree builder from classify palette

2.2. Double click interactive tree builder. set split selection to Length kernel groove, cut off point to 5.573 and click split. see the report.

Find best threshold for separating Canadian seeds and Koma

1. double click scatter plot (1)

2. click Vizrank under optimisation dialogs

3. Click Start evaluating procedure and click Locally optimise best projection and see the result

4. From the result I have selected rank 7 combination Length of kernel groove vs area. In this, its easy to put a threshold. but the accuracy of this classification is only 89.8 percentage. threshold = 13.55 area.

Now I feel I should understand what is really happening.

Tuesday, 15 April 2014

Iris dataset classification

First let us visualize the iris dataset

1. drag and drop file widget from data palette

2. add scatter plot from data visualisation palette

3. double click file and load the iris data set.

Iris data set can be downloaded from http://orange.biolab.si/datasets.psp

4. Double click scatter plot and see the data visualization by changing the x and y parameters.

How to Change point shape?

under additional point properties, change the point shape attribute to iris

can I save the plot?

You can save the plot by using save graph. plots can be saved as a picture and matplotlib script.

Building the classification model

From the below plot, it seems easy to separate iris-setosa from other two iris species

We can write a simple logic of if petal length is less than 2.0 cm then it is iris-setosa else other two species.

How to classify based on the petal length?

1. add interactive tree builder from classify palette and rename as Iris setosa classifier

2. Double click interactive tree builder. set split selection to petal length, cut off point to 2.0 and click split. see the report.

We have successfully classified iris-setosa.

How to classify other two species?

We will separate iris-setosa from the data first. and will plot only the Iris-verginica and Iris-versicolor

identifying threshold to separate these two species is not easy as iris-setosa. we have to find a best method.

I found one simple inbuilt method to identify best threshold for the separation of these two species. I am not sure this will work for all the data.

separate Iris-setosa

1. Add select data widget from data palette

2. Double click the select data widget, select iris under attribute, equals under operator, Iris-setosa under value and check negate

3. click add button

inbuilt method to find best threshold

1. double click scatter plot(1)

2. click Vizrank under optimisation dialogs

3. Click start evaluating projections button in Vizrank Dialog

4. Click Locally optimise best projections button. (see the plots when you do this action)

5. see the results, in petal width and petal length combination best projection is achieved.

6. best projection is achieved with the default settings.

7. Now by applying the threshold of 1.65, we can separate Iris-virginica from Iris-Versicolor

8. add interactive tree builder and set petal width and cutoff as 1.65

Our Final Code

If you any problem in understanding feel free to call me.