Machine Learning in Python

zoomzurichΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

51 εμφανίσεις

Machine Learning in Python


Vandana

Bachani

http://infolab.tamu.edu

Spring 2012


What is
scikit
-
learn?


How can it be useful to the lab?


There are other packages too!


Features


Usage


Conclusion

http://infolab.tamu.edu

2

scikit
-
learn is a Python module integrating classic machine
learning algorithms in the tightly
-
knit world of scientific Python
packages
(
numpy
,
scipy
,
matplotlib
)



A comprehensive package for all machine learning needs.


Faster


Accuracy? If you have the right data, it is pretty loyal.

Ref: http
://scikit
-
learn.org/stable/

3

Ref: http
://scikit
-
learn.org/stable/

4


Our daily jobs:


Regression/Prediction


Text Classification


Text Feature Extraction


Text Feature Selection


Using Chi
-
Square and other metrics


Cross
-
Validation


K
-
Fold


Clustering (K
-
Means, etc.)


Maybe in future:


Image Classification

http://infolab.tamu.edu

5

All in one
package!

NLTK

Orange

scikit
-
learn

Machine Learning +
Text Processing

+ …

Machine Learning +

visualizations

Machine Learning

+
Machine Learning

Mature (Book exists!)

Naïve and sophisticated

New, Still

developing

Documentation



Not so
great.

Good. Sufficient code
examples.

Documentation



Very
good, but incomplete

Lacks in

functionality
(w.r.t ML), old school

Lacks lot of
functionality

(unsupervised learning)

Almost complete

w.r.t.
machine learning +
additional utilities

Good Metrics Support

Good Metrics Support

Good Metrics Support

Complicated to use

Easy

to use

Easy and intuitive to use

Rest API

No API

support

No API support

http://infolab.tamu.edu

6

Linear Models


Regression (Predicting Continuous Values)

Example: Prices of houses (Boston house dataset)


Linear, Ridge, Lasso (for sparse coefficients, useful in
field of compressed sensing), LARS (very
-
high
dimensional data), Bayesian


Classification


Logistic Regression, Stochastic Gradient Descent



http://infolab.tamu.edu

7

Support Vector Machines


Classification


SVC (one
-
vs
-
one),
LinearSVC

(one
-
vs
-
rest)


Regression


SVR


Density Detection & Outlier Detection

(unsupervised learning)

http://infolab.tamu.edu

8

Unsupervised Learning


Clustering


K
-
Means, Mean Shift, Spectral Clustering


Ward (hierarchical, constructs tree
)


Manifold Learning


Dimensionality Reduction (for visualization,
etc
)


Novelty and Outlier Detection


Uses SVM

http://infolab.tamu.edu

9

Miscellaneous


Nearest neighbors


Unsupervised, Classification


Decision Trees


Classification, Regression


Gaussian Processes


Regression


Metrics


metrics.roc_curve
(
y_true
,
y_score
)


metrics.precision_recall_fscore_support
(...)


j
oblib

and pickle

http://infolab.tamu.edu

10


Cross
-
Validation


cross_validation.KFold
(n, k[, indices
])


Datasets


Feature Extraction


Text


feature_extraction.text.WordNGramAnalyzer
([...])


feature_extraction.text.CharNGramAnalyzer
([...])


Image


feature_extraction.image.extract_patches_2d(...)


Feature Selection


feature_selection.chi2(X, y
)


feature_selection.SelectKBest
(
score_func
[, k])

http://infolab.tamu.edu

11


Linear Regression

>>> from
sklearn

import
linear_model

>>>
clf

=
linear_model.LinearRegression
()

>>>
clf.fit

([[0, 0], [1, 1], [2, 2]], [0, 1, 2])

LinearRegression
(
copy_X
=True,
fit_intercept
=True, normalize=False)



Classification

>>> from
sklearn.linear_model

import
SGDClassifier

>>> X = [[0., 0.], [1., 1.]]

>>> y = [0, 1]

>>>
clf

=
SGDClassifier
(loss="hinge", penalty="l2")

>>>
clf.fit
(X, y)

SGDClassifier
(alpha=0.0001,
class_weight
=None, eta0=0.0,
fit_intercept
=True,


learning_rate
='optimal', loss='hinge',
n_iter
=5,
n_jobs
=1,


penalty='l2',
power_t
=0.5, rho=0.85, seed=0, shuffle=False,


verbose=0
)

http://infolab.tamu.edu

12


SVC & Cross
-
Validation

>>> from
sklearn

import datasets

>>>
from
sklearn

import
svm

>>>
from
sklearn

import
cross_validation

>>>
iris =
datasets.load_iris
()

>>>
clf

=
svm.SVC
(kernel='linear
')

>>>
scores =
cross_validation.cross_val_score
(
clf
,
iris.data
,
iris.target
, cv=5)

>>>
scores

array
([ 1. ..., 0.96..., 0.9 ..., 0.96..., 1.
...])






http://infolab.tamu.edu

13


data_train
,
data_test

=
trainData.data
,
testData.data


y_train
,
y_test

=
trainData.target
,
testData.target




print "Extracting features from the training
dataset"


#
can use a specific analyzer to be passed to
vectorizer


#by default
WordNGramAnalyzer

is used


vectorizer

=
Vectorizer
()




X_train

=
vectorizer.fit_transform
(
data_train
)


print "done in %
fs
" % (time()
-

t0)


print "
n_samples
: %d,
n_features
: %d" %
X_train.shape




print
"Extracting features from the test
dataset"


X_test

=
vectorizer.transform
(
data_test
)


print "done in %
fs
" % (time()
-

t0)


print "
n_samples
: %d,
n_features
: %d" %
X_test.shape





http://infolab.tamu.edu

14

penalty = "l2"


#
LinearSVC

can be tried with L1, L2 penalties


print "
LinearSVC
"


linearSVC

=
LinearSVC
(loss='l2', penalty=penalty,

C=1000
, dual=False,
tol
=1e
-
3)


classify(
linearSVC
,
X_train
,
y_train
,
X_test
,
y_test
)




#
SGDClassifier


print "
SGDClassifier
"


sgdClf

=
SGDClassifier
(alpha=.0001,
n_iter
=50,
penalty=penalty)


classify(
sgdClf
,
X_train
,
y_train
,
X_test
,
y_test
)




print "
NaiveBayes

-

Multinomial"



bernoulliNBClf

=
BernoulliNB
(alpha=.01)


classify(
bernoulliNBClf
,
X_train
,
y_train
,
X_test
,
y_test
)


--------------

def

classify(
clf
,
X_train
,
y_train
,
X_test
,
y_test
):


clf.fit
(
X_train
,
y_train
)


train_time

= time()
-

t0


print "train time: %0.3fs" %
train_time



pred

=
clf.predict
(
X_test
)


test_time

= time()
-

t0


print "test time: %0.3fs" %
test_time




print "classification report:"


print
metrics.classification_report
(
y_test
,
pred
,
target_names
=categories)

SGDClassifier

train time: 1.505s

test time:
0.023s

classification report:





precision
recall f1
-
score support



TECHNOLOGY

0.75
0.99 0.85 3918


IDIOMS

0.94
0.66 0.78 5205


POLITICAL

0.88
0.99 0.93 4268


MUSIC

0.90
0.74 0.81 872


GAMES

0.97
0.95 0.96 457


SPORTS

0.87
0.98 0.92 443


MOVIES

0.97
0.90 0.93 1092


CELEBRITY


0.73 0.46 0.56 24


avg

/ total 0.88 0.86 0.86 16279

http://infolab.tamu.edu

15


If you are a python person
-



Seems like a good library


NLTK +
scikit
-
learn should make an excellent pair
for our lab


Good documentation wins!

http://infolab.tamu.edu

16

Thanks


Email:

vandana_bvj_tamu@tamu.edu

http://infolab.tamu.edu

17