CSE 591: Machine learning and Applications

journeycartΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

91 εμφανίσεις

CSE 591: Machine learning
and Applications

Jieping Ye

Department of Computer Science & Engineering

Arizona State University

Brief Introduction


Dr. Jieping Ye


Assistant Professor at CSE Dept.


Affiliated with the Center for Evolutionary Functional
Genomics at the Biodesign Institute


Research interests: machine learning, data mining
and their applications to bioinformatics


Dimensionality reduction


Semi
-
supervised learning


Kernel learning


Biological image analysis

Outline of lecture


Course information



Project



Introduction to ML



Course schedule



Survey

Course Information


Instructor: Dr. Jieping Ye


Office: BY 568


Phone: 727
-
7451


Email:

jieping.ye@asu.edu


Web
:
http://www.public.asu.edu/~jye02/CLASSES/Spring
-
2007/


Time: TTh 4:40am

5:55pm


Office hours: TTh 10:00 am
--

11:45 am


Location: BYAC 270



TA: Jianhui Chen


Office hours: 3:30 pm


4:30 pm, Th

Course information (Cont’d)


Prerequisite: Basics of linear algebra, a, algorithm design and
analysis.



Course textbook: No textbook is required. (Papers and other
materials are available at the class web page)



Objective
: An in
-
depth understanding of some of the important
machine learning methods and their applications in bioinformatics
and other domains.



Topics
: Clustering, regression, classification, semi
-
supervised
learning, feature reduction, manifold learning, ranking, and kernel
learning.


Reference books


Pattern Classification. Duda, et al. , 2000.



The Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Hastie, et al., 2001.



Kernel Methods in Computational Biology. Scholkopf, et al.,
editors. 2004.



Kernel Methods for Pattern Analysis. Taylor and Cristianini,
2004.



Introduction to Data Mining. Tan, et al., 2005.

Grading


Homework

(3): 30%



Project
: 40%. Two to three students form a group to carry out a small
research project.


A survey of the state
-
of
-
art in an area related to this course


Machine learning techniques for specific applications


A comparative study of several well
-
known algorithms.


Design of a novel algorithm related to this course.



Exam

(1): 20%. There will be one open
-
book exam on
3/22/07
.



Class participation
: 10%. Students are required to attend the lecture and
participate in the class discussion.



A: 90

100, A
-
: 85

89, B+: 80

84, B: 70

79, C: 60

70

Project


Project proposal is due on
2/08/07


One half to one page


Topics, references, and plan



The intermediate project report is due on
4/05/07


Five to ten pages



The final project report is due on
4/26/07


Fifteen to twenty pages



Project presentation


About 5 minutes

Programming languages


Matlab


Tutorials


http://www.math.ufl.edu/help/matlab
-
tutorial/


http://www.math.mtu.edu/~msgocken/intro/node1.html


R (Statistics)


http://www.r
-
project.org/



Or other languages

What is machine learning?


Machine learning is the study of computer systems that improve
their performance through experience.


Learn existing and known structures and rules.


Discover new findings and structures.


Face recognition


Bioinformatics



Supervised learning vs. unsupervised learning



Semi
-
supervised learning

Machine learning versus data mining


A lot of common topics


Clustering


Classification


Many others


Different focuses


ML focuses more on theory (statistics)


DM focuses more on applications

Clustering


Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups

Inter
-
cluster
distances are
maximized

Intra
-
cluster
distances are
minimized

Applications of Cluster Analysis


Understanding


Group genes and proteins that have similar
functionality, or group stocks with similar price
fluctuations


Summarization


Reduce the size of large data sets


Clustering precipitation
in Australia

Classification: Definition


Given a collection of records (
training set
)


Each record contains a set of
attributes
, one of the attributes is
the
class
.


Find a
model

for class attribute as a function of the values of other
attributes.


Goal:
previously unseen

records should be assigned a class as
accurately as possible.


A
test set

is used to determine the accuracy of the model. Usually,
the given data set is divided into training and test sets, with
training set used to build the model and test set used to validate
it.

Classification Example

Test

Set

Training

Set

Model

Learn

Classifier

Classification: Application


Fraud Detection


Goal: Predict fraudulent cases in credit card transactions.


Approach:


Use credit card transactions and the information on its
account
-
holder as attributes.


When does a customer buy, what does he buy, how often
he pays on time, etc


Label past transactions as fraud or fair transactions. This
forms the class attribute.


Learn a model for the class of the transactions.


Use this model to detect fraud by observing credit card
transactions on an account.

Character Recognition


Given a digit representation.


What is it’s class?



AT&T have used


Neural Networks


Support Vector Machines



Error rates ~1.4%



Inputs are 28x28 greyscale
images.

Other applications


Face recognition



Protein function
prediction



Cancer detection



Document categorization

Data representation


Traditional algorithms work on vectors.



Images can be represented as matrices or vectors.



Abstract data


Graphs


Sequences


3D structures



Kernel Methods: Basic ideas

Original Space

Feature Space

f

f

f

Applications in bioinformatics


Protein sequence


Protein structure

Data integration

mRNA

expression data

protein
-
protein

interaction data

hydrophobicity data

sequence data

(gene, protein)

Genome
-
wide data

Curse of dimensionality


Large sample size is required for high
-
dimensional data.



Query accuracy and efficiency degrade rapidly as the dimension
increases.



Strategies


Feature reduction


Feature selection


Manifold learning


Kernel learning


Manifold learning


A manifold is a topological space which is
locally Euclidean.

Intuition: how does your brain
store these pictures?

Model selection


Choose the best model from a set of different models to fit to
the data



Support Vector Machines (SVM), Linear Discriminant Analysis
(LDA)


Models are specified by certain parameters.


How to choose the best parameters?


Cross
-
validation (leave one out, k
-
fold CV)

Machine learning applications


Bioinformatics
: Hugh amount of biological data from
the human genome project and human proteomics
initiative.


Goal: Understanding of biological systems at the molecular
level from diverse sources of biological data.


Challenge: Scalability, multiple sources, abstract data.


Applications: Microarray data analysis, Protein classification,
Mass spectrometry data analysis, Protein
-
protein interaction.




Others
: Computer vision, information retrieval, image
processing, text mining, web mining, etc.

Course schedule

Survey


Why are you taking this course?



What would you like to gain from this course?



What topics are you most interested in learning about from this
course?




Any other suggestions?



Next class


Topics


Basics of linear algebra


Basics of probability



Readings (available at the class webpage)


Mini tutorial on the Singular Value Decomposition