Introduction to AI:
Informal list of topics
2009
1
Introduction to
Artificial Intelligence
An informal Q/A list about the contents of the course
TOPICS

Overview of Artificial Intelligence
[
behaviour, agent, intelligent behaviour,
goals, environment, …]

Making
Decisions u
nder Uncertainty
[agent, environment, actions, utility, risk, MEU principle, rational behaviour, evidence, decision,
MAP decision, ML estimation]

Probabilistic Inference with Graphical Models
[probabilistic inference, joint probability, conditional probability, marginalis
ation,
maximisatio
n,
MAP
solution, belief networks, ML parameter estimation, hidden Markov models,
representing
problems with BBNs,
…]

Machine Learning
[definition of machine learning
,
formal model for iid supervised learning problems: classification
and regression
,
input space, output space, hypothesis space
,
measuring performance: confusion
matrix, sensitivity, specificity, accuracy
,
Bayes theorem
,
linear classifiers
,
linear regressio
n
,
dual
representation of linear hypotheses
,
kernel functions
,
kernel methods
,
nearest ne
ighbour
]

Text Processing & Information
Retrieval
[vector space models: bag of words, stemming, stop words, text categorisation, tf/idf
,
cosine
similarity
,
inverted i
ndex
,
search engines and
PageRank,
some simple corpus statistics]

Perception
[Information Value Theory, Sensory information versus Perception, Inverse Problems, Ill Posed
problems, Poverty of Stimulus, Noisy Channel Model]

Data Mining
/ Association R
ules
[this is not taught/
examined]
[KDD: knowledge discovery in databases
,
importance of data mining
,
association rules
,
itemsets
,
confidence
,
suppor
t
,
…
]
Textbook:
Artificial
I
n
telligence: a
m
odern approach
(
2
nd
edition
)
,
R
ussell, S. and Norvig, P.
Overview
of Artificial Intelligence
[behaviour, agent, intelligent behaviour, goals, environment, …]
(
textbook section 2.2
)

Q
:
What is an intelligent agent
?
A: A
n agent that can make decisions and choices in a way to pursue its goals, maximising its utility,
under conditions not foreseen by the designer. That is a robust, adaptive, goal driven system
.
Learning, Perception, Inference, Pattern Discovery, are all aspects of intelligent behaviour
. Both
natural and artificial systems can be modelled as intelligent
agents; bacteria, plants, insect
colonies, animals and robots: all need to choose the next action based on available information, in a
Introduction to AI:
Informal list of topics
2009
2
way to pursue their goals, or to maximise their utility. They differ vastly in the way they represent
information interna
lly, and hence in the kind of planning, learning and decision

making they can
produce.

Q:
What is behaviour?
A: M
aking the appropriate decision in a given situation, making choices between available options
or actions, based on information gathered from
the environment. There are various ways to make
these decisions. In intelligent behaviour, these decisions are
finalised to achieving a goal.

Q:
What is a
Rational Agent
?
A:
An
agent
is something that perceives and acts in an environment. The
agent functi
on
for an agent
specifies the action taken by the agent in response to any percept sequence.
The
performance measure
evaluates the behavio
u
r of the agent in an environment. A
rational agent
acts so
as to maximize the expected value
(expected utility)
of the performance measure
in an uncertain
environment
,
given the percept sequence it has seen so far.
A
task environment
specification includes the performance mea
sure, the external environment,
the
actuators, and the sensors. In designing an agent, the
first step must
always be to specify the task
environment as fully as possible.
Making Decisions under
Uncertainty
[agent, environment, actions, utility, risk, MEU principle, rational behaviour, evidence, decision, MAP
decision, ML estimation]
(
from the
textbook
)
This chapter shows how to combine
utility theory with probability
to enable an agent to select actions that will
maximize its expected performance.
Probability theory
describes what an agent should believe on the basis of evidence,
utility theory
describes
what an agent wants, and
decision theory
puts the two together to describe what an agent should do.
We can use decision theory to build a system that makes decisions by considering all possible actions and
choosing the one that leads .to the best expected outcome. Such a system is known as a
rational agent.

General setting and terminology (
Section 16.1
)

Q:
Define
utility
functional,
and Expected Utility
as an expectation of the
utility
received for a given
action, under all possible (unknown) states of nature
–
given a belief in their probability
.
A: I
f I have a belief in the probability of each unknown state of nature, and a pay

off matrix for each
action in each state, then I can calculate the expected reward for each action
–
then I can pick the
maximum expected reward action
as described below.

Q:
State in probabilistic terms how an agent can hold beliefs,
and generate a behaviour
(
formalise
space of possible actions, space of possible states of the environment, utility function
...
)
.
A:
A
n agent
A has a set of possible actions {ACTIONS} that can
be
taken in a given environment.
The environment is in an unknown state S, and the
utility
of each action depends
(probabilistically) on the actual state of the environment. Observable quantities that depend on
the hidden state of the environment can be used
, to gain some knowledge about its state.
A nondeterministic action
A
will have possible outcome states
Re
i
(A)
,
where the
index
i
ranges
over the different outcomes. Prior to the execution of
A
, the agent assigns probability
P(Re
i
(A)Do(A), E)
to each outcome, where
E
summarises
the agent’s available evidence about the
Introduction to AI:
Informal list of topics
2009
3
world and
Do(A)
is the proposition that action
A
is executed in the current state.

Q:
What is a rational decision? What is the expected utility? How to make
decisions under
unce
rtainty?
A: A
rational behaviour in a system is one that maximises some utility function in the agent; under
uncertainty the system can only maximise its expected utility
. (
Section 16.1
,
pp. 585
)

Q:
State the principle of maximum expected utility
(MEU) for rational agents
.
(see
Section 16.1 pp. 585
and the notes below)
A:
Expected Utility (EU)
A nondeterministic action
A
will have possible outcome states
Re
i
(A)
,
where the
index
i
ranges
over the different outcomes. Prior to the execution of
A
, the agent assigns probability
P(Re
i
(A)Do(A), E)
to each outcome, where
E
summarises the agent’s available evidence about the
world and
Do(A)
is the proposition that action
A
is executed in the current state. Then we can
calculate the expected utility of
the action given the evidence,
EU(AE)
, using the following
formula:
))
(
(Re
)
),
(

)
(
(Re
)

(
A
U
E
A
Do
A
P
E
A
EU
i
i
where:
)
(
Re
A
i
is the probability of outcome state
i
resulting from doing action A,
Do(A)
is the
proposition that the agent executes action A, and E
is the available evidence.
MEU principle
The principle of Maximum Expected Utility (MEU) says that a rational agent should choose an
action that maximises the agent's expected utility. In other words, an agent is rational if and only if
it chooses the ac
tion that yields the highest expected utility, averaged over all possible outcomes of
the action.

Q:
Define the Information Value of a sensory action
α
.
A:
We assume that exact evidence is obtained about the value of some random variable
E
j
, so the
phrase value of perfect information (VPI) is used. Let the agent’s current knowledge be
E
.
The Expected Utility of an action
α
is defined by
and the value of the new best action (after the new evidence
E
j
is obtained) will be
But
is a random variable whose value is currently unknown, so we must average over all
possible values
e
jk
that we might discover for
using our current beliefs about its value. The
value of discovering
, given current information E, is then defined as
.
Probabilistic
Inference with Graphical Models
[probabilistic inference, joint probability, conditional probability, marginalisation, maximisation, MAP
solution, belief networks, ML parameter estimation, hidden Markov models, representing problems
Introduction to AI:
Informal list of topics
2009
4
with BBNs
, …]
(Textbook:
S
ection
s
14.1
, 14.2
(all)
;
Sections 13.4,
13.5
,
and 13.6
)
A
Bayesian network is a directed acyclic graph whose nodes correspond to random variables; each node has a
conditional distribution for the node, given its parents. Bayesian networks provide a concise way to represent
conditional independence
relationships i
n the domain.
A
Bayesian network specifies a full joint distribution; each joint entry is defined as the product of the
corresponding entries in the local conditional distributions.
A
Bayesian network is often exponentially smaller
than the full joint dist
ribution.
Inference in Bayesian networks means computing the probability distribution of a set of query variables, given
a set of evidence variables. Exact inference algorithms, such as
variable elimination,
evaluate sums of products
of conditional probabi
lities as efficiently as possible.
In
polytrees
(singly connected networks), exact inference takes time linear in the size of the network.
This
includes HMM
s as an important special case. In the general case, the problem is intractable.

Q:
What is inferen
ce, and what types of inference exist?
A: I
nference is the process of forming beliefs about variables that cannot be observed directly,
based on other observable quantities and on some theoretical models; logic vs probabilistic;
deductive vs inductive; et
c. In
most of
our course, beliefs are maintained as distributions of
probability, and states of hidden nodes. Inference is often concerned with guessing one of those
two types of quantities, and generally the probability of some random variable having
a ce
rtain
value…

Q:
Define probabilistic inference as obtaining conditional probabilities co
nnecting some random
variables
.
A:
S
ection 13.4
==
starts at
pp. 475
==
(read also
Sections
13.5 and 13.6
for background)

Q:
How can we obtain a conditional
probability, by marginalisation of joint probability?
A: ==
S
ection 13.4
==
pp. 475
==

Q:
How is marginalisation of a joint probability used to perform probabilistic inference?
A:
Conditional probability is the probability of some event A, given the o
ccurrence of some other
event
B, P(AB). Joint probability is the probability of two events in conjunction, P(A,B).
Marginal probability is the unconditional probability P(A) of the event A; that is, the probability of
A,
regardless of whether event B
did or did not occur. The marginal probability of A=a can be
obtained by summing the joint probabilities over all outcomes for B. For example, if there are two
possible outcomes for B with corresponding events b and b’, this means that
P(A=a) = P(A=a,B=b
) + P(A=a,B=b’).
This operation is called marginalization of the joint probability P(A,B) with respect to B and allows
us to infer P(A).

Q:
Define Independence and Conditional Independence of Random Variable
s.
A
: 2 possible definitions…
Section 13.5
==
pp. 481
–
482
==

Q:
Factorisation of joint probabilities, exploiting CI assumptions
.
A:
==
pp. 478
==
Introduction to AI:
Informal list of topics
2009
5

Q:
Explain how we can obtain a graphical representation of factorised joint as a BBN
.
A: S
how how a graph can be mapped to a joint, with nodes mapping
to random variables, and
edges to CPTs, and lack o
f edges to CI assumptions, etc…
==
S
ections 14.1 and 14.2
==

Q:
Define Belief Network
.
A: (see textbook
pp.
493
)
A
Bayesian network is a directed acyclic graph whose nodes correspond to random variables; each node has
a conditional distribution for the node, given its parents. Bayesian networks provide a concise way to
represent
conditional independence
relationships i
n the domain.
A
Bayesian network specifies a full joint distribution; each joint entry is defined as the product of the
corresponding entries in the local conditional distributions.
A
Bayesian network is often exponentially
smaller than the full joint
distribution.
We can represent a joint distribution as a product of conditional probabilities (factorisation) as follows:
Factorization of joint probability:
n
i
i
i
n
X
x
P
x
x
x
P
1
2
1
)
(
parents

)
,...,
,
(

Q:
Explain how we can marginalise and maximise over a BBN
.
A:
G
eneral
discussion
of the problem
, not detailed derivation
of algorithms
.
Even though the worst
case complexity of enumerating all states of all hidden nodes may be intractable, for many graph
topologies we can benefit from dynamic programming algori
thms
that make
the exact computation
tractable. For large graphs, there are approximate solution methods. Various methods are
available in our Matlab toolbox, or in many packages.

Q:
W
hat are observable and hidden nodes in a BBN?
A:
W
e can interpret all nodes in a BBN
as random variables. Some will be directly observable by
the agent, and will form the Evidence used in inference. Others will not be observable, and the
value of some of them will be the quantity of interest to the agent. Others still will remain
unobserv
ed, and can be dealt with by marginalisation
–
see interpretations of BBNs in
Chapter 14
.

Q:
What is a Hidden Markov Model?
A: A special case of a BBN, described in
Section 15.3
. Applied to bioinformatics, speech,
and text
analysis.
Often used as an example of a noisy channel model.
It involves a hidden M
arkov chain,
and an observable sequence of the same length, whose symbols are probabi
listic functions of the
hidden Markov chain symbols
(no details required).

Q:
What is the optimiz
ation problem solved by Viterbi algorithm?
A:
In a HMM (hidden M
arkov model) it is the problem of finding the state of the hidden chain that
maximises the joint probability P(o,h) of the observable and hidden sequences, having fixed a
setting of the param
eters
.
==
pp. 547
–
549
/ no formal details needed /
==

Q:
Define likelihood, and maximum likelihood principle.
A:
The likelihood is the probability of the given data under a certain hypothesis.
The maximum

likelihood principle prescribes to choose the
value of parameters that maximizes the likelihood.
This
choice of parameters becomes particularly easy to compute in the case of markov chains, or also
HMMs when all variables are observables
(reducing often to the computation of transition and
emission f
requencies
in the sample
)
.
(
see textbook
p
p.
716 to 720
)

Q:
What is the learning problem associated to a BBN?
A:
G
iven a BBN; and given set of observations of all or some of its variables,
infer
the parameters
(
CPTs
) that maximise the
likelihood
.
Introduction to AI:
Informal list of topics
2009
6

Q:
What is the purpose of the EM algorithm?
A: To infer parameters from data.

Q:
What is the
statistical
principle behind EM?
A:
M
aximum likelihood
–
see below
..
.

Q:
What are the algorithmic limitations of the EM algorithm?
A: N
ot guaranteed to find ML
solution
, but
often it can
approximate it.
NOTE
: we do not do any algorithmic details about the EM algorithm.

Q:
What
is a BBN? How can probabilistic inference be performed by marginalisation in a BBN?
A:
A Belief Network (or Bayesian Belief Network or in
general Probabilistic Graphical Model) is a
data structure that represents dependencies among random variables (nodes) and gives concise
specification of any full joint probability distribution.
A Bayesian Belief Network is a directed graph in which each
node is annotated with quantitative
probability information. The full specification is as follows:
(see
text
book for details)
a) A set of random variables (discrete or continuous) makes up the nodes of the network.
b) A set of directed links or arrows con
nects pairs of nodes. If there is an arrow from node X to
node Y, then X is said to be a parent of Y.
c) Each node X
i
has a conditional probability distribution P(X
i
parents(X
i
)) that quantifies the
effect of the parents on the node.
d) The graph has no di
rected cycles (Directed Acyclic Graph
–
DAG).

Q:
Define
what
a Hidden Markov Model is
(s
tate formally, the various components of the model).
A: Given a hidden alphabet H, and
a visible alphabet V, define a M
arkov chain over H, whose
transition matrix is
T. Define also a process emitting a visible symbol for each element of the
hidden chain formed by the above process. This follows the emission probabilities stored in E. The
visible sequence is generated by first generating the hidden one by using T, then
the visible one,
using E and the hidden chain. Most quantities are easily computed in HMMs: P(v,h), max_{h}P(v,h),
etc.

Q:
State Bayes Rule
.
(
see textbook,
S
ection 13.6)
P(HD)
=
P(DH)P(H)/P(D) where we c
ould
interpret D and H as data and hypothesis, hence
finding a natural way to choose hypotheses based on data, in machine learning.
We could also
interpret these as Hidden and Visible chains in a HMM, or generally as sent/received message in a
noisy channel mo
del.

Q:
What is a MAP (maximum a posteriori) decision?
(
see textbook pp.
714
)
Machine Learning
[definition of machine learning, formal model for iid supervised learning problems: classification and
regression, input space, output
space, hypothesis space, measuring performance: confusion matrix,
Introduction to AI:
Informal list of topics
2009
7
sensitivity, specificity, accuracy, Bayes theorem, linear classifiers, linear regression, dual
representation of linear hypotheses, kernel functions, kernel methods, nearest neighbour]
(
see Chapters 18 and 20
)

Q:
How many different types of learning can you list, in humans?
A: Open

ended; but any basic list should include: memorisation, discovery, motor coordination,
linguistic learning, meaning, etc.

Q:
W
hat
is the general machine lea
rning problem?
A: U
pdating the beliefs of the agent with experience; these beliefs are used to map input to output
in any behavio
ur generation method,
e.g.
in MEU
.

Q:
Can we generate behaviour without necessarily modelling a complete joint distribution?
How do
we model the acquisition of direct input/output couplings?
A: W
e can have a class of hypotheses, H, and a set of training examples
S from an input space X
,
then we use
S
to find the most suitable hypothesis
in H
to generate the required behaviour
.
For
example we may be given a bilingual corpus of French and English documents, and the hypothesis
we are looking for is a translation model that maps Fren
ch sentences into English ones.

Q:
C
an you formalise the special case of learning classification fun
ctions, given a finite sample of
their target behaviour?
A: X input space; Y output space (
e
.
g
.
Y= {0,
1}), H hypothesis space H
=
{f:X
Y  some property of
f}, S is a subset of X, each element of S is labelled with an element of Y (training set).
Given a
LOSS function to assess the performance of any hypothesis in H, the task is to use S to find
the best (minimal expected loss on future data) hypothesis in H.
Assumptions on the generation of
data are usually needed, typically it is assumed that data are sa
mpled iid from (X,Y).

Q:
How
do we compare performance in machine learning algorithms?
A:
F
ixed a loss function (
e
.
g
.
the number of mistaken predictions, when doing classification
learning), typically 2 methods are compared by testing them on several datasets under identical
conditions. It is important to assess the average performance of a method over a dataset, as well
as
its standard deviation. This is often obtained by repeatedly generating random subsets from the
dataset, training on the subset and testing on the remaining data.
Various options of loss function are available. Important are: accuracy, sensitivity, sp
ecificity, and
generally it is useful to keep track of the confusion matrix in classification problems (the full
statistical tally of what types of errors were made on the test set).

Q:
List
s
ome statistical learning method
s.
A:
Statistical learning metho
ds range from simple calculation of averages to the construction
of complex
models such as Bayesian networks and neural networks. They have applications
throughout computer
science, engineering, neurobiology
, psychology, and physics. This
chapter has prese
nted some of the basic
ideas and given a flavo
u
r of the mathematical underpinnings.
The main points are as follows
:
Bayesian learning
methods formulate learning as a form of probabilistic inference,
using the
observations to update a prior distribution
over hypotheses. This approach
provides a good way to
implement Ockham's razor, but
quickly becomes intractable for
complex hypothesis spaces.
Maximum a posteriori
(MAP) learning selects a single most likely hypothesis given
the data. The
hypothesis prior
is still used and the meth
od is often more tractable than
full Bayesian learning.
Introduction to AI:
Informal list of topics
2009
8
Maximum likelihood
learning simply selects the hypothesis that maximizes the likelihood
of the
data; it is equivalent to MAP learning with a uniform prior. In simple cases
su
ch as linear regression
and fully observable Bayesian networks, maximum likelihood
solutions can be found easily in closed
form.

Q:
What is a classification problem?
A: Given a set of examples from X marked as belonging to one of a k classes, find a funct
ion in H
that can correctly assign new ex
amples to the appropriate class
(example: handwritten digit
recognition).

Q:
Define a machine learning model (input space, output space, hypothesis space, feature
extraction, model selection etc.
)
.
A: I
nput space
X, labels Y, sample S, divided into train and test, hypothesis space H, use training set
to find best function f in H that makes predictions on test set, with lowest possible risk (which
itself is a blend o
f loss and outcome probability)
...

Q:
State the
problem of learning classification functions. Specify: input, output and hypothesis
space. In the case of linear classifiers, state the hypothesis space both in primal and in dual
representation.

Q:
Discuss
the difference between primal and dual
representations of linear classifiers.

Q:
State the hypothesis space of kernel classifiers. What is a feature space? Definition of a kernel
function.

Q:
Define the nearest

neighbour classification rule.

Q:
What is a kernel function?
A:
A kernel function
between 2 poi
nts in an input space, x ∈ X, z ∈
X, returns the result of the inner
product between their images in a “feature” space F: K(x,z)
=
<
φ
(x),
φ
(x)> with
φ
any map from
X to F:
φ
:X
F.

Q:
State the input space, output space, hypothesis space, of t
he basic least squares linear regression
method in one dimension? In many dimensions?
A:
(one dimension)
Input space: X
=
ℝ
Output space: Y
=
ℝ
Hypothesis space: H
=
{f:X
Y  f(x)
=
ax+b, with a
∈
ℝ
, b
∈
ℝ
}
(n dimensions)
Input space: X
=
ℝ
n
Output
space: Y
=
ℝ
n
Hypothesis space: H
=
{f:X
Y  f(x)
=
<w,x>
+
b, with w
∈
ℝ
n
, b
∈
ℝ
}

Q:
State the input space, output space, hypothesis space, of the basic 2 class linear classification
method in many dimensions.
A:
Input space: X
=
ℝ
n
Output space: Y
=
{0,1}
Introduction to AI:
Informal list of topics
2009
9
Hypothesis space: H
=
{f:X
Y  f(x)
=
sign(
<w,x>+b
)
with w
∈
ℝ
n
, b
∈
ℝ
}

Q:
What are the hypothesis space, input space, output space and loss function of least squares
regression?
A:
Suppose that we have a matrix
X
with n rows and m columns (n obser
vations of the m random
variables) and a target vector
y
with n elements.
Input space:
X
∊ ℝ
n x m
Output space:
y
∊ ℝ
n
Hypothesis space:
H
= {
f
:
X
→
y

f
(
X
) =
Xw
+
b
, with
w
∊ℝ
m
,
b
∊ℝ
n
}
Loss function:
Text Processing and Information
Retrieval
[vector space models: bag of words, stemming, stop words, text categorisation, tf/idf, cosine similarity,
inverted index, search engines and PageRank, some simple corpus statistics]
Read Chapter 23.2
from the textbook
and
G
oogle paper
(
http://ww
w.math.upenn.edu/~kazdan/210/210F08/LectureNotes/Google/Brin

Page.pdf
).

Q:
W
hat is the vector space representation of a document?
A: I
t involves mapping a document to a vector space, where every dimension corresponds to a
possible word in the corpus.
Typically the value of the corresponding coordinate will reflect the
frequency
–
or other property
–
of the given word, and often the length of the vector is normalised.
Inner products in this space often correlate to semantic similarity, and are used for
document
retrieval.

Q:
What is t
ext categorisation
and
retrieval
?
A: I
n text categorisation, a classification algorithm is requested to learn to assign documents to
one of a set of classes. Special cases include spam filtering, and topic detection. In
information
retrieval, a query document is presented, and the algorithm is requested to rank the remaining
ones in order of relevance, or semantic similarity, to the query.
Vector space representations are
often used in both cases.

Q:
What is a “
bag of wo
rds
” representation?
A: E
ssentially it is another name for a vector space representation. In order to represent a
document as a vector,
we loose all information deriving from the order of words in a text. All that
is kept is the set of words, and their fre
quency:
what we call “a bag of words”.

Q:
What are
stop words
?
A:
M
ost words in text documents do not contain any information relative to the topic of the
document, but are just grammatical words such as ‘the’ and ‘of’. Since we remove ordering
informatio
n, these contain even less information. Removing them often helps to detect semantic
similarity.

Q:
W
hat is stemming?
A:
D
irect comparison of words will consider as different 2 words that perhaps differ just by their
number (
e
.
g
.
dog vs dogs) or tense (
e
.
g
.
go vs gone). Removing inflection information, and keeping
Introduction to AI:
Informal list of topics
2009
10
just word stems, can improve the quality of the semantic similarity detected by vector space
models.

Q:
Define: tf/idf and cosine measure.
A:
The term frequency
tf
(t,d) of term t in document d i
s defined as the number of times that t
occurs in d. The relevance of a term is not directly proportional to its term frequency. Also some
terms are just more common than others.
The first
problem can be solved perhaps by using the
log of tf. But the seco
nd needs more information. Rare terms are more informative than frequent
ones. We can count the document frequency of a term as follows:
df
(t) the document frequency of
term t in the entire corpus is defined as the total number of documents that contain it
.
It is an
inverse measure of how informative the term is.
We can define
idf
(t) as the log(N/
df
(t)) where N
is the total number of documents, and the log is used to
moderate the impact of terms that have
very high df.
Every document can be represented as a vector. Formally, given a vocabulary
V
of 
V
 distinct
terms, a document
D
can be considered as an 
V


dimensional vector which assigns a numerical
value to each term of the vocabulary.
Let
N
be the total number of documents in the system or the collection and
df
i
be the number of
documents in which term
t
i
appears at least once. Let
f
ij
be the raw frequency count of term
t
i
in a
document
d
j
. Then, the normalised term frequency
tf
ij
of
t
i
in
d
j
is given by
where the maximum is computed over all terms that appear in document
d
j
. If term
t
i
does not
appear in
d
j
then
tf
ij
= 0.
The inverse document frequency
idf
i
of term
t
i
is given by
The final TF

IDF term weight is given by
Stemming refers to the process of reducing words to their stems or roots. A stem is the portion of
a word that is left after removing its prefixes and suffixes. Stemming is useful
because it increases
the recall and reduces the size of the indexing structure (vocabulary).
Stop words are frequently occurring and insignificant words in a language that help construct
sentences but do not represent any content of the documents,
e.g.
‘t
he’, ‘a’, ‘in’, etc. Stop word
removal is the procedure of removing this category of words before processing documents
(before converting documents to vector space representations). Clearly, removing those words
reduces the size of the vocabulary and keeps
the focus of the vector space document
representation on the words that significantly shape its meaning.
Given that
A
and
B
are the vector representations of documents
d
A
and
d
B
respectively, the cosine
similarity (cosim) between
d
A
and
d
B
is defined as
where
is the angle between vectors
A
and
B
.
Introduction to AI:
Informal list of topics
2009
11
Advantages: Simple method with very small computational complexity.
Disadvantage: Assumes that all the terms (words) are pairwise independent (which is not the
case).

Q:
Define: PageRank of a webpage.
A:
PageRank
is a link analysis algorithm that assigns a numerical weighting to each element of a
hyperlinked set of
documents with the purpose of “measuring”
its relative importance within the
set. The numerical weight that it assigns to any given element
E
is also called the
PageRank of E
and denoted by
PR
(
E
).
It can equivalently be interpreted either as a measure of
authority
of a
page or as a measure of
probability
of it being visited in a random walk. In the first case, the
authority of a page result from the authority of all the pages pointing to it. In the second case the
probability of a node being visited in a random walk on the hypertext graph depen
ds on the same
quantity for all its neighbours.
The PageRank
PR
of a web page
p
i
is defined as
where
M(p
i
)
is the set of
j
pages that link to
p
i
,
L(p
j
)
denotes the number of outbound links for page
p
j
,
and
N
is t
he total number of web pages involved.
The PageRank theory holds that even an imaginary surfer who is randomly clicking on links will
eventually stop clicking. The probability, at any step, that the person will continue is a damping
factor
d
.
In the above formula, we have assumed that the user never stops clicking and therefore
d
= 1.
The general intuition of the above formula is that a page gets a higher ranking when pages with
high ranking and a relatively small amount of links include link
s to it.
It leads to the following eigenproblem and therefore it can be solved efficiently:
The general intuition of the above formula
s
is that a page gets a higher ranking when pages with
high ranking and a relatively small amount of lin
ks include links to it.
Q: can you
derive / discuss Pagerank in a
different way?
A:
[
H
ere is
a derivation based
the notion of centrality of a node in a
network. I
am using notation
and
equations from
Wikiped
i
a
]
.
G
oogle Pagerank
is a measure of the
authority
of a
node
in a
network
.
T
he
principle
is
that connections to high

authority
nodes
should result into a higher authority for a node.
It is part
of the general concept of
‘eigenvector centrality
’ of a node in a network.
L
et
x
i
denote the
score
of the
i
th node.
Let
A
i
,
j
be the
adjacency matrix
of the network.
Hence
A
i
,
j
= 1
if the
i
th node is adjacent to the
j
th node, and
A
i
,
j
= 0
otherwise.
More generally, the entries in
A
can be real numbers representing connection strengths, as in a
stochastic matrix
.
Introduction to AI:
Informal list of topics
2009
12
For the
i
th
node, let the centrality score b
e proportional to the sum of the scores of all nodes which are
connected to it. Hence
where
M
(
i
)
is the set of nodes that are connected to the
i
th
node,
N
is the total number of nodes and
λ
is a
constant. In vector notation this can be rewritten as
or as the
eigenvector
equation
In general, there will be many different
eigenvalues
λ
for which an eigenvecto
r solution exists.
However, the additional requirement that all the entries in the eigenvector be positive implies (by the
Perron
–
Frobenius theorem
)
that only the greatest eigenvalue results in the desired centrality
measure.
[7]
The
i
th
component of the related eigenvector then gives the centrality score of the
i
th
node in the
network.

Q:
Define: precision / recall and related quantities (sensitivity/specificity, etc.)
A:
In IR,
precision
is defined as the
number of relevant documents
retrieved by a search
divided by
the total number of documents retrieved
by tha
t search, and
recall
is defined as the
number of
relevant documents
retrieved by a search
divided by the total number of existing relevant
documents
(which should have been retrieved).

Q:
What is the distribution of frequencies of words in natural language text?
A:
W
ords are distributed according to a power law
. T
he number
k(n)
of words with frequency
n
is
distributed as
:
Where Z is just a normalisation factor.
This shows t
hat there are many words with low
probability, and a few words with very high probability, and the tail of the dist
ribution vanishes
very slowly …
Perception
[
Information Value Theory, Sensory information versus Perception, Inverse Problems, Ill Posed
problems, Poverty of Stimulus, Noisy Channel Model
]
Autonomous agents use information from their environment in order to make decisions that are
aimed at achieving their goals.
Access to environmental information is essential for this process, as
much as
it their capability to act on the environment. This will create a feedback loop
, coupling agent
and environment.
Introduction to AI:
Informal list of topics
2009
13
Sensory Information and its Value
Environmental information that is relevant to the agent may be acquired by the use of sensors. Any
measureme
nt apparatus may qualify as a sensor, so that an agent acting in the physical world may be
provided with sensors for: light, gravity, acidity, presence of nutrients, noise, magnetic fields, contact,
and so on.
Biological agents, like single cells, are prov
ided with membrane receptors, that are linked
with sensory pathways with the rest of the cell metabolic network. Binding with certain chemicals
–
for example
–
may trigger escape reactions in bacteria. Also software agents can
be modelled at the
same way,
for example considering an agent’s capability to acquire statistical information about the
content of a webpage as a sensory capability.
The acquisition of information can the result of a
deliberate act by the agent (a sensory action). In evolution the
exi
stence itself of a sensory receptor
can be seen as the result of a ‘decision’ by the species. In both cases we can ask the question: how
valuable is the information acquired by the agent?
Information value theory

Q:
How to assess the value of a piece of
information? Or the value of maintaining a sensory
apparatus?
A:
The basic idea is that
(see textbook for more details)
I
nformation has value to the extent that
1)
it is likely to cause a change of plan
2)
the new plan will be significantly better than th
e old plan
.
This suggests that we can measure it by computing the expected utility
of the optimal behaviour
(action) made with or without that piece of information.
(
from
textbook)
The value of information derives from the fact that
with the information, one's course
of action can
be changed to suit the
actual situation.
One can discriminate according to the
situation, whereas
without the information, one has to do what's best
on average
over the possible situations. In
general,
the
value of a given piece of information is defined to be the difference in expected value
between best actions before and after information is obtained.
[
Note
:
it is often the case that a sensory apparatus is lost when a species does not benefit from it,
du
e to a change in the habitat. See eyes in the caecilians, or odor receptors in humans. Entire
pathways can be lost in parasitic bacteria. The value of the information did not match its cost
].
A formal definition for the information value of a sensory acti
on
α
can be given as follows (taken
from the textbook):
We assume that exact evidence is obtained about the value of some random variable
E
j
, so the
phrase value of perfect information (VPI) is used. Let the agent’s current knowledge be
E
.
The Expected Utility of an action
α
is defined by
and the value of the new best action (after the new evidence
E
j
is obtained) will be
But
is a random variable whose value is currently unknown, so we must average over all
possible values
e
jk
that we might discover for
using our current beliefs about its value. The
value of discovering
, given curren
t information E, is then defined as
Introduction to AI:
Informal list of topics
2009
14
.
Perception as Inference (and as Inverse Problem)
Often the most relevant piece of information for the agent cannot be directly observed.
This may be
because it is something about the future, or because the agent has sensor to measure it directly. Still
an educated guess about its value can be made, based on partial or indirect information: the agent can
try to
infer
the value of interest. I
nferring the value of hidden variables is a difficult task, but can be
performed provided that assumptions are made by the agent, and/or various sources of information
are integrated. This may involve distinct sensory channels, or information from memory,
or simply
expectations (based on beliefs and experience).
The result of this inference is what we call Perception
:
relevant information about the state of the world, not directly observed, but obtained by combining
direct sensory data with other data and w
ith prior information
.
Sometimes the observation that sensory information is not sufficient to infer the value of the hidden
variable (the relevant state of the environment) is called “poverty of the stimulus”.
This consideration
shows that Perception is
typically an inverse problem, and as such an ill

posed problem (definitions
below). We will model this with a “noisy channel model” (definition below).
Poverty of Stimulus
Often the information coming via senses is not sufficient for the agent to form a
useful picture of the
environment. It is necessary to integrate it with expectations, extra information, other senses, etc.. The
same stimulus could lead to different perceptions in different agents. Perception is an inverse problem
Definition:
Well Posed
Problem
In mathematics, a well

posed problem has the 3 following properties
(Hadamard).
1)
there exists a solution
2)
the solution is unique
3)
the solution is a smooth function of the
inputs (in some reasonable metric)
Well posed problem
:
The mathematical term
well

posed problem stems from a definition given by Jacques
Hadamard. He believed that mathematical models of physical phenomena should have the properties that
1. A solution exists 2. The solution is unique 3. The solution depends continuously (even smoot
hly) on the
data, in some reasonable topology. INVERSE PROBLEM An inverse problem is the task that often occurs in
many branches of science and mathematics where the values of some model parameter(s) must be
obtained from the observed data.
Definition:
Ill

Posed Problem
An ill

posed problem is a problem that is not
well posed
in the sense of the definition above.
Problems that allow multiple solutions or that are instable with perturbations of the input, are ill

posed. Regularization techniques are oft
en used to address ill

posed problems, or to improve their
stability.
Hadamard in a 1902 paper defined ill posed those mathematical problems whose solution does not exist or
it is not unique or it if is not stable under perturbations on data.
Definition
:
Regularization
Introduction to AI:
Informal list of topics
2009
15
Ill posed problems can be treated by means of regularisation techniques.
These involve adding extra
constraints to the ill

specified problem so that the solution becomes unique...
One type of ill

posed problems, important in modelling Perception but also Machine Learning
algorithms, is that of inverse problems.
Definition:
Inverse Problem
In mathematics, problems that involve reconstructing the parameters of a model from partial
observations of its
behaviour (
Eg: image reconstruction for X

ray
computerized tomography
).
In
machine learning we need to reconstruct a target function based on fi
nite samples of its behaviour
(training data). In Perception we need to estimate a hidden state of the world based on measurements
of related data (sensory information).
Every time a PGM estimates the value of hidden nodes, as in a
Hidden Markov Model, it
is solving an inverse problem. Indeed strong assumptions are made by
HMMs in order to solve it.
“
Inverse problems are concerned with determining causes for a desired or an observed effect or
calibrating the parameters of a mathematical model to reproduce
observations. Inverse problems
most often do not fulfil Hadamard's postulates of well

posedness: they might not have a solution in the
strict sense, solutions might not be unique and/or might not depend continuously on the data. Hence
their mathematical an
alysis is subtle. However they have many applications in engineering, ph
ysics
and other fields”
[http://www.proba.jussieu.fr/pageperso/ramacont/inverse.html]
A Model for Inverse Problems: the Noisy Channel
Definition:
Noisy Channel
Assume that a set of
random variables are read, and their values are sent along a noisy channel, that
can corrupt their value. You receive the corrupted message and are asked to guess the original one.
Under assumptions about the kind of noise, and the kind of random variable
s generating the original
message, this can be modelled as a PGM.
[NCM=Noisy Channel Model.]
Example
:
If the original message is a string randomly generated by a Markov chain, and if the noise is
independently applied to each symbol, then the NCM is equiv
alent to an HMM.
How can the noisy channel be used to model
the
inverse problem
of Perception
?
We assume that the hidden variable of interest is represented by a hidden node in a PGM, and that it is
linked to a set of variables that can be observed. One p
articular example of this is when we can regard
the observable variables as the corrupted version of the hidden variables.
Sensory information
(possibly from multiple channels) can be used then to reconstruct the original information that could
not be sens
ed.
Data Mining / Association Rules
[
this is not taught/examined
]
[KDD: knowledge discovery in databases, importance of data mining, association rules, itemsets,
confidence, support, …]

Q:
What is Data Mining?
A: Data mining is the process of extracting hidden patterns from data. It is commonly used in a
wide range of applications such as, marketing, fraud detection and scientific discovery.
Introduction to AI:
Informal list of topics
2009
16

Q:
What is market basket analysis?
A: An important task in data minin
g is the discovery of co

occurrences among items in a
transactions database (eg: products sold together, web

pages viewed by the same users, etc). An
intuitive example of this is market basket analysis in the retail business. In this case retailers use it
to understand the purchase behaviour of groups of customers, and use it for cross

selling, store
design, discount plans and promotions. The most common example i
s of course Amazon.com.
Typical
output
s
of this analysis are association rules (see below).

Q:
What are itemsets and transactions?

Q:
What is the support of an itemset?

Q:
What is an association rule?

Q:
What is the confidence and support of an association rule?
General
a
nswer (more details on course webpage): In the context of association rul
es mining, we are
initially given a space of items, and a transaction is a subset of items t
hat have been bought together.
A
transactions database is formed by a set of transactions.
An itemset is any subset of the space of items,
and its support is the ca
rdinality of the set of transactions that contain it. A frequent itemset in a given
database is an itemset with the support larger than a chosen threshold.
(A: For the previous 4 definitions, see the various links to papers on the course webpage, and also
feel
free to google about this, and see for example
http://en.wikipedia.org/wiki/Association_rule
)
Comments 0
Log in to post a comment