What is Cognitive Science?

unclesamnorweiganAI and Robotics

Oct 18, 2013 (4 years and 22 days ago)

91 views

What is Cognitive Science?

Josh Tenenbaum

MLSS 2010

Pscyhology/CogSci and machine
learning: a long
-
term relationship


Unsupervised learning


Factor analysis


Multidimensional scaling


Mixture models (finite and infinite) for classification


Spectral clustering


Topic modeling by factorizing document
-
word count matrices


“Collaborative filtering” with low
-
rank factorizations


Nonlinear manifold learning with graph
-
based approximations


Supervised learning


Perceptrons


Multi
-
layer perceptrons (“backpropagation”)


Kernel
-
based classification


Bayesian concept learning


Reinforcement learning


Temporal difference learning

“Hebb rule”

A success story in the 1980s
-
1990s:

The “standard model of learning”

“Long term potentiation”

A success story in the 1980s
-
1990s:

The “standard model of learning”

Outline


The big problems of cognitive science.


How machine learning can help.


A brief introduction to cognition viewed
through the lens of statistical inference and
learning.


The big question

How does the mind get so much out of so
little?




Our minds build rich models of the world and make strong
generalizations from input data that is sparse, noisy, and
ambiguous


in many ways far too limited to support the
inferences we make.


How do we do it?

Visual perception

(Marr)


Goal of visual
perception is to recover
world structure from
visual images.




Why the problem is
hard: many world
structures can produce
the same visual input.



Illusions reveal the
visual system’s implicit
knowledge of the
physical world and the
processes of image
formation.

Ambiguity in visual perception

(Shepard)

Learning
-
based machine vision:
state of the art

(Choi, Lim, Torralba, Willsky)

Input

Output

Learning concepts from examples

“tufa”

“tufa”

“tufa”

Humans and bumble bees

“According to the theory of aerodynamics, a
bumble bee can’t fly.”


According to statistical learning theory, a
person can’t learn a concept from just one
or a few positive examples…



Causal inference

1

5

6

2

Didn’t take drug

Took drug

cold

1 week

cold

1 week

Does this drug help you
get over a cold faster?

Causal inference

1

5

6

2

Didn’t touch stove

Touched stove

Got

burned

Didn’t get

burned

How does a child learn not to
touch a hot stove? (c.f. Hume)

What happens
if I press this

button over
here on the
wall …?

Language


Parsing:


The girl saw the boy with the telescope.


Two cars were reported stolen by the Groveton police
yesterday.


The judge sentenced the killer to die in the electric chair
for the second time.


No one was injured in the blast, which was attributed to
a buildup of gas by one town official.


One witness told the commissioners that she had seen
sexual intercourse taking place between two parked
cars in front of her house.

(Pinker)

Language

Language

Language

Ervey tihs si yuo enve msipleeld thugho wrdo cna stennece reda.

Language


Parsing


Acquisition:


Learning verb forms


English past tense: rule vs. exceptions


Spanish or Arabic past tense: multiple rules plus
exceptions


Learning verb argument structure


e.g., “give” vs. “donate”, “fill” vs. “load”


Learning to be bilingual

Theory construction in science

Intuitive theories


Physics


Parsing: Inferring support relations, or the causal
history and properties of an object.


Intuitive theories


Physics


Parsing: Inferring support relations, or the causal
history and properties of an object.


Intuitive theories


Physics


Parsing: Inferring support relations, or the causal
history and properties of an object.


Acquisition: Learning about gravity and support.


Gravity
--

what’s that?


Contact is sufficient


Mass distribution and location is important


A different intuitive theory…

Two Demos.




“If you have a mate, and there is a rival, go and peck that rival…”

Intuitive theories


Physics


Parsing: Inferring support relations, or the causal
history and properties of an object.


Acquisition: Learning about gravity and support.


Gravity
--

what’s that?


Contact is sufficient


Mass distribution and location is important


Psychology


Parsing: Inferring beliefs, desires, plans.


Acquisition: Learning about agents.


Recognizing intentionality, but without mental state reasoning


Reasoning about beliefs and desires


Reasoning about plans, rationality and “other minds”.

Outline


The big problems of cognitive science.


How machine learning can help.


A brief introduction to cognition viewed
through the lens of statistical inference and
learning.


The big questions

1. How does knowledge guide inductive learning,
inference, and decision
-
making from sparse, noisy or
ambiguous data?

2. What are the forms and contents of our knowledge of
the world?

3. How is that knowledge itself learned from experience?

4. How do we balance constraint and flexibility,
assimilating new data to our current model versus
accommodate our model to the new data?

5. How can accurate inductive inferences be made
efficiently, even in the presence of complex
hypothesis spaces?

Machine learning provides a toolkit for
answering these questions

1.
Bayesian inference in probabilistic generative models

2.
Probabilities defined over structured representations:
graphs, grammars, predicate logic, programs

3.
Hierarchical probabilistic models, with inference at all
levels of abstraction

4.
Adaptive nonparametric or “infinite” models, which
can grow in complexity or change form in response to
the observed data.

5.
Approximate methods of learning and inference, e.g.,
Markov chain Monte Carlo (MCMC), importance
sampling, and sequential importance sampling
(particle filtering).

Basics of Bayesian inference


Bayes’ rule:


An example


Data: John is coughing


Some hypotheses:

1.

John has a cold

2.

John has lung cancer

3.

John has a stomach flu


Likelihood
P
(
d
|
h
) favors 1 and 2 over 3


Prior probability
P
(
h
) favors 1 and 3 over 2


Posterior probability
P
(
h
|
d
) favors 1 over 2 and 3

Phrase structure
S

Utterance
U

Grammar
G

P
(
S

|
G
)

P
(
U

|
S
)

P
(
S

|
U
,
G
) ~

P
(
U

|
S
)

x

P
(
S

|
G
)


Bottom
-
up Top
-
down

Phrase structure

Utterance

Speech signal

Grammar

“Universal Grammar”

Hierarchical phrase structure
grammars (e.g., CFG, HPSG, TAG)

P
(phrase structure | grammar)

P
(utterance | phrase structure)

P
(speech | utterance)



P
(grammar | UG)

Compositional scene grammars

(e.g., attribute graph grammar, AND/OR grammar)

(Han & Zhu, 2006)

Parsing graph

Surfaces

Image

Grammar

P
(parsing graph | grammar)

P
(surfaces | parsing graph)

P
(image | surfaces)




P
(grammar | UG)

“Universal Grammar”

Principles

Structure

Data

Whole
-
object principle

Shape bias

Taxonomic principle

Contrast principle

Basic
-
level bias

Learning word meanings

Causal learning and reasoning

Principles

Structure

Data

Goal
-
directed action
(production and comprehension)

(Wolpert et al., 2003)

Marr’s levels

Computational

Algorithmic

Neural

Importance sampling

Markov Chain

Monte Carlo

(MCMC)

Bayes meets Marr: the Sampling
Hypothesis

Particle filtering



t=150 ms

Outline


The big problems of cognitive science.


How machine learning can help.


A
very

brief introduction to cognition
viewed through the lens of statistical
inference and learning.


Five big ideas


Understanding human cognition as Bayesian inference over
probabilistic generative models of the world.


Building probabilistic models defined over structured
knowledge representations, such as graphs, grammars,
predicate logic, functional programs.


Explaining the origins of knowledge by learning in
hierarchical probabilistic models, with inference at multiple
levels of abstraction.


Balancing constraint with flexibility, via adaptive
representations and nonparametric (“infinite”) models that
grow in complexity or change form in response to the data.


Tractable methods for approximate learning and inference
that can react to new data in real time and scale up to large
problems (e.g., Markov chain Monte Carlo, Sequential MC).

Cognition as probabilistic inference

Visual perception
[Weiss, Simoncelli, Adelson, Richards, Freeman, Feldman,
Kersten, Knill, Maloney, Olshausen, Jacobs, Pouget, ...]

Language acquisition and processing
[Brent, de Marken, Niyogi, Klein,
Manning, Jurafsky, Keller, Levy, Hale, Johnson, Griffiths, Perfors, Tenenbaum, …]

Motor learning and motor control
[Ghahramani, Jordan, Wolpert, Kording,
Kawato, Doya, Todorov, Shadmehr,

…]

Associative learning
[Dayan, Daw, Kakade, Courville, Touretzky, Kruschke, …]

Memory
[Anderson, Schooler, Shiffrin, Steyvers, Griffiths, McClelland, …]

Attention
[Mozer, Huber, Torralba, Oliva, Geisler, Yu, Itti, Baldi, …]

Categorization and concept learning
[Anderson, Nosfosky, Rehder, Navarro,
Griffiths, Feldman, Tenenbaum, Rosseel, Goodman, Kemp, Mansinghka, …]

Reasoning
[Chater, Oaksford, Sloman, McKenzie, Heit, Tenenbaum, Kemp, …]

Causal inference
[Waldmann, Sloman, Steyvers, Griffiths, Tenenbaum, Yuille, …]

Decision making and theory of mind
[Lee, Stankiewicz, Rao, Baker,
Goodman, Tenenbaum, …]

Bayesian inference in perceptual and
motor systems

Weiss, Simoncelli & Adelson (2002)

Kording & Wolpert (2004)

Bayesian ideal observers using
natural scene statistics

Wainwright, Schwartz & Simoncelli (2002)

Does this approach extend to cognition?

Modeling basic cognitive capacities as
intuitive Bayesian statistics


Similarity
(Tenenbaum & Griffiths,
BBS

2001; Kemp & Tenenbaum,
Cog Sci

2005)


Representativeness and evidential support
(Tenenbaum &
Griffiths,
Cog Sci

2001)


Causal judgment
(Steyvers et al., 2003; Griffiths & Tenenbaum,
Cog
.
Psych.
2005)


Coincidences and causal discovery
(Griffiths & Tenenbaum,
Cog Sci
2001;

Cognition

2007;
Psych. Review
, in press)


Diagnostic inference
(Krynski & Tenenbaum,
JEP: General
2007)


Predicting the future
(Griffiths & Tenenbaum,
Psych. Science
2006)

Coin flipping

Which sequence is more likely to be produced
by flipping a fair coin?



HHTHT




HHHHH

Predict a random sequence of coin flips: Mathcamp 2001, 2003

Mathcamp 2001, 2003 data: collapsed over parity

Zenith radio data (1930’s): collapsed over parity

Coin flipping

Why do some sequences
appear

much more
likely to be produced by flipping a fair
coin?



HHTHT




HHHHH

“We can introspect about the outputs

of cognition, not the processes or the
intermediate representations of the
computations.”

Prediction


given





?

H

D

Likelihood:

Predictive versus inductive
reasoning

Induction


?





given

Prediction


given





?

H

D

Likelihood ratio:

Predictive versus inductive
reasoning

Likelihood:

P
(
H
1
|D
)
P
(
D|H
1
)
P
(
H
1
)

P
(
H
2
|D
)
P
(
D|H
2
)
P
(
H
2
)


= x

Comparing two hypotheses


Different patterns of observed data:


D =

HHTHT
or
HHHHH



Contrast simple hypotheses:


H
1
: “fair coin”,
P
(
H
) = 0.5


H
2
:“always heads”,
P
(
H
) = 1.0



Bayes’ rule in odds form:



Comparing two hypotheses




D
:


HHTHT

H
1
,
H
2
:

“fair coin”, “always heads”

P
(
D|H
1
) =

1/2
5



P
(
H
1
) =

?



P
(
D|H
2
) =

0



P
(
H
2
) =

1
-
?



Comparing two hypotheses




D
:


HHTHT

H
1
,
H
2
:

“fair coin”, “always heads”

P
(
D|H
1
) =

1/2
5



P
(
H
1
) =

e



P
(
D|H
2
) =

0



P
(
H
2
) =

1
-

e



Comparing two hypotheses




D
:


HHHHH

H
1
,
H
2
:

“fair coin”, “always heads”

P
(
D|H
1
) =

1/2
5



P
(
H
1
) =

e



P
(
D|H
2
) =

1



P
(
H
2
) =

1
-

e



Comparing two hypotheses




D
:


HHHHH

H
1
,
H
2
:

“fair coin”, “always heads”

P
(
D|H
1
) =

1/2
5


P
(
H
1
) =

999/1000

P
(
D|H
2
) =

1



P
(
H
2
) =

1/1000



Comparing two hypotheses




D
:


HHHHHHHHHH

H
1
,
H
2
:

“fair coin”, “always heads”

P
(
D|H
1
) =

1/2
10


P
(
H
1
) =

999/1000


P
(
D|H
2
) =

1




P
(
H
2
) =

1/1000




Measuring prior knowledge

1. The fact that
HHHHH

looks like a “mere coincidence”,
without making us suspicious that the coin is unfair, while
HHHHHHHHHH

does begin to make us suspicious, measures
the strength of our prior belief that the coin is fair.


If
q

is the threshold for suspicion in the posterior odds, and
D
* is
the shortest suspicious sequence, the prior odds for a fair coin is
roughly
q
/
P
(
D*
|“fair coin”).


If
q

~ 1 and
D
* is between 10 and 20 heads, prior odds are roughly
between 1/1,000 and 1/1,000,000.

2. The fact that
HHTHT

looks representative of a fair coin, and
HHHHH

does not, reflects our prior knowledge, intuitive
theories about possible causal mechanisms in the world.


Easy to imagine how a trick all
-
heads coin could work: low (but
not negligible) prior probability.


Hard to imagine how a trick “
HHTHT
” coin could work: extremely
low (negligible) prior probability.


You read about a movie that has made $60 million to date.
How much money will it make in total?


You see that something has been baking in the oven for 34
minutes. How long until it’s ready?


You meet someone who is 78 years old. How long will they
live?


Your friend quotes to you from line 17 of his favorite poem.
How long is the poem?


You meet a US congressman who has served for 11 years.
How long will he serve in total?


You encounter a phenomenon or event with an unknown
extent or duration,
t
total
, at a random time or value of
t <t
total
.

What is the total extent or duration
t
total
?

Everyday prediction problems

(Griffiths & Tenenbaum,
Psych. Science

2006)

Priors
P
(
t
total
)
based on empirically measured durations or magnitudes
for many real
-
world events in each class:

Median human judgments of the total duration or magnitude
t
total

of
events in each class, given one random observation at a duration or
magnitude
t
, versus Bayesian predictions (median of
P
(
t
total
|
t
)).

Learning words for objects

“tufa”

“tufa”

“tufa”

What is the right prior?

What is the right hypothesis space?

How do learners acquire that background knowledge?