Extracting Academic Affiliations Status Report

streakconvertingΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

89 εμφανίσεις

Extracting Academic Affiliations

Status Report

Alicia Tribble

Einat Minkov

Andy Schlaikjer

Laura Kieras


Identify people who are affiliated with an
academic institution


Degrees earned


Positions held (student, post
-
doc, faculty)


Current position


Class of beliefs to be learned:


affiliated(<person>,<degree>,<institution>)

The Problem

The System /Algorithm

Patterns

Relations

(facts)

Html files

Extract
patterns

Extract
relations

Search Engine

Interface

Assess
patterns

Assess
relations

Query
relation

Query
pattern

Query Generator

Algorithm Details


Pattern query formulation


Replace <
arg
> in pattern string with '*' operator


Remove leading and trailing '*'s


Wrap query string in quotes


Example:


"<PERSON> received his <DEGREE> from <UNIVERSITY>"

-
becomes
-

'"received his * from"'

Algorithm Details


Relation Extraction (Slot filling)


Find the relevant sentence/s on a page


Alignment


slot filling


Some cleanup


“he”, capitalization


Examples:


Robertson, Ph.D. in ecology and evolutionary biology, Indiana University

Jeff, B.S., Bucknell University

Rex Jung, degree, University of New Mexico

Alavosius, BA in psychology, Clark University

Jacobs, B.E.E. degree, Cornell University

He
, Associates Degree in Livestock Production, Northeast Community College


Algorithm Details


Relation query formulation


All argument values become query terms


Example:


(William Cohen, Ph.D., Rutgers)

-
becomes
-

'William Cohen Ph.D. Rutgers'

Algorithm Details


Pattern Extraction


Build a regex from a relation, one per argument


(Mr
\
.|Mr|MR|M
\
.?+r
\
.?+|Dr
\
.?+|Mrs
\
.?+|MRS|Ms|MS)* ?+(Scott
Fahlman|Scott|Fahlman)


([a
-
zA
-
Z]*? [dD]egree|[Dd]octoral [Dd]egree|PhD|Ph
\
.D
\
.|Doctorate|PHD)


(MIT)


Apply regex to input and for every match, extract
intermediate string and generalize


<PERSON> received her <DEGREE> from <UNIVERSITY>

<PERSON> received his <DEGREE> from <UNIVERSITY>

<PERSON> earned a <DEGREE> from <UNIVERSITY>

<PERSON>s, MD <UNIVERSITY>



Initial seeds


Relations

affiliated('William Cohen', 'Ph.D.', 'Duke University')

affiliated('Tom Mitchell', 'Ph.D.', 'Stanford')

affiliated('Scott Fahlman', 'Ph.D.', 'MIT')


Patterns

<PERSON> received his <DEGREE> from <UNIVERSITY>

<PERSON> earned his <DEGREE> from <UNIVERSITY>

<PERSON> earned a <DEGREE> from <UNIVERSITY>


Testing and development performed with 2
bootstrap iterations, using only
Google snippets

Experimental Settings

Results!




inital:


patterns: 3


relations: 3


iteration 0:


patterns: 6 (+3)


relations: 13 (+3)

iteration 1:


patterns: 14 (+9)


relations: 0


total:


patterns: 23


relations: 16




Interim Conclusions


Issue I:

over
-
specificity of queries arguments


Q:

"Oren Etzioni" "Ph.D" "CMU"



But, what if actual relevant mention includes:

A:

"Oren Etzioni", "doctorate" "Carnegie Mellon University".. ?




Possible avenues:


Larger dictionaries


Unquote query arguments? (allow for some variation)


Allow argument values to include random terms "Oren * Etzioni"


This might incorporate more noise, and require additional queries to be
issued per relation.





Interim Conclusions


Issue II:

name and pronoun resolution


Q:

"Oren Etzioni" "Ph.D" "CMU"



But, what if actual relevant mention includes:

A:

"
He

recieved his Ph.D from CMU in..."



Rate of occurance of "S/he..." in extracted relations


1 pattern, 50 queries: 56.8% (96/169)



Possible avenues:


Identify homepages and extract names from titles, or other
unambiguous sources on page


Pronoun resolution simple techniques?? (for example, identify
immediate previous name mentions. This may require NER.)





Interim Conclusions


Issue III:

compound sentences


Q:

"Oren Etzioni" "Ph.D" "CMU"


But, what if actual relevant mention includes:

A:

"Oren Etzioni recieved his MS from <UNIVERSITY>, and his Ph.D
from CMU"



Possible avenues:


Extensions to pattern extraction techinque


May require dependency parsing





Software / Resources


A generic search framework which allows
asynchronous processing of search tasks, as well as
"filter" tasks (processing of resulting URLs)


A URL caching implementation of Java 1.5's
java.net.ResponseCache using Hibernate, supporting
centralized caching and remote access


Result

Generic Search Framework

Search

URL

Extraction

Search

Tasks

Filter

Tasks

SearchProcessor

Extraction

Search

Extraction

Filter

Test run:

1 Search

50 URLs

169 Extractions

15 seconds

Search Framework System Flow

Relation

Pattern

Relation

Pattern

SearchProcessor

Validate

Validate

Extensions


Dictionaries
-

next slide


Simple pronoun resolution


Extraction validation metrics


URL of professor’s personal home page


Clustering of people / universities, or
normalization of names


Identify biography section of personal home pages


Links incoming and outgoing from personal home
page

Additional information


Dictionary of institution names


Tiny dictionary of degrees


E.g. Ph.D., B.S., B. Tech., etc


Map of domain names to institution names


E.g. cmu.edu : Carnegie Mellon University


This could be learned but we will leave that for another
group!


Example extracted relations


Dictionary of institution names


Tiny dictionary of degrees


E.g. Ph.D., B.S., B. Tech., etc


Map of domain names to institution names


E.g. cmu.edu : Carnegie Mellon University


This could be learned but we will leave that for another
group!