Open Source Text Mining

blabbingunequaledΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

78 εμφανίσεις

1

Open Source Text Mining

Text Mining 2003 @ SDM03

Cathedral Hill Hotel, San Francisco


Hinrich Schütze, Enkata

May 3, 2003

2

Motivation


Open source used to be a crackpot idea.


Bill Gates on linux (1999.03.24): “I really don't think in the
commercial market, we'll see it in any significant way.”


MS 10
-
Q quarterly filing (2003.01.31): “The popularization
of the open source movement continues to pose a
significant challenge to the company's business model.”


Open source is an enabler for radical new things


Google


Ultra
-
cheap web servers


Free news


Free email


Free …


Class projects


Walmart pc for $200

3

GNU
-
Linux

4

Web Servers:

Open Source Dominates

Source: Netcraft

5

Motivation (cont.)


Text mining has not had much impact.


Many small companies & small projects


No large
-
scale adoption


Exception: text
-
mining
-
enhanced search


Text mining could transform the world.


Unstructured


structured


Information explosion


Amount of information has exploded


Amount of accessible information
has not


Can open source text mining make this
happen?

6

Unstructured vs Structured
Data

Prabhakar Raghavan, Verity

7

Business Motivation


High cost of deploying text mining solutions


How can we lower this cost?


100% proprietary solutions


Require re
-
invention of core infrastructure


Leave fewer resources for high
-
value
applications built on top of core
infrastructure

8

Definitions


Open source


Public domain, bsd, gpl (gnu public license)


Text mining


Like data mining but for text


NLP (Natural Language Processing)
subdiscipline


Has interesting applications now


More than just information retrieval /
keyword search


Usually: some statistical, probabilistic or
frequentistic component


9

Text Mining vs. NLP


(Natural Language Processing)


What is not text mining: speech, language
models, parsing, machine translation


Typical text mining: clustering, information
extraction, question answering


Statistical and high volume

10

Text Mining: History


80s: Electronic text gives birth to Statistical
Natural Language Processing (StatNLP).


90s: DARPA sponsors Message
Understanding Conferences (MUC) and
Information Extraction (IE) community.


Mid
-
90s: Data Mining becomes a discipline
and usurps much of IE and StatNLP as “text
mining”.

11

Text Mining: Hearst’s Definition


Finding nuggets


Information extraction


Question answering


Finding patterns


Clustering


Knowledge discovery


Text visualization

12

foodscience.com
-
Job2


JobTitle:

Ice Cream Guru


Employer:

foodscience.com


JobCategory:

Travel/Hospitality


JobFunction:

Food Services


JobLocation:

Upper Midwest

Contact Phone:

800
-
488
-
2611


DateExtracted:

January 8, 2001


Source:

www.foodscience.com/jobs_midwest.html


OtherCompanyJobs:

foodscience.com
-
Job1

Information Extraction

13

Knowledge Discovery:
Arrowsmith


Goal: Connect two disconnected subfields of
medicine.


Technique


Start with 1st subfield


Identify key concepts


Search for 2nd subfield with same concepts


Implemented in Arrowsmith system


Discovery: magnesium is potential treatment
for migraine


14

Knowledge Discovery:
Arrowsmith

15

When is Open Source
Successful?


“Important” problem


Many users (operating system)


Fun to work on (games)


Public funding available (OpenBSD, security)


Open source author gains fame/satisfaction/immortality/community


Adaptation


A little adaptation is easy


Most users do not need any adaptation (out of the box use)


Incremental releases are useful


Cost sharing without administrative/legal overhead


Dozens of companies with significant interest in linux (ibm …)


Many of these companies contribute to open source


This is in effect an informal consortium


A formal effort probably would have killed linux.


Same applies to text mining?


Also: bugs, security, high
-
availability, ideal for consulting &
hardware companies like IBM

16

When is Open Source Not
Successful?


Boring & rare problem


Print driver for 10 year old printer


Complex integrated solutions


QuarkXPress


ERP systems


Good UI experience for non
-
geeks


Apple


Microsoft Windows


(at least for now)





17

Text Mining and Open Source


Pro


Important problem: fame, satisfaction,
immortality, community can be gained


Pooling of resources / critical mass


Con


Non
-
incremental?


Most text mining requires significant
adaptation.


Most text mining requires
data resources

as
well as
source code
.


The need for data resources does not fit well
into the open source paradigm.


18

Text Mining Open Source Today


Lucene


Excellent for information retrieval, but not
much text mining.


Rain/bow, Weka, GTP, TDMAPI


Text mining algorithms / infrastructure, no
data resources


NLTK


NLP toolkit, some data resources


WordNet, DMOZ


Excellent data resources, but not enough
breadth/depth.

19

Open Source with Open Data


Spell checkers (e.g., emacs)


Antispam software (e.g., spamassassin)


Named entity recognition (Gate/Annie)


Free version less powerful than in
-
house


20

SpamAssassin: Code + Data

21

Open Data Resources:
Examples


SpamAssassin


Classification model for spam


Named entity recognition


Word lists, dictionaries


Information extraction


Domain model, taxonomies, regular
expressions


Shallow parsing


Grammars


22

Code

?


Proprietary

Open
Source

No
Resources
Needed

Significant
Resources
Needed

Code vs Data

Text Classification

N. Entity Recognition

Information Extraction


Complex&Integrated SW

Good UI Design


Linux

Web Servers

Spam Filtering

Spell Checkers


23

Open Source with Data: Key Issues


Can data resources be recycled?


Problems have to be similar.


More difficult than one would expect: my first attempt
failed (medline/reuters).


Next: case study


Assume there is a large library of data resources
available.


How do we identify the data resources that can be
recycled?


How do we adapt them?


How do we get from here to there?


Need incremental approach that is sustained by
successes along the way.

24

Text Mining without Data
Resources


Premise: “Knowledge
-
poor” text mining taps
small part of potential of text mining.


Knowledge
-
poor text mining examples


Clustering


Phrase extraction


First story detection


Many success stories

25

Case Study: ODP
-
> Reuters

Case Study:

Train on ODP

Apply to Reuters

26

Case Study: Text Classification


Key Issues for text classification


Show that text classifiers can be recycled


How can we select reusable classifiers for a
particular task?


How do we adapt them?


Case Study


Train classifiers on open directory (ODP)


165,000 docs (nodes), crawled in 2000, 505 classes


Apply classifiers to Reuters RCV1


780,000 docs, >1000 classes


Hypothesis: A library of classifiers based on
ODP can be recycled for RCV1.



27

Experimental Setup


Train 505 classifiers on ODP


Apply them to Reuters


Compute chi
2

for all ODP x Reuters pairs


Evaluate n pairs with the best chi
2


Evaluation Measures


Area under ROC curve


Plot false positive rate vs true positive rate


Compute area under the curve


Average precision


Rank documents, compute precision for each rank


Average for all positive documents


Estimated based on 25% sample

28

Japan: ODP
-
> Reuters

29

Some Results

30

BusIndTraMar0 / I76300: Ports

31

Discussion


Promising results


These are results
without any adaptation
.


Performance expected to be much better
after adaptation.

32

Discussion (cont)


Class relationships are m:n, not 1:1


Reuters: GSPO


SpoBasCol0


SpoBasMinLea0


SpoBasReg0


SpoHocIceLeaNatPla0


SpoHocIceLeaPro0


ODP: RegEurUniBusInd0 (UK industries)


I13000 (petroleum & natural gas)


I17000 (water supply)


I32000 (mechanical engineering)


I66100 (restaurants, cafes, fast food)


I79020 (telecommunications)


I9741105 (radio broadcasting)

33

Why Recycling Classifiers is
Difficult


Autonomous vs relative decisions


ODP Japan classifier w/o modifications has
high precision, but only 1% recall on RCV1!


Most classifiers are tuned for optimal
performance in embedded system.


Tuning decreases robustness in recycling.


Tokenization, document length, numbers


Numbers throw off medline vs. non
-
medline
categorizer (financial classified as medical)


Length
-
sensitive multinomial Naïve Bayes:
nonsensical results

34

Specifics


What would an open source text classification
package look like?


Code


Text mining algorithms


Customization component


To adapt recycled data resources


Creation component


To create new data resources


Data


Recycled data resources


Newly created data resources


Pick a good area


Bioinformatics: genes / proteins


Product catalogs

35

Other Text Mining Areas


Named entity recognition


Information extraction


Shallow parsing

36

Data vs Code


What about just sharing training sets?


Often proprietary


What about just sharing models?


Small preprocessing changes can throw you
off completely


Share (simple?) classifier cum preprocessor
and models


Still proprietary issues


37

Open Source & Data

Sanitized&

Enhanced

Code+Data

Enhanced

Code+Data

adapt

Public

Proprietar
y

Code+Data

V1.0

Code+Data

V1.1

publish

sanitize

new release

38

Free Riders?


Open source is successful because it makes
free riding hard.


Viral nature of GPL.


Harder to achieve for some data resources


Download models


Apply to your data


Retrain


You own 100% of the result


Less of a problem for dictionaries and
grammars

39

Data Licenses


Open Directory License


http://rdf.dmoz.org/license.html


Bsd flavor


Wordnet


http://www.cogsci.princeton.edu/~wn/license
.shtml


Copyright


No license to sell derivative works?


Some criteria for derivative works


Substantially similar (seinfeld trivia)


Potential damage to future marketing of derivative
works


40

Code vs Data Licenses


Some similarity


If I open
-
source my code, then I will benefit
from bug fixes & enhancements written by
others.


If I open
-
source my data resource, then my
classification model may become more robust
due to improvements made by others.


Some dissimilarity


Code is very abstract: few issues with
proprietary information creeping in.


Text mining resources are not very abstract:
there is a potential of sensitive information
leaking out.

41

Areas in Need of Research


How to identify reusable text mining components


ODP/Reuters case study does not address this.


Need (small) labeled sample to be able to do this?


How to adapt reusable text mining components


Active learning


Interactive parameter tweaking?


Combination of recycled classifier and new training
information


Estimate performance


Most estimation techniques require large labeled
samples.


The point is to avoid construction of a large labeled
sample.


Create viral license for data resources.


42

Summary


Many interesting research issues


Need institution/individual to take the lead


Need motivated network of contributors


data resource contributors


source code contributors


Start with small & simple project that proves
idea


If it works … text mining could become an
enabler on a par with linux.

43

More Slides

44

RegAsiJap0

JAP

0.86

0.62

RegAsiPhi0

PHLNS

0.91

0.56

RegAsiIndSta0

INDIA

0.85

0.53

SpoSocPla0

CCAT

0.60

0.53

RegEurRus0

CCAT

0.58

0.51

RegEurRus0

RUSS

0.85

0.51

SpoSocPla0

GSPO

0.78

0.42

SpoBasReg0

GSPO

0.75

0.33

RegAsiIndSta0

MCAT

0.56

0.32

SpoBasPla1

GSPO

0.80

0.31

SpoBasCol0

GSPO

0.78

0.31

SpoBasCol1

GSPO

0.74

0.26

RegEurSlo0

SLVAK

0.86

0.25

SpoBasPla0

GSPO

0.77

0.24

RegEurRus0

MCAT

0.49

0.23

BusIndTraMar0

I76300

0.81

0.23

SpoHocIceLeaPro0

GSPO

0.71

0.20

SpoBasMinLea0

GSPO

0.71

0.20

RegMidLeb0

LEBAN

0.83

0.19

RecAvi0

I36400

0.74

0.18

RegSou0

BRAZ

0.84

0.18

RegAsiHonBus0

HKONG

0.66

0.18

SpoMotAut0

GSPO

0.67

0.18

SpoHocIceLeaNatPla0

GSPO

0.72

0.17

SocPol0

EEC

0.85

0.17

RegAsiIndSta0

M14

0.59

0.17

RegAsiChiPro0

CHINA

0.67

0.17

RecAvi0

I3640010

0.77

0.17

SpoFooAmeColNca1

GSPO

0.72

0.17

SocPol0

G15

0.86

0.16

RegEurBul0

BUL

0.72

0.15

RegAsiIndPro0

INDON

0.72

0.13

SpoSocPla0

UK

0.49

0.12

RegEurUkr0

UKRN

0.73

0.11

RegEurRus0

GPOL

0.48

0.11

RegEurPolVoi0

POL

0.67

0.11

RegAsiIndSta0

M141

0.61

0.10

SpoFooAmeNflPla0

GSPO

0.65

0.09

RegEurGerSta0

GFR

0.56

0.09

RegEurFra0

FRA

0.54

0.09

RegCar0

CUBA

0.76

0.09

RegEurUniBusInd0

C18

0.59

0.08

RegEurUniEngEss0

I66200

0.72

0.08

RegSou0

PERU

0.88

0.08

ComHar0

C22

0.61

0.08

RegMidTur0

TURK

0.69

0.08

RegAsiIndSta0

M13

0.56

0.08

RegEurUniBusInd0

C181

0.59

0.07

RegNorUniCalLocPxx0

LATV

0.64

0.07

RegEurRus0

GVIO

0.52

0.07

SpoSocPla0

ITALY

0.58

0.07

RegEurUniSco0

GSPO

0.54

0.07

RegEurNet0

NETH

0.65

0.07

RegEurRus0

GDIP

0.46

0.07

ArtMusStyCouBan0

GENT

0.52

0.07

RegEurRus0

BYELRS

0.92

0.06

BusIndTraMar0

C24

0.54

0.06

BusIndTraMar0

I74000

0.72

0.06

RegNorMexSta0

I76300

0.58

0.06

SpoHocIceLeaNatPla0

CANA

0.54

0.06

RegSou0

MRCSL

1.00

0.06

SocRelBud0

GREL

0.57

0.05

RegEurBel0

FRA

0.49

0.05

SpoSocPla0

FRA

0.50

0.05

RegEurUniBusInd0

I6540005

0.69

0.05

RegNorCanQueLoc0

FRA

0.46

0.05

RegEurGerSta0

GSPO

0.45

0.05

RegAsiIndSta0

M131

0.61

0.05

RegAsiPak0

SHAJH

0.76

0.05

SpoSocPla0

GFR

0.48

0.05

RegSou0

PARA

0.90

0.04

RegEurUniBusInd0

I9741109

0.59

0.04

RegSou0

BOL

0.90

0.04

RegEurRus0

UKRN

0.83

0.04

SpoSocPla0

SPAIN

0.61

0.04

NewOnlCnn0

BAH

0.56

0.04

ArtAniVoi0

I97100

0.70

0.03

RegEurRus0

NATO

0.75

0.03

RegEurRus0

GDEF

0.55

0.03

SpoSocPla0

MONAC

0.87

0.03

SciEarPal0

GSCI

0.42

0.03

RegEurRom0

ROM

0.57

0.03

RegAsiPhi0

I85000

0.66

0.03

SpoBasReg0

SPAIN

0.59

0.03

BusIndTraMar0

USSR

0.47

0.03

SpoSocPla0

NETH

0.54

0.03

SpoFooAmeNflPla0

CANA

0.48

0.03

RegEurRus0

AZERB

0.94

0.03

SciBioTaxTaxPlaMagMag0

ECU

0.54

0.03

RegNorUniCalLocPxx0

I41500

0.65

0.02

RegEurRus0

TADZK

0.95

0.02

RegEurUniBusInd0

I8150206

0.71

0.02

RegEurUniBusInd0

I81502

0.58

0.02

RegSou0

URU

0.88

0.02

RegEurUniBusInd0

I50300

0.74

0.02

RegEurUniBusInd0

I37100

0.79

0.02

RefFlaReg0

GUREP

0.69

0.02

SciBioTaxTaxPlaMagMag0

I0100144

0.58

0.02

NewOnlCnn0

GWEA

0.66

0.02

RegEurUniBusInd0

I85000

0.57

0.02

ArtCelMxx0

I97100

0.66

0.02

SpoMotAut0

SMARNO

0.88

0.02

RegEurUniBusInd0

I5020022

0.79

0.02

NewOnlCnn0

DOMR

0.55

0.02

ArtMusStyCouBan0

GPRO

0.45

0.02

RegEurUniEngEss0

I83954

0.66

0.02

SpoBasReg0

GREECE

0.51

0.02

RegEurRus0

GRGIA

0.84

0.02

RegEurRus0

KAZK

0.82

0.02

RegEurNet0

M142

0.45

0.02

RegEurUniBusInd0

I83200

0.67

0.01

NewOnlCnn0

BELZ

0.50

0.01

RegEurUniBusInd0

C34

0.49

0.01

RegEurUniEngEss0

I82002

0.56

0.01

SpoBasReg0

ISRAEL

0.38

0.01

RegEurUniBusInd0

I83400

0.73

0.01

RegEurUniBusInd0

I83954

0.67

0.01

RegEurPolVoi0

FIN

0.58

0.01

RegEurRus0

USSR

0.82

0.01

RegEurUniBusInd0

I9741105

0.58

0.01

RegEurUniBusInd0

I32852

0.80

0.01

RegEurUniBusInd0

I83940

0.63

0.01

BusIndTraMar0

BUL

0.37

0.01

RegEurUniBusInd0

I61000

0.68

0.01

BusIndTraMar0

ESTNIA

0.60

0.01

NewOnlCnn0

GABON

0.46

0.01

NewOnlCnn0

CVI

0.70

0.01

SciBioTaxTaxAniChoAve0

GENV

0.45

0.01

SpoMotAut0

MONAC

0.71

0.01

ArtCelBxx0

I97100

0.64

0.01

SpoBasReg0

TURK

0.46

0.01

BusIndTraMar0

PORL

0.57

0.01

SpoBasReg0

CRTIA

0.48

0.01

RegEurUniBusInd0

I95100

0.65

0.01

BusIndTraMar0

CRTIA

0.41

0.01

BusIndTraMar0

UKRN

0.43

0.01

ArtCelLxx0

I97100

0.60

0.01

RegEurRus0

MOLDV

0.78

0.01

RegSou0

SURM

0.80

0.01

BusIndTraMar0

LATV

0.60

0.01

BusIndTraMar0

ALB

0.24

0.01

BusIndTraMar0

LITH

0.58

0.01

ArtCelSxx0

I97100

0.63

0.01

RegEurUniBusInd0

I16000

0.59

0.01

SpoBasCol0

E71

0.42

0.01

SciBioTaxTaxPlaMagMag0

BELZ

0.53

0.01

ArtMusStyCouBan0

GOBIT

0.53

0.01

BusFinBanBanReg0

C173

0.68

0.01

RegEurRus0

ARMEN

0.85

0.01

RegEurRus0

I22471

0.66

0.01

RegEurRus0

TURKM

0.86

0.01

BusIndTraMar0

ROM

0.40

0.01

BusIndTraMar0

TUNIS

0.67

0.00

RegAsiChiPro0

I5020006

0.76

0.00

ArtTelNet0

I9741105

0.67

0.00

BusIndTraMar0

YEMAR

0.49

0.00

BusIndTraMar0

CYPR

0.40

0.00

RefFlaReg0

SLVNIA

0.57

0.00

RegEurUniEngEss0

I9741105

0.57

0.00

RegEurRus0

KIRGH

0.83

0.00

RegCar0

GTOUR

0.55

0.00

BusIndTraMar0

UAE

0.48

0.00

NewOnlCnn0

BERM

0.52

0.00

BusIndTraMar0

NAMIB

0.48

0.00

BusIndTraMar0

JORDAN

0.36

0.00

RecAvi0

C313

0.42

0.00

BusIndTraMar0

MOZAM

0.51

0.00

RegEurUniBusInd0

I66200

0.66

0.00

BusIndTraMar0

SILEN

0.34

0.00

RegMidLeb0

I9741105

0.54

0.00

RegAsiHonBus0

I81400

0.61

0.00

RefFlaReg0

WORLD

0.43

0.00

RegNorUniCalLocVxx0

C313

0.39

0.00

RegAsiHonBus0

I64700

0.72

0.00

RefFlaReg0

UPVOLA

0.58

0.00

SciBioTaxTaxPlaMagMag0

I0100216

0.66

0.00

RegAsiHonBus0

I3640048

0.70

0.00

SciBioTaxTaxAniChoAve0

AARCT

0.53

0.00

RegSou0

I5020051

0.84

0.00

NewOnlCnn0

TCAI

0.00

0.00

45

Resources


http://www
-
csli.stanford.edu/~schuetze

(this talk, some
additional material)


Source of Gates quote:
http://www.techweb.com/wire/story/TWB19990324S0014


Kurt

D. Bollacker and Joydeep Ghosh. A scalable method for
classifier knowledge reuse. In
Proceedings of the 1997
International Conference on Neural Networks
, pages 1474
-
79,
June 1997. (proposes measure for selecting classifiers for reuse)


W.Cohen, D.Kudenko:
Transferring and Retraining Learned
Information Filters
, Proceedings of the Fourteenth National
Conference on Artificial Intelligence, AAAI 97. (transfer within
the same dataset)


Kurt

D. Bollacker and Joydeep Ghosh. A supra
-
classifier
architecture for scalable knowledge reuse. In
The 1998
International Conference on Machine Learning
, pp. 64
-
72, July
1998. (transfer within the same dataset)


Motivation of open source contributors:
http://newsforge.com/newsforge/03/04/19/2128256.shtml?tid
=11
,
http://cybernaut.com/modules.php?op=modload&name=News&f
ile=article&sid=8&mode=thread&order=0&thold=0