Major Web Intelligence Tools

boorishadamantΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 4 χρόνια και 11 μέρες)

74 εμφανίσεις

1

© 2005

Major Web Intelligence Tools

2

© 2005

Web Intelligence Tools


I. Collection


Offline Explorer


SpidersRUs (AI Lab)


Google Scholar



II. Analysis (Data and Text Mining)


Google APIs


Google Translation


GATE


Arizona

Noun Phraser (AI Lab)


Self
-
Organizing Map, SOM (AI Lab)


Weka



III. Visualization


NetDraw


JUNG


Analyst’s Notebook and Starlight

3

© 2005

Collection:

Offline Explorer


Developed by MetaProducts Corporation,
Offline Explorer

can d
ownload Web
sites to your hard disk for offline browsing
.


http://www.metaproducts.com/OE.html



Advantages of
Offline Explorer


Save Time
: D
ownload up to 500 files simultaneously
.



Save Yesterday's Web Sites for Tomorrow's Use



Monitor Web Sites



Mine your Data


TextPipe tool
in
Offline Explorer Pro edition

can

extract or change the desired data, or
even explort it to a database
.

4

© 2005

Offline Explorer

Project list

Project properties setup window

File filters, URL filters,
and other advanced
properties.

Download
URLs

Download
level

File modification
check

5

© 2005

SpidersRUs


SpidersRUs Digital Library Toolkit
wa
s developed by Artificial
Intelligence Lab at the University of Arizona.


http://ai.eller.arizona.edu/spidersrus/




Provide modular tools for spidering, indexing, searching for building
digital libraries in different languages in a simple DIY (Do
-
It
-
Yourself) way. Users can create their own search engines easily
and quickly via
the

friendly user interface.




SpidersRUs can automate the development of
v
ertical
s
earch
e
ngines in
d
ifferent
d
omains and
l
anguages. It can work on non
-
English languages such as Asian and Middle East languages.


6

© 2005

SpidersRUs

An e
xample of a Chinese search engine built by SpidersRUs

Keyword search

Search results

7

© 2005

Google Scholar


Google Scholar provides a simple way to broadly search for scholarly
literature.


http://scholar.google.com/



Features of Google Scholar
:



Search diverse sources from one convenient place


Find papers, abstracts and citations


Locate the complete paper through your library or on the web


Learn about key papers and scholars in any area of research


8

© 2005

Google Scholar

Search for “Bioterrorism” in Google Scholar

List of papers citing this paper

366
citations

9

© 2005

Analysis:

Google APIs


Google provides many APIs to help you quickly develop your own application
s
.


http://code.google.com/more/




Examples

of Google APIs:


Google API for Inlink
:
D
iscovers what pages link to your website.


Google Data APIs: Provide a simple, standard protocol for reading and
writing data on the
W
eb. Several Google services provide a Google Data
API, including Google Base, Blogger, Google Calendar, Google
Spreadsheets and Picasa Web Albums.


Google AJAX Search API: Use
s

JavaScript to embed a simple, dynamic
Google search box and display search results in your own
W
eb pages.


Google Analytics:
Allows users
gather, view, and analyze data about
their

W
ebsite traffic
.
Users can s
ee which content gets the most visits, average
page views and time on site for visits.


Google Safe Browsing APIs:
A
llow client applications to
check URLs
against Google's constantly
-
updated blacklists of suspected phishing and
malware pages
.


YouTube Data API: Integrate
s

online videos from YouTube into your
application
s
.

10

© 2005

Example: Google API for
I
nlink

Input “link URL” and search

Results: all the related
inlink Web pages

11

© 2005

Google Translation


Google's Translate function.
http://www.google.com/language_tools?hl=en



The input and output languages can be
Arabic, Chinese, Dutch,
English, French, German, Greek, Italian, Japanese, Korean,
Portugese, Russian or Spanish.



Major functions of Google Translation include:


Search multilingual Web pages


Search the Internet in one

language and get the results in another one.


Translate text


Translate free text into multiple languages.


Translate a Web page


Translate a
W
eb page into multiple languages.

12

© 2005

Google Translation

Translate text from Arabic to English

Search multilingual Web pages

Translate a Web page

13

© 2005

GATE


Generalised Architecture for Text Engineering

(
GATE
)

is a
toolkit for Text Mining
.

It
was developed by NLP group at the
University of Sheffield (UK).
http://gate.ac.uk




Information Extraction tasks:


Named Entity Recognition (NE)


Finds names, places, dates, etc.


Co
-
reference Resolution (CO)


Identifies identity relations between entities in texts.


Template Element Construction (TE)


Adds descriptive information to NE results (using CO).


Template Relation Construction (TR)


Finds relations between TE entities.


Scenario Template Production (ST)


Fits TE and TR results into specified event scenarios.



GATE also includes:


Pa
rsers,
s
temmers,
and
I
nformation
R
etrieval

tools
;


T
ools for visuali
z
ing and manipulating ontolog
y; and


E
valuation and benchmarking tools
.

14

© 2005

GATE

* Picture is from http://nlp.shef.ac.uk

Project information

Results display

Attributes

15

© 2005

Arizona Noun Phraser


The Arizona Noun Phraser was developed by Artificial Intelligence Lab at the University
of Arizona.


http://ai.arizona.edu/




The Arizona Noun Phraser is made up of three major components, a tokenizer, a part
-
of
-
speech tagger, and a phrase generation tool. It generates precise topic descriptions.



Tokenizer


Separates punctuation and symbols from text without affecting content
.



Part of Speech (POS) Tagger


Uses both lexical and contextual disambiguation in POS assignment
;


Lexicons

include
: Brown Corpus, Wall Street Journal,
and
Specialist Lexicon
.



Phrase Generation


Uses
Simple Finite State Automata (FSA) of noun phrasing rules
;


Breaks sentences and clauses into grammatically correct noun phrases
.

16

© 2005

Arizona Noun Phraser

17

© 2005

SOM



The multi
-
level
self
-
organizing
map neural network
algorithm
was developed by Artificial Intelligence Lab
at the University of Arizona.



Using a 2D map display, similar topics

are
positioned closer according to their co
-
occurrence
patterns; more important topics occupy larger
regions.


18

© 2005

SOM

Example:
FMD Paper Content Map (2001~2005)

Different
Topics

Topic
region

Topic

# of
documents
belonging to
this topic

Warm colors
represent
new topics.


Developed by AI lab at the

University of Arizona

19

© 2005

Weka


Weka was d
eveloped at
the
University of Waikato in New Zealand.
http://www.cs.waikato.ac.nz/~ml/



Tools include:



D
ata preprocessing

(
e.g.
,

Data Filters
),


Classification
(
e.g.
,

BayesNet, KNN, C4.5 Decision Tree, Neural
Networks, SVM
),


Regression
(
e.g.
,

Linear Regression, Isotonic Regression, SVM for
Regression
),


Clustering
(
e.g.
,

Simple K
-
means, Expectation Maximization (EM),
Farthest First
),



Association rules
(
e.g.
,

Apriori Algorithm, Predictive Accuracy,
Confirmation Guided
),


Feature Selection
(
e.g.
,

Cfs Subset Evaluation, Information Gain, Chi
-
squared Statistic
), and


Visualization
(
e.g.
,

View different two
-
dimensional plots of the data
)
.

20

© 2005

Weka

Different analysis tools

Different attributes to
choose

The value set of the chosen attribute
and the # of input items with each value

21

© 2005

Visualization:

NetDraw


NetDraw is a open source program written by Steve Borgatti
from Analytic Technologies for visualizing both 1
-
mode and 2
-
mode social network data.


http://www.analytictech.com/downloadnd.htm





Handle multiple relations at the same time, and can use node
attributes to set colors, shapes, and sizes of nodes. Pictures
can be saved in metafile, jpg, gif and bitmap formats.



Two basic kinds of layouts are implemented: a circle and an
MDS/ spring embedding based on geodesic distance. You
can also rotate, flip, shift, resize and zoom configurations.

22

© 2005

NetDraw

Display
setup of the
nodes and
relations

Different functions

The networks: nodes representing the
individuals and links representing the relations

23

© 2005

JUNG


T
he Java Universal Network/Graph Framework

(
JUNG
)
is a software
library for the modeling, analysis, and visualization of data that can be
represented as a graph or network. It
wa
s developed by
School of
Information and Computer Science at the University of California, Irvine.


http://jung.sourceforge.net/index.html



The current distribution of JUNG includes implementations of a number
of algorithms from graph theory, data mining, and social network
analysis
:


Clustering


Decomposition


Optimization


Random Graph Generation


Statistical Analysis


Calculation of Network Distances and Flows and Importance
Measures (Centrality, PageRank, HITS, etc.).


24

© 2005

JUNG

Examples of visualization types


* Pictures are from
http://jung.sourceforge.net/index.html

25

© 2005

Analyst’s Notebook & Starlight


Analyst’s Notebook, by i2: A 2D graph and timeline
layout tool for crime and intelligence analysis



Startlight, by Pacific Northwest Lab (PNL): A 3D
network visualization and navigation tool for
intelligence analysis

26

© 2005

Analyst’s Notebook, i2

Starlight, PNL