Blog Mining Project: Initial Specifications

coldwaterphewΔιακομιστές

17 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

72 εμφανίσεις


Blog Mining Project: Initial Specifications






Project Modules:

1.

Blog Crawler

2.

Text Classifier

3.

Indexer

4.

Subtopic Identification and labeling

5.

Sentiment Analysis

6.

Ranking

7.

User Interface


Support Modules:

1.

RSS/RDF parser

2.

WordNet Modu
le

3.

PoS and/or NE Tagger


Blog Crawler:

Uses various search engine APIs to issue query on the selected topic. All the blogs on
the results page are fetched. The crawler does not “crawl” the blogosphere in the
conventional sense by following the links and th
e pages visited. Instead it relies on
the search engines to provide the links. The search engines used are mainly RSS Feed
Blog
Crawler

Text

Classifier

(SVM)

Indexer

Ranking

Module

Subtopic

Identification
& Labeling

Sentiment
Analysis

User

Interface

IR Index

RDBMS

search engines like feedster.com. The search pages of such engines are available in
RDF/RSS format and hence are easier to process us
ing readymade tool kits.


Text Classifier:

We can not rely on the quality of seach engine results. Hence to make sure that the
blogs we are processing are indeed the blogs which are talking about the selected
topic, we have to filter the irrelevant blogs
from the relevant ones. T
he classifier uses
readily available features from the blogs. SVM is used here for the text classification.
SVM packages like libSVM, SVMlight etc. are available.


The crawler and the classifier are run on a daily basis. The classi
fier lists the relevant
blogs to a file. The blogs that appear highly relevant over a period of time are chosen
for indexing and further processing. This process is repeated frequently so that we do
not miss out newly started blogs.


Indexer:

Indexes the c
ontent of a blog and the meta
-
data associated with it. All available meta
-
data and link structure is stored using a RDBMS like Berkley DB, MySQL etc. Full
text index is created for blog content using Jakarta Lucene. Indexer provides required
interface to u
se these to indices in a simple manner. It is also responsible to create and
manage any other type of indices for example, NE indices if required.


Subtopic Identification and Labeling:

The indexed blogs are then clustered based on the subtopic they discus
s. Each cluster
is give a label automatically. Wei would be using his technique here. However other
techniques for hierarchical and non hierarchical clustering can be used too.
Hierarchical techniques include single link, complete link, groupwise average a
nd
single pass. Non hierarchical are KMeans/EM etc.


Sentiment Analysis:

Each blog is analyzed for whether it is saying something positive or negative about
the issue/topic selected. It can be also assigned a value based on intensity of the
opinion expres
sed like strong, neutral or somewhat in between. This would involve
NLP for sentiment analysis.


Ranking Module:

The relevant blogs are ranked based on various factors like link structure, content,
subtopic, sentiment etc.

A new metric for producing rankin
g of blogs is required.


User Interface:

The system will have web interface for
quering, viewing mining results, providing
feedback and browsing.

Administrators can configure the whole system using web interface


UI would be developed as a web application
using JSP/Servlets optionally using
Apache Struts, hosted on Apache Tomcat.

The UI would be as simple as possible at the same time it should provide an intuitive
and flexible way of information visualization. The layout of main results page would
be

like:





The RSS/RDF parser:

This is developed using readymade tools/APIs. It will provide a simple interface to
extract information needed from the RSS/RDF content.

Indexer and the crawler are the main users.


WordNet and PoS/ NE
Tagger would be needed for sentiment analysis.




Higher
level
category

Lower level
categories

Results
Section

Header

Background of
the link is color
coded based on
+ve/
-
ve
opinion and the
intensity

Metadata,
link(s),
extract