A collaborative approach to Spam E-Mail Filtering: Recommendation Using Ontological user profiling

chardfriendlyAI and Robotics

Oct 16, 2013 (3 years and 5 months ago)

73 views

A collaborative approach to Spam E
-
Mail Filtering:
Recommendation Using Ontological user profiling


Dr. Lina Zhou

Xiaoli Jiao


Department of Information Systems

University of Maryland Baltimore County



Abstract


Most definitions assume UCE (Unsolicited C
ommercial Email) and spam to be
synonymous. However, people do not classify emails as spam objectively purely on
whether they adhere to a definition, but rather subjectively on whether the email is of
interest to them. It is noteworthy that some consider e
mail to be spam even if they have
explicitly given the sender permission to contact them. This reflects some of the
conundrums of legislative debates on spam.

Server
-
based collaborative filtering has been very successful in many systems, but
they are in fa
vor of ubiquitous computing settings. We propose a personalized,
collaborative approach to filtering spam using recommendation system. In addition, we
explore a novel ontological approach to improve user profiling and hence the
recommendation accuracy.




1. Introduction


The increasing volume of unsolicited bulk e
-
mail (spam) has generated a need for
reliable anti
-
spam filters. There are a number of very successful collaborative filters
today: Vipul’s Razor[17], distributed Checksum Clearinghouse [18] and
SpamNet [19].

however, very few takes a considerate care of domains or users personalization.

Studies [16] show that people have their personal views on what constitutes spam.
A centralized server filter will cause false positives for users whose opinions
differ from
the majority. In our research, we address the issue with the aid of user profiling in
collaborative spam email filtering.

The user profiling approach used by most recommendation systems is behavior
based, commonly using a binary class model to

represent what users find interesting and
uninteresting. However, a binary profile does not lend itself to sharing examples of
interest or integrating any domain knowledge that might be available.

We use the term,
ontology
, to refer to the classification

structure and instances
within a knowledge base. We propose an approach to filtering spam email combining
collaborative and content
-
based recommendation techniques and representing user
profiles in ontological term.


2. Related work


Alan Gray and Mads Ha
ahr analyzed 3 assumptions implicit in centralized spam
filtering and described how they affect spam filtering. They presented an architecture for
personalized, collaborative spam filtering and described the design and implementation
of proof
-
of
-
concept, p
eer
-
to
-
peer, signature
-
based system based on the architecture.
Ernest Damiani and his collogues proposed a decentralized privacy
-
preserving approach
to spam filtering in their paper “P2P
-
Based Collaborative Spam Detection and Filtering.”
They exploits dige
sts to indentify messages that are a slight variation of one another and
a structured peer
-
to
-
peer architecture between mail servers to collaboratively share
knowledge about spam.


3. Our Proposed Approach


We propose a personalized, collaborative appro
ach to filtering spam using
recommendation system, our approach tends to intend to improve the overall
performance of conventional server
-
based collaborative filters. Recommendations are
based on a group of user with similar user profiles, in this way, whe
n a false
positive/negative happens to one user, it helps others. User interest profile is computed by
correlating previously email messages with their classification and is a part of user profile
to determine similarities among users. We use ontological i
nference to improve user
profiling and hence the recommendation accuracy.

The proposed approach leaves much flexibility for user’s input and feedback,
users can define the parameters like cost values, interest topics, recommendation groups,
trust levels,
or for any absence these parameters, system can sit behind the screen and
determine by watching their behavior.


3.1

Message representation and classification


Massages are represented with term vectors and each term has a weight.
Similarities among messag
es can be computed using their term vectors. Features
may include sender’s ID (name, domain, IP address), subject, classification….

Classification of message is determined by message content, several
machine learning methods such as k
-
Nearest Neighbor, Ex
pectation
Maximization (EM), Support Vector Machine, Naïve Bayes and decision tree may
be applied to develop a classifier.

To be more helpful in identifying spam with just a little variation in
context, a signature is computed on every new email received a
nd compared to
the known message. In order for algorithms to be more robust (or “fuzzy” i.e.
ignores small randomization in the text), they need to be developed to be more
content
-
aware so that unimportant discrepancies between messges do not change
the si
gnature. When a new message comes to a user, its signature is first
compared to the known message signatures of this user, if there is a match, its
spamminess can be identified by the known message and no further step to go.


3.2

Recommendation


A thresho
ld is set for every user based the cost value assigned to false
positive/negative messages. When a message comes, an accumulated value is
calculated based on the recommendation of a group of users with similar profiles.
If the accumulated value is greater
than the threshold, the message is classified as
spam for this user.

User can define a user group from which the recommendation is made, or
let the system to define one based on similarities of user profiles. The
accumulated value for a message is calcula
ted based on the recommendation
value and the trust level of each user in the recommendation group for this user.
The system will look into the message folders of each user in the recommendation
group and assign a recommendation value to indicate how likel
y it be spam by
comparing the similarity with previous messages either spam or legitimate. (This
procedure can be seen as a particular message classification problem, in which
only two classes are available: spam or legitimate. In this way, a classifier is

trained for each user.) Individual sets of spam and non
-
spam messages are
maintained for each user’s profile.



3.3

User Profiling


In addition to an individual interest profile computed by correlating
previously email messages browsed or flagged as spam

with their classification, a
user’s profile may include features like key word list, rules, black list, white list
and etc. Ontological relationships between email classifications are used to infer
topics that might not have been specified explicitly.


Wh
en the system get started to work, a uniform profile is assigned to
every user and then individual user profile is updated on daily bases or upon
acceptation of recommendations or receiving of false positive/negative report
from user. This is done by syste
m with unobtrusive monitoring of user behavior.
After updating user profiles similarities between the changed and other user
profiles need to be updated and followed by updating of recommendation group
for users.

User interest feedback details a level of
interest/dislike in a topic, this
feedback enables the spam filter to track concept drift in spam and to be retrained
in the case of false positive [15]. We are going to develop a profiling algorithm to
automatically adjust user profiles to match any topic

interest/dislike levels
declared via profile feedback. An instance of spam for a specific class may add a
percentage of its value to the super
-
class.

Time
-
decay function and other existing profiling algorithms may also be
invoked to find current interest
s.


3.4

Ontology construction


Ontology is a conceptualization of a domain into a human
-
understandable,
and machine
-
readable format consisting of entities, attributes, relationships, and
axioms [14]. Ontologies can provide a rich conceptualization of the w
orking
domain of an organization, representing the main concepts and relationships of
the work activities.

There are two ways for this task: use existing ontology or construct one
based on the classification of email messages.


Although there are many top
ic
-
specific messages in a moderated mailing
list, most of them will fall in standard categories. We are going to user an existing
taxonomy with appropriate customizations based on specific domain knowledge.

For example, if the application is in academic re
search domain, we can add
additional topics to the ontology for the target researchers.


Kazem Taghva and Julia Borsack reported on the construction of an
ontology that applies rules for identification of features to be used for email
classification [12].




4. Evaluation Measures



Suppose S and N are the number of spam and non
-
spam messages for each user,
S+ is the number of spam message that are correctly classified by a system, and S
-

is the
number of messages misclassified as Non
-
spam, similarly, N+ d
enote the number of non
-
spam message that are correctly classified and N
-

the number of messages misclassified
as spam.

Following measures can be calculated based on these four values to measure the
performance of the system.

4.1 Filtering accuracy



4.1
.1 Precision, recall and F measure value

We can calculate Precision, Recall and F
-
measure value based on these
four values to measure the performance of the system.

4.1.2 Utility measures

In this measure, a loss value V is attached to each S
-
, S+, N
-
, N+,
the
overall performance of a system is the sum of the multiplication of 4
numbers of messages and its corresponding V
-
value.

4.1.3 Weighted accuracy

Suppose misclassifying a non
-
spam message as spam is t times more
costly than the symmetric misclassificati
on, a version of accuracy
sensitive to t
-
cost: Wacc = (t*L
-

+ S+)/(t*L + S)


4.2 Ontological inference in user profiling

We are going to compare the accuracy values as 3.1 in the absence of
ontological inference scenario to identify to what extent ontolog
ical profiling
helps improve the overall system performance.


4.3 User satisfaction



The higher accuracy value indicates higher satisfaction of users.


(Some metrics for measuring recommendation performance are suggested by
Schein et al. [2002].
Jonathan

L. et al overviewed the factors that have been
considered in evaluations as well as introduced new factors that they believe
should be considered in evaluation in [3].)




References

[1]

Middleton, S. E., Shadbolt, N. R. and De Roure, D. C.

(
2004
)
Ontolog
ical User Profiling in Recommender Systems
.
ACM Transactions on
Information Systems (TOIS)

22
Pages: 54
-

88
[2] Joseph A. Konstan
Introduction to recommender systems: Algorithms and Evaluation

ACM
Transactions on Information Systems (TOIS)

22

Pages: 1
-

4

[3]

JONATHAN

L. HERLOCKER, JOSEPH A. KONSTAN, LOREN G. TERVEEN, and
JOHN T. RIEDL
Evaluating collaborative filtering recommender systems

ACM Transactions on Information Systems (TOIS)

22
Pages: 5
-

53

[4]
Fabrizio Sebastiani.
Machine learning in automated text
categorization
, ACM
Computing Surveys, Vol. 34, No. 1, March 2002,
pp.1
-
47

[5]

Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.
Spyropoulos, and P. Stamatopoulos.
Learning to filter spam e
-
mail: A
comparison of a naive bayesian and a memory
-
based approach
. In Work
shop
on Machine Learning and Textual Information Access, 4th European
Conference on Principles and Practice of Knowledge Discovery in
Databases (PKDD)

[6] Ken Lang.
NewsWeeder: Learning to filter netnews
. In Machine
Learning: Proceedings of the Twelfth Int
ernational Conference, Lake
Taho, California, 1995.

[7]
J. Canny.
Collaborative filtering with privacy
. In IEEE Symposium
on Security and Privacy, pages 45
--
57, Oakland, CA, May 2002.

[8]

Mark Claypool Anuja Gokhale, Tim Miranda, Pavel Murnikov, Dmitry
N
etes and Matthew Sartin.
Combining Content
-
Based and Collaborative
Filters in an Online Newspaper

ACM SIGIR Workshop on Recommender
Systems Berkeley, CA
(
1999
)

[9]Crano
r, L.F. and B.A. LaMacchia. 1998. Spam!
Communications of ACM
,
41(8):74

83.

[10]


MIT Spam Conference looks beyond filters
” by Paul Roberts, IDG
News Service, Boston Bureau, January 20, 2004

[11] Mitchell, T.M. 1997.
Machine Learning
. McGraw
-
Hill.

[12]

Kaz
em Taghva, Julie Borsack, Jeffrey Coombs, Allen Condit, Steve
Lumos, Tom Nartker
,
Ontology
-
based Classification of Email

Proceedings
of the International Conference on Information Technology: Computers

and Communications
(
2003)

[13] Masahiro Morita and Yoi
chi Shinoda.
Information filtering based on
User Behavior Analysis and Best Match Text Retrieval

[14]Guarino, N. and Giaretta, P. 1995.
Ontologies and knowledge bases:
Towards a terminological clarification
. In Towards Very Large Knowledge
Bases: Knowledge

Building and Knowledge Sharing, N. Mars, Ed. IOS
Press, 25
-
32.

[16] Deborah Fallows.
Spam: How it is hurting email and degrading life
on the Internet
. Pew Internet and American Life Project, October 2003.


Internet resources:


http://www.postini.com/newsletter/

http://spamconference.org/talks2004.html

http://jamesthornton.com/cf/

http://spamlinks.openrbl.org/filter
-
research.htm