A collaborative approach to Spam E
Recommendation Using Ontological user profiling
Dr. Lina Zhou
Department of Information Systems
University of Maryland Baltimore County
Most definitions assume UCE (Unsolicited C
ommercial Email) and spam to be
synonymous. However, people do not classify emails as spam objectively purely on
whether they adhere to a definition, but rather subjectively on whether the email is of
interest to them. It is noteworthy that some consider e
mail to be spam even if they have
explicitly given the sender permission to contact them. This reflects some of the
conundrums of legislative debates on spam.
based collaborative filtering has been very successful in many systems, but
they are in fa
vor of ubiquitous computing settings. We propose a personalized,
collaborative approach to filtering spam using recommendation system. In addition, we
explore a novel ontological approach to improve user profiling and hence the
The increasing volume of unsolicited bulk e
mail (spam) has generated a need for
spam filters. There are a number of very successful collaborative filters
today: Vipul’s Razor, distributed Checksum Clearinghouse  and
however, very few takes a considerate care of domains or users personalization.
Studies  show that people have their personal views on what constitutes spam.
A centralized server filter will cause false positives for users whose opinions
the majority. In our research, we address the issue with the aid of user profiling in
collaborative spam email filtering.
The user profiling approach used by most recommendation systems is behavior
based, commonly using a binary class model to
represent what users find interesting and
uninteresting. However, a binary profile does not lend itself to sharing examples of
interest or integrating any domain knowledge that might be available.
We use the term,
, to refer to the classification
structure and instances
within a knowledge base. We propose an approach to filtering spam email combining
collaborative and content
based recommendation techniques and representing user
profiles in ontological term.
2. Related work
Alan Gray and Mads Ha
ahr analyzed 3 assumptions implicit in centralized spam
filtering and described how they affect spam filtering. They presented an architecture for
personalized, collaborative spam filtering and described the design and implementation
based system based on the architecture.
Ernest Damiani and his collogues proposed a decentralized privacy
to spam filtering in their paper “P2P
Based Collaborative Spam Detection and Filtering.”
They exploits dige
sts to indentify messages that are a slight variation of one another and
a structured peer
peer architecture between mail servers to collaboratively share
knowledge about spam.
3. Our Proposed Approach
We propose a personalized, collaborative appro
ach to filtering spam using
recommendation system, our approach tends to intend to improve the overall
performance of conventional server
based collaborative filters. Recommendations are
based on a group of user with similar user profiles, in this way, whe
n a false
positive/negative happens to one user, it helps others. User interest profile is computed by
correlating previously email messages with their classification and is a part of user profile
to determine similarities among users. We use ontological i
nference to improve user
profiling and hence the recommendation accuracy.
The proposed approach leaves much flexibility for user’s input and feedback,
users can define the parameters like cost values, interest topics, recommendation groups,
or for any absence these parameters, system can sit behind the screen and
determine by watching their behavior.
Message representation and classification
Massages are represented with term vectors and each term has a weight.
Similarities among messag
es can be computed using their term vectors. Features
may include sender’s ID (name, domain, IP address), subject, classification….
Classification of message is determined by message content, several
machine learning methods such as k
Nearest Neighbor, Ex
Maximization (EM), Support Vector Machine, Naïve Bayes and decision tree may
be applied to develop a classifier.
To be more helpful in identifying spam with just a little variation in
context, a signature is computed on every new email received a
nd compared to
the known message. In order for algorithms to be more robust (or “fuzzy” i.e.
ignores small randomization in the text), they need to be developed to be more
aware so that unimportant discrepancies between messges do not change
gnature. When a new message comes to a user, its signature is first
compared to the known message signatures of this user, if there is a match, its
spamminess can be identified by the known message and no further step to go.
ld is set for every user based the cost value assigned to false
positive/negative messages. When a message comes, an accumulated value is
calculated based on the recommendation of a group of users with similar profiles.
If the accumulated value is greater
than the threshold, the message is classified as
spam for this user.
User can define a user group from which the recommendation is made, or
let the system to define one based on similarities of user profiles. The
accumulated value for a message is calcula
ted based on the recommendation
value and the trust level of each user in the recommendation group for this user.
The system will look into the message folders of each user in the recommendation
group and assign a recommendation value to indicate how likel
y it be spam by
comparing the similarity with previous messages either spam or legitimate. (This
procedure can be seen as a particular message classification problem, in which
only two classes are available: spam or legitimate. In this way, a classifier is
trained for each user.) Individual sets of spam and non
spam messages are
maintained for each user’s profile.
In addition to an individual interest profile computed by correlating
previously email messages browsed or flagged as spam
with their classification, a
user’s profile may include features like key word list, rules, black list, white list
and etc. Ontological relationships between email classifications are used to infer
topics that might not have been specified explicitly.
en the system get started to work, a uniform profile is assigned to
every user and then individual user profile is updated on daily bases or upon
acceptation of recommendations or receiving of false positive/negative report
from user. This is done by syste
m with unobtrusive monitoring of user behavior.
After updating user profiles similarities between the changed and other user
profiles need to be updated and followed by updating of recommendation group
User interest feedback details a level of
interest/dislike in a topic, this
feedback enables the spam filter to track concept drift in spam and to be retrained
in the case of false positive . We are going to develop a profiling algorithm to
automatically adjust user profiles to match any topic
declared via profile feedback. An instance of spam for a specific class may add a
percentage of its value to the super
decay function and other existing profiling algorithms may also be
invoked to find current interest
Ontology is a conceptualization of a domain into a human
readable format consisting of entities, attributes, relationships, and
axioms . Ontologies can provide a rich conceptualization of the w
domain of an organization, representing the main concepts and relationships of
the work activities.
There are two ways for this task: use existing ontology or construct one
based on the classification of email messages.
Although there are many top
specific messages in a moderated mailing
list, most of them will fall in standard categories. We are going to user an existing
taxonomy with appropriate customizations based on specific domain knowledge.
For example, if the application is in academic re
search domain, we can add
additional topics to the ontology for the target researchers.
Kazem Taghva and Julia Borsack reported on the construction of an
ontology that applies rules for identification of features to be used for email
4. Evaluation Measures
Suppose S and N are the number of spam and non
spam messages for each user,
S+ is the number of spam message that are correctly classified by a system, and S
number of messages misclassified as Non
spam, similarly, N+ d
enote the number of non
spam message that are correctly classified and N
the number of messages misclassified
Following measures can be calculated based on these four values to measure the
performance of the system.
4.1 Filtering accuracy
.1 Precision, recall and F measure value
We can calculate Precision, Recall and F
measure value based on these
four values to measure the performance of the system.
4.1.2 Utility measures
In this measure, a loss value V is attached to each S
, S+, N
overall performance of a system is the sum of the multiplication of 4
numbers of messages and its corresponding V
4.1.3 Weighted accuracy
Suppose misclassifying a non
spam message as spam is t times more
costly than the symmetric misclassificati
on, a version of accuracy
sensitive to t
cost: Wacc = (t*L
+ S+)/(t*L + S)
4.2 Ontological inference in user profiling
We are going to compare the accuracy values as 3.1 in the absence of
ontological inference scenario to identify to what extent ontolog
helps improve the overall system performance.
4.3 User satisfaction
The higher accuracy value indicates higher satisfaction of users.
(Some metrics for measuring recommendation performance are suggested by
Schein et al. .
L. et al overviewed the factors that have been
considered in evaluations as well as introduced new factors that they believe
should be considered in evaluation in .)
Middleton, S. E., Shadbolt, N. R. and De Roure, D. C.
ical User Profiling in Recommender Systems
ACM Transactions on
Information Systems (TOIS)
 Joseph A. Konstan
Introduction to recommender systems: Algorithms and Evaluation
Transactions on Information Systems (TOIS)
L. HERLOCKER, JOSEPH A. KONSTAN, LOREN G. TERVEEN, and
JOHN T. RIEDL
Evaluating collaborative filtering recommender systems
ACM Transactions on Information Systems (TOIS)
Machine learning in automated text
Computing Surveys, Vol. 34, No. 1, March 2002,
Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.
Spyropoulos, and P. Stamatopoulos.
Learning to filter spam e
comparison of a naive bayesian and a memory
. In Work
on Machine Learning and Textual Information Access, 4th European
Conference on Principles and Practice of Knowledge Discovery in
 Ken Lang.
NewsWeeder: Learning to filter netnews
. In Machine
Learning: Proceedings of the Twelfth Int
ernational Conference, Lake
Taho, California, 1995.
Collaborative filtering with privacy
. In IEEE Symposium
on Security and Privacy, pages 45
57, Oakland, CA, May 2002.
Mark Claypool Anuja Gokhale, Tim Miranda, Pavel Murnikov, Dmitry
etes and Matthew Sartin.
Based and Collaborative
Filters in an Online Newspaper
ACM SIGIR Workshop on Recommender
Systems Berkeley, CA
r, L.F. and B.A. LaMacchia. 1998. Spam!
Communications of ACM
MIT Spam Conference looks beyond filters
” by Paul Roberts, IDG
News Service, Boston Bureau, January 20, 2004
 Mitchell, T.M. 1997.
em Taghva, Julie Borsack, Jeffrey Coombs, Allen Condit, Steve
Lumos, Tom Nartker
based Classification of Email
of the International Conference on Information Technology: Computers
 Masahiro Morita and Yoi
Information filtering based on
User Behavior Analysis and Best Match Text Retrieval
Guarino, N. and Giaretta, P. 1995.
Ontologies and knowledge bases:
Towards a terminological clarification
. In Towards Very Large Knowledge
Building and Knowledge Sharing, N. Mars, Ed. IOS
 Deborah Fallows.
Spam: How it is hurting email and degrading life
on the Internet
. Pew Internet and American Life Project, October 2003.