School of Science and Technology 科技學院
Computing Programmes 電腦學系

Hyperpersonal Effect and Spam Mail Detection

HUI Chik Keung

Programme Bachelor of Science with Honours in Computing
Supervisor Dr. Andrew Lui
Areas Artificial Intelligence and Robotics
Year of Completion 2011

Objectives

The aim of the project is to investigate the detection of spam mail with the consideration of the hyper-personal effect. The hyper-personal effect enables the spammers to selectively project their images and personalities when using only written words to convey their personalities.

My idea is that if the spammers can selectively present their images in the form of text then we are able to extract the information of the spammers' images from the text. The images and personalities of the mail author affect the detection of the spam mail in some ways. The hyper-personal effect is hence investigated in this project to find out personalities and images projected by spammers. The result is then used to assist the spam mail detection.

The project objectives are described in detail below:

  • To exam and find out the self-presentation of the spammer when they can make use of the visual anonymity to selectively present themselves over the email.
  • To implement a system that can detect the textual spam mail. Also, to recognize the textual information from the image spam and use the system developed to do the spam detection.
  • To evaluate which approach results in the best performance of the spam detection while we are considering the hyper-personal effect. The precision and recall will be considered.
  • To develop a web-based system that can demonstrate the spam mail detecting system.

Background and Methodology

Spam mail has long been the problem in the recent years and it affects every email users and providers. “A report in May 2009 by Symantec suggests that 90.4% of emails were spam, Google (postini) rated spam volume to be around 90-95% in all four quarters of 2009, while a report from Microsoft in September 2009 rated spam to be at 97.3% of emails.” (Feng Qian1 Abhinav Pathak2 Y. Charlie Hu2 Z. Morley Mao1 Yinglian Xie3, 2010).

There are lots of spam mail detection techniques. Currently, the statistical machine learning filter is the popular one of many commercial anti-spam filters (Seongwook Youn & Dennis McLeod, 2009). There are three types of the machine learning filter including supervised machine learning filter, semi-supervised machine learning filter and unsupervised machine learning filter. Apart from the machine learning approach, the word filters and rule-based scoring system can also detect the spam mails by simply identifying any email that contains certain key words.

There are limitations for the existing solutions. False positive is one of the examples, in the spam mail detection, means that when the mail is classified as a spam which is a positive result for the spam mail detection, but the mail is actually a legitimate message which indicates that the result is false. False positive is possible to happen for all spam filters and it is unacceptable because an important e-mail may be identified as a spam and is rejected. Besides, the rule-based methods are difficult to evaluate and need to be maintained manually. If the spammers know the keywords used for filter the spam mail, they can avoid using the keywords to keep their spam mails up to date.

The current spam mail filters need to keep up with the spam techniques. As the spam mails keep evolving, the spam mail senders frequently change the terms they used in the spam mails to bypass the keyword rule-based scoring filters. Also, they may insert some non-spam content from the book or newspaper into their mail in order to make their mails seem like legitimate mails.

We try to develop a more robust method by analyzing the spamming behaviours and mail senders' intentions that are infrequently changed. Our method includes a theorectical part and a implementation part. The theorectical part is to build a model for the spam mail senders' self-representation. The implementation part is to develop a spam mail detecting system which detects the spam mail based on the theorecticl model.

In this project, we investigate a set of spam mails to find out how the spam mail senders present themselves to deceive and persuade the mail receivers. A model containing the hyper-personal effect properties that appears in the spam mails is produced. The features in this model are the personalities and images projected by the spam mail senders when they are able to selectively present themselves over the email. These images and personalities are extracted from the set of 300 spam mails used in this project. Using a different set of spam mails may produce a different model.

Based on our model, more than one image or personality can appears in the same mail.

There is a potential for finding out some co-occurrence of the images and personalities that is significant. In this project we group the similar images and personalities into the same class as shown below:

We end up getting the model containing the 7 spammers' images and personalities for the set of spam mails used in this project as shown below:

At this moment, every images and personalities in the self-presentation model of spammers will be used to detect the spam mail uniquely. The result below shows how significant are the images and it will be used to weight the images in our model:

The spam mail decider is built using the rule-based scoring approach and the cascaded model approach. For the former approach, the mail is weighted based on its images and personalities projected. The mail is classified as a spam mail if its weighting reaches a certain level. The details is shown below for illustration:

The cascaded model approach is just like the decision tree. We design a cascaded model and consider the personalities and images project by the mail at some nodes. The mail is classified as a spam mail if it finally goes to the node “Spam”, or as a non-spam mail otherwise. The details is shown below for illustration:

A web-based spam mail detector will be developed to demonstrate the spam mail detecting system. The web-based system will be developed using JSP. After receiving the web request and the unknown mail from the client, the server will return the analyzed mail to the client. The analyzed mail contains the highlighted sentences if the sentences are projecting the personalities or images of spammers, and the result of detection as well. The main interface of the system is shown below:

Evaluation

To evaluate how significant are the images in our model, each image projected by the spammers is used uniquely to detect the spam mails. For example, we consider the mail as a spam mail if the mail senders present themselves as a group of people. Then we evaluate the precision and recall for using that image to uniquely detect the spam mail.

In this project, we will use a set of 700 spam mails and 700 legitimate mails provided by the Spamassassin as a testing set of mails to evaluate our model and our spam mail detecting system.

The result showing how signifiant are the images in our model is shown below:

From the above figure, it shows that some images and personalities in our model is significant while some of them are weak to indicate a spam mail. For example, it is pretty strong evidence indicating that a mail is a spam mail if the spammer presents himself as a businessman. However, there are few spammers try to project this image in their spam mail. By using the developed sentence classifier, we extract all the images and personalities projected in the mail by anaylzing every sentence in the mail. After analyzing our model, all of these images and personalities are weighted. We develop two approaches to make use of the information extracted by our system including rule-based scoring approach and cascaded model approach. Regarding the evaluation of the rule-based scoring approach, we anaylze a set of 700 spam mails and 700 legimate mails in 2005 from SpamAssassin and classify these mails by setting the spam mail's score from 1.0 to 3.3. The following table shows the results of the precision and recall of rule-based scoring approach:
Spam mail's Score Precision Recall
1.0 0.80663 0.80431
1.1 0.80663 0.80431
1.2 0.80663 0.80431
1.3 0.80663 0.80431
1.4 0.82689 0.76978
1.5 0.83855 0.76978
1.6 0.90322 0.72518
1.7 0.92 0.69496
1.8 0.928 0.66762
1.9 0.928 0.66762
2.0 0.928 0.66762
2.1 0.928 0.66762
2.2 0.93131 0.66331
2.3 0.95205 0.6
2.4 0.96163 0.57698
2.5 0.97041 0.47194
2.6 0.97248 0.45755
2.7 0.97214 0.45180
2.8 0.97214 0.45180
2.9 0.97214 0.45180
3.0 0.98083 0.44173
3.1 0.98276 0.41007
3.2 0.99528 0.30360
The following figure shows the ROC curve for our system, the ROC curve shows that our system perform better than completely randomly classifying a mail as a spam or legitimate mail since our curve is located above the line of no discrimination.

The cascaded model is another approach to make use of the information extracted by our system. We apply the above cascaded to the set of testing mails including 700 spam mails and 700 legitimate mails in 2005 provided by SpamAssassin. The figure of precision and recall are 0.8909426987060998 and 0.6935251798561151 respectively.

The following cascaded model is created to aganist the spam mails mostly based on the group property of the mail sender:

More evaluation is carried out comparing with with other spam mail detecting approaches and the result is shown below:

ApproachPrecisionRecall
K-meansNNC97.5%63.4%
K-meansK-NNC74.4%97.6%
BIRCHNNC93.7%58.1%
BIRCHK-NNC91.6%52.3%

The above table shows part of the evaluation result of using different text based clustering approaches. It is quoted form the paper “A Novel Method of Spam Mail Detection using Text Based Clustering Approach” (M. Basavaraju & Dr. R. Prabhakar, 2010).

The following table shows part of our evaluation result for comparison:

ApproachPrecisionRecall
Rule-based Scoring(1.0)80.6%80.4%
Rule-based Scoring(2.0)92.8%66.8%
Rule-based Scoring(2.5)97.0%47.2%
Cascaded Model97.4%38.7%

 

Conclusion and Future Development

In this project, the Hyperpersonal Effect is investigated to implement a spam mail detecting system. We detect the spam mails by considering the spamming personalities and images of the spammers while they are able to selectively present themselves under the Hyperpersonal Effect. We implement our system using Java and the open source such as Weka, SpamAssassin and Lingpipe. We measure the performance of our system by observing the precison and recall. As our system is not using the keyword rule-based approach, the spammers can not just simply change the terms they used in their mail to bypass our system. Also, we consider the mail content on sentence level such that the non-spam content intentionally inserted into the mail by the spammers can not confuse our system. It also has the potential to discover the hidden spam mails.

By discovering and increasing more personalities and image to the spammer's self-representation model, it may enable the system to detect more type of spam mail and improve the performance of our spam mail detecting system. We may also examine the coocurrence of the images and personalities in the model to produce more significant combinations. Beside, improving the precision of the sentence classifier may also improve the performance of our system.

Copyright Hui Chik Keung and Andrew Lui 2011

Jonathan Chiu
Marketing Director
3DP Technology Limited

Jonathan handles all external affairs include business development, patents write up and public relations. He is frequently interviewed by media and is considered a pioneer in 3D printing products.

Krutz Cheuk
Biomedical Engineer
Hong Kong Sanatorium & Hospital

After graduating from OUHK, Krutz obtained an M.Sc. in Engineering Management from CityU. He is now completing his second master degree, M.Sc. in Biomedical Engineering, at CUHK. Krutz has a wide range of working experience. He has been with Siemens, VTech, and PCCW.

Hugo Leung
Software and Hardware Engineer
Innovation Team Company Limited

Hugo Leung Wai-yin, who graduated from his four-year programme in 2015, won the Best Paper Award for his ‘intelligent pill-dispenser’ design at the Institute of Electrical and Electronics Engineering’s International Conference on Consumer Electronics – China 2015.

The pill-dispenser alerts patients via sound and LED flashes to pre-set dosage and time intervals. Unlike units currently on the market, Hugo’s design connects to any mobile phone globally. In explaining how it works, he said: ‘There are three layers in the portable pillbox. The lowest level is a controller with various devices which can be connected to mobile phones in remote locations. Patients are alerted by a sound alarm and flashes. Should they fail to follow their prescribed regime, data can be sent via SMS to relatives and friends for follow up.’ The pill-dispenser has four medicine slots, plus a back-up with a LED alert, topped by a 500ml water bottle. It took Hugo three months of research and coding to complete his design, but he feels it was worth all his time and effort.

Hugo’s public examination results were disappointing and he was at a loss about his future before enrolling at the OUHK, which he now realizes was a major turning point in his life. He is grateful for the OUHK’s learning environment, its industry links and the positive guidance and encouragement from his teachers. The University is now exploring the commercial potential of his design with a pharmaceutical company. He hopes that this will benefit the elderly and chronically ill, as well as the society at large.

Soon after completing his studies, Hugo joined an automation technology company as an assistant engineer. He is responsible for the design and development of automation devices. The target is to minimize human labor and increase the quality of products. He is developing products which are used in various sections, including healthcare, manufacturing and consumer electronics.

Course Code Title Credits
  COMP S321F Advanced Database and Data Warehousing 5
  COMP S333F Advanced Programming and AI Algorithms 5
  COMP S351F Software Project Management 5
  COMP S362F Concurrent and Network Programming 5
  COMP S363F Distributed Systems and Parallel Computing 5
  COMP S382F Data Mining and Analytics 5
  COMP S390F Creative Programming for Games 5
  COMP S492F Machine Learning 5
  ELEC S305F Computer Networking 5
  ELEC S348F IOT Security 5
  ELEC S371F Digital Forensics 5
  ELEC S431F Blockchain Technologies 5
  ELEC S425F Computer and Network Security 5
 Course CodeTitleCredits
 ELEC S201FBasic Electronics5
 IT S290FHuman Computer Interaction & User Experience Design5
 STAT S251FStatistical Data Analysis5
 Course CodeTitleCredits
 COMPS333FAdvanced Programming and AI Algorithms5
 COMPS362FConcurrent and Network Programming5
 COMPS363FDistributed Systems and Parallel Computing5
 COMPS380FWeb Applications: Design and Development5
 COMPS381FServer-side Technologies and Cloud Computing5
 COMPS382FData Mining and Analytics5
 COMPS390FCreative Programming for Games5
 COMPS413FApplication Design and Development for Mobile Devices5
 COMPS492FMachine Learning5
 ELECS305FComputer Networking5
 ELECS363FAdvanced Computer Design5
 ELECS425FComputer and Network Security5