School of Science and Technology 科技學院
Computing Programmes 電腦學系

A Programming Assignment Plagiarism Detection and Automatic Marking System based on Tree-based Comparison

WONG Lai Shan

ProgrammeBachelor of Computing with Honours in Computing
SupervisorDr. Vanessa Ng
AreasTeaching and Classroom Support
Year of Completion2012


The aim of this project is to design and implement a program comparator. This comparator can be used to detect plagiarism and mark programs for assignments. A system is developed in this project for the evaluation of the comparator.

To achieve the aim, the project has defined a number of objectives as follows:

  • To design a program-comparing algorithm and develop a program comparator.
  • To implement the comparator as a plagiarism detection tool.
  • To implement the comparator as an automatic marking tool.
  • To evaluate the comparator. An online assignment management system including plagiarism detection and automatic marking is developed.

Background and Methodology

Programming assignments are usually considered as the major assessment component of a programming course. Since the programming assignments are usually done using computers, it is very easy for students to copy the source code from each other. Plagiarism in programming has been a serious problem within the years. Also, students may submit their assignments after the due date, so the teacher may have to mark the assignments over a long period of time. They may forget similar scripts they marked before. Moreover, a programming class usually contains a large number of students. So it is very difficult for teachers to find out similar pairs of programs from a large set of programs. Due to the serious problem of plagiarism in programming courses, the main focus of this project is to design and implement a program comparator to detect plagiarism.

Also it is time consuming for the teacher to mark the assignments one by one manually in a large set of programs. As the teacher only needs to check the program with the model answer for most programs, the program comparator can be implement for automatic marking. Most previous program comparing algorithms compare either syntax or semantic only, this is not accurate enough for program comparison. Therefore, an efficient and accurate program comparing tool comparing both syntax and semantic is highly demanded.

The marking of programming assignments involve two processes, accurate plagiarism detection and effective marking of the assignments. This project focuses on the analysis excising program comparing methods and how students carried out plagiarism in order to develop an accurate plagiarism detection tool. Marking of the assignments is a time consuming job. So this project also analyses existing auto marking methods and develop an effective automatic marking tool.

Metric-based is an earlier approach. This is also called attribute-counting-metric since it compares the frequency of keywords between pairs of programs. Different systems use different metrics. For example, the following are some of the metrics used by the Faidhi-Robinson System (Faidhi and Robinson, 1987), number of program statements, repetitive statement percentage, conditional statement percentage, number of modules, module contribution percentage (the number of code lines inside procedures, count of unique identifiers, average spaces percentage per line, average identifier length.

Later on, a more accurate and reliable approach structural-based comparison is developed. The programs are parsed and transformed to token streams. The following shows an example of Java source text and corresponding tokens:

Java Source CodeGenerated Tokens
public class Count{BEGINCLASS
public static void main(String [] args)VARDEF,BEGINMETHOD
int count = 0 ;VARDEF,ASSIGN
System.out.println(count+” chars.”);APPLY

Tree-based comparing approach is chosen because it can compare both the structure and the content of the programs. Each program is transformed into a parse tree with nodes and tokens. Then, a program comparator is developed. It compares the trees in pairs and produces a similarity score for each pair.

After the comparator is developed, it is applied as a plagiarism detection tool. A threshold rate is defined. If the similarity rate of a pair of programs is higher than the threshold rate, the pair is considered as plagiarized programs. The comparator can also be applied as an automatic marking tool. The more similar they are, the higher the score of the program. An online assignment management system is developed to evaluate the comparator. The system includes managing assignments, checking plagiarism and marking assignments. An example of a simple parse tree is shown below:

The following shows an example of the hierarchy of a parse tree:

There are four components in this system. They include assignment submission, plagiarism detection, automatic marking and assignment management. Only teachers can access the plagiarism detection and automatic marking sub-system. The following shows the system component diagram:


The system is a web-based online system that provides an easy and comfortable interface for students and teachers to open and submit programming assignments. Teachers can open new assignment submission and upload model answers, and then students can submit their assignments. All the files submitted and the assignment information is sent to the database. Apache Commons FileUpload Java libraries (The Apache Software Foundation) and Apache Commons IO Java libraries (The Apache Software Foundation) are added to the system to help upload submitted files to the database. The main interfaces of the system are shown below:

A program comparator for comparing the parse trees is developed. The comparator receives a set of parse trees and compares a pair of parse trees each time. The trees generated are complicated so they are broken down into sub-trees when comparing. Each sub-tree represents a sub-part of the program, which represents a package, an import, a class or an interface, a constructor, a field, a method, an expression, or a statement. The comparator uses depth-first approach to compare two sub-trees. It compares the nodes and the tokens in the two sub-trees and different similarity score is returned according to different conditions. Then, a sub-total similarity rate is calculated using breath-first approach for each sub-tree. The final similarity score of the pair of programs is the total scores for all sub-trees. Some extensions have also been added to the comparator to suit the conditions that should considered as similar.

The following shows an example of detected exact copy and the similar rate is 100%:

The comparator can also match different identifier name, and statements or methods in different order. The details is shown below:

In addition, statements with similar structure can also be matched, and the rate is high as shown below:

Different structures should have different nodes, since the nodes are different, the returned ratio should be relatively low. The following shows an example of programs with different structure:


The following shows some simple programs for comparison:

HelloWorld.javaContains a main function which assigns “HelloWorld” to a string “a” and prints a.
HelloWorld2.javaContains a main function which assigns “HelloWorld” to a string “b” and prints b.
IfElse.javaContains a main function with a simple if-else statement
IfElseCopy.javaA copy of
WhileLoop.javaContains a main function with a simple while loop and the program function is same as
ForLoop.javaContains a main function with a simple for loop and the program function is same as

The comparing results are shown below:

Compared FileCompared File DescriptionMetric-basedStructure-basedJPlagTree-based
IfElse.javaSame structure and same content100%100%100%100%     
HelloWorld.javaSame content but different variable100%100%100%97%     
ForLoop.javaSimilar structure and similar content100%80%10%100%     
IfElse.javaDifferent structure and different content66%75%10%46%    

The results for tree-based are more reasonable. The programs of the first pair are exactly the same, so their similar rate should be 100%. The programs of the second pair has different variable name, so the rate should be high (97%) but not as high as 100% as they are not exactly the same. The programs of the third pair have similar content and similar structure, so the rate should be high (100%). The programs of the forth pair has totally different structure and content, so the rate should be low (46%). As a result, the comparator based on tree-based approach works.

Other than accuracy, efficiency is also a concern. A few set of programs are tested and the time taken is measured. From the result, the time taken for plagiarism detection in our system is acceptable. The result is shown below:

SetNumber of successful parsed filesAverage Number of linesTime Taken for Plagiarism Detection
173382:30 min
260311:30 min
364532:00 min

Teachers and schoolmates have been invited to try the system and conduct a questionnaire for system evaluation. The goal of the evaluation is to measure the objective and subjective value of the system. So the main focus is on the accuracy, efficiency, functionality, and usability of the four sub-systems. Five teachers and Five students are invited. Since students won’t be able to use the plagiarism detection and automatic marking system, they do not need to rate on accuracy and efficiency. For the close-option questions, 5 rating options is provided from 1 to 5, which indicates from very bad, bad, acceptable, good and very good. The overall comments and ratings are good. The result is shown below:

Plagiarism Detection3.
Automatic Marking3.
Assignment Submission4.04.2
Assignment Management4.24.2

Conclusion and Future Development

In this project, different approaches on program comparing, plagiarism detection and automatic marking are reviewed in the background. According to the analysis, tree-based comparison is chosen for higher accuracy since it compares both the structure and the contents of the programs. A tree-based program comparator is designed and implemented as a plagiarism detection tool in an online assignment management system. Other than plagiarism detection, the comparator is also implemented as an automatic marking tool since the previous common approach is not applicable in some situations. By evaluating the comparator and the system, results show that the use of tree-base comparison is a feasible method. It gives a realistic and accurate comparison result. In addition, the system can relieve the workload of the markers by reducing the time and effort for marking programming assignments. Evaluation has shown that the system is generally useable while some areas of the system are still needed to improve.

There are a number of extensions can be added to this project. One suggestion is that it will be better if it can support multiple files for a question, such that it can call other classes while running test cases. It can also be improved to support more different languages such as C, since the program structure is similar.

From the feedback among the teachers, they find that it is not so convenient while using automatic marking, as they have to write another testing program. So an extension of this project includes automatic generating testing programs in the future.

Copyright Wong Lai Shan and Vanessa Ng 2012

Jonathan Chiu
Marketing Director
3DP Technology Limited

Jonathan handles all external affairs include business development, patents write up and public relations. He is frequently interviewed by media and is considered a pioneer in 3D printing products.

Krutz Cheuk
Biomedical Engineer
Hong Kong Sanatorium & Hospital

After graduating from OUHK, Krutz obtained an M.Sc. in Engineering Management from CityU. He is now completing his second master degree, M.Sc. in Biomedical Engineering, at CUHK. Krutz has a wide range of working experience. He has been with Siemens, VTech, and PCCW.

Hugo Leung
Software and Hardware Engineer
Innovation Team Company Limited

Hugo Leung Wai-yin, who graduated from his four-year programme in 2015, won the Best Paper Award for his ‘intelligent pill-dispenser’ design at the Institute of Electrical and Electronics Engineering’s International Conference on Consumer Electronics – China 2015.

The pill-dispenser alerts patients via sound and LED flashes to pre-set dosage and time intervals. Unlike units currently on the market, Hugo’s design connects to any mobile phone globally. In explaining how it works, he said: ‘There are three layers in the portable pillbox. The lowest level is a controller with various devices which can be connected to mobile phones in remote locations. Patients are alerted by a sound alarm and flashes. Should they fail to follow their prescribed regime, data can be sent via SMS to relatives and friends for follow up.’ The pill-dispenser has four medicine slots, plus a back-up with a LED alert, topped by a 500ml water bottle. It took Hugo three months of research and coding to complete his design, but he feels it was worth all his time and effort.

Hugo’s public examination results were disappointing and he was at a loss about his future before enrolling at the OUHK, which he now realizes was a major turning point in his life. He is grateful for the OUHK’s learning environment, its industry links and the positive guidance and encouragement from his teachers. The University is now exploring the commercial potential of his design with a pharmaceutical company. He hopes that this will benefit the elderly and chronically ill, as well as the society at large.

Soon after completing his studies, Hugo joined an automation technology company as an assistant engineer. He is responsible for the design and development of automation devices. The target is to minimize human labor and increase the quality of products. He is developing products which are used in various sections, including healthcare, manufacturing and consumer electronics.

Course CodeTitleCredits
 COMP S321FAdvanced Database and Data Warehousing5
 COMP S333FAdvanced Programming and AI Algorithms5
 COMP S351FSoftware Project Management5
 COMP S362FConcurrent and Network Programming5
 COMP S363FDistributed Systems and Parallel Computing5
 COMP S382FData Mining and Analytics5
 COMP S390FCreative Programming for Games5
 COMP S492FMachine Learning5
 ELEC S305FComputer Networking5
 ELEC S348FIOT Security5
 ELEC S371FDigital Forensics5
 ELEC S431FBlockchain Technologies5
 ELEC S425FComputer and Network Security5
 Course CodeTitleCredits
 ELEC S201FBasic Electronics5
 IT S290FHuman Computer Interaction & User Experience Design5
 STAT S251FStatistical Data Analysis5
 Course CodeTitleCredits
 COMPS333FAdvanced Programming and AI Algorithms5
 COMPS362FConcurrent and Network Programming5
 COMPS363FDistributed Systems and Parallel Computing5
 COMPS380FWeb Applications: Design and Development5
 COMPS381FServer-side Technologies and Cloud Computing5
 COMPS382FData Mining and Analytics5
 COMPS390FCreative Programming for Games5
 COMPS413FApplication Design and Development for Mobile Devices5
 COMPS492FMachine Learning5
 ELECS305FComputer Networking5
 ELECS363FAdvanced Computer Design5
 ELECS425FComputer and Network Security5