Plagiarism Detector
hero

NLTK Library Tutorial

Detecting Plagiarism using NLTK library in Arabic texts

Plagiarism is a serious issue in academia and other fields where originality and authenticity are valued. With the abundance of information available online, it has become easier for individuals to copy and paste text from other sources without proper citation or attribution. This is where natural language processing (NLP) comes in as a powerful tool for detecting plagiarism and ensuring academic integrity. In this tutorial, we will explore how to use the Natural Language Toolkit (NLTK) in Python to build a simple plagiarism detector. We will start by preprocessing the text using techniques such as tokenization, stemming, and lemmatization, and then convert the text into numerical vectors using bag-of-words representation. We will then calculate the cosine similarity between pairs of text vectors and set a threshold for similarity to determine whether a piece of text is plagiarized from another. We will also discuss some advanced techniques for improving the accuracy and performance of the plagiarism detector. By the end of the tutorial, you will have a basic understanding of how to use NLTK for plagiarism detection and can further customize and enhance the detector to fit your specific needs.

Introduction

In this tutorial, we will demonstrate how to use the Natural Language Toolkit (NLTK) in Python to build a plagiarism detector for Arabic texts. We will start by preprocessing the Arabic texts using techniques such as tokenization, stemming, and stop-word removal. We will then use TF-IDF vectorization to convert the preprocessed texts into numerical vectors, which will enable us to calculate the cosine similarity between them. Finally, we will set a threshold for the similarity score to determine whether the texts are considered to be plagiarized or not. Throughout the tutorial, we will explain the NLP concepts and techniques used in detail and provide code examples to illustrate the implementation. By the end of the tutorial, you will have a solid understanding of how to use NLTK for plagiarism detection in Arabic texts.

Implementation

To train our plagiarism detector for Arabic texts, we first needed a dataset of Arabic texts that we could use for training. We created a folder called "Trainingdata" and added a set of Arabic text files to it. Next, we used a sample file to check the similarity of the test file with the training file, We search for the training file with the highest similarity score as the potential source of plagiarism for the sample file. This process enabled us to test our plagiarism detector and fine-tune the similarity threshold to achieve the desired level of accuracy. To build our plagiarism detector, we loaded several libraries including "os" and "glob" for file handling, "nltk" for NLP techniques such as tokenization, stemming, and stop-word removal, and "sklearn" for vectorization and cosine similarity calculation. We also downloaded the necessary NLTK data such as stopwords and punkt to ensure the proper functioning of the NLP techniques.


         import os
         import glob
         import nltk
         from nltk.corpus import stopwords
         from nltk.tokenize import word_tokenize
         from nltk.stem import SnowballStemmer
         from sklearn.feature_extraction.text import TfidfVectorizer
         from sklearn.metrics.pairwise import cosine_similarity

         nltk.download('stopwords')
         nltk.download('punkt')     
        

Preprocessing the data

The preprocess function is designed to clean and standardize textual data before it can be used for further analysis. Specifically, the function tokenizes the input text into individual words, removes any stop words and punctuation marks, and then stems the remaining words to their root forms. First, the function tokenizes the text using the word_tokenize() function from the Natural Language Toolkit (NLTK) library. It then removes any stop words and punctuation marks using the set() function to create a set of stop words and a list comprehension to filter out words that are not alphabetic or in the set of stop words. The stopwords package from the NLTK library is used to obtain a set of stop words for the Arabic language. After removing stop words and punctuation marks, the function uses the SnowballStemmer class from the NLTK library to stem the remaining words to their root forms. The Snowball stemmer algorithm is a widely used stemming algorithm that can be used for various languages including Arabic. Finally, the stemmed words are joined back into a string and returned as the preprocessed text output of the function.


         def preprocess(text):
         tokens = word_tokenize(text)
         stop_words = set(stopwords.words('arabic'))
         words = [word for word in tokens if word.isalpha() and word not in stop_words]
         stemmer = SnowballStemmer('arabic')
         words = [stemmer.stem(word) for word in words]
         text = " ".join(words)
         return text
      

Training the Detector

This train method is used to train a plagiarism detector by preparing the dataset for training. It reads all text files in the specified folder and preprocesses the texts using the preprocess method which tokenizes the text into words, removes stopwords and punctuation, and stems the words. The preprocessed texts are then vectorized using the Term Frequency-Inverse Document Frequency (TF-IDF) method to create a numerical representation of each text. This numerical representation captures the relative importance of each word in the text and its relevance to the entire dataset. The TfidfVectorizer from scikit-learn is used to perform the vectorization. The resulting vectors and the TfidfVectorizer object are returned as output, along with the filenames of the texts in the dataset. These vectors and the TfidfVectorizer object are used in the compare method to compare a given text with the trained dataset and identify the most similar text.


         def train(folder):
         files = glob.glob(os.path.join(folder, "*.txt"))
         texts = []
         for file in files:
         with open(file, encoding='utf-8') as f:
         text = f.read()
         texts.append(text)
         
         preprocessed_texts = [preprocess(text) for text in texts]
         
         vectorizer = TfidfVectorizer()
         
         vectors = vectorizer.fit_transform(preprocessed_texts)
         
         return vectors, vectorizer, files 
      

Compare The Sample Text

The compare method is used to compare the similarity between the given file and all other files in the trained dataset using cosine similarity. It first reads the contents of the file and preprocesses the text using the preprocess method. Then it vectorizes the preprocessed text using the vectorizer obtained during training, and calculates the cosine similarity between the query vector and the vectors of all texts in the dataset. The index of the text with the highest similarity is then found and used to get the filename of the text. Finally, it returns the filename and percentage similarity between the given file and the most similar text in the dataset, rounded to two decimal points.


         def compare(file, vectors, vectorizer, files):
         with open(file, encoding='utf-8') as f:
         text = f.read()
         preprocessed_text = preprocess(text)
         query_vector = vectorizer.transform([preprocessed_text])
         similarities = cosine_similarity(query_vector, vectors)[0]
    
         index = similarities.argmax()
    
         percentage_similarity = round(similarities[index] * 100, 2)
    
         filename = os.path.basename(files[index])
    
         return filename, percentage_similarity
    

Plagiarism Checking

This code runs a plagiarism checker on a file located at file by training on a set of text files located in the folder directory. It first calls the train function to train the plagiarism detector on the text files in the folder. Then, it calls the compare function to compare the given file with the trained dataset and prints the most similar file name along with the percentage similarity to the console. The compare function works by first reading the contents of the given file, preprocessing it using the preprocess function, and vectorizing it using the vectorizer. It then calculates the cosine similarity between the vector of the given file and the vectors of all other files in the trained dataset. Finally, it returns the filename of the file with the highest similarity along with the percentage similarity.


           folder = r"C:\Users\91801\Desktop\Trainingdata"
           file = r"C:\Users\91801\Documents\sample text.txt"

           vectors, vectorizer, files = train(folder)

           filename, percentage_similarity = compare(file, vectors, vectorizer, files)
           print(f"Most similar file: {filename}")
           print(f"Percentage similarity: {percentage_similarity}%")     
           

Output for the sample text file:
Most similar file: text7.txt
Percentage similarity: 61.56%

References:

NLTK
simple-plagiarism-detection-using-nlp
Arabic-Plagiarism-Detector-Using-NLP