NLTK Tutorial - Plagiarism Detector

Implementation

To train our plagiarism detector for Arabic texts, we first needed a dataset of Arabic texts that we could use for training. We created a folder called "Trainingdata" and added a set of Arabic text files to it. Next, we used a sample file to check the similarity of the test file with the training file, We search for the training file with the highest similarity score as the potential source of plagiarism for the sample file. This process enabled us to test our plagiarism detector and fine-tune the similarity threshold to achieve the desired level of accuracy. To build our plagiarism detector, we loaded several libraries including "os" and "glob" for file handling, "nltk" for NLP techniques such as tokenization, stemming, and stop-word removal, and "sklearn" for vectorization and cosine similarity calculation. We also downloaded the necessary NLTK data such as stopwords and punkt to ensure the proper functioning of the NLP techniques.


         import os
         import glob
         import nltk
         from nltk.corpus import stopwords
         from nltk.tokenize import word_tokenize
         from nltk.stem import SnowballStemmer
         from sklearn.feature_extraction.text import TfidfVectorizer
         from sklearn.metrics.pairwise import cosine_similarity

         nltk.download('stopwords')
         nltk.download('punkt')

Preprocessing the data

The preprocess function is designed to clean and standardize textual data before it can be used for further analysis. Specifically, the function tokenizes the input text into individual words, removes any stop words and punctuation marks, and then stems the remaining words to their root forms. First, the function tokenizes the text using the word_tokenize() function from the Natural Language Toolkit (NLTK) library. It then removes any stop words and punctuation marks using the set() function to create a set of stop words and a list comprehension to filter out words that are not alphabetic or in the set of stop words. The stopwords package from the NLTK library is used to obtain a set of stop words for the Arabic language. After removing stop words and punctuation marks, the function uses the SnowballStemmer class from the NLTK library to stem the remaining words to their root forms. The Snowball stemmer algorithm is a widely used stemming algorithm that can be used for various languages including Arabic. Finally, the stemmed words are joined back into a string and returned as the preprocessed text output of the function.


         def preprocess(text):
         tokens = word_tokenize(text)
         stop_words = set(stopwords.words('arabic'))
         words = [word for word in tokens if word.isalpha() and word not in stop_words]
         stemmer = SnowballStemmer('arabic')
         words = [stemmer.stem(word) for word in words]
         text = " ".join(words)
         return text

Training the Detector

This train method is used to train a plagiarism detector by preparing the dataset for training. It reads all text files in the specified folder and preprocesses the texts using the preprocess method which tokenizes the text into words, removes stopwords and punctuation, and stems the words. The preprocessed texts are then vectorized using the Term Frequency-Inverse Document Frequency (TF-IDF) method to create a numerical representation of each text. This numerical representation captures the relative importance of each word in the text and its relevance to the entire dataset. The TfidfVectorizer from scikit-learn is used to perform the vectorization. The resulting vectors and the TfidfVectorizer object are returned as output, along with the filenames of the texts in the dataset. These vectors and the TfidfVectorizer object are used in the compare method to compare a given text with the trained dataset and identify the most similar text.


         def train(folder):
         files = glob.glob(os.path.join(folder, "*.txt"))
         texts = []
         for file in files:
         with open(file, encoding='utf-8') as f:
         text = f.read()
         texts.append(text)
         
         preprocessed_texts = [preprocess(text) for text in texts]
         
         vectorizer = TfidfVectorizer()
         
         vectors = vectorizer.fit_transform(preprocessed_texts)
         
         return vectors, vectorizer, files

Compare The Sample Text

The compare method is used to compare the similarity between the given file and all other files in the trained dataset using cosine similarity. It first reads the contents of the file and preprocesses the text using the preprocess method. Then it vectorizes the preprocessed text using the vectorizer obtained during training, and calculates the cosine similarity between the query vector and the vectors of all texts in the dataset. The index of the text with the highest similarity is then found and used to get the filename of the text. Finally, it returns the filename and percentage similarity between the given file and the most similar text in the dataset, rounded to two decimal points.


         def compare(file, vectors, vectorizer, files):
         with open(file, encoding='utf-8') as f:
         text = f.read()
         preprocessed_text = preprocess(text)
         query_vector = vectorizer.transform([preprocessed_text])
         similarities = cosine_similarity(query_vector, vectors)[0]
    
         index = similarities.argmax()
    
         percentage_similarity = round(similarities[index] * 100, 2)
    
         filename = os.path.basename(files[index])
    
         return filename, percentage_similarity

Plagiarism Checking

This code runs a plagiarism checker on a file located at file by training on a set of text files located in the folder directory. It first calls the train function to train the plagiarism detector on the text files in the folder. Then, it calls the compare function to compare the given file with the trained dataset and prints the most similar file name along with the percentage similarity to the console. The compare function works by first reading the contents of the given file, preprocessing it using the preprocess function, and vectorizing it using the vectorizer. It then calculates the cosine similarity between the vector of the given file and the vectors of all other files in the trained dataset. Finally, it returns the filename of the file with the highest similarity along with the percentage similarity.


           folder = r"C:\Users\91801\Desktop\Trainingdata"
           file = r"C:\Users\91801\Documents\sample text.txt"

           vectors, vectorizer, files = train(folder)

           filename, percentage_similarity = compare(file, vectors, vectorizer, files)
           print(f"Most similar file: {filename}")
           print(f"Percentage similarity: {percentage_similarity}%")

Output for the sample text file:
Most similar file: text7.txt
Percentage similarity: 61.56%

References:

NLTK
simple-plagiarism-detection-using-nlp
Arabic-Plagiarism-Detector-Using-NLP

NLTK Library Tutorial

Detecting Plagiarism using NLTK library in Arabic texts