Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection

date_range 2020

person

Author Roostaee M.

description

Abstract The exponential growth of documents in various languages throughout the web, along with the availability of several editing and translation tools have made the cross-language plagiarism detection a challenging issue. Regarding its high importance, the present study focuses on the task of cross-language text alignment also known as detailed analysis which works on the outputs of the source retrieval step of cross-language plagiarism detection systems. The paper proposes a two-level matching approach with the aim of considering both syntactic and semantic information to align plagiarism fragments from the source and suspicious documents, accurately. At the first level, a vector space model which employs a multilingual word embeddings based dictionary and a local weighting technique is used in order to extract a minimal set of highly potential candidate fragment pairs rather than considering all possible pairs of fragments. This step also contains a dynamic expansion technique to cover more candidate pairs aiming at improving the system`s recall. It is followed by a more precise algorithm that examines the candidate pairs at the sentence level using a graph-of-words representation of text. As a result, by modelling both the words and their relationships, an acceptable increase in the system`s precision which is the goal of the second level is also observed. To identify evidence of plagiarism, i.e. potential cases of unauthorized text reuse, the algorithm tries to find maximum cliques from the match graph of source and suspicious texts. With this two-level investigation, the approach is capable to discriminate true plagiarism cases from the original text. The experimental results on different datasets such as PAN-PC-11, PAN-PC-12, and SemEval-2017 show that the proposed cross-language text alignment approach significantly outperforms the state-of-the-art models and can be fed into an expert system for further improvement of cross-language plagiarism detection. The source codes are publicly available on GitHub, for the purposes of reproducible research. © 2020 Elsevier Ltd

article

DOI 10.1016/j.eswa.2020.113718

language

Journal Expert Systems with Applications

description

Source Scopus

Submit your feeback

CARI! has performed crawling, tagging, and other data processing to produce this page. If you find an error or have feedback for this page, please fill out the form below. Thank You.

How to correct

Name and Email are required!
One of the location fields (prov, district, or sub-district) must be filled in
Fields other than those mentioned above are optional

Name :*

Email : *

Source

Profile ID

Global Research

National Research

Province

District / City

Sub-District

Hazards

DM Phase

Sub - DM Phase

Aspect

Sub - Aspect

Description

Meta Tags

Source from CARI Engine

Provincies : Papua

Cities : KEEROM

Districts :

Hazards :

Sub DM Phase : Early Warning,Hazard Assesment

Sub Aspects :

Citations Articles

Source from Semantic Scholar

Candidate entity generation in lexical semantics

The Utilisation of TF-IDF and Cosine Similarity for Automating Course Waivers in Academic Institutions

SentiVol-GA: a volatility-scaled genetic fusion of predictive models and financial sentiment for adaptive stock forecasting

E-BERT: A Deep Learning and Local Alignment-Based Approach for Paraphrased Plagiarism Detection

Plagiarism types and detection methods: a systematic survey of algorithms in text analysis

A New Classification Model Using a Decision Tree Generated from Hyperplanes in Dimensional Space

Second-Order Text Matching Algorithm for Agricultural Text

Machine learning model for chatGPT usage detection in students’ answers to open-ended questions: Case of Lithuanian language

An effective text plagiarism detection system based on feature selection and SVM techniques

A Review on diverse algorithms used in the context of Plagiarism Detection

A Deep Learning Approach to Detect Plagiarism in Bengali Textual Content using Similarity Algorithms

A Simple and Effective Method of Cross-Lingual Plagiarism Detection

Automatic Plagiarism Detection Using Natural Language Processing

Important Arguments Nomination Based on Fuzzy Labeling for Recognizing Plagiarized Semantic Text

Cross-lingual sentence embedding for mining low-resources parallel sentences

Improving plagiarism detection in text document using hybrid weighted similarity

Transformer-Based Multilingual Language Models in Cross-Lingual Plagiarism Detection

Reliable plagiarism detection system based on deep learning approaches

Do Language Models Plagiarize?

An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding

Citation Worthiness Identification for Fine-Grained Citation Recommendation Systems

Duplicate product record detection engine for e-commerce platforms

Plagiarism detection and prevention: a primer for researchers

A Systematic Review of Multilingual Plagiarism Detection: Approaches and Research Challenges

A simple and efficient text matching model based on deep interaction

Hierarchical ensemble framework for detecting paraphrased near duplicates in scientific abstracts

References Articles

Source from Semantic Scholar