SDRS: A new lossless dimensionality reduction for text corpora

Published on Jul 1, 2020in Information Processing and Management3.892
· DOI :10.1016/J.IPM.2020.102249
Iñaki Velez de Mendizabal1
Estimated H-index: 1
Vitor Basto-Fernandes7
Estimated H-index: 7
(ISCTE-IUL: ISCTE – University Institute of Lisbon)
+ 2 AuthorsUrko Zurutuza
Abstract In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naive Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naive Bayes classifiers.
  • References (35)
  • Citations (0)
📖 Papers frequently viewed together
2 Authors (Wenxin Ning, Ming Yu)
1 Citations
78% of Scinapse members use related papers. After signing in, all features are FREE.
#1Zenun Kastrati (NTNU: Norwegian University of Science and Technology)H-Index: 6
#2Ali Shariq Imran (NTNU: Norwegian University of Science and Technology)H-Index: 7
Last. Sule Yildirim Yayilgan (NTNU: Norwegian University of Science and Technology)H-Index: 7
view all 3 authors...
Abstract This paper presents a semantically rich document representation model for automatically classifying financial documents into predefined categories utilizing deep learning. The model architecture consists of two main modules including document representation and document classification. In the first module, a document is enriched with semantics using background knowledge provided by an ontology and through the acquisition of its relevant terminology. Acquisition of terminology integrated...
1 CitationsSource
#2Abhishek JamadarH-Index: 1
Last. Siddharth DuduguH-Index: 1
view all 3 authors...
As there is fast growth in digital data collection techniques it has made way for large amount of data. Greater than 85% of present day data is comprised of unsaturated and unstructured data. Determining the definite patterns and trends to examine a textual data is biggest issue in text mining The various domains associated together in data mining are text mining, web mining, graph mining, and sequencing mining. The selection of proper and correct technique of text mining enhances the hustle and...
1 CitationsSource
#1José Ramon Méndez (University of Vigo)H-Index: 16
Last. David Ruano-Ordás (University of Vigo)H-Index: 8
view all 3 authors...
Abstract The Internet emerged as a powerful infrastructure for the worldwide communication and interaction of people. Some unethical uses of this technology (for instance spam or viruses) generated challenges in the development of mechanisms to guarantee an affordable and secure experience concerning its usage. This study deals with the massive delivery of unwanted content or advertising campaigns without the accordance of target users (also known as spam). Currently, words (tokens) are selected...
1 CitationsSource
#1Xuelian Deng (Xida: Guangxi University)H-Index: 1
#2Yuqing Li (Xida: Guangxi University)H-Index: 1
Last. Jilian Zhang (JNU: Jinan University)H-Index: 13
view all 4 authors...
Big multimedia data is heterogeneous in essence, that is, the data may be a mixture of video, audio, text, and images. This is due to the prevalence of novel applications in recent years, such as social media, video sharing, and location based services (LBS), etc. In many multimedia applications, for example, video/image tagging and multimedia recommendation, text classification techniques have been used extensively to facilitate multimedia data processing. In this paper, we give a comprehensive...
12 CitationsSource
#1Eman M. Bahgat (Ain Shams University)H-Index: 3
#2Sherine Rady (Ain Shams University)H-Index: 5
Last. Ibrahim F. Moawad (Ain Shams University)H-Index: 7
view all 4 authors...
Abstract Emails have become one of the major applications in daily life. The continuous growth in the number of email users has led to a massive increase of unsolicited emails, which are also known as spam emails. Managing and classifying this huge number of emails is an important challenge. Most of the approaches introduced to solve this problem handled the high dimensionality of emails by using syntactic feature selection. In this paper, an efficient email filtering approach based on semantic ...
#1Berna Altınel (Marmara University)H-Index: 7
#2Murat Can Ganiz (Marmara University)H-Index: 12
Abstract Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the ...
9 CitationsSource
#1David Ruano-Ordás (University of Vigo)H-Index: 8
#2Florentino Fdez-Riverola (University of Vigo)H-Index: 23
Last. José Ramon Méndez (University of Vigo)H-Index: 16
view all 3 authors...
Abstract One of the most relevant problems affecting the efficient use of e-mail to communicate worldwide is the spam phenomenon. Spamming involves flooding Internet with undesired messages aimed to promote illegal or low value products and services. Beyond the existence of different well-known machine learning techniques, collaborative schemes and other complementary approaches, some popular anti-spam frameworks such as SpamAssassin or Wirebrush4SPAM enabled the possibility of using regular exp...
5 CitationsSource
Currently, short communication channels are growing up due to the huge increase in the number of smartphones and online social networks users. This growth attracts malicious campaigns, such as spam campaigns, that are a direct threat to the security and privacy of the users. While most researches are focused on automatic text classification, in this work we demonstrate the possibility of improving current short messages spam detection systems using a novel method. We combine personality recognit...
1 CitationsSource
#1Renato Moraes Silva (State University of Campinas)H-Index: 5
#2Tulio C. Alberto (UFSCar: Federal University of São Carlos)H-Index: 3
Last. Akebo Yamakami (State University of Campinas)H-Index: 16
view all 4 authors...
A new classifier is presented to detect undesired short text comments.The proposed approach is light, fast, multinomial and offers incremental learning.The impact of applying text normalization and semantic indexing is studied.The results indicate the proposed techniques outperformed most of the approaches.Text normalization and semantic indexing enhanced the classifiers performance. The popularity and reach of short text messages commonly used in electronic communication have led spammers to us...
6 CitationsSource
Jun 21, 2017 in HAIS (Hybrid Artificial Intelligence Systems)
#1David Ruano-Ordás (University of Vigo)H-Index: 8
#2Vitor Basto-Fernandes (Polytechnic Institute of Leiria)H-Index: 7
Last. José Ramon Méndez (University of Vigo)H-Index: 16
view all 4 authors...
This paper presents an evolutionary multi-objective optimization problem formulation for the anti-spam filtering problem, addressing both the classification quality criteria (False Positive and False Negative error rates) and email messages classification time (minimization). This approach is compared to single objective problem formulations found in the literature, and its advantages for decision support and flexible/adaptive anti-spam filtering configuration is demonstrated. A study is perform...
1 CitationsSource
Cited By0