A hybrid approach for extracting informative content from web pages

Uzun, Erdinc; Agun, Hayri Volkan; Yerlikaya, Tarik

A hybrid approach for extracting informative content from web pages

dc.authorid	Uzun, Erdinç/0000-0003-4351-2244
dc.authorid	Agun, Hayri Volkan/0000-0002-4253-8920
dc.authorid	Yerlikaya, Tarik/0000-0002-9888-0151
dc.authorwosid	Uzun, Erdinç/AAG-5529-2019
dc.authorwosid	Yerlikaya, Tarık/AGP-6489-2022
dc.authorwosid	Agun, Hayri Volkan/P-5002-2019
dc.contributor.author	Uzun, Erdinc
dc.contributor.author	Agun, Hayri Volkan
dc.contributor.author	Yerlikaya, Tarik
dc.date.accessioned	2024-06-12T11:17:14Z
dc.date.available	2024-06-12T11:17:14Z
dc.date.issued	2013
dc.department	Trakya Üniversitesi	en_US
dc.description.abstract	Eliminating noisy information and extracting informative content have become important issues for web mining, search and accessibility. This extraction process can employ automatic techniques and hand-crafted rules. Automatic extraction techniques focus on various machine learning methods, but implementing these techniques increases time complexity of the extraction process. Conversely, extraction through hand-crafted rules is an efficient technique that uses string manipulation functions, but preparing these rules is difficult and cumbersome for users. In this paper, we present a hybrid approach that contains two steps that can invoke each other. The first step discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method. The second step extracts informative content using rules obtained from the first step. However, if the second step does not return an extraction result, the first step gets invoked. In our experiments, the first step achieves high accuracy with 95.76% in extraction of the informative content. Moreover, 71.92% of the rules can be used in the extraction process, and it is approximately 240 times faster than the first step. (C) 2013 Elsevier Ltd. All rights reserved.	en_US
dc.identifier.doi	10.1016/j.ipm.2013.02.005
dc.identifier.endpage	944	en_US
dc.identifier.issn	0306-4573
dc.identifier.issn	1873-5371
dc.identifier.issue	4	en_US
dc.identifier.scopus	2-s2.0-84875710694	en_US
dc.identifier.scopusquality	Q1	en_US
dc.identifier.startpage	928	en_US
dc.identifier.uri	https://doi.org/10.1016/j.ipm.2013.02.005
dc.identifier.uri	https://hdl.handle.net/20.500.14551/24628
dc.identifier.volume	49	en_US
dc.identifier.wos	WOS:000319543800015	en_US
dc.identifier.wosquality	Q2	en_US
dc.indekslendigikaynak	Web of Science	en_US
dc.indekslendigikaynak	Scopus	en_US
dc.language.iso	en	en_US
dc.publisher	Elsevier Sci Ltd	en_US
dc.relation.ispartof	Information Processing & Management	en_US
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.rights	info:eu-repo/semantics/closedAccess	en_US
dc.subject	Web Content Extraction	en_US
dc.subject	Template Detection	en_US
dc.subject	Web Cleaning	en_US
dc.subject	Web Learning Modeling	en_US
dc.subject	Searching Strategies	en_US
dc.title	A hybrid approach for extracting informative content from web pages	en_US
dc.type	Article	en_US

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

A hybrid approach for extracting informative content from web pages

Dosyalar

Koleksiyon