Deep Learning-Based Imbalanced Data Classification for Drug Discovery

dc.authoridKorkmaz, Selçuk/0000-0003-4632-6850
dc.authorwosidKorkmaz, Selçuk/AAU-4677-2020
dc.contributor.authorKorkmaz, Selcuk
dc.date.accessioned2024-06-12T11:17:42Z
dc.date.available2024-06-12T11:17:42Z
dc.date.issued2020
dc.departmentTrakya Üniversitesien_US
dc.description.abstractDrug discovery studies have become increasingly expensive and time-consuming processes. In the early phase of drug discovery studies, an extensive search has been performed to find drug-like compounds, which then can be optimized over time to become a marketed drug. One of the conventional ways of detecting active compounds is to perform an HTS (high-throughput screening) experiment. As of July 2019, the PubChem repository contains 1.3 million bioassays that are generated through HTS experiments. This feature of PubChem makes it a great resource for performing machine learning algorithms to develop classification models to detect active compounds for drug discovery studies. However, data sets obtained from PubChem are highly imbalanced. This imbalanced nature of the data sets has a negative impact on the classification performance of machine learning algorithms. Here, we explored the classification performance of deep neural networks (DNN) on imbalance compound data sets after applying various data balancing methods. We used five confirmatory HTS bioassays from the PubChem repository and applied one undersampling and three oversampling methods as data balancing methods. We used a fully connected, two-hidden-layer DNN model for the classification of active and inactive molecules. To evaluate the performance of the network, we calculated six performance metrics, including balanced accuracy, precision, recall, F1 score, Matthews correlation coefficient, and area under the ROC curve. The study results showed that the effect of imbalanced data on network performance could be mitigated to a degree by applying the data balancing methods. The level of imbalance, however, has a negative effect on the performance of the network.en_US
dc.identifier.doi10.1021/acs.jcim.9b01162
dc.identifier.endpage4190en_US
dc.identifier.issn1549-9596
dc.identifier.issn1549-960X
dc.identifier.issue9en_US
dc.identifier.pmid32573225en_US
dc.identifier.scopus2-s2.0-85091807294en_US
dc.identifier.scopusqualityQ1en_US
dc.identifier.startpage4180en_US
dc.identifier.urihttps://doi.org/10.1021/acs.jcim.9b01162
dc.identifier.urihttps://hdl.handle.net/20.500.14551/24792
dc.identifier.volume60en_US
dc.identifier.wosWOS:000576675900011en_US
dc.identifier.wosqualityQ1en_US
dc.indekslendigikaynakWeb of Scienceen_US
dc.indekslendigikaynakScopusen_US
dc.indekslendigikaynakPubMeden_US
dc.language.isoenen_US
dc.publisherAmer Chemical Socen_US
dc.relation.ispartofJournal Of Chemical Information And Modelingen_US
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/closedAccessen_US
dc.subjectAvailable Python Packageen_US
dc.subjectSupport Vector Machinesen_US
dc.subjectNearest-Neighbor Ruleen_US
dc.subjectNeural-Networksen_US
dc.titleDeep Learning-Based Imbalanced Data Classification for Drug Discoveryen_US
dc.typeArticleen_US

Dosyalar