Clustering English news articles based on relevant domains: comparative study using three clustering algorithms
| dc.contributor.author | Disayiram, N. | |
| dc.contributor.author | Rupasingha, R.A.H.M. | |
| dc.date.accessioned | 2026-03-12T05:37:24Z | |
| dc.date.available | 2026-03-12T05:37:24Z | |
| dc.date.issued | 2022-10-28 | |
| dc.description.abstract | The news tells us about what happens around us. Nowadays, people use news sites to read exciting news. News has many categories. The preferable choice of the news category differs for each newsreader. In the end, every news category is important. Every day lots of news is published on news websites. Typically, news sites categorize the news, but all the categories are not included on that site. Most news sites prioritise some categories, and other categories get lower media coverage. It is, therefore, difficult to find the relevant types of news. These problems give complexity to the newsreaders and relevant content seekers to find the relevant section on the news sites. The clustering of English news based on the relative category gives solutions to overcome those problems. This study aims to cluster news articles based on the relevant domain using machine-learning algorithms. We consider five domains: politics, sports, health, technology, and business. The online collected data was converted into vector format by using the term frequency-inverse document frequency vectorization. Then, the three clustering algorithms: Expectation Maximization, Simple Kmeans, and Hierarchical Clustering based on agglomerative technique, were separately applied to the body of the news and the news headline. The accuracy is calculated through the classes to clusters evaluation model in the WEKA tool. The results show that the Expectation Maximization algorithm achieved the highest accuracy of 87.9%, while it was 83.8% for the Simple Kmeans algorithm. Further, the Hierarchical Clustering method achieved the minimum accuracy results. The comparison results between the heading of news and the body of news show that the body of news performed better than the heading of news to cluster the news articles. | |
| dc.identifier.citation | Proceedings of the Postgraduate Institute of Science Research Congress (RESCON) -2022, University of Peradeniya, P 102 | |
| dc.identifier.isbn | 978-955-8787-09-0 | |
| dc.identifier.uri | https://ir.lib.pdn.ac.lk/handle/20.500.14444/7632 | |
| dc.language.iso | en_US | |
| dc.publisher | Postgraduate Institute of Science (PGIS), University of Peradeniya, Sri Lanka | |
| dc.subject | Clustering | |
| dc.subject | Domain | |
| dc.subject | Machine learning | |
| dc.subject | News article | |
| dc.title | Clustering English news articles based on relevant domains: comparative study using three clustering algorithms | |
| dc.title.alternative | ICT, Mathematics and Statistics | |
| dc.type | Article |