Web Page Classification Based on a Least Square Support Vector Machine with Latent Semantic Analysis
Chinese web page classification (WPC) has been considered as a hot research area in data mining. In order to effectively classify web pages, we present a web page categorization based on a least square support vector machine (LS-SVM) with latent semantic analysis (LSA). LSA uses Singular Value Decom- postion (SVD) to obtain latent semantic structure of original term-document matrix solving the polysemous and synonymous keywords problem. LS-SVM is an effective method for learning the classification knowledge from massive data, especially on condition of high cost in getting labeled classical examples. We adopt a novel method of web page expression, and make use of summarization algorithm to reduce the noise of web pages. A preliminary experimental comparison is made showing encouraging results.