Topic Analysis of Microblog About “Didi Taxi” Based on K-means Algorithm
American Journal of Information Science and Technology
Volume 3, Issue 3, September 2019, Pages: 72-79
Received: Jul. 30, 2019; Accepted: Aug. 16, 2019; Published: Sep. 2, 2019
Views 710      Downloads 192
Yonghe Lu, School of Information Management, Sun Yat-sen University, Guangzhou, China
Xin Xiong, School of Information Management, Sun Yat-sen University, Guangzhou, China
Article Tools
Follow on us
In the age of information and digitization, most users publish and obtain real-time information by microblog in social networks. Through effective means, we can accurately discover, organize, and utilize the valuable information hidden behind the massive short texts of social networks. Then we can explore hot topics in microblog, which is conducive to public opinion monitoring and marketing development. In today's society, Didi Taxi has become a necessary choice for many users to travel. This paper applied K-means clustering algorithm to topic analysis of Sina microblog short text on Didi Taxi. We crawled 17226 search results of microblog relevant to the topic of Didi Taxi from April 2019 to June 2019. After a series of data cleaning and data preprocessing steps, we used TF-IDF method to represent 15054 pieces of text data after processing. Through the evaluation of silhouette coefficient, we set the dimension of text 300 and the number of clusters 34 with K-means. Next, we extracted 8 topic clusters from 34 clusters, which include the advantages and disadvantages of Didi Taxi and its development status. Finally, we discussed the results by human check in semantic perspective. Through the topic analysis of microblog, we can understand the public’s attitude to Didi Taxi and provide the basis for the management of the government or company in the future.
K-means Clustering, Topic Analysis, Microblog Text, Didi Taxi
To cite this article
Yonghe Lu, Xin Xiong, Topic Analysis of Microblog About “Didi Taxi” Based on K-means Algorithm, American Journal of Information Science and Technology. Vol. 3, No. 3, 2019, pp. 72-79. doi: 10.11648/j.ajist.20190303.13
Copyright © 2019 Authors retain the copyright of this article.
This article is an open access article distributed under the Creative Commons Attribution License ( which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Sina microblog data center (2019). “2018 microblog user development report,”
YUAN Bo. “Microblog topic mining based on relation network,” in Harbin Institute of Technology, 2014, pp. 1-3.
LU Rong, XIANG Liang, LIU Mingrong, YANG Qing (2012). Discovering News Topics from Microblogs Based on Hidden Topics Analysis and Text Clustering. Pattern Recognition and Artificial Intelligence, 25 (3): 382-387.
MA Wenwen, WEI Wenhan, DENG Yigui (2014). Micro-blog topic detection method based on Latent Semantic Analysis. Computer Engineering and Applications, 50 (1): 96-100.
WANG Xuren, LI Na, HE Famei, WANG Yanli, SONG Bei (2014). Research and Implementation of Desktop Search Engine Based on Tika and Lucene. Journal of The China Society for Scientific and Technical Information, 33 (5): 530-537.
DING Ruoyao. “Research on Internet topic detection and tracking based on blog,” in Beijing Jiaotong University, 2011, pp. 27-30.
YANG Changchun, ZHOU Meng, YE Shiren, XU Xiaosong (2013). An Improved Hot Topic Detection Method for Microblog Based On CURE Algorithm. Computer Simulation, 30 (11): 383-387.
GESANG Duoji, QIAO Shaojie, HAN Nan, ZHANG Xiaosong, YANG Yan, YUAN Changan, et al. (2015). An Internet Public Opinion Hotspot Detection Algorithm Based on Single-Pass. Journal of University of Electronic Science and Technology of China, 4 (44): 599-600.
FANG Xingxing, LV Yongqiang (2014). Discovering the Topic of Network Public Opinion Based on Improved Single-pass Algorithm. Computer and Digital Engineering, (7): 1233-1237.
ZHANG Meng. “Research on LDA short text classification algorithm based on Hadoop platform,” in Tianjing University of Finance and Economics, 2016, pp. 24-27.
Wang, F., Liu II, P., & Zhu III, Z. (2018). Hadoop-based analysis model of network public opinion and its implementation. In Third International Workshop on Pattern Recognition. (Vol. 10828, p. 108281H). International Society for Optics and Photonics.
WANG Jia-bao, XUE Man, DUN Shuai (2017). Business Model of Sharing Economy in Chinese Context based on Comparative Study of Multiple Cases. Commercial Research, (09): 21-27.
Liu Jiangang, Zhang Meijuan, Chen Changjie, Zhao Lingling (2017). Influence Factors on Business Model Innovation of Internet Platform Enterprise——A Case Study of DiDi Based on Ground Theory. Forum on Science and Technology in China, (06): 185-192.
HE Minghua, LIANG Xiaobei (2018). Effect of Platform's and Service Providers' Reputation on Consumers' Continuous Intention to Use a Sharing Option in the Context of Sharing Economy——An Empirical Study Based on Didi Chuxing Platform. Reform of the Economic System, (02): 85-92.
ZUO Wenming, ZHU Wenfeng (2018). Research on Service Quality Management of Online Car-hailing Based on SERVQUAL in Sharing Economy: Case Study of Didichuxing and Uber. Journal of Management Case Studies, 11 (04): 349-367.
Guo, K., Shi, L., Ye, W., & Li, X. (2014). A survey of internet public opinion mining. In 2014 IEEE International Conference on Progress in Informatics and Computing (pp. 173-179). IEEE.
ZHANG Hua. “The research of micro-blog public opinion predictive model based on optimized BP neural network,” in Central China Normal University, 2014, pp. 12-14.
TANG Luyang. “Research on web data acquisition and management for online public opinion analysis,” in University of Electronic Science and Technology of China, 2017, pp. 19-23.
Zhang, H. P., Yu, H. K., Xiong, D. Y., & Liu, Q. (2003). HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17 (pp. 184-187). Association for Computational Linguistics.
ZHANG Haidong. “Research on hot topic identification and trend prediction based on BBS” in Shanghai Normal University, 2014, pp. 19-27.
MI Wenli, SUN Yuexin (2014). Microblog Hot Topics Discovery Method Based on Probabilistic Topic Model. Computer Systems & Applications, (8): 163-167.
Xu, G., Wu, X., Yao, H., Li, F., & Yu, Z. (2019). Research on Topic Recognition of Network Sensitive Information Based on SW-LDA Model. IEEE Access, 7, 21527-21538.
SU Yu, ZHENG Cheng, MA Zhongjie (2011). The Improvement of VSM Model Based on Semantics. Computer Applications and Software, 28 (08): 158-161.
Lewis, P. (1962). The characteristic selection problem in recognition systems. IRE Transactions on information theory, 8 (2), 171-178.
Zhang, Y. (2013). Overview of keyword extraction in single document. Scientific Journal of Information Engineering, 3 (1).
Hu, Y., & Loizou, P. C. (2004). Speech enhancement based on wavelet thresholding the multitaper spectrum. IEEE Transactions on Speech and Audio Processing, 12 (1), 59-67.
Han, J. (2005). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc.
Science Publishing Group
1 Rockefeller Plaza,
10th and 11th Floors,
New York, NY 10020
Tel: (001)347-983-5186