A New Stylometry Method Basing on the Numerals Statistic
International Journal on Data Science and Technology
Volume 3, Issue 2, March 2017, Pages: 16-23
Received: Mar. 22, 2017; Accepted: Apr. 25, 2017; Published: May 22, 2017
Andrei Viacheslavovich Zenkov, Department “Modelling of Controllable Systems”, Ural Federal University, Ekaterinburg, Russia
Larisa Anatolievna Sazanova, Department of Statistics, Econometrics and Computer Science, Ural State University of Economics, Ekaterinburg, Russia
A new method of statistical analysis of texts is suggested. The frequency distribution of the first significant digits in numerals of connected authorial English-language texts is considered. Benford's law is found to hold approximately for these frequencies with a marked predominance of the digit 1. Deviations from Benford's law are statistically significant author peculiarities that allow, under certain conditions, to consider the problem of authorship and distinguish between texts by different authors. At the end of {1, 2,…, 8, 9} row, the digits distribution is subject to strong fluctuations and thus unrepresentative for our purpose. The approach suggested and the conclusions are backed by the examples of the computer analysis of works by W. M. Thackeray, M. Twain, R. L. Stevenson et al. The results are confirmed on the basis of non-parametric range Mann-Whitney and Kruskal-Wallis tests as well as the parametric Pearson's chi-squared test.
Benford’s Law, Statistic of Numerals, Text Attribution, Text Processing, English-Language Fiction, Mann-Whitney U Test, Pearson's Chi-Squared Test
To cite this article
Andrei Viacheslavovich Zenkov, Larisa Anatolievna Sazanova, A New Stylometry Method Basing on the Numerals Statistic, International Journal on Data Science and Technology. Vol. 3, No. 2, 2017, pp. 16-23. doi: 10.11648/j.ijdst.20170302.11
Copyright © 2017 Authors retain the copyright of this article.
This article is an open access article distributed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
