A New Stylometry Method Basing on the Numerals Statistic
International Journal on Data Science and Technology
Volume 3, Issue 2, March 2017, Pages: 16-23
Received: Mar. 22, 2017; Accepted: Apr. 25, 2017; Published: May 22, 2017
Views 2462      Downloads 51
Andrei Viacheslavovich Zenkov, Department “Modelling of Controllable Systems”, Ural Federal University, Ekaterinburg, Russia
Larisa Anatolievna Sazanova, Department of Statistics, Econometrics and Computer Science, Ural State University of Economics, Ekaterinburg, Russia
Article Tools
Follow on us
A new method of statistical analysis of texts is suggested. The frequency distribution of the first significant digits in numerals of connected authorial English-language texts is considered. Benford's law is found to hold approximately for these frequencies with a marked predominance of the digit 1. Deviations from Benford's law are statistically significant author peculiarities that allow, under certain conditions, to consider the problem of authorship and distinguish between texts by different authors. At the end of {1, 2,…, 8, 9} row, the digits distribution is subject to strong fluctuations and thus unrepresentative for our purpose. The approach suggested and the conclusions are backed by the examples of the computer analysis of works by W. M. Thackeray, M. Twain, R. L. Stevenson et al. The results are confirmed on the basis of non-parametric range Mann-Whitney and Kruskal-Wallis tests as well as the parametric Pearson's chi-squared test.
Benford’s Law, Statistic of Numerals, Text Attribution, Text Processing, English-Language Fiction, Mann-Whitney U Test, Pearson's Chi-Squared Test
To cite this article
Andrei Viacheslavovich Zenkov, Larisa Anatolievna Sazanova, A New Stylometry Method Basing on the Numerals Statistic, International Journal on Data Science and Technology. Vol. 3, No. 2, 2017, pp. 16-23. doi: 10.11648/j.ijdst.20170302.11
Copyright © 2017 Authors retain the copyright of this article.
This article is an open access article distributed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
F. Benford, “The law of anomalous numbers”. Proceedings of American Philosophical Society. 1938. vol. 78. No. 4. pp. 551–572.
T. P. Hill, “A Statistical Derivation of the Significant-Digit Law”. Statistical Science. 1995. vol. 10. pp. 354–363.
W. M. Goodman, “Reality Checks for a Distributional Assumption: The Case of ‘Benford’s Law’”. JSM 2013 – Business and Economic Statistics Section, pp. 2789–2803.
M. J. Nigrini, Benford’s Law: applications for forensic accounting, auditing, and fraud detection. Hoboken: John Wiley & Sons, 2012.
B. F. Roukemaa, “A first-digit anomaly in the 2009 Iranian presidential election”. Journal of Applied Statistics. 2014. vol. 41. No. 1. pp. 164–199.
D. Biau, “The first-digit frequencies in data of turbulent flows”. Physica A. 2015. vol. 440, pp. 147–154.
T. P. Hill and R. F. Fox, “Hubble’s Law Implies Benford’s Law for Distances to Galaxies”. Journal of Astrophysics and Astronomy. 2016. vol. 37. No. 4. 8 pages.
M. Sambridge, H. Tkalčić, and P. Arroucau, “Benford’s Law of First Digits: from Mathematical Curiosity to Change Detector”. Asia Pacific Mathematics Newsletter. 2011. vol. 1. No. 4. pp. 1–6.
P. Andriotis, G. Oikonomou, and T. Tryfonas, “JPEG steganography detection with Benford’s Law”. Digital Investigation. 2013. vol. 9. No. 3–4. pp. 246–257.
A. D. Alves, H. H. Yanasse, and N. Y. Soma, “Benford’s Law and articles of scientific journals: comparison of JCR and Scopus data”. Scientometrics. 2014. vol. 98. pp. 173–184.
A. V. Zenkov, “Deviation from Benford’s law and identification of author peculiarities in texts”. Computer Research and Modeling, 2015, vol. 7, No. 1, pp. 197–201 (in Russian).
The Best American Humorous Short Stories, by G. P. Morris, E. A. Poe, C. M. S. Kirkland, E. Leslie, G. W. Curtis, E. E. Hale, O. W. Holmes, M. Twain, H. S. Edwards, R. M. Johnston, H. C. Bunner, F. R. Stockton, F. Bret Harte, O. Henry, G. R. Chester, G. MacGowan Cooke, W. J. Lampton, and W. Hastings. The Project Gutenberg eBook, eBook #10947;
The Short-story, by W. Irving, E. A. Poe, N. Hawthorne, F. Bret Harte, R. L. Stevenson, and R. Kipling. The Project Gutenberg eBook, transcribed from the 1916 Allyn and Bacon edition, eBook # 21964.
The Lock And Key Library, Classic Mystery And Detective Stories, by R. Kipling, A. Conan Doyle, E. Castle, S. J. Weyman, W. Collins, and R. L. Stevenson. The Project Gutenberg eBook, transcribed from the 1909 Review of Reviews Co. edition, eBook # 2038.
Shorter Novels, Eighteenth Century. The History of Rasselas, The Castle of Otranto, Vathek, by S. Johnson, H. Walpole, and W. Beckford. The Project Gutenberg eBook, transcribed from the 1903 Aldine House edition, eBook # 34766.
The Best of the World's Classics, Vol. V – Great Britain and Ireland, by J. Boswell, W. Wordsworth, W. Scott, S. T. Coleridge, R. Southey, W. S. Landor, C. Lamb, W. Hazlitt, T. De Quincey, Lord Byron, P. Bysshe Shelley, G. Grote, T. Carlyle, Lord Macaulay. The Project Gutenberg eBook, transcribed from the 1909 Funk & Wagnalls Co. edition, eBook # 22182.
The Great English Short-Story Writers, Vol. 1, by D. Defoe, J. Hogg, W. Irving, N. Hawthorne, E. A. Poe, J. Brown, C. Dickens, F. R. Stockton, M. Twain, F. Bret Harte, T. Hardy, H. James, and R. L. Stevenson. The Project Gutenberg eBook, transcribed from the 1910 Readers's Library edition, eBook # 10135.
A House to Let, by C. Dickens, W. Collins, E. Gaskell, and A. A. Procter. The Project Gutenberg eBook, transcribed from the 1903 Chapman and Hall edition, eBook #2324.
Masterpieces of Mystery, Vol. 1, Ghost Stories, by A. Blackwood, M. R. James, K. Rickford, W. F. Harvey, R. A. Cram, R. L. Stevenson, and W. D. Steele. The Project Gutenberg eBook, transcribed from the 1920 Doubleday, Page & Co. edition, eBook # 27722.
J. N. Binongo, “Who wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution”. Chance. 2003. vol. 16. No. 2, pp. 9–17.
The Oxford Handbook of Computational Linguistics (Ed. R. Mitkov). Oxford (a.o.): Oxford University Press, 2003.
The Handbook of Linguistics (Eds. M. Aronoff and J. Rees-Miller). Oxford (a.o.): Blackwell Publishing, 2004.
B. Ryabko, J. Astola, and M. Malyutov, Compression-Based Methods of Statistical Analysis and Prediction of Time Series. Springer International Publishing Switzerland, 2016.
Science Publishing Group
NEW YORK, NY 10018
Tel: (001)347-688-8931