Events and Meetings of Italian Statistical Society, Advances in Latent Variables - Methods, Models and Applications

Font Size: 
Finding Scientific Topics Revisited
Martin Ponweiser, Bettina Grün, Kurt Hornik

Last modified: 2013-06-14


The publication of statistical results based on the use of computational tools requires that the data as well as the code are provided in order to allow to reproduce and verify the results with reasonable effort. However, this only allows to rerun the exact same analysis. While this is helpful to understand and retrace the steps of the analysis which led to the published results, it constitutes only a limited proof of reproducibility. In fact for `true' reproducibility one might require that the essentially same results are obtained in an independent analysis. To check for this `true' reproducibility of results of a text mining application we replicate a study where a latent Dirichlet allocation model was fitted to the document-term matrix derived for the abstracts of the papers published in the Proceedings of the National Academy of Sciences from 1991 to 2001. Comparing the results we assess (1) how well the corpus and the document-term matrix can be reconstructed, (2) if the same model would be selected and (3) if the analysis of the fitted model leads to the same main conclusions and insights. Our study indicates that the results from this study are robust with respect to slightly different preprocessing steps and the use of a different software to fit the model.

Full Text: PDF