Comparing programming languages for data analytics: Accuracy of estimation in Python and R

Chelsey Hill, Lanqing Du, Marina Johnson, B. D. McCullough

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Several open-source programming languages, particularly R and Python, are utilized in industry and academia for statistical data analysis, data mining, and machine learning. While most commercial software programs and programming languages provide a single way to deliver a statistical procedure, open-source programming languages have multiple libraries and packages offering many ways to complete the same analysis, often with varying results. Applying the same statistical method across these different libraries and packages can lead to entirely different solutions due to the differences in their implementations. Therefore, reliability and accuracy should be essential considerations when making library and package usage decisions while conducting statistical analysis using open source programming languages. Instead, most users take this for granted, assuming that their chosen libraries and packages produce accurate results for their statistical analysis. To this extent, this study assesses the estimation accuracy and reliability of Python and R's various libraries and packages by evaluating the univariate summary statistics, analysis of variance (ANOVA), and linear regression procedures using benchmarking data from the National Institutes of Standards and Technology (NIST). Further, experimental results are presented comparing machine learning methods for classification and regression. The libraries and packages assessed in this study include the stats package in R and Pandas, Statistics, NumPy, statsmodels, SciPy, statsmodels, scikit-learn, and pingouin in Python. The results show that the stats package in R and statsmodels library in Python are reliable for univariate summary statistics. In contrast, Python's scikit-learn library produces the most accurate results and is recommended for ANOVA. Among the libraries and packages assessed for linear regression, the results demonstrated that the stats package in R is more reliable, accurate, and flexible; thus, it is recommended for linear regression analysis. Further, we present results and recommendations for machine learning using R and Python. This article is categorized under: Algorithmic Development > Statistics Application Areas > Data Mining Software Tools.

Original languageEnglish
Article numbere1531
JournalWiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Volume14
Issue number3
DOIs
StatePublished - 1 May 2024

Keywords

  • comparing Python and R
  • open-source software for data analytics
  • statistical software reliability and accuracy

Fingerprint

Dive into the research topics of 'Comparing programming languages for data analytics: Accuracy of estimation in Python and R'. Together they form a unique fingerprint.

Cite this