Selecting the optimal statistical programming language for a data science application
August 14, 2021 3282 words 16 minutes to read Last Updated: August 29, 2021
For many projects, the best statistical programming language is simply the one that you know best. Don’t let “analysis paralysis” prevent you from getting started - learning a statistical programming language is an applied, experiential activity. Carefully but quickly pick a language, and then write three programs in it: a light data analysis project, a somewhat heavier data mining project, and a prototypical deep learning project.
Let’s get started
Your enterprise has just launched a new data sciences division whose work activities will include artificial intelligence, statistical analysis, data analysis, predictive analytics, data mining, business intelligence, machine learning, and deep learning projects. You have been tasked with identifying, recommending, and selecting the optimal statistical programming language for the in-house data science applications.
Initially, your team must identify and develop a range of evaluation criteria, including business, technology, process, and training requirements. These will be exercised to assess the optimal solution for your firm.
Statistical programming is essential for a successful data science division. This programming must support the stakeholder to act upon big data through a life cycle of activities: pre-processing, analysis, visualization, prediction, and preservation.
A quick Google search by your team lead produced pages upon pages of other “experts” informed—(biased)—opinions about the best and most perfect statistical programming language. You need to separate signal from the noise, and understand business implications like:
- one-time and recurring costs of the software,
- one-time costs of training your team and other stakeholders, and
- one-time and recurring costs of necessary computer resources.
The requirements analysis process
The requirements analysis process should encompass the following steps:
1. Identify key stakeholders.
2. Capture stakeholder requirements.
- Execute stakeholder interviews and focus groups;
- Develop “use cases;” and
- Build prototypes.
3. Categorize requirements.
- Functional requirements – features and functions for interaction by the stakeholders;
- Operational requirements – back-office operations to run and maintain the applications;
- Technical requirements – technical issues and configuration management concerns considered necessary to successfully implement the programming languages and associated processes; and
- Transformational requirements – steps and activities necessary to smoothly implement the programming language, requisite hardware, computer processing, and storage management.
4. Interpret and record requirements.
- Precisely define requirements in sufficient detail;
- Prioritize requirements from critical to optional;
- Execute an impact analysis associated with people, processes, and other applications;
- Execute a scenario analysis to resolve stakeholder conflict issues and a range of possible “futures;” and
- Execute a scenario analysis to assess the reliability and ease-of-use of the selected programming language.
5. Obtain sign off.
Sounds complicated. It can be, depending upon the complexity of anticipated use cases and scope of the projects: broad or deep, web application deployment or off-the-shelf report dashboard, data visualization, etc. One size does not fit all situations.
Many programming languages are geared to statistics, but each of them is suitable for specific circumstances. The remainder of this blog post will be purposefully limited to the critical pros and cons of the features associated with the statistical programming languages, important requirements, and software quality attributes.
Initially, the evaluation team must find a method to select a subset of all the statistical programming languages (SPLs) available. One approach would be to review the 2018 Kaggle Machine Learning and Data Science Survey, which ranked the SPLs on “regular use” by 18,827 respondents. Or the team could check out the infographic from Maryville University’s Bachelor’s in Data Science program. Even a glance at the PYPL Popularity of Programming Language Index, which was created by analyzing the volume of computer language tutorials searched on Google would provide a jump start. Finally, a glance at the 5th Annual Developer Ecosystem Survey conducted by JetBrains presents exhibits from its analysis of responses from 31,743 developers in 183 countries.
With so many sources to cherry-pick from, we would suggest that an important first step would be to use the TIOBE Quality Indicator to select the software tools with the highest quality indicators. Software quality will be one of the most critical success factors for acquisition and deployment. TIOBE’s mission is to evaluate over a billion lines of “software code for its customers worldwide, real-time, each day.” The evaluation criteria are based upon the ISO 25010:2011 Standard. The resulting score ranges from an A (100) to an F (0). The standard automatically measures 350 standardized factors of the computer code.
The TIOBE Quality Indicator for Aug of 2021 assessed the top statistical programming languages (SPLs) used in data science applications to be:
TABLE 1: TIOBE Aug 2021 Top Statistical Programming Languages
The TIOBE Community Index rating is an important index. It is a popularity indicator based upon the global volume of skilled engineers, training courses, and 3rd-party vendors. When your firm adopts a new tool, having a vibrant community to interact with is a must. Let’s take a snapshot of some of the significant strengths and shortcomings of each SPL, in ranked sequence, that are widely accepted in the industry.
Pros and cons of statistical programming languages (SPLs)
The following table highlights several criteria important for your evaluation:
|TIOBE Ranking||Statistical Programming Language (SPL)||Open Source/ Open Standard/ Free?||Flexible SPL?||Clean, Easy, and Intuiitive HCI?||High Speed/ Performance?||Easy to Learn ?||Minimal Statistical Features||Useful Application Libraries||Statistical Libraries||Active or Limited Community?|
|2||Python||***||***||***||***||TensorFlow, Pytorch, Keros, Scikit-learn||Pandas, NumPy, SciPy, Matplotlib||Active|
|10||SQL||***||***||***||Variety of implementations: MySQL, SQLite, PostgreSQL||Active|
|14||R||***||Comprehensive R Archive Network (CRAN), Apache Spark, RMySQL||Active|
|17||MATLAB||***||***||***||Interfaces with Simulink, CarSim, PreScan||Expansive library of predefined functions||Active|
|21||SAS||***||***||***||Statistical analysis and machine learning libraries||Active|
|26||Julia||***||***||***||***||***||Extensive and written in Julia, Calls faciliated to C+ Fortan libraries||Limited|
|32||Scala||***||***||***||***||Apache Spark||Extensive libraries||Limited|
TABLE 2: Table to Evaluation of Programming Languages
Credits: Standard C++ Foundation
Being really good at C++ is like being really good at using rocks to sharpen sticks
- C/C++ is a well-respected, mature programming language that is relevant as an SPL.
- C/C++ is a high-performing SPL. Vest match for projects requiring massive scalability and significant speed.
- C/C++ permits users to focus and refine certain application processing because of its low-level orientation.
- C/C++ is a low-level language, reducing its popularity as an SPL and increasing the learning commitment.
- C/C++ does not offer well-known data visualization libraries like Python and R.
- C/C++ does not encompass a strong data science and analytics community.
Credit: Python Software Foundation
The joy of coding Python should be in seeing short, concise, readable classes that express a lot of action in a small amount of clear code — not in reams of trivial code that bores the reader to death.
– Guido van Rossum, from Flanders, C. T. (2018). The Digital Research Skills Cookbook: An Introduction to the Research Bazaar Community. Research Platforms Services.
- Python is a rapidly growing high-level, interpreted language for general-purpose applications.
- Python is one of the most popular SPLs among data scientists. This level of popularity means significant support and a wide range of available resources for continuing education.
- Python has efficient high-level data structures and effective execution of object-oriented programming.
- Python’s popularity is founded on its clean, elegant, and easy-to-read syntax, especially when it comes to matrix operations.
- Python is especially popular with machine learning and deep learning projects thanks to its wide array of libraries.
- Python’s heavy reliance on indentation and slow performance annoy some developers, but the language’s flexibility outweighs the drawbacks for many applications.
- Python visualizations are usually convoluted, but it has excellent visualization libraries like Matplotlib.
- Python has less functionality than R.
Java is the SUV of programming tools. A project done in Java will cost 5 times as much, take twice as long, and be harder to maintain than a project done in a scripting language such as PHP or Perl. … But the programmers and managers using Java will feel good about themselves because they are using a tool that, in theory, has a lot of power for handling problems of tremendous complexity. Just like the suburbanite who drives his SUV to the 7-11 on a paved road but feels good because in theory, he could climb a 45-degree dirt slope.
- Famous for its “write once, run anywhere” mantra, Java is a universal programming language that may be used for pretty much any purpose. If the goal is to build complete applications with some statistical elements, Java’s platform-agnostic approach makes it the right selection for many situations.
- Java pays attention to security requirements.
- Java has implemented an efficient garbage collection feature.
- Java is one of the most popular and adopted languages in terms of mobile and web applications.
- Java is not the most potent tool specifically for statistical computations and analysis.
- Java does not contain data visualization libraries like Python and R.
- Java does not encompass a strong data science and analytics community.
Credits: W3 Schools
Credits: W3 Schools
Big data will spell the death of customer segmentation and force the marketer to understand each customer as an individual within 18 months or risk being left in the dust.
–Ginni Rometty, CEO IBM
- With SQL, users can easily and speedily create, update, and retrieve datasets containing millions of data points.
- As a non-procedural, declarative programming language, SQL also makes data manipulation easy since it doesn’t require the user to specify how operations should be performed. The user only needs to express what is needed via statements.
- SQL’s crucial advantage is language statement standardization.
- Strictly speaking, SQL (Structured Query Language, pronounced “S-Q-L” or “sequel”) isn’t intended for statistical computations. Rather, SQL is a supplemental tool for statistics that can facilitate the handling and querying of structured data.
Credits: R Project
The idea behind rubber duck debugging is to pretend that you are trying to explain what your code is doing to an inanimate object, like a rubber duck. Often, the process of explaining it aloud is enough to help you find the problem.
–Russell Poldrack, from Poldrack, R. A. (2020). An R Companion to Statistical Thinking for the 21st Century.
- R was developed specifically for statistical and mathematical calculations and is perhaps the very best programming language in existence for statistical analysis.
- R is likely the optimal SPL if the enterprise’s use cases are heavily focused on statistical computations and do not involve much else. It enforces an efficient array operations and data-handling capability.
- R promotes a vast range of packages for statistical tests, time-series analysis, linear and non-linear modeling, graphical plotting, and many other purposes. As of late July 2021, The Comprehensive R Archive Network (CRAN) – a collection of R documentation and packages, among other things – contained close to 18,000 packages.
- R is both a procedurally-oriented and object-oriented language.
- R is arguably less intuitive and somewhat more difficult to learn than Python and many other SPLs.
- R tends to be a memory hog.
- R is not as flexible as Python.
- R lacks built-in web security and should not be used for calculations as a back-end server.
[Interviewer] Would you have done anything differently in the development of MATLAB if you had the chance?
[CM]: …Its evolution, from a primitive calculator to a modern programming language, has been very difficult. If I had started out at the beginning to design a programming language, MATLAB probably would have been something quite different. The original intention was that it would be easy to use and that it would have solid mathematics underlying it. I’m glad I did it the way I did, but if I knew what it was going to be today, I might not have done it that way.
–Clive Moler (Father of MathWorks)
- MATLAB is particularly popular in academia, where faculty members, researchers, and students can use it for free. It is often used as a resource to accelerate knowledge acquisition of data science concepts.
- MATLAB is favored for its graphical user interfaces and wide range of visualization and statistical computation tools, and it even has machine learning and deep learning toolboxes.
- MATLAB interacts with 3rd-party software, like Simulink, CarSim, and PreScan.
- MATLAB remains a popular choice for numeric computing and statistics, although other languages – such as Python – today offer the same and perhaps better functionality than MATLAB.
- MATLAB is proprietary and requires the purchase of a license.
- MATLAB is an interpreted language and a memory hog when processing data. Slow computational speeds are apparent when processing a large dataset.
Credits: SAS Institute Inc.
Torture the data, and it will confess to anything.
–Ronald Coase, British Economist and Author
- One of the oldest languages designed for statistics.
- SAS is a highly reliable, stable, and secure platform for analytical processing.
- SAS has fallen behind with the introduction of advanced and open-source software.
- SAS lacks graphical representation capability, thus, is challenged to translate data to a visual form.
- SAS requires a significant challenge and expense to incorporate advanced tools and features already existent in other SPLs.
Credits: Julia Lab at MIT
We want a language that’s open-source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with an obvious, familiar mathematical notation like MATLAB. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as MATLAB, as good at gluing programs together as the shell. Something that is dirt simple to learn yet keeps the most serious hackers happy. We want it interactive, and we want it compiled.
- Julia is a relatively young programming language aimed at large-scale numerical computing.
- Julia’s syntax is as user-friendly as Python’s syntax.
- Julia is an ideal language for artificial intelligence and machine learning. It supports both parallel and distributed computing.
- Despite its comparatively short presence in the industry, Julia is a major threat to other languages and is becoming an increasingly popular tool in the data science sector.
- Julia’s strength is its speed and high performance (about as well as C). Julia is fast because it is compiled, but not interpreted. Furthermore, not only does Julia perform operations markedly quicker than other programming languages, (most notably, Java, Python, and R), but it also is easy to learn thanks to its light syntax.
- Julia permits stakeholders to call C and Fortran functions without glue code or compilation.
- Julia’s community is still small, which could severely impact developers’ ability to troubleshoot code or find solutions to their issues.
- Julia is an immature SPL and needs significant improvements. Julia’s tools are not as fluidic and reliable as other languages.
- Julia does not yet excel in visualization capabilities, but JuliaPlots advances a range of simple, but powerful plotting options.
- Julia lags behind Python and R because of its inability to identify issues and lack of debugging tools.
Credits: Scala Center
Eric Raymond introduced the cathedral and bazaar as two metaphors of software development. The cathedral is a near-perfect building that takes a long time to build. Once built, it stays unchanged for a long time. The bazaar, by contrast, is adapted and extended each day by the people working in it. In Raymond’s work the bazaar is a metaphor for open-source software development. Guy Steele noted in a talk on “growing a language” that the same distinction can be applied to language design. Scala is much more like a bazaar than a cathedral, in the sense that it is designed to be extended and adapted by the people programming in it. Instead of providing all constructs you might ever need in one “perfectly complete” language, Scala puts the tools for building such constructs into your hands.
–Odersky M., Spoon L., & Venners B. (2008). Programming in Scala, 1st Edition). Scala Publications.
- Running on the Java Virtual Machine (JVM), Scala is effectively an extension of Java.
- Scala integrates with Java programs and addresses many of Java’s issues, bringing a lighter and more intuitive experience to users.
- Scala combines features of object-oriented and functional languages; and facilitates parallel processing on a large scale, making it ideal for working with high-volume data sets.
- Scala’s strongest statistics and data manipulation feature is Apache Spark – a toolset for large-scale data computations. Spark is written in Scala and is also available in Java, Python, R, and SQL, among other programming languages.
- Scala allows users to parallelize operations, dramatically increase computational performance, and improve hardware utilization.
- Scala’s syntax and type system require a steep learning curve.
- Scala exhibits a small, bounded developer pool community.
Other less popular statistical programming languages (SPLs) include Haskell, Swift, Octave, Perl, Lisp, and F#. The market penetration for these tools is low, so fewer individuals have gained experience or training. Of special note is RapidMiner. It is not an SPL from a definitional level but is a data science platform for AI and machine learning. This tool was selected as a Visionary in the Gartner 2021 Magic Quadrant for Data Science & Machine Learning. In addition, RapidMiner was selected as a Leader in The Forrester Wave™: Multimodal Predictive Analytics And Machine Learning Solutions, Q3 2020. This tool was not mentioned in the TIOBE Index for August 2021.
A great place to begin your evaluation is a review of Farshidi’s Multi-Criteria Decision-Making Model. Outside resources could prove useful in developing your team’s requirements analysis. A valuable approach to evaluation has been documented by Farshidi, Jansen, & Deldar (2021).
Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, O., … & Ward, M. D. (2015). Data science in statistics curricula: Preparing students to “think with data.” The American Statistician, 69(4), 343-353.
Miller, J. D. (2017). Statistics for Data Science: Leverage the power of statistics for Data Analysis, Classification, Regression, Machine Learning, and Neural Networks. Packt Publishing Ltd.
Kaplan, D. (2018). Teaching stats for data science. The American Statistician, 72(1), 89-96.
Çetinkaya-Rundel, M., & Ellison, V. (2021). A fresh look at introductory data science. Journal of Statistics and Data Science Education, 29(sup1), S16-S26.
Muralidharan, K. (2013). Statistics and Data Science: The Emergence of New Paradigms. Arthshodh, 31.
Bansal, A., & Srivastava, S. (2018). Tools used in data analysis: A comparative study. International Journal of Recent Research Aspects, 5(1), 15-18.