• Skip to primary navigation
  • Skip to main content
The Data Lab

The Data Lab

Pruple button with the word menu
  • About Us
        • About Us

           
          Visit our About Us page

        • Careers
        • Our Team
        • Impact
        • The Scottish AI Alliance
        • Contact us
  • Business
        • For Business

           

          Visit our Business Support page

        • Access Talent
        • Funding and Business Support
        • Partnerships
  • Professionals
        • For Professionals

           

          Visit our Professional Development page

        • Online Courses
        • Data Skills for Work
  • Students
        • For Students

           

          Visit our Students page

        • The Data Lab Academy
        • Student Placements
        • Scholarships
  • Universities and Colleges
        • For Universities and Colleges

           

          Visit our Universities and Colleges page

        • Funding and Support
        • Collaborate and Innovate
        • Academic Projects
  • Community
        • Community

           

          Visit our Community page

        • Online Community
        • News
        • Case Studies
        • DataFest

Which Data Science Platform is Best? The Challenges of Explainable ML and AI

Technical Skills, Thought Leadership 31/08/2018

Recently, I have finished a project working on testing Machine Learning algorithm performances in different data science platforms with an explicit focus on explainability. In this post, I shall describe some of the criteria and the platforms that were used in this project.

In the era of the internet where vast amounts of data are being generated by different sectors ranging from pharmaceuticals to molecular biology, the need for automated tools that allow for analysing of these data as well as insights that are critical for business is higher today than it has ever been.

Data science platforms as a software hub that allows for integrating data, building and deploying models have seen an exponential rise both in supply and demand, but which of these platforms is ‘best’?

Without getting into a philosophical debate, the term best would probably have a very different meaning to different individuals and organisations. However, platforms that offer interpretable Machine Learning (ML) and Artifical intelligence (AI) outputs are prefered than those of ‘black box’.

Platforms that met our criteria

The number of available tools is ever increasing as the friction in the market for the data science services disappears. How would one choose the right platform for data science tasks?
In the development of this project, we first set out the foundations for what would make a data science platform. For a platform to have made it to our list, it should have allowed for data pre-processing, feature selection, classifier choice, parameter tuning and support for open source use. Of the 41 platforms Identified R and Python were set as the benchmark and the other five that satisfied the selection criterion were chosen to be studied further and to perform supervised and unsupervised ML models.

R is an open source statistical platform that allows users to build very advanced ML models. The functionality of R is very widely known by the industry users as well as academics. R features different libraries which are a fully open source and each function within those libraries are fully transparent and explained mathematically. R is arguably the best visualisation platform available. R runs on Windows, Mac OS X and Linux and it is compatible with different data frames such as Microsoft Excel, Microsoft Access, MySQL, SQLite, Oracle. One of the remarkable features of the R language is its adaptability. Due to R’s popularity and its expressive power and transparency, R developers keep on building creative and inexhaustible interfaces to software that complements Rs strengths.R’s memory management has been a drawback. However, recently there has been the advancement in techniques which allows developers to understand R’s memory management and ultimately make functions and loops run faster.

Python is a widely used DS platform and programming language. Python is also widely used for web and game developing. It is an object-oriented language. The Python programming language is used in many different software packages and sectors ranging from academia to pharmaceutical. Python is capable of powering the Googles search engine, YouTube, DropBox, Reddit, Quora, Disqus and FriendFeed. NASA, IBM and search browsers such as Mozilla rely mostly on Python as a programming language. Due to its ability to allow for the integration of systems quickly and effectively and being open source is very attractive, python is exceptionally appealing to startups and smaller companies.

H2O is an open source, in-memory, distributed ML platform. H20 runs on Java such that inside H2O a key-values distributed storage is used that enables the data, models and other objects to be used across different machines. H2O uses map reduce distributed framework and allows for the java join framework. The data is transformed in an h2O data frame which is distributed across all clusters and stored in memory. H2O’s intelligent data parser can guess the schema of the incoming datasets and supports data ingest from multiple sources in various formats. H2O’s API enables access script via JSON over HTTP. The API is used by H2Os web interface (Flow UI), R binding (H2O-R), and Python binding (H2O-Python).

BigML allows developers and enterprises to create ML algorithms. BigML offers an abstract, simple interface to a wide range of ML algorithms that can be used in isolation at a very high level and also combined, by means of DSLs, into new, more complex, algorithmic workflows; so one can cover the gamut from users that barely know the particulars of an algorithm they are invoking to savvy data scientists that can combine many of them in complex ways. BigML enables the users to perform their task more effectively by tapping the functionality of the platform without having to use proprietary API’s (an API whose methods and outcomes are public and usable by anyone without any kind of reverse engineering). BigML connects to R via the ’Bigml’ package which contains the Bigml API. However, this package is old and has not been updated. The package includes methods that provide straightforward access to basic API functionality, as well as methods that accommodate local R data types and concepts. BigML also offers many other BigML language bindings that are all open source such as python, java, ruby and clojure.

RapidMiner offers data mining and ML procedures including data loading and transformation, data preprocessing and visualisation, modelling, evaluation, and deployment. RapidMiner is written in the Java programming language. It also integrates learning schemes and attributes evaluators of the Weka machine learning environment and statistical modelling schemes of the R-Project. This platform benefits from an extensive built-in library which also integrates with existing databases and most common open source DS programming languages such as R and Python. The Auto ML function of this platform is an automated lifecycle to build ML algorithms.

Dataikus Data Science Studio (DSS) platform allows connection to any data store, eliminating integration stages. DSS detects wrong entries while automatically cleansing, transforming, and enriching data. Visualisation features make it easy to find correlations, variables, and patterns to predict future outcomes and trends with certainty. DSS also has features that support collaborative data science which makes the job of different teams such as data engineers, business analysts, business stakeholders, hardcore coders, R users and Python users more collaborative. This, in turn, provides an efficient way of making the needs of these different roles to work together on DS projects. This platform runs python in memory.

Azure ML studio allows users to develop models in the cloud. Azure is also integrated with R and Python environments. This feature makes it possible for data scientists to write and run R and Python programs on the cloud as well.

So, which platform is best?

One of the remarkable features of the chosen platforms is that they have massive support for collaborative data science at scale as well as allowing for integration with the benchmark platforms.
This project lays out the path for an era to attract more work towards platform comparison via algorithm performance as compared to just algorithm testing. The need for the automated
ML and AI will see an even more increasing rise and having research in this area will enable industry and academia to use the trade-offs between different platform to decide what might be most suited for their purpose. All in all, there is no one fit all platform that solves every problem. The measures in this project show that the chosen platforms just as the benchmark platforms provide similar functionality and results.

For more information on the criteria, platforms not included in here, ML algorithms and the results of this project please visit my thesis and reference appropriately if used.

If the works of this project are of interest to you please get in touch.

Tags: AI, ML

Reader Interactions

Leave a Reply

Your email address will not be published. Required fields are marked *

Innovate • Support • Grow • Respect

Get in touch

t: +44 (0) 131 651 4905

info@thedatalab.com

Follow us on social

  • Twitter
  • YouTube
  • Instagram
  • LinkedIn
  • TikTok

The Data Lab is part of the University of Edinburgh, a charitable body registered in Scotland with registration number SC005336.

  • Contact us
  • Partnerships
  • Website Accessibility
  • Privacy Policy
  • Terms & Conditions

© 2025 The Data Lab