Anyone who uses R programming typically does so using the wonderful RStudio IDE. It’s a neat and intuitive tool with excellent and regular maintenance updates. A lot of tools from other languages have tried to copy RStudio’s style but to no avail – it stands out as one of the best coding tools in the community (not to mention it’s open source!).
But performing Big Data tasks with R has been a little challenging. Sure there exist a few packages like Sparklyr that make things easier but scaling up has been an obstacle for many an organization. This gap is now being addressed through an integrated platform developed by Databricks and RStudio. Databricks was founded by the creators of Apache Spark and has recently been in the news thanks to MLflow – their open source platform that works with any language, tool and algorithm.
The platform, provided by Databricks, integrates seamlessly with RStudio and enables data scientists and data engineers to automatically execute R code at an unprecedented scale. Both the popular R packages currently used for connecting and interacting with Apache Spark, sparklyr and SparkR, can be used inside RStudio on Databricks. Awesome!
A demo of this platform has also been provided by Databricks which shows a KNN Regression problem. You can either view it using the HTML version or download the R Markdown file and watch the magic unfold inside RStudio itself.
As mentioned in this Databricks blog post, “R users can get access to the full ETL capabilities of Databricks to provide access to relevant datasets including optimizing data formats, cleaning up data, and joining datasets to provide the perfect dataset for your analytics”.
Any data engineer (or to a certain extent a data scientist) who currently works with R will love this release. Despite recent advances in R, performing Big Data tasks has always been a challenge. Most of the data engineers prefer working with Python. It helps massively that a R Markdown file is available to get you started. There’s a free trial available so you can test it out on your machine before applying it in your current project.
All the data engineers out there – what do you make of this release? Will it make your current job easier? Let me know in the comments section below.