Writing a CSV File with Scala and Using it to Create a Machine Learning Model

Saanya Last Updated : 24 Nov, 2020

5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Scala is difficult to learn, true, but it’s worth the hard work. Scala has much easier syntax and is also more expressive. Scala codes are much concise than Java’s and an engineer who can write short and expressive code while also making it a type-safe and high-performance application are be considered valuable.

In this project, that I made as a college project, we’ll see how to write in a .csv file using Scala, which we will then use to create a basic fruit detection Machine Learning model.

Data Set

The data set that we will use can be found here.

The data set contains 4 fruits – Apple, Mandarin, Orange, and Lemons. And we will classify them,
solely on the basis of the given height, width, mass, and color score.

Although our dataset is already cleaned, if you wish to use a different dataset, make sure to clean and preprocess the data using python or any other way you want, to get the maximum out of your data, while training the model.

Writing in the CSV file

For writing the CSV file, we’ll use Scala’s BufferedWriter,

FileWriter and csvWriter.

We need to import all the above files before moving forward to deciding a path and giving column headings to our file.

We take a few rows of our data to take as input for the training dataset and to use it in writing our CSV file.

1. val out = new BufferedWriter(new FileWriter("D:/Academic/Assignments/Scala/Fruits.csv")) //this line will locate the file in the said directory
2. val writer = new CSVWriter(out) //this creates a csvWriter object for our file
3. val FruitSchema=Array("fruit_label","fruit_name","fruit_subtype","mass","width","height","color_score") // these are the schemas/headings of our csv file

Then we create arrays of our dataset, according to our schema plan.

To write this data into our csv file, we need to add this code snippet,

1. var listOfRecords=List() // this creates a list which holds our data
2. writer.writeAll(listOfRecords) // this adds our data into csv file
3. out.close() //closing the file

Whew, we got that right,

Creating a file using random data

We’ve created our CSV file using Scala. Although, there is one more way to do this, where we generate our data randomly, using ranges which we can then convert to lists.

Firstly, we import all the required libraries.

Then, we’ll now create our lists and ranges, which will contain the data we need in our CSV file.

1. val widthList = Range.BigDecimal(5.8,9.6,0.1).toList // BigDecimal(starting number, ending number, step count) is used to accept float in range and the toList function converts this range to a list
2. val random = new Random() // this function is used to generate the data randomly

Now we will put all this data in our CSV

1. var listOfRecords = new ListBuffer[Array[String]]() // this buffer holds all our data
2. listOfRecords += csvFields // this adds our schemas/headings
3. for(i<- 1 until 50){ listOfRecords+=Array(i.toString,nameList(random.nextInt(nameList.length)),massList(random.nextInt(massList.length)).toString(), widthList(random.nextInt(widthList.length)).toString(),heightList(random.nextInt(heightList.length)).toString(),colorList(random.nextInt(colorList.length)).toString())}  //the loop that which adds data to buffer

I used the Vlookup function in excel to add the fruit label.

This code generates the data purely randomly, so we need to be very careful before using it.

Creating Machine Learning model

To build our model, we’ll use Jupyter IDE of python.

I added a few more rows of data in my first CSV file, to get more accurate results.

Let’s get started, by importing the required libraries.

Now, it’s always better to have all the CSV files and python files in the same folder, so that it’s easy for us to code and organize and for python to find the file. Now, we will read in our CSV file.

We can also visualize the data using the seaborn library of python, to understand the data better. I, for one, have skipped it for now.

Lets split our data into training and test data,

After splitting the data, let’s check what model we can use, I first tried using Decision Tree, as we have a comparatively lesser data

But, we can clearly see that this model is overfitting, so we reject this. Now, let’s check for K Nearest Neighbor,

We can see that accuracy for both, training and test set is pretty good, so we can use this model, as it is neither overfitting nor underfitting.

Let’s fit our data in the KNN model and check for the best neighbor value.

After finding the perfect value, let’s see the prediction score of our model,

It’s not the best, but because we took a small dataset for this project, this is quite nice, now, we’ll finally plot the decision boundaries of our project.

And, we’re done!

Conclusion

We have now learned, how to create CSV files using Scala and the basics of Machine Learning!

Saanya

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Writing a CSV File with Scala and Using it to Create a Machine Learning Model

Introduction

Data Set

Writing in the CSV file

Creating a file using random data

Creating Machine Learning model

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS