A command line is a valuable tool for productivity in daily Data Science activities. As Data Scientists, we are adept at using Jupyter Notebooks and RStudio to obtain, scrub, explore, model, and interpret data (OSEMN process). From Pandas to Tidyverse, messy data is handled very effectively and effortlessly to provide input to machine learning algorithms for modeling purposes. However, simple operations such as sorting dataframes and filtering rows are given a condition, or creating complex data pipelines for large datasets can be performed just as quickly using the Bash command-line interface. First released in 1989, Bash (a.k.a., Bourne Again Shell) is an essential part of a data scientist’s toolkit, but one that is not popularly taught in data science bootcamps, Master’s programs, and even online courses.
Source: Unsplash
This article will introduce you to the fantastic world of Bash, beyond the basic commands that are commonly used, such as printing a working directory using pwd, changing directories using cd, listing items in a folder using ls, copying things using cp, moving items using mv, deleting items using rm among others. After going through the article, you will have become familiar with in-built data wrangling commands available in Bash, ready at your disposal.
Bonus: A Cheatsheet has also been provided for quick reference to the commands and how they work! Some miscellaneous commands are included so make sure to check it out!
A Command-Line Interface (CLI) allows users to write commands to instruct the computer to perform specific tasks. “Shell” is a CLI and is so called because it restricts the outer layer of the system from the inner operating system, i.e., the kernel. Shell is tasked with reading the commands, interpreting them, and directing the operating system to perform those tasks.
Each command is preceded by a dollar sign ($) called the prompt. Only the $ sign is used in the examples in this article because the prompt is irrelevant to the actual commands and changes when you go to another directory, and it can also be customized. The command structure follows this sequence: command -options arguments.
Let’s look at some simple commands in Bash:
Let us review the suite of commands in Bash that Data Scientists use. We will use two datasets, the first being Apple’s 40-year stock history from January 1, 1981, till December 31, 2020, that can be downloaded from Yahoo Finance here. The second dataset is a custom dataset as below.
wc command for word count returns the number of lines, words and characters in a file in this order.
Print number of lines, words, and characters of a file using wc command
You can use the input redirection operator “<” if you do not wish to see the result’s file path.
Print a file’s number of lines, words, and characters without showing the file path using the wc command.
Furthermore, use options -l to return the number of lines, -w to return the number of words, and -m to produce several characters.
Print the number of lines, words, and characters using wc command with specified options
The head command returns top 10 lines in the file by default.
Print the first 10 lines of the file using head command
To see first n lines, pass in the -n option stating the number of lines.
Print the first 2 lines of the file using head command
Print the first 5 lines of the file using head command
The tail command returns bottom 10 lines in the file by default.
Print the last 10 lines of the file using tail command
Similar to head command, to see last n lines, pass in the -n option.
Print the last 3 lines of the file using tail command
cat for concatenate is a multi-purpose command used for creating files and viewing file content among its other uses. To view an existing file’s content, simply pass the file path as the argument. cat returns all the lines in the file.
Displaying first 5 lines of entire file output returned by cat command
To keep track of the line numbers, -n option can be used.
Displaying first 5 lines of entire file output returned by cat command, showing line numbers using -n option
To create a new file, use the output redirection operator “>” after cat and specify filename. Add your content in provided space, and press CTRL+D to exit editor.
Create a new file and add content using cat command
To append to an existing file, use the append operator “>>” after cat and specify filename. As before, add the content to be appended, and press CTRL+D to exit editor.
Append to an existing file using cat command
Display the contents of NewFile.csv after creation and appending using cat command
The sort command is used to sort contents of a file, by the ASCII order of blank first, then digits, then uppercase letters followed by lowercase letters.
Let us sort the custom dataset for understanding this command. By default, the sort command sorts in ascending order and acts lexicographically on the first character in each line in the dataset (I, 1, 7, 9, 2, 4). Lexicographic sorting means that “29” comes before “4,5”. However, since we have a comma-separated file, we want sort to act on columns, and by default, to act on the first column of (ID, 1312, 7891, 9112, 2236, 4561). Thus, we pass in the delimiter option -t and the comma delimiter.
Sorting the first column using default sort command with -t option
Notice that the header row is shifted to the end, since uppercase letters come after digits in the ASCII order. To sort numerically, we should use -n option. This ensures that sorting is only done numerically rather than lexicographically.
Sorting the first column numerically using sort command with -n option
Notice now that the header row is unaffected. To reverse sort the dataset, pass in the option -r. The output is the reverse of the numerical sorting output.
Reverse sorting the first column using sort command with -r option
To sort a particular column, pass in the -k option with the column number. Here, let us sort on age in ascending order.
Sorting age column using sort command with -k option
Let us also see an example of sorting non-numeric columns, such as the last column of “major”.
Sorting non-numeric column using sort command
The sort command can also be used to sort month columns using -M option, check if column is already sorted using -c option, remove duplicates and sort using -u option.
The tr command stands for “translate”, and is used for translating and deleting characters. It reads only from standard input and shows the output on standard output.
Here, we will introduce the pipe operator “|” that passes the standard output of one command as standard input into another command, like a pipeline. Let us again use the custom dataset for understanding this command.
To convert uppercase characters to lowercase, pass in the first argument as “[:upper:]” and second argument as “[:lower:]”, and vice-versa. Alternatively, first argument can be “[A-Z]” and second one will then be “[a-z]”.
Converting uppercase characters to lowercase using tr command
To translate the comma-separated file into a tab-separated file, use tr command with first argument as “,” and second argument as “\t”.
Converting csv file format to a tsv file format using tr command
To delete a character in a file, use the delete option -d with the tr command. The operation is case-sensitive.
Deleting the character “S” using tr command with -d option
Notice that the character “S” is deleted from the entire file. Similarly to remove all uppercase letters, use character string “[:upper:]”, to remove all digits, use character string “[:digit:]” and so on.
Deleting all digit characters using tr command with -d option
To delete everything except a character, use the complement option -c and the delete option -d with the tr command.
Delete all characters except uppercase letters using tr command with -c and -d options
To replace multiple continuous occurrences of character with single event, use the squeeze repeats option -s with the tr command giving only one argument as input.
Keeping a single character instance of “2” using tr command with -s option
To replace all single and multiple continuous occurrences of a character with another character, use the squeeze repeats option -s with the tr command giving two arguments as input. Note that all numerous endless events are also replaced with the single character.
Replacing all single and multiple occurrences of “2” with “h” using tr command with -s option
The paste command joins two files horizontally using a tab delimiter by default.
The -d option can be used to specify a custom delimiter. Let’s concatenate the two datasets with a comma delimiter and see the first 6 rows using the pipe operator “|”.
Concatenating two datasets horizontally using the paste command with -d option specified as a comm
The uniq command detects and filters out duplicate rows in a file.
Let us use the custom dataset, and append two duplicate lines to the file first.
Appending duplicate rows to file using append operator “>>” with cat command
This is how the dataset looks like now.
View of data file after adding duplicate line items
Now, let’s see the number of lines with their count using -c option.
Find the count of each line item using uniq command with -c option
Notice that the detection of duplicate entries is case-sensitive. To ignore case, use the -i option.
Find the count of each line item ignoring case using uniq command with -c and -i options
Other valuable options with the uniq command include -u option that returns unique line items, and -d option that returns only duplicate line items.
grep stands for “global regular expression print” and is Bash’s in-built utility for searching line items matching a regular expression.
Let us search for all rows which have “John” using grep.
Searching for line items matching regular expression using grep command
Since grep is case-sensitive, we can use -i option to ignore the case for matching.
Searching for line items matching regular expression (case-insensitive) using grep command with -i option
The number of lines containing “John” can be returned using the count option -c.
Counting number of lines matching regular expression using grep command with -c option
To match whole words instead of a substring using grep command, use the word option -w. To demonstrate, let’s first append a new row using cat command.
Appending a new row in data file using append operator “>>” with cat command
Let’s see the default output of grep command.
Default search for regular expression using grep command
To search only for whole word of “ohn” instead of all substrings, let’s now use the word option -w.
Searching for whole words of regular expression using grep command
To keep track of the line numbers of line items returned by grep command, use the -n option.
Print line numbers for lines matched by regular expression using grep command with -n option
The cut command is used to cut and extract sections from each file line.
The field option -f must be used to return a particular column. The field counter for the option starts from 1 and not 0 for the first column onwards.
Return a field in a data file using cut command with the -f option
As we can see, Bash isn’t able to identify the columns, thus the delimiter option -d must used in conjunction with it.
Return a field in a data file using cut command with the -d and -f options
Let’s look at a more complex example of the cut command using the Apple stock prices dataset. Specifically, we want to see the columns – Date, High, Low, Volume – of the first 10 data rows in the file. To do this, we first get first 11 rows (including header) using head command as standard output, and pipe it into the cut command. Note that the field option -f gets multiple column numbers as input.
Returning a subset of rows and columns using head and cut commands with pipe operator
The “!$” unique character in Bash is used to designate the last argument of the preceding command. CTRL + R is used for reverse searching for commands through the Bash session history.
To understand it better, let’s see an example of “!$” character.
Returning standard output of data file using cat command
Now, say we want to look at only the first 3 rows in the file. Instead of repeatedly mentioning the entire file path in my new command, I can type “head -3 !$”. The “!$” unique character will automatically take in the path.
Printing the first 3 rows using “!$” unique character with head command
CTRL + R for reverse-i-search is beneficial for searching through any old and long command you’d written and want to bring up again. The command searches recursively starting at the last matched command, and moves up the history. Furthermore, the characters typed in get incrementally compared with the previous commands.
Reverse searching for grep commands in the session history
Reverse searching for grep command with -i option having “John” value in the session history
The entire command history of the session can be seen using history command if manual search needs to be done.
Bash command line for data scientists is a very useful tool for some quick data analysis, without launching any integrated development environment. All the commands become more powerful tools when combined with Input/Output redirection (“<”, “>”, “>>”) and pipe (“|”) utilities of Bash. Experiment with these utilities and find efficient ways of wrangling with your data. Make sure to get your hands dirty to leverage Bash for your data needs!
Read more articles on our blog!
A. Bash commands are executed in the Bash shell, a popular command-line interface and scripting language used in Unix-like operating systems. They allow users to interact with the operating system, execute programs, manipulate files, and perform various system-related tasks.
A. The “ls” command is used in Bash to list the contents of a directory. It displays the files and directories in the current working directory or any specified directory, along with their details, such as permissions, size, and timestamps.
A. To start a Bash command, you need to open a terminal or command prompt, depending on your operating system. Once the terminal is open, you can type and execute Bash commands directly by typing the command followed by pressing the Enter key.
A. Git Bash is a command-line interface for Windows that provides a Unix-like environment, including a Bash shell and a collection of Git commands. It allows Windows users to use Git version control system and execute Bash commands, making it easier to work with Git repositories and perform command-line operations.