This article was published as a part of the Data Science Blogathon
Working with text data can be fun and interesting. There is a whole lot of opportunities in NLP, Text Analytics, Text Mining, and so on. But, before proceeding with all this, one must know how to work with text data in Python.
( Image: https://www.pexels.com/photo/coffee-writing-computer-blogging-34600/)
There arise a lot of challenges in working with text data.
Let’s say, we have an array of numbers, we can easily find the sum of all numbers and the average of all numbers. Or, let’s say, we want to create a regression model from this data. Things are pretty simple for numeric data. Numerical data can be processed very easily.
Now, coming to text data. How do we compare two book reviews, or let’s say two different comments on a Facebook post? How do we determine if a tweet carries a positive sentiment or a negative sentiment?
All these challenges can be solved with NLP, Text Analytics, Text Mining, and other text-based solutions.
Text data constitutes a large part of all data online. It can be Wikipedia pages, Twitter tweets, Amazon product reviews, and so on. With time, the amount of text data is going to increase. This data can yield many important insights and give valuable outcomes. There is an increasing need for data professionals and people who can work with data to tap into all this potential. Python can be used to process text data, and conduct various analyses and gather metrics. The data available is going to increase with time and encompass wider types of text data. But all of the data is not going to be clean or easily processable.
But before one proceeds with these things, one must know the basic text operations in Python. Knowing the way to properly use string functions in Python can make working and manipulating text data easy and fast.
Let us proceed with the code.
Python Code:
w="London is a big city."
l1=list(w)
print(l1)
t="London is a big city"
print("Text Length:", len(t))
print("List Length:", len(list(t)))
s= "London"
t="Lo"
print("Checking if s startswith 't':")
print(s.startswith(t))
The list() function can be used to get all the individual characters from a string. This function returns all the characters and whitespaces as a list.
We can see that all the characters have been added to a list. Now, all the individual characters can be accessed. Let us check if the length of the list is equal to the length of the text.
So, we can see that both have the same length, hence implying that the function works perfectly.
Suppose, we want to check if a particular string is present at the beginning of a larger text. In that case, we can use this function to check if a particular string starts with the mentioned string.
Let us see the implementation.
s= "London" t="Lo" print(s.startswith(t))
Output:
True
Let us check another input.
s= "London" print(s.startswith("Ne"))
Output:
False
This function does the opposite, as the name implies. It checks if a particular string is present at the end of another string.
s= "London" print(s.endswith("on"))
Output:
True
So, both the functions can be used to check the starting and ending of a string. It can be useful if we are searching for some prefix or suffix.
s.isupper()
This function checks if all the characters in a string are in upper case or not. Implementation is simple and it returns a True or False value.
w="BERLIN" print(w.isupper())
Output:
True
Just like the name implements, it is the opposite of the previous function. It checks if all the characters in a string are lower case or not. It returns a True or False value.
w="BERLIN" print(w.islower())
Output:
False
t in s:
The keyword “in” can be used to check if a particular substring is present in a larger string. This can be used to find some string in a larger text, or check if the word we need is present in a larger paragraph.
Implementation is very easy and simple.
s="Berlin is the capital of Germany" print("Berlin" in s)
Output:
True
We get the appropriate output.
This function returns if a particular text is in title format. For example, “United States”. Basically, all the first letters of all words must be capital for it to be in title format.
Let us see the implementation code.
s="New York" print(s.istitle())
Output:
True
As both the first letters of the words are capital, it is returned as True.
Let us try a different example.
print("roMe".istitle())
Output:
False
This function checks if the characters in a string are all alphabets.
s="dsnlmls" print(s.isalpha())
Output:
True
So, we can see that as all characters in the above string are alphabets, the function returns True.
Let us try with different input.
s="56700#" print(s.isalpha())
Output:
False
The output is as expected.
This function checks if all the characters in a string are numbers.
s="2021" print(s.isdigit())
Output:
True
As the input is numeric, the function returns True.
This function checks if a string has either numeric characters or alphabets. If special characters are present, False will be returned.
s= "jan2021" print(s.isalnum())
Output:
True
Let us try a text with a special character.
s="@1234" print(s.isalnum())
Output:
False
This function converts all the characters of the string to lowercase. This function is used when we want uniformity in our data.
Let’s see how it works.
s="KOLKATA" print(s.lower())
Output:
kolkata
As we can see, all the characters have been converted to lowercase.
As the name suggests, this function converts all lowercase characters of a string to uppercase.
s='Kolkata' print(s.upper())
Output:
KOLKATA
This function converts all the 1st letters of words to uppercase.
s="kolKata is a bIg city" print(s.title())
Output:
Kolkata Is A Big City
Earlier, we had seen how to split the text into the characters, but what if we want to get the words.
We can use the split() function to split the text into smaller texts based on a character. That is, this character will serve as the split point.
s="Mumbai is the financial capital of India" print(s.split(" "))
Output:
['Mumbai', 'is', 'the', 'financial', 'capital', 'of', 'India']
As we can see all the words have been returned in a list. Now, we can access all the words individually.
Now, think of a situation where we have to join all these to form a string.
Let us see how to implement it.
s="Mumbai is the financial capital of India" s_split= s.split(" ") res= " ".join(s_split) print(res)
Output:
Mumbai is the financial capital of India
We get the joined string.
Let us try the same with some different data.
s=["Ram",",", "Shyam",",","Ravi",",","Hari"] res= "".join(s) print(res)
Output:
Ram,Shyam,Ravi,Hari
We get the output as desired.
If we have to remove whitespaces around a text, we can use this function.
s= " London" s.strip()
Output:
London
This function also removes whitespaces from the end of the strings.
s= " London " s.strip()
Output:
‘London’
We get the output as expected.
This function removes whitespaces, but only from the end of the string.
s= " London " s.rstrip()
Output:
' London'
This function can be used to find a particular string in a larger string.
The function returns the location of the search query string.
s="London is the capital of UK" s.find("is")
Output:
7
Here, as the string is at the 8th position, the output is 8-1=7.
This function is used to replace one string with another. Let us see how it works.
s="London is the capital of UK" s=s.replace("London", "Rome") s=s.replace("UK", "Italy") print(s)
Output:
Rome is the capital of Italy
As we can see, the appropriate edits have been made.
Suppose we are extracting text from a web source, this function can be used to split the text into sentences.
t="Germany is the capital of Germany. n London is the capital of UK. n Paris is the capital of France." t.splitlines()
Output:
['Germany is the capital of Germany. ', ' London is the capital of UK. ', ' Paris is the capital of France.']
The n stands from the newline. So, the three individual sentences are found here.
There is a lot more to learn in python.
To check the code, see this.
Prateek Majumder
Data Science and Analytics | SEO | Content Creation
Connect with me on Linkedin.
My other articles on Analytics Vidhya: Link.
Thank You.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Very informative
Good compilation.