This article was published as a part of the Data Science Blogathon.
In this article, I am going to explain, how can we use log parsing with Spark and Scala to get meaningful data from unstructured data. In my experience, after parsing a lot of logs from different sources, I have found no data is unstructured. There is always some meaningful way to look at it and understand it. This is the way that we understand unstructured data. In the case of logs, there are configuration files, that contain in what position which field gets written. So to start with I will take a simple NGINX logline and explain how it can be parsed using regular expressions. But before that let’s go through a brief of the tools involved.
Regular Expression is a combination of strings that are used to search patterns in text. It is widely used with almost all the programming languages that are available, even NoSQL databases also allow using regular expressions while querying them.
Spark is a data processing engine that is used due to its parallel processing abilities. It is written in a programming language called Scala. Spark can be used with Java, Python, R and Scala. But since it is written in Scala, I have seen in many projects, they prefer to build their data pipelines in Scala.
Let’s take an example,
Say I am running NGINX, in one of my servers and there is a need to monitor its access log, so that failures, API usage and different meaningful data points can be derived from it.
A sample logline from the NGINX access log looks like this:
47.29.201.179 – – [28/Feb/2019:13:17:10 +0000] “GET /?e=1 HTTP/2.0” 200 5316 “https://sampleexample.com/?e=1” “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36”
The configuration that defines the log format can be found in nginx.conf file, which looks like :
log_format main ‘$remote_addr – $remote_user [$time_local] “$request” $status $body_bytes_sent “$http_referer” “$http_user_agent”‘;
To human eyes the correlation is clear and looks like.
$remote_addr | 47.29.201.179 |
– | – |
$remote_user | – |
[$time_local] | [28/Feb/2019:13:17:10 +0000] |
$request | GET /?e=1 HTTP/2.0 |
$status | 200 |
$body_bytes_sent | 5316 |
$http_referer | https://sampleexample.com/?e=1 |
$http_user_agent | Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36 |
But it is a little difficult to make this correlation programmatically.
To start with, we have to write a regular expression, that can recognise patterns from the log. I have written one, and the explanation follows. You can write your own regular expression or make changes to it as per your use case.
(d+.d+.d+.d+) – – ([.*?]) (“.*?”) (d+) (d+) (“.*?”) (“.*?”)
The correlation between regular expression and log format is given in below table:
$remote_addr | (d+.d+.d+.d+) |
– | – |
$remote_user | – |
[$time_local] | ([.*?]) |
“$request” | (“.*?”) |
$status | (d+) |
$body_bytes_sent | (d+) |
“$http_referer” | (“.*?”) |
“$http_user_agent” | (“.*?”) |
Now, let’s initialize the logline into a data frame
as in most cases you will have a lot of it and data frames are the most convenient
for data processing. You can follow the below code for the same.
val log="""47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?e=1 HTTP/2.0" 200 5316 "https://sampleexample.com/?e=1" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"""" val df=Seq((log)).toDF df.show(false)
Now, Let’s create a UDF to parse logs using regular expressions.
import org.apache.spark.sql.functions.{col,udf} val regex_split=udf((value:String, regex:String)=>{regex.r.unapplySeq(value)}) val regularExpression="""(d+.d+.d+.d+) - - ([.*?]) (".*?") (d+) (d+) (".*?") (".*?")""" val df_parsed=df.withColumn("parsed_fields",regex_split($"value",lit(regularExpression)))
In the above code block, “regex_split” UDF is created to parse the log data, Now we use the UDF to parse the log and see the data output. You can see that “df_parsed”, the output data frame has 2 fields, one of which is an array that is generated using the UDF.
Now we have to separate the elements of the fields in the array to different columns. The below code shows how that can be done using a map variable.
val colname_position_map=Map((0,"remote_addr"),(1,"time"),(2,"request"),(3,"status"),(4,"body_bytes_sent"),(5,"http_referer"),(6,"http_user_agent")) df_parsed.select(colname_position_map.keys.toList.map(x=>col("parsed_fields")(x).as(colname_position_map(x))):_*).show
The final output looks like
image source: Local
This solution is scalable and can be used in real-time spark streaming jobs as well.
Read more articles on the Analytics Vidhya community.
I hope you enjoyed this article on Log Parsing. If you have any queries or feedback, you reach out to me on LinkedIn.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.