Converting nginx access logs to tsv using bash

To my humble satisfaction, Gwasanaethau Cymru (Services
Wales)
was launched a mere week and a half
ago. It is my first genuine effort to write a publically accessible
web application that I intend to actively maintain so that I can grow
my Java development skills.

I have nginx web server sitting in front of my instance of Tomcat and
I’ve noticed myself becoming increasingly fascinated by the
access.log found in /var/logs/nginx.

I find myself daily, looking to see how many people are visiting. What
I am most interested in finding out is whether people are visiting my
page multiple times. As the number of visitors is very modest at the
moment, it would still be realistic for me to load this data into
Microsoft Excel (or LibreOffice Calc in my case) and use a pivot table
for this purpose.

Unfortunately the data isn’t in a format that Excel could read in
nicely:

Of course, I could separate the fields using a whitespace character
however, I am only interested in extracting the IP address and the
full timestamp. After some efforts I found that I could use a
combination of grep and xargs to achieve this:

We use cat to concatenate all of our access.log files together. When
we pipe this into grep, we use -e twice to match both the ip
address field and the timestamp field. Each ip/timestamp matched
combination identified by grep is split over two lines,
unfortunately. We can rectify this by simply piping to xargs which
allows us to split up our input into multiples of n, in this case, 2.

The output looks like this:

There are certainly cleaner ways of achieving this but I think using
grep and xargs in this way gives you the flexibility to match a number
of arbitrary patterns in a file such as access.log. This could also be
used to strip out unwanted data from very large log files so that
their size is more manageable for using in an application like Excel.

This entry was posted in bash, logs, nginx. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *