Contact
Site: US UK AU |
Nexcess Blog

Piping Too Much Data

November 14, 2010 0 Comments RSS Feed

Awk
I recently ran in to a problem where I had a 1.1 gig Apache transfer log and I needed to produce a list of IPs that had accessed two URLs more than 5 times but exclude a /24 CIDR block from the list.

Here’s an example of what transfer.log looked like:

8.8.8.8 - - [14/Mar/2010:03:14:15 -0400] "GET /foo.php?search=bar HTTP/1.1" 200 867 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
4.2.2.4 - - [14/Mar/2010:03:14:15 -0400] "GET /foo.php?search=bar HTTP/1.0" 200 867 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 6.03 [en]"
192.168.1.6 - - [14/Mar/2010:03:14:15 -0400] "GET /blog/qwerty/ HTTP/1.0" 200 5309 "-" "Opera/8.51 (Windows NT 5.1; U; en)"
3.14.159.26 - - [14/Mar/2010:03:14:15 -0400] "GET /blog/qwerty/ HTTP/1.0" 200 5309 "-" "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)"

I quickly threw together something using awk, sort, and uniq -c but it took ~3 minutes to produce the list of IPs which was annoyingly slow. It got the job done but clearly there had to be a better way:

$ time gawk '
> > $1 !~ /192\.168\.1\./  && ($7 ~ /\/foo.php\?search=bar/ || $7 ~ /\/blog\/qwerty\//) {
> > 	print $1
> > }' ./transfer.log | sort | uniq -c | gawk '$1 > 5 {print $2}' > ./ips.txt
real	2m55.607s
user	2m59.874s
sys	0m4.139s

The Unix philosophy talks about small programs doing one thing well. awk, uniq, and sort all do a great job of what they’re supposed to do but slow down isn’t because of one of them, but because of huge amount of data being passed in the pipes that each program has to process.

There’s three pipes in the above command: sort, uniq -c, gawk. The transfer log starts out at 5,513,132 lines and after passing through the first gawk it’s only reduced to 5,388,614 lines which sort sorts, then passes the sorted 5,388,614 lines to uniq -c to count the number of times an IP is on the list which reduces the number of lines to 4534, which is then passed to gawk to find out which of those IPs accessed the two pages more than five times. In summary gawk, sort, and uniq are each having to process over 5 million lines of text which is why things are slow.

To speed things up, we need to cut down on the data being passed between them. The sort | uniq -c basically creates an associative array of the IPs and the number of times they have accessed either of the URLs. gawk has associative arrays so instead of printing out the IP, we create an array and we tack an an END to gawk to print out the array:

gawk '
> > $1 !~ /192\.168\.1\./  && ($7 ~ /\/foo.php\?search=bar/ || $7 ~ /\/blog\/qwerty\//) {
> > 	freq[$1]++
> > }
> > END {
> > 	for (x in freq) {
> > 		print freq[x], x
> > 	}
> > }' ./transfer.log  | gawk '$1 > 5 {print $2}'  > ./ips.txt

The piping of gawk to gawk is pointless and looks bad so we move logic from the last pipe, gawk ‘$1 > 5 {print $2}’, in to END on the first and re-run our test.

$ time gawk '
> > $1 !~ /192\.168\.1\./  && ($7 ~ /\/foo.php\?search=bar/ || $7 ~ /\/blog\/qwerty\//) {
> > 	freq[$1]++
> > }
> > END {
> > 	for (x in freq) {
> > 		if (freq[x] > 5) {
> > 			print x
> >		}
> > 	}
> > }' ./transfer.log > ./ips.txt
real	0m16.529s
user	0m12.863s
sys	0m3.148s

We went from taking 3 minutes to process a 1.1 gig log file to ~15 seconds which is huge improvement. We could call it a day here if you want (I initially did) but the code can still be sped up some. We translated the sort | uniq -c pipe in to awk code literally so our associative array contains the total number of times each IP addresses accesses the two URLs. We don’t care about the total number of times an IP accessed the URLs, we only want to know which of them accessed the URLs more than 5 times so there’s no point in keeping track after they’ve hit our threshold. We just add freq[$1] <= 5 to the conditional to take care of that:

$ time gawk '
> > freq[$1] <= 5 && $1 !~ /192\.168\.1\./  && ($7 ~ /\/foo.php\?search=bar/ || $7 ~ /\/blog\/qwerty\//) {
> > 	freq[$1]++
> > }
> > END {
> > 	for (x in freq) {
> > 		if (freq[x] > 5) {
> >			print x
> >		}
> >	}
> > }' ./transfer.log > ./ips.txt
real	0m10.364s
user	0m7.291s
sys	0m3.065s

Instead of having three programs do operations on 5,000,000+ lines of piped text, we’ve condensed it down to one (gawk) doing it once, taking us from a 3 minutes to run to 10 seconds. If the transfer log had been smaller or if the number of lines containing IPs accessing the two URLs had been 5,000 lines instead of 5,000,000 lines, none of this would have mattered. When you start dealing with bigger sets of data, the inefficiencies in your code become more evident.

Posted in: Linux