Contact
Site: US UK AU |
Nexcess Blog

One Liners for Apache Log Files

January 21, 2011 14 Comments RSS Feed

Apache One-Liners

I frequently need to look at apache log files to diagnose problems. Over time I’ve developed a series of one liners I can copy and paste to quickly analyze a log file to look for a problems, abuse, popular pages, etc.

If someone is reporting a slow site, it can be useful to see if one IP is accesing URLs much more than other IPs since this can be an indication of a poorly written crawler which is using up lots of resources. Other times a slow site might be because someone is getting high traffic so it can be useful to look at the top referrers to see where they’re linked or to look at the most popular URLs and cache that page.

The one liners are usually just a first step in diagnosing the problem. For example I might to only want to look at a certain time so instead of using tail on the transfer log, I’ll use fgrep ‘2011:05:’ ./transfer.log to look at what happened between 5:00 AM and 5:59 AM. Or maybe I want to see what one IP was doing so I’ll grep the IP out and see what the top 20 URLs for it were, then maybe narrow it down further to look for only ones that were POST instead of GET. If you don’t know awk, you should really learn awk since it is great for stuff like this, otherwise you can just use grep to get the data you want. Here’s the above example using both lots of greps and one of the one liners below:

fgrep '2011:05:' ./transfer.log | fgrep '1.2.3.4' | grep 'POST' | awk '{freq[$7]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20

If the log file is large or you want to get fancy, you can do it in awk instead:

cat ./transfer.log | awk '$4 ~ /2011:05:/ && $1 ~ /1\.2\.3\.4/ && $6 ~ /POST/ {freq[$7]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20

The Art of Web and Wicked Cool Shell Scripts both cover some of this, although using their method on large log files you can end up piping too much data to sort | uniq -c which can more efficiently be handled by awk. These one liners show both methods of getting information. If you’re interested in seeing the affects piping too much data, use your biggest log file on some of them to see how faster it is when you move the sort | uniq -c in to awk.

# top 20 URLs from the last 5000 hits
tail -5000 ./transfer.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20
tail -5000 ./transfer.log | awk '{freq[$7]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20

# top 20 URLS excluding POST data from the last 5000 hits
tail -5000 ./transfer.log | awk -F"[ ?]" '{print $7}' | sort | uniq -c | sort -rn | head -20
tail -5000 ./transfer.log | awk -F"[ ?]" '{freq[$7]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20

# top 20 IPs from the last 5000 hits
tail -5000 ./transfer.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -20
tail -5000 ./transfer.log | awk '{freq[$1]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20

# top 20 URLs requested from a certain ip from the last 5000 hits
IP=1.2.3.4; tail -5000 ./transfer.log | grep $IP | awk '{print $7}' | sort | uniq -c | sort -rn | head -20
IP=1.2.3.4; tail -5000 ./transfer.log | awk -v ip=$IP ' $1 ~ ip {freq[$7]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20

# top 20 URLS requested from a certain ip excluding, excluding POST data, from the last 5000 hits
IP=1.2.3.4; tail -5000 ./transfer.log | fgrep $IP | awk -F "[ ?]" '{print $7}' | sort | uniq -c | sort -rn | head -20
IP=1.2.3.4; tail -5000 ./transfer.log | awk -F"[ ?]" -v ip=$IP ' $1 ~ ip {freq[$7]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20

# top 20 referrers from the last 5000 hits
tail -5000 ./transfer.log | awk '{print $11}' | tr -d '"' | sort | uniq -c | sort -rn | head -20
tail -5000 ./transfer.log | awk '{freq[$11]++} END {for (x in freq) {print freq[x], x}}' | tr -d '"' | sort -rn | head -20

# top 20 user agents from the last 5000 hits
tail -5000 ./transfer.log | cut -d\  -f12- | sort | uniq -c | sort -rn | head -20

# sum of data (in MB) transferred in the last 5000 hits
tail -5000 ./transfer.log | awk '{sum+=$10} END {print sum/1048576}'
Posted in: Linux
  • Techedemic

    I know this is an old post, but thanks anyway. Very useful. Been trying to get my head around awk for a while now.

  • That’s great to hear! It’s nice to see that the content has a second life :)

  • Very helpful thanks! I’ve had great luck with the “apachetop” app, which tries to give you “top” style output but specifically for your Apache logs. It gives auto-updating reports on the top URLs being loaded, IPs loading things and Referrers sending them. It’s really useful when there’s an obvious terrible problem but those reports are often not useful because they don’t surface bots that are loading expensive URLs but not their assets, since each IP will be competing in terms of “resources loaded” with actual users that load 30 objects in 1 second because it’s an actual page, but that page was cached unlike the bot that hits random month archives etc.

    The main thing I miss from Apachetop is user-agent listing, since that’s so often the way I identify bots while visually scanning logs. Your script is really helpful, but I wish it was self-updating. Any leads on some kind of “top-like” app that would let me track which user-agents are in use?

  • Mark McKinstry

    Jeremy,

    I don’t know of any of anything like off the top of my head but you could always combine the one liner with watch like below to get it to self-update:

    watch 'tail -50 ./transfer.log | cut -d -f12- | sort | uniq -c | sort -rn | head -20'

  • Olivier Dulac

    awk can also replace all those | grep (and | sed too ) ^^ .
    For example : ( $4 ~ /regexp/ ) will compare $4 to that regexp. or just “/regexp” to compare the whole line ($0) to the regexp. So your first example could be:

    <./transfer.log awk ' /2011:05:/ && /1.2.3.4/ && /POST/ { freq[$7]++} END {for (x in freq) {print freq[x],x}}' | sort -rn | head -20

    or if you want to be sure you only compare the date in the right field, and the ip in the first field, and post in the Nth:

    <./transfer.log awk ' ( $4 ~ /2011:05:/ ) && ( $1 = "1.2.3.4") && /POST / { freq[$7]++} END {for (x in freq) {print freq[x],x}}' | sort -rn | head -20

    another way to awk on a file (or several) is to add those files at the end of the awk line, after the closing singlequote. About, instead, I awk on stdin (as I specified no files after the closing single quote), and I redirected stdin to read from the single file you want to look at instead of from my terminal

  • Awesome list of some VERY useful apache one liners. Thank you so much, within 5 seconds I could identify 5 new bad referers in the last day from a very large apache log file. I’m keeping this list on hand.

  • Hi Ben, these have proven so useful to me on a daily basis, especially pulling out the top referers on a daily basis from my logs which I run as a cron and email to myself every morning and I can quickly add bad referers to my bad referer block list.

    I have just one question though, is it possible to take your one liner for pulling out the referers and make it exclude certain referers like my own domain name, bing.com and google.com ? It would then produce me a much shorter list to scan through each morning?

    I am no boff with grep, awk. Thanks in advance and thanks again for making my daily life SO much easier.

  • Mark McKinstry

    For the ones where you’re not using the -F flag for awk, you can add ‘$11 !~ /google|bing|mydomain/’ to the beginning of the query and for the ones that have the -F you can add ‘$12 !~ /google|bing|mydomain/’ to the beginning

    # top 20 URLs from the last 5000 hits excluding google, bing, and mydomain referrers
    tail -5000 ./transfer.log | awk ‘$11 !~ /google|bing|mydomain/ {freq[$7]++} END {for (x in freq) {print freq[x], x}}’ | sort -rn | head -20

    # top 20 URLS excluding POST data from the last 5000 hits and excluding google, bing, and mydomain referrers
    tail -5000 ./transfer.log | awk -F”[ ?]” ‘$12 !~ /google|bing|mydomain/ {freq[$7]++} END {for (x in freq) {print freq[x], x}}’ | sort -rn | head -20

    You can actually put “””awk ‘$11 !~ /google|bing|mydomain/'””” immediately after cating or tailing the transfer log to exclude referrers so something like this works too:

    # top 20 user agents from the last 5000 hits excluding google, bing, and mydomain referrers
    tail -5000 ./transfer.log | awk ‘$11 !~ /google|bing|mydomain/’ | cut -d -f12- | sort | uniq -c | sort -rn | head -20

    I just added the little awk command to the existing one liner.

  • Thank you so much Mark, that is awesome and works an absolute treat for what I need on a daily basis. Just your little one liner has given me such a better understanding of how to use awk and multiple awk’s in one line. You are a star !!!

  • Good solid content, never ages. The web these days is so full of “tutorials” that are so useless and end up causing more confusion than anything. 90% of them are just copied and pasted and slightly modified to make it look like their own work. Good solid “how to’s” like this should be available until the end of time.

  • How would one check for log lines containing an empty referer?

  • will_hough

    tail -500 transfer.log | awk ‘ $11 == “”-“” { print $0}’

  • Thank you Will, that works like an absolute charm !!!

  • Here’s another interesting one for all you one liners.

    I’d like to be able to take a plain text list of domains (over 4000 currently)
    I’d like to run a bash script or one liner grepping for each one in the logs I point it to and have it return how many times each one appears.

    So in essence I’d like to see results like this:

    domain1 – 210
    domain2 – 0
    domain3 – 15
    domain3 – 27

    etc etc.