Referrer Filter

Referrer Filter (ReferFilter) is a Java utility for filtering server logs based on a configurable referrer whitelist. ReferFilter is © 2013 by Christoph Nahr but available for free download under the MIT license.

The download package ReferFilter.zip (23.2 KB) contains the executable JAR file with complete source code and documentation. Please see the enclosed ReadMe file for system requirements, usage instructions, and the copyright notice. The rest of this page provides background information on referrer spam and ReferFilter, as well as sample statistics from my website.

Overview

ReferFilter is designed to purge referrer spam from server logs. Referrer spam is far from the only kind of unwanted Internet traffic, but it’s the only kind that’s easily caught by hand-written filters. Comprehensive spam fighting requires dedicated web services with continually updated blacklists, so as to reliably classify traffic by context and originating IP. Examples include Akismet, Spamhaus, and SURBL.

There are similar services for gathering reliable visitor statistics, such as Google Analytics and WordPress.com Stats. However, these two rely on easily blocked client-side scripts which may cause them to miss legitimate visitors. Server logs have the opposite problem: they record everything, so they need additional filtering in order to eliminate spam and bots.

The program I use to analyze my server logs, WebLog Expert, already has a fairly complete list of search spiders. I wrote ReferFilter as an additional filter against referrer spam. The rest of this page describes the concept of referrer spam and provides some sample statistics for my website.

Note — As mentioned above, there are malicious requests that don’t conveniently identify themselves by their referrer headers, as well as benign bots that shouldn’t count as visitors. ReferFilter won’t do you much good if you get a lot of unwanted traffic that’s not referrer spam. I reached that point about four months after I wrote ReferFilter, so today I simply use Google Analytics to determine actual human visitors. Consider this page a historical artifact in the long futile battle against the Internet’s bot infestation.

Referrer Spam

The sadly misspelled referer header is an optional part of every HTTP request. It indicates that the requested URI was obtained from the referred URI, typically by clicking on a hyperlink. Since this header is controlled by the client, users can disable it entirely for privacy reasons (e.g. Mozilla’s sendRefererHeader option) – or they can forge it. That’s what spammers do.

Even back in 2004, Rui Carmo ranted about widespread referrer spam. In the early days of blogging people thought it would be a great idea to publish all their incoming referrer URIs. Spammers quickly exploited this opportunity to get clicks on their fake referrers. Moreover, many websites stored their server logs in directories that were unintentionally visible to search spiders. So all referrer URIs would get picked up by search engines, boosting the link count and search ranking of spammed URIs.

Exploding referrer spam has ensured that nobody publishes these headers anymore, intentionally or otherwise. Yet the spam continues unabated. I’m guessing this is partly because the cost is so low, and partly because commercial spammers would hardly tell their foolish clients how ineffective their “service” really is. So even in 2013, you must filter out referrer spam to get a realistic idea of your pages’ popularity.

Using Whitelists

ReferFilter attempts to clean up server logs using a whitelist as its primary filtering mechanism, rather than a blacklist. Intuitively, this should be a better choice because it reflects the different intentions behind legitimate and malicious domain referrals.

  1. Good guys want few, stable, recognizable domains which are easy to remember.
  2. Bad guys want many, always slightly changing domains which are hard to block.

Whitelisting plays to the strength of the good guys and exploits the weakness of the bad guys. We add a legitimate domain once and we’re done. We don’t need to go back every week and add another dozen variations that a spammer would create to avoid detection.

Of course, the total number of legitimate domains is pretty large, although not deliberately in constant flux like spam domains. So we still need to constantly update our whitelist to keep up with legitimate visitors. Did we actually gain anything, or is this just as laborious as a blacklisting?

Sample Statistics

Let’s look at the archived server logs for my little website (excluding the weblog). These logs cover nearly 28 months and count “visitors” as defined by WebLog Expert: page views, excluding known search spiders and repeated hits from the same IP address within a 30 minute timeout.

  • 87,585 visitors total
  • 52,350 visitors (59.8%) have no referrer at all
  • 16,063 visitors (18.3%) refer from my own website (including weblog)
  • 2,622 visitors (3%) refer from a single website (Stack Overflow)
  • 5,843 visitors (6.7%) refer from the next ten domains (Google etc., including variations that can be whitelisted collectively with a single rule)

Since requests without referrer header are obviously not referrer spam, I can clear over 78% of all visitors by simply whitelisting my own domain! Adding just one more domain increases the share to 81%, and the next ten to 88%. Clearly I won’t have to do a lot of whitelisting to get a fairly comprehensive picture. But how many of the remaining 12% are spam?

Whitelist Performance

I filtered this time range using 393 whitelist entries, mostly partial domains that can match multiple host names. This eliminated 7% out of the questionable 12% of page visitors. I’m confident this result is fairly accurate because I inspected all accepted and rejected referrers. There were only a few requests I was unsure about. Other measures – raw non-spider hits, raw page views, etc. – differ slightly, but all show a rejection rate of 4–8%. How does that fairly small drop translate to unique referrers and their domains?

  • Unfiltered — 4,814 unique referrers from 2,620 domains
  • Whitelist — 2,496 unique referrers (–48.2%) from 566 domains (–78.4%)

Although less than 10% of all requests are spam, nearly 80% of referring domains are! Given that my whitelist required 393 entries to cover 566 legitimate domains, a blacklist might well need over 1,400 entries for the remaining 2,054 spam domains. Let’s examine ReferFilter’s diagnostic output to see how this balance is likely to shift in the future. Note that the following figures are not directly comparable to those listed above because they include non-page hits (images etc.) as well as weblog requests.

  • Filtering 28 months at once — 2,660 unique referrers from 2,259 spam domains
  • Filtering each year separately — 825 + 1,204 + 781 = 2,810 unique referrers from 717 + 1,014 + 679 = 2,410 spam domains (12 + 12 + 4 months)

The sum of separate annual runs barely exceeds the true totals. There is remarkably little overlap between years, as spammers keep varying their domains ever so slightly. Worse, their numbers grow along with legitimate traffic. If the current trend continues, I can expect perhaps twice as many spam domains in 2013 as in 2012. Given that they already outnumber legitimate domains 4:1, and that the bulk of legitimate traffic shows either no referrer or a handful of stable domains, I think it’s clear that a whitelist is the only practical approach to manual spam filtering.

Targeted Spam

At this point you might wonder why I even bother with spam filtering, now that I know the percentage. Couldn’t I just deduct 7% from all page visits and call it a day? Sadly no, for two reasons. First, there’s no guarantee that this share remains constant over time. More importantly, it isn’t even remotely constant across pages. Comparing spam-filtered page visits to the unfiltered totals:

  • The most popular page dropped by only 0.1% (High DPI on Windows)…
  • …but the 2nd most popular page dropped by 2.7% (WPF Performance)
  • The Civilization V page dropped by only 0.3%…
  • …but the Civilization IV page dropped by 11.5%
  • The worst drop was a shocking 36.8% (Class Diagrammer)

Spammers like to hit one page repeatedly so the discrepancies are worse in terms of raw page hits, reaching 59.7% for Class Diagrammer. Evidently, spammers target not my entire website but individual pages, with no apparent relation to age or popularity. Unless Russian prostitutes suddenly developed a keen interest in UML diagrams, I can only assume that some URLs got randomly fed into a spam network. Filtering is indeed necessary since the relative popularity of different pages is so grossly distorted by targeted spam.

Other Unwanted Traffic

A similar case turned up when I compared my weblog’s server stats to its WordPress statistics. One single post got hundreds of requests that WP didn’t show. Most came from two narrow IP ranges owned by a notoriously spam-friendly hosting service. The requests made no sense for human users, so WP was likely correct to suppress them – or perhaps the bots simply didn’t run JavaScript code. At any rate, they sailed right through my filter since they specified no referrers. I only noticed them because of a sudden inexplicable rise in page views. ReferFilter works great against referrer spam, but you’ll want a professional service’s comprehensive IP blacklist to guard against this kind of malicious requests.

Tips — Some camouflaged attack bots are easily eliminated by only including HTTP GET requests in your server log evaluation. ReferFilter doesn’t provide this filter because it’s already built into WebLog Expert (use Edit Profile: Filters). Dropping POST requests uncovered another targeted bot attack that boosted my old Galactopedia page by over 100 fake hits. Moreover, dropping HEAD requests eliminates OpenGraph previews that are legitimate but shouldn’t count as page views.

If your website doesn’t use queries or PHP, another simple anti-bot measure is to ignore any visitors that send queries (…?…) or attempt to load PHP scripts. WebLog Expert once again provides built-in filters for these cases: simply ignore all query strings (*=*) and “visitors by file” *.php and *.php/*. You may have to disable the analysis option Truncate text after question marks (?) in file names for this to work.