ReadMe for ReferFilter 1.0.0

Program:	Referrer Filter for Server Logs
Version:	1.0.0 (initial release)
Released:	04 May 2013
Author:	Christoph Nahr (Copyright)
Contact:	webmaster@kynosarges.org
Website:	http://kynosarges.org/ReferFilter.html

System Requirements
Package Contents
Usage Overview
1. Configuration File
2. Diagnostic Output
Known Issues
Copyright Notice

1. System Requirements

ReferFilter is a Java command-line application that should work on any system with a command prompt and a current JRE. The source code was written for JDK 7. The precompiled JAR file should be compatible with JRE 5/6, but I haven’t tested that.

Microsoft Windows Tips

Please see Oracle Java on Windows for information on how to avoid Oracle’s terrible default Windows JRE.

ReferFilter outputs large amounts of data to both stdout and stderr. Use the operator 2> to redirect stderr to a file, same as on typical Unix shells. See Microsoft’s Using command redirection operators for more details.

Server logs are usually stored in GZip archives, so you’ll want a program capable of unpacking those to stdout. You can use 7-Zip with the -so switch, or alternatively get the GnuWin port of GZip. The lightweight GnuWin tools are fully compatible with the Windows command prompt and file system. They don’t need an entire emulated Unix environment like Cygwin tools.

2. Package Contents

The directory to which the archive was unpacked contains the following files:

`ReadMe.html`	This file
`ReferFilter.jar`	Precompiled executable Java archive
`sample.cfg`	Sample configuration file with popular domains
`compile.bat`	Windows batch file for rebuilding `ReferFilter.jar`
`org/kynosarges/*`	Java source code files for `ReferFilter.jar`

Full source code is provided so you can inspect and change how ReferFilter operates. The Windows batch file compile.bat is provided for convenience and rebuilds the executable Java archive. This requires a JDK 7 installation in your PATH.

3. Usage Overview

ReferFilter is executed from the command line, using the following inputs and outputs:

Configuration file defines the server log format and domain filters.
stdin provides the list of original server log entries to filter.
stdout receives the list of server log entries that passed the filter.
stderr receives error messages and diagnostic output, including all filtered referrers.

The configuration file is specified by a mandatory command line argument. The following usage samples show how to filter an uncompressed and compressed server log, respectively, with the included sample configuration file:

java -jar ReferFilter.jar sample.cfg <access.log >filter.log 2>message.log
gzip -cd access.log.gz | java -jar ReferFilter.jar sample.cfg >filter.log 2>message.log

3.1 Configuration File

The mandatory configuration file is read line-by-line. All leading and trailing spaces are trimmed. Empty lines and lines beginning with a hash (#) are ignored. Non-ASCII characters must use UTF-8 encoding. The first non-ignored line must contain the server log format. All other non-ignored lines are optional and define whitelisted or blacklisted domains.

Server Log Format

The first non-ignored line defines the format of all server log entries, using a Java regular expression pattern. The format in the sample configuration file matches the enhanced Apache format used by 1&1 servers. I adapted the format from a simpler version in LogEval, an open-source Java server log analyzer and parser. Any server log entry that does not match the specified pattern results in an error message.

Domain Filter List

All other non-ignored lines define domain filters, one per line. These use a custom format with the following rules:

An initial www. is always optionally accepted. Never specify it explicitly.
Starting a filter with a dot indicates that other optional subdomains are acceptable.
Ending a filter with a dot indicates that superdomains are required to perform the match.

Here are some examples of these rules in action:

google.com matches google.com and www.google.com, nothing else.
.google.com matches google.com, www.google.com, and any other subdomains such as plus.google.com.
.google. matches google.com, www.google.com, any other subdomains such as plus.google.com, and any other superdomains such as google.co.uk, but not google or www.google (no superdomain).

The order in which domain filters are specified doesn’t matter. Domain filters act as a whitelist by default: any referrer header that specifies a matching domain is accepted. Preceding the line with an at sign (@) changes the filter to a blacklist. This blacklist applies only to ReferFilter’s heuristic of uncertain referrers.

Implicit Filter Rules

In addition to the explicitly specified domain list, ReferFilter automatically accepts any requests with the following referrers:

Referrers that are empty or a single dash (-). We’re looking for referrer spam, so requests without a referrer are obviously fine.
Referrers whose scheme doesn’t contain http, e.g. links in locally saved HTML pages. Them spam I’ve seen always includes http.
Referrers whose domain is a purely numerical IP address. These are usually legitimate requests via translation or anonymization proxies, or workarounds for DNS issues. Edit method ReferFilter.isAccepted to change this behavior.

3.2 Diagnostic Output

Aside from the filtered server log on stdout, ReferFilter produces a large volume of diagnostic output on stderr. This is what you’ll use to build your referrer whitelist. The output begins with overall statistics, followed by three lists of referrers: corrected, uncertain, and rejected.

Corrected Referrers

ReferFilter employs a heuristic correction when a referrer header cannot be parsed using the Java URL constructor. Any substring up to the first colon (:), if any, is interpreted as a protocol. Any non-empty protocol that doesn’t contain http is implicitly accepted – whatever it is, it’s probably not spam. Otherwise, the heuristic skips any slashes (/) following the protocol, and then grabs all consecutive characters that are alphanumerical or dots. The result is interpreted as the host name. This approach seems to work well enough for accidentally or deliberately mangled URLs. Method Request.correctHost implements the heuristic.

Uncertain Referrers

Uncertainty arises when the same client IP first refers from a rejected domain, then shortly afterward from an accepted domain. This might indicate that the rejected referrer has directed legitimate traffic to your site and should be whitelisted. On the other hand, it might indicate an undeclared web crawler or camouflaged attack bot.

ReferFilter shows a sample of one accepted referrer for each rejected referrer where this happens. When a case isn’t clear from the sample, you should inspect the original server log for more context. The IP comparison considers the last 100 requests with rejected referrers. All new requests from the same IP are filtered out, including any repeat requests beyond the sample.

By default, ReferFilter only considers such requests uncertain if both the client IP and the user agent are identical. This helps avoid false positives for dynamic IPs. However, if you’re sure that a specific referrer is associated with undesirable bots, you can add it to the blacklist (precede domain with @). This has two consequences. First, request sequences originating from that referrer will be silently filtered, without appearing in the list of uncertain referrers. Second, repeat hits will be filtered based only on the client IP, even if the user agent is different. That’s because bots that obfuscate their referrers will also obfuscate their user agents.

Rejected Referrers

The list of rejected referrers shows all referrers whose requests were filtered from the server log. It is sorted by descending number of requests from unique host names, each containing a list of all unique referrers per host name. If a rejected host name appears in just one unique referrer, then only that referrer is listed.

This sort order helps you quickly identify important legitimate traffic sources that should be whitelisted. Over a 28-month period on my website, two thirds of all spam domains only sent requests in the single digits. So you’ll likely risk no more than a few false negatives if you examine only the top referrers and skip the long tail of this list entirely. Please also see the Referrer Filter project page for more details on my sample statistics.

4. Known Issues

None at this time, but please take note of the possibly unexpected implicit behavior described above.

5. Copyright Notice

All files – individual files, multi-file packages, and individual files contained in multi-file packages – that constitute the original distribution of ReferFilter are Copyright © 2013 by Christoph Nahr.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.