|Referrer Filter for Server Logs
|1.0.0 (initial release)
|04 May 2013
|Christoph Nahr (Copyright)
ReferFilter is a Java command-line application that should work on any system with a command prompt and a current JRE. The source code was written for JDK 7. The precompiled JAR file should be compatible with JRE 5/6, but I haven’t tested that.
Please see Oracle Java on Windows for information on how to avoid Oracle’s terrible default Windows JRE.
ReferFilter outputs large amounts of data to both
stderr. Use the operator
2> to redirect
stderr to a file, same as on typical Unix shells. See Microsoft’s Using command redirection operators for more details.
Server logs are usually stored in GZip archives, so you’ll want a program capable of unpacking those to
stdout. You can use 7-Zip with the
-so switch, or alternatively get the GnuWin port of GZip. The lightweight GnuWin tools are fully compatible with the Windows command prompt and file system. They don’t need an entire emulated Unix environment like Cygwin tools.
The directory to which the archive was unpacked contains the following files:
|Precompiled executable Java archive
|Sample configuration file with popular domains
|Windows batch file for rebuilding
|Java source code files for
Full source code is provided so you can inspect and change how ReferFilter operates. The Windows batch file
compile.bat is provided for convenience and rebuilds the executable Java archive. This requires a JDK 7 installation in your
ReferFilter is executed from the command line, using the following inputs and outputs:
stdin provides the list of original server log entries to filter.
stdout receives the list of server log entries that passed the filter.
stderr receives error messages and diagnostic output, including all filtered referrers.
The configuration file is specified by a mandatory command line argument. The following usage samples show how to filter an uncompressed and compressed server log, respectively, with the included sample configuration file:
java -jar ReferFilter.jar sample.cfg <access.log >filter.log 2>message.log gzip -cd access.log.gz | java -jar ReferFilter.jar sample.cfg >filter.log 2>message.log
The mandatory configuration file is read line-by-line. All leading and trailing spaces are trimmed. Empty lines and lines beginning with a hash (
#) are ignored. Non-ASCII characters must use UTF-8 encoding. The first non-ignored line must contain the server log format. All other non-ignored lines are optional and define whitelisted or blacklisted domains.
The first non-ignored line defines the format of all server log entries, using a Java regular expression pattern. The format in the sample configuration file matches the enhanced Apache format used by 1&1 servers. I adapted the format from a simpler version in LogEval, an open-source Java server log analyzer and parser. Any server log entry that does not match the specified pattern results in an error message.
All other non-ignored lines define domain filters, one per line. These use a custom format with the following rules:
www. is always optionally accepted. Never specify it explicitly.
Here are some examples of these rules in action:
www.google.com, nothing else.
www.google.com, and any other subdomains such as
www.google.com, any other subdomains such as
plus.google.com, and any other superdomains such as
google.co.uk, but not
www.google (no superdomain).
The order in which domain filters are specified doesn’t matter. Domain filters act as a whitelist by default: any referrer header that specifies a matching domain is accepted. Preceding the line with an at sign (
@) changes the filter to a blacklist. This blacklist applies only to ReferFilter’s heuristic of uncertain referrers.
In addition to the explicitly specified domain list, ReferFilter automatically accepts any requests with the following referrers:
-). We’re looking for referrer spam, so requests without a referrer are obviously fine.
http, e.g. links in locally saved HTML pages. Them spam I’ve seen always includes
ReferFilter.isAccepted to change this behavior.
Aside from the filtered server log on
stdout, ReferFilter produces a large volume of diagnostic output on
stderr. This is what you’ll use to build your referrer whitelist. The output begins with overall statistics, followed by three lists of referrers: corrected, uncertain, and rejected.
ReferFilter employs a heuristic correction when a referrer header cannot be parsed using the Java
URL constructor. Any substring up to the first colon (
:), if any, is interpreted as a protocol. Any non-empty protocol that doesn’t contain
http is implicitly accepted – whatever it is, it’s probably not spam. Otherwise, the heuristic skips any slashes (
/) following the protocol, and then grabs all consecutive characters that are alphanumerical or dots. The result is interpreted as the host name. This approach seems to work well enough for accidentally or deliberately mangled URLs. Method
Request.correctHost implements the heuristic.
Uncertainty arises when the same client IP first refers from a rejected domain, then shortly afterward from an accepted domain. This might indicate that the rejected referrer has directed legitimate traffic to your site and should be whitelisted. On the other hand, it might indicate an undeclared web crawler or camouflaged attack bot.
ReferFilter shows a sample of one accepted referrer for each rejected referrer where this happens. When a case isn’t clear from the sample, you should inspect the original server log for more context. The IP comparison considers the last 100 requests with rejected referrers. All new requests from the same IP are filtered out, including any repeat requests beyond the sample.
By default, ReferFilter only considers such requests uncertain if both the client IP and the user agent are identical. This helps avoid false positives for dynamic IPs. However, if you’re sure that a specific referrer is associated with undesirable bots, you can add it to the blacklist (precede domain with
@). This has two consequences. First, request sequences originating from that referrer will be silently filtered, without appearing in the list of uncertain referrers. Second, repeat hits will be filtered based only on the client IP, even if the user agent is different. That’s because bots that obfuscate their referrers will also obfuscate their user agents.
The list of rejected referrers shows all referrers whose requests were filtered from the server log. It is sorted by descending number of requests from unique host names, each containing a list of all unique referrers per host name. If a rejected host name appears in just one unique referrer, then only that referrer is listed.
This sort order helps you quickly identify important legitimate traffic sources that should be whitelisted. Over a 28-month period on my website, two thirds of all spam domains only sent requests in the single digits. So you’ll likely risk no more than a few false negatives if you examine only the top referrers and skip the long tail of this list entirely. Please also see the Referrer Filter project page for more details on my sample statistics.
None at this time, but please take note of the possibly unexpected implicit behavior described above.
All files – individual files, multi-file packages, and individual files contained in multi-file packages – that constitute the original distribution of ReferFilter are Copyright © 2013 by Christoph Nahr.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.