Archive for the ‘Website’ Category

Web Server Logs

Wednesday, May 4th, 2005

It used to be that there was a fairly good correlation between “unique visitors” in your web server logs and, well, unique visitors – real people who had actually looked at your web site.

These days it seems that I get almost more junk traffic than real page views, with most of it being automated.

Here’s a list roughly ordered by volume.

  • Referrer spam. Thanks to blogging software that automatically displays a list of the top linking pages on the front page, it is worthwhile for spammers to make hundreds of bogus page requests with spoofed HTTP referer headers (usually a gambling website URL).
  • Comment spam. Even many months after removing Movable Type, spam bots are still hitting my non-existant mt-comments.cgi page. See my article on Combatting Comment Spam.
  • Deep linking of images. When an image is getting an order of magnitude more hits than the page it is displayed on, you know that someone has deep linked your images. This is easy enough to prevent, but most documents that show you how to prevent deep linking of images neglect to mention that if someone is using a proxy that removes the “referer” header from their HTTP request, or they are using a user agent that does not include the header, then the image will not load even if they load the image from the webpage.
  • RSS feed requests. Some of this is from RSS aggregator applications, and some seems to be from websites that aggregate RSS feeds or provide RSS search facilities.
  • Search engine spiders. Since the search engine wars hotted up, it seems that I am getting hits from Googlebot, MSNbot and Yahoo’s slurp multiple times per day.
  • Hacking attempts. I always find it amusing when someone runs IIS exploits against a Linux webserver running Apache.

I need to find a decent web log analysing program this will filter all this junk out for me.

Web Site Mangement

Sunday, January 9th, 2005

I am looking for a no-cost tool (or tools) that will crawl my website and validate my XHTML and CSS, and check all of my links. If I had a real Content Management System(tm), I am sure that it would handle this sort of thing for me but, as I am not and this should be a common problem, there has to be a solution out there somewhere…

Update: REL Link Checker Lite looks useful, but I am still looking for a total solution.

Combatting Comment Spam

Friday, January 7th, 2005

A can of SpamAutomated comment spam is a huge problem for anyone using any of the popular blogging programs. For anyone sick of comments about cheap Viagra or fake Rolexes, here are my suggestions.

1. Make it hard to identify your website as a weblog.

The web is a target-rich environment; a Google search for mt-comments.cgi gives 3.7 million potential targets from one popular piece of software alone. As with any automated attack, being one of many vulnerable sites is not a protection.

Ideally, you want to make it hard for spammers to find your blog without making it hard for everyone else.

Most of the default templates that come with blogging software has a “Powered by {SoftwareName}” link on every page. This is the first thing I would remove from any template, along with any direct links to the file that handles comments. This is not something I want showing up in a Google search.

Turning off the option to notify other websites when you update (weblog ping) also reduces spam but will also reduce the number of new people coming to your site.

2. Make spam hard to automate.

Making people log in before leaving comments is the most obvious solution to this problem, but it will obviously reduce the number of legitimate comments that are left. I don’t like the idea of discouraging casual comments.

I also think that IP blacklists are a bad idea. They prevent regular users who share a proxy with a spammer from leaving comments. They also have the same problem as any blacklist – the centrally maintained database is subject to poisoning.

Captchas aren’t foolproof, but I think they are the best solution at this time. No blogging tools have included a captcha implementation in the main source tree, but they all seem to have this option as a plug-in or mod. The main objection to captchas is that they disadvantage blind people – this is coming from people who have no problem denying all requests from a class C IP address range. Surely they couldn’t object to captchas if users were given the alternative of logging in if they couldn’t handle it.

A very temporary fix that seems to be effective against the majority of the current generation of spam bots is renaming the file that is used to post comments. While it is simple to write a bot that will handle this, until more people do this the spammers may not bother. It’s like the joke about the two men who stumble upon the hungry mountain lion. “I don’t have to outrun the lion, I just have to outrun you.

3. Filter anything that does get through.

With the number of spam comments a well connected blog receives, sending all new comments to a queue for approval before being displayed is no longer a practical solution.

The current version of WordPress (version 1.2.2) has an option to automatically moderate comments that contain certain keywords or have more than a certain number of links. I would expect to see more detailed heuristic filtering like this in the future, with the obvious next step being baysian filtering.

4. Eliminate reward.

The main reason blogs – even those with very low traffic – receive comment spam is Google PageRank. The more pages that link to a website, the higher it’s PageRank, and the higher it appears in Google’s results for related search terms. Comment spam is a way of artificially inflating a site’s PageRank.

Some blogging software has started to redirect links in comments through an intermediate page, so they will receive no PageRank benefit. Unfortunately, spambots are too dumb to realise that you are doing this, and post spam comments anyway. Until everyone does this, it will not discourage spammers.