Web Stats Read Me

Includes:
  • How the web works
  • The general summary
  • Graphical statistics
  • File and directory reports

    How the web works

    [this section cut from the readme for
    analog web statistics software]

    This page is about what happens when somebody connects to your web site, and what statistics you can and can't calculate. There is a lot of confusion about this. It's not helped by statistics programs which claim to calculate things which cannot really be calculated, only estimated, with varying degrees of accuracy. The simple fact is that certain data which we are used to knowing for traditional print and even broadcast media are simply not available on the web.

    I should say that these ideas are not new to me. In particular, I can recommend four excellent articles about this subject: Interpreting WWW Statistics by Doug Linder; Making Sense of Web Usage Statistics by Dana Noonan; Getting Real about Usage Statistics by Tim Stehle; and, the most negative of all, Why Web Usage Statistics are (Worse Than) Meaningless by Jeff Goldberg.


    1. The basic model. Let's suppose I visit your web site. I follow a link from somewhere else to your front page, read some pages, and then follow one of your links out of your site.

    So, what do you know about it? First, I make one request for your front page. You know the date and time of the request and which page I asked for (of course), and the internet address of my computer (my host). I also usually tell you which page referred me to your site, and the make and model of my browser. I do not tell you my user name or my e-mail address.

    Next, I look at the page (or rather my browser does) to see if it's got any graphics on it. If so, and if I've got image loading turned on in my browser, I make a separate connection to retrieve each of these graphics. I never log into your site: I just make a sequence of requests, one for each new file I want to download. The referring page for each of these graphics is your front page. Maybe there are 10 graphics on your front page. Then so far I've made 11 requests to your server.

    After that, I go and visit some of your other pages, making a new request for each page and graphic that I want. Finally, I follow a link out of your site. You never know about that at all. I just connect to the next site without telling you.


    2. Caches. It's not always quite as simple as that. One major problem is cacheing. There are two major types of cacheing. First, my browser automatically caches files when I download them. This means that if I visit them again, the next day say, I don't need to download the whole page again. Depending on the settings on my browser, I might check with you that the page hasn't changed: in that case, you do know about it, and analog will count it as a new request for the page. But I might set my browser not to check with you: then I will read the page again without you ever knowing about it.

    The other sort of cache is on a larger scale. I'm in the UK. Because the link across the Atlantic is sometimes very congested, we've set up a national cache. (Many individual ISP's also do the same thing.) I can set my browser to get your pages from the national cache instead of directly from you. If anyone else in the country has used the cache to look at your pages recently, the cache will have saved them, and will give them out to me without ever telling you about it. So hundreds of people could read your pages, even though you'd only sent it out once. Also, if the page I wanted wasn't already stored in the cache, the cache would ask for it from you on my behalf. This would mean that the request appeared to come from the cache, rather than from me. If several people did this, you would think that only one host was accessing the cache, rather than lots of different ones.


    3. What you can know. The only things you can know for certain are the number of requests made to your server, when they were made, which files were asked for, and which host asked you for them.

    You can also know what people told you their browsers were, and what the referring pages were. You should be aware, though, that many browsers lie deliberately about what sort of browser they are, or even let users configure the browser name. Also, some browsers send incorrect referrers, telling you the last page that the user was on even if they weren't referred by that page.


    4. What you can't know.
    1. You can't tell the identity of your readers. Unless you explicitly require users to provide a password, you don't know who's connected or what their e-mail addresses are.
    2. You can't tell how many visitors you've had. You can guess by looking at the number of distinct hosts that have requested things from you. But this is not always a good estimate for three reasons. First, if users get your pages from a local cache server, you will never know about it. Secondly, sometimes many users connect from the same host: either users from the same company or ISP, or users using the same cache server. Finally, sometimes one user connects from many different hosts. In most countries, 'phone calls are not free. So users sometimes download one page, disconnect from their ISP, and then reconnect to follow a link: but when they reconnect, they will often be allocated a different hostname by their ISP. The same can happen if users access the web from their company through a firewall. Some ISPs even allocate users a different hostname for every request within a session.
    3. You can't tell how many visits you've had. Many programs, under pressure from advertisers' organisations, define a "visit" (or "session") as a sequence of requests from the same host until there is a half-hour gap. This is an unsound method for several reasons. First, it assumes that each host corresponds to a separate person and vice versa. This is simply not true in the real world, as discussed in the last paragraph. Secondly, it assumes that there is never a half-hour gap in a genuine visit. This is also untrue. I quite often follow a link out of a site, then step back in my browser and continue with the first site from where I left off. Should it really matter whether I do this 29 or 31 minutes later? Finally, to make the computation tractable, such programs also need to assume that your logfile is in chronological order: it isn't always, and analog will produce the same results however you jumble the lines up.
    4. You can't follow a person's path through your site. Even if you assume that each person corresponds one-to-one to a host, you don't know their path through your site. It's very common for people to go back to pages they've downloaded before. You never know about these subsequent visits to that page, because their browser has cached them. So you can't track their path through your site accurately.
    5. You often can't tell where they entered your site, or where they found out about you from. If they are using a cache server, they will often be able to retrieve your home page from their cache, but not all of the subsequent pages they want to read. Then the first page you know about them requesting will be one in the middle of their true visit.
    6. You can't tell how they left your site, or where they went next. They never tell you about their connection to another site, so there's no way for your to know about it.
    7. You can't tell how long people spent reading each page. The same comments apply as in the previous paragraph. You can't tell which pages they are reading between successive requests for pages. They might be reading some pages they downloaded earlier. They might have followed a link out of your site, and they might or might not return later. They might have interrupted their reading for a quick game of Minesweeper. You just don't know.
    The bottom line is that HTTP is a stateless protocol. People don't log in and retrieve several documents: they make a separate connection for each file they want. And a lot of the time they don't even behave as if they were logged into one site. Hence analog's emphasis on requests, rather than visits.
    I've presented a somewhat negative view on this page, emphasising what you can't find out. Web statistics are still informative: it's just important not to slip from "this page has received 30,000 requests" to "30,000 people have read this page."

    The general summary

    This is a good summary to look at and provides some nice numbers. The first number listed is total since the logfile has been created. This date is listed above. The number in parentheses is the amount in the last week.

    Successful requests
    This is the total number of hits to your web site. A hit occurs anytime someone downloads a file from your site. So for each html file on the site this counts as a hit when someone reads it. Also for each graphic or other type of file, sound, or animation, etc. that someone sees that counts as a hit as well. So for example a page with something like the HTML file, a logo, a picture and a link to a file. Just viewing the page is 3 hits (the HTML file, the logo, and the picture) then clicking on the link and downloading the file is another hit. As you can see this isn't always a very useful number to track visitors, but it will give you a general idea of traffic to your site.

    Average successful requests per day
    Same as above but averaged per day.

    Successful requests for pages
    This is the total number of HTML pages that have been requested. Does not include graphics or sounds, etc. This is a good number to look at to see if people are actually viewing a large amount of the site. The higher the number here, the more of the si te has been viewed. Or many people have viewed just the home page. This is expanded upon in the request report at the bottom of the web stats.

    Average successful requests for pages per day
    Same as above but averaged per day.

    Failed requests
    These occur if either something is wrong with part of your web site or if people are trying to go to a part of the site that isn't there anymore. Such as a link from somewhere else that links to a page that you have removed from the site. These also occur because search engines look for optional instruction files called robots.txt that may not exist.

    Redirected requests
    Indicates that the user was directed to a different file instead of the one requested The most common cause is that the user has incorrectly requested a directory name without the trailing slash. The server replies with a redirection and the user then makes a second connection to get the correct document (although usually the browser does it automatically without the user's intervention or knowledge). The other common cause of redirected requests is their use as "click-thru" advertising banners.

    Distinct files requested
    This is the number of files that have distinct names that have been served. This number won't change much unless you change the file names in your web site.

    Distinct hosts served
    Every computer on the internet has a unique address. This counts the number of unique addresses that have been to your site. This unique address is pretty much the number of people who have seen your web site. At the very least it correlates the the n umber of computers that have been to your web site.

    Data Transferred
    This is the amount of data sent to the browser from your web site. This number will grow as people view more of the site, or download files from the site.

    Average data transferred per day
    Same as above but averaged per day.

    Graphical statistics

    Monthly Report
    Graphs out the usage per month.

    Daily Report
    Graphs out the usage per day. This is good to know if people are going to your site during the day or the weekend to see what their surfing habits are.

    Hourly Report
    Graphs out the usage per hour. This will keep increasing each day. Lets you see what time of day the most users visit you.

    File and directory reports

    Domain Report
    This report is not used by our server.

    Directory Report
    Shows what directories have the most traffic in your web site.

    File Type Report
    Shows what file types have the most traffic.

    Status Code Report
    Shows the status codes that are returned to the browser by the server.
    This is getting into the details of how the http protocol works. Here is a general overview of what the codes mean according to their number order:

    For a more complete overview of http and these codes go to
    RFC 2068, status codes are in section 10.

    Request Report
    This shows you which files on your site are the most popular ones. Very useful to see where users are going in your site and what they are looking at.