Welcome to Geeklog Friday, November 24 2017 @ 06:02 am EST


Status: offline

LWC

Forum User
Full Member
Registered: 19/02/2004
Posts: 818
I've just taken a look at my site's log and saw that my most popular Geeklog page is my calendar.

I don't use my calendar...

Is there some known security flaw in Geeklog's calendar which makes spammers look for it or something?

Status: offline

Dirk

Site Admin
Admin
Registered: 12/01/2002
Posts: 13073
Location:Stuttgart, Germany
Did you check who was visiting the calendar? Googlebot seems to love it, for example ...

bye, Dirk

Status: offline

LWC

Forum User
Full Member
Registered: 19/02/2004
Posts: 818
Actually, today alone, my referer log looks something like this:
PHP Formatted Code

http://www.google.com/search?q=...-> /article.php/...
http://my-own-domain/article.php/... -> /images/speck.gif
http://www.google.com/search?q=...-> /article.php/...
http://www.google.com/search?q=...-> /article.php/...
- -> /calendar.php
- -> /calendar.php
- -> /calendar.php

About 5 pages of "- -> /calendar.php" later...

- -> /calendar.php
- -> /calendar.php
- -> /calendar.php

And so on...
Until finally, when I checked it out myself:

http://my-own-domain/ -> /calendar.php

 


That's insane...
Also, since there's no referer, it also means that's it's not even internal (because that would have been shown as my-own-domain like the final entry). It's as if someone goes there manually/via favorites!

Let's say I'm wrong and it is internal - it's not possible that Geeklog loads it every time anyone goes to ANY Geeklog page, is it? Because why would it take up like 90% of my referer log?

Status: offline

Dirk

Site Admin
Admin
Registered: 12/01/2002
Posts: 13073
Location:Stuttgart, Germany
Don't you have access to your webserver's logfiles? Without that, it's all speculation ...

As I said, Googlebot and the other search engine spiders seem to love the calendar:
PHP Formatted Code
lj2248.inktomisearch.com - - [05/Feb/2005:13:28:51 +0100] "GET /calendar.php?view=day&mode=&day=10&month=10&year=2003 HTTP/1.0" 200 17011 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
lj2035.inktomisearch.com - - [05/Feb/2005:17:00:20 +0100] "GET /calendar.php?month=1&year=2004 HTTP/1.0" 200 27367 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
lj2411.inktomisearch.com - - [05/Feb/2005:18:50:57 +0100] "GET /calendar.php?mode=&view=week&month=9&day=7&year=2003 HTTP/1.0" 200 11751 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
 

Etc., spread out over the entire day.

That's from a randomly selected logfile of one of my sites. It's Yahoo's spider in this case, but I've seen similar "walkthroughs" from other spiders, including Goolegbot and msnbot.

bye, Dirk

Status: offline

LWC

Forum User
Full Member
Registered: 19/02/2004
Posts: 818
What do you mean? This was my webserver's logfile...
It's called referer.log and it's formatted like this:
left side --> right side

where "left side"=the referer and "right side"=the referered page.

So if "left side" is empty, that means there's no "official" referer - which means someone who came in manually or via favorites, but also someone who shut off his/her referer's header.

Status: offline

Dirk

Site Admin
Admin
Registered: 12/01/2002
Posts: 13073
Location:Stuttgart, Germany
Quote by LWC: What do you mean? This was my webserver's logfile...

Sorry, never seen a logfile like that. Doesn't it list the user agent (aka browser)?

Quote by LWC: So if "left side" is empty, that means there's no "official" referer - which means someone who came in manually or via favorites, but also someone who shut off his/her referer's header.

If you check what I quoted from my logfiles above, you'll see that those also came in without any referrer (that's the "-" bit), which is common for search engine robots.

bye, Dirk

Status: offline

LWC

Forum User
Full Member
Registered: 19/02/2004
Posts: 818
Your format is the "combined" one, which I guess is the most popular one, but there are others like mine - two separate logs - referrer log and browser log.

But if they have no referer, what's inktomisearch.com?

Anyway, so what you're saying is that spiders, respected (e.g. Google, MSNBot, etc.) or otherwise (i.e. spammers) have no referers.
And they all come to stare at my blank calendar...

Well, maybe if I start writing a calendar, my site would be popular.

Status: offline

Dirk

Site Admin
Admin
Registered: 12/01/2002
Posts: 13073
Location:Stuttgart, Germany
Quote by LWC: what's inktomisearch.com?

They're owned by Yahoo and provide the search results for them.

Everyone's talking about Google vs. MSN, but Yahoo has quietly aquired a lot of search engine technology (Inktome, Overture, Altavista, ...) to do their own thing ...

Quote by LWC: Anyway, so what you're saying is that spiders, respected (e.g. Google, MSNBot, etc.) or otherwise (i.e. spammers) have no referers.

Actually, spammers DO have referrers. Have you never wondered about all those porn and poker sites supposedly linking to you? Referrer spam is also very popular these days.

bye, Dirk

Status: offline

LWC

Forum User
Full Member
Registered: 19/02/2004
Posts: 818
That's all nice and well, but, as you could see, when it comes to the calendar, the referers are empty.

Of course the log contained spam referers, but none that came to the calendar (usually just to the front page).

Is it a coincidence that usually the referer is logged, unless the accessed page is the calendar?

For example, if someone had kept it in his/her favorites and clicked it 500 times, that would have explained it.

Like you said, if it was Googlebot, it would say something like:
PHP Formatted Code

http://googlebot.google.com - -> /calendar.php

 

Status: offline

Dirk

Site Admin
Admin
Registered: 12/01/2002
Posts: 13073
Location:Stuttgart, Germany
grumpy
Quote by LWC: That's all nice and well, but, as you could see, when it comes to the calendar, the referers are empty.

Yes, because that's what the search engine spiders do. I thought we already covered that?

bye, Dirk

Status: offline

LWC

Forum User
Full Member
Registered: 19/02/2004
Posts: 818
Then I assume "inktomisearch.com" was the visitor and not the referer (which I mistaked it for in the earlier posts).
So in my case, it would have appeared only in the browser log.

Basically, that means that with separate logs, like I have, one can never
match referers with visitors.

Which means one can't pin point the visitors (that have empty referers) that visit the calendar...

Status: offline

tstockma

Forum User
Full Member
Registered: 22/07/2003
Posts: 169
Some answers, and a question

Spiders and Search Engines try to find all your pages, and every day on your calendar is a link, even if you have nothing listed. They'll cheerfully spend eternity following up all the "links" your empty calendar contains.

You can use robots.txt to exclude search engines from your calendar entirely, but if you have some events, you'll exclude those events from the SEs.

My question: how can I turn off the link to a day on the main calendar page, if there's nothing listed? That link is what Calendar currently uses to allow entry of a new event, but could we set up a more generic "add callendar event" function that we could exclude with robots.txt, and you have to specify your date when you enter that process?

That would eliminate the SE hits we currently see. (It would also help my "broken links" searcher considerably, it's stupid and never stops looking at empty calendar days.)

Sorry if this has been asked before...thanks for any comments!
Tom www.southparkcity.com

Status: offline

LWC

Forum User
Full Member
Registered: 19/02/2004
Posts: 818
First of All, Dirk would be glad to know that my access log now also shows referrers (in addition to the specialized referrers log...). And I think you're right - a sample check proved that Googlebot is the main visitor of the calendar.

But the sample check also revealed that there's a MUCH (and I mean MUCH) worse file - submit.php! And again, it's Googlebot that won't stop visiting it!

Which brings us to tstockma...
[quote ...a snippet from my robots.txt]
User-agent: *
.
.
.
Disallow: /calendar.php
.
.
.
Disallow: /submit.php
.
.
.
[/quote]
And yet, these two are bombed with hits...

Before you blame the file, I try to test my robots.txt with a personal search engine from time to time to make sure it's valid.

So if even a "respected" bot like Google's ignores robots.txt anyway, I don't know if it's worth to come out with a solution to your request.

BTW, I've known for a long time that Google ignores my robots.txt file (and never managed to change that), but what I didn't know was how much traffic its bot causes those forbidden files!

Status: offline

Dirk

Site Admin
Admin
Registered: 12/01/2002
Posts: 13073
Location:Stuttgart, Germany
Quote by LWC: BTW, I've known for a long time that Google ignores my robots.txt file (and never managed to change that), but what I didn't know was how much traffic its bot causes those forbidden files!

Hmm, I would be surprised if it ignored the robots.txt.

Are you sure your robots.txt is syntactically correct? Did you check it?

Also, are you sure it's really Googlebot and not some other bot claiming to be Googlebot? Check the IPs it's coming from - they should all belong to Google.

bye, Dirk

Status: offline

LWC

Forum User
Full Member
Registered: 19/02/2004
Posts: 818
[quote ...a sample from the access log]
crawl-66-249-65-137.googlebot.com - - [06/Mar/2005:00:06:19 -0700] "GET /submit.php?type=event&mode=&month=04&day=11&year=2002&hour=16 HTTP/1.1" 200 14840 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
crawl-66-249-65-137.googlebot.com - - [06/Mar/2005:00:21:17 -0700] "GET /submit.php?type=event&mode=&month=05&day=07&year=2002&hour=14 HTTP/1.1" 200 14840 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
crawl-66-249-65-137.googlebot.com - - [06/Mar/2005:00:28:14 -0700] "GET /submit.php?type=event&mode=&month=07&day=26&year=2002&hour=23 HTTP/1.1" 200 14840 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[/quote]
What you see there can go for pages!
Hmm, when I look at it, Googlebot never uses referrers.

And your site has approved my robots.txt

Status: offline

Dirk

Site Admin
Admin
Registered: 12/01/2002
Posts: 13073
Location:Stuttgart, Germany
Yep, that's Googlebot. In that case I think you should email them and tell them that Googlebot has been a bad boy.

If you go to this URL, it already has an option "Googlebot is overloading my servers".

Keep us posted, I'd be interested in their response.

bye, Dirk

Status: offline

LWC

Forum User
Full Member
Registered: 19/02/2004
Posts: 818
I feel there's too much risk involved.

The "overloading" form states sending the form would lead to Google coming less to my site.
That's not what I want at all! My site is not updated enough in Google as it is (the problems are ignoring my robots.txt and never neglecting outdates pages)...

These are probably automated forms so even if I tell them that, the only one who'd read it is some robot, which won't care for anything I write other than I filled a "please send me less traffic" form.

Status: offline

RickW

Forum User
Full Member
Registered: 28/01/2004
Posts: 240
Location:United States
Do what I did and remove the calendar files:

http://www.geeklog.net/forum/viewtopic.php?forum=3&showtopic=38627

I had the same problem - I wasn't using the calendar, it was just an empty feature I didn't want visitors to see, and google was spidering it like crazy. I had 1000+ calendar pages in link:to-my-site.com, but they had zero page rank because it was all redundant and I had a feeling it was actually penalizing my site (google is sensitive to spamming).

This is what my robots.txt file looks like:

PHP Formatted Code
User-agent: *
Disallow: /comment.php
Disallow: /submit.php
Disallow: /profiles.php
Disallow: /calendar.php
Disallow: /usersettings.php
Disallow: /forum/createtopic.php


 


Even after a proper robots.txt file, and removing the calendar, if you want to cover all your bases you could add this to your header template in with the other meta tags:

PHP Formatted Code

<?php
$this_page = basename($_SERVER['PHP_SELF']);

switch ($this_page)
{
    case "comment.php":
    case "submit.php":
    case "profiles.php":
    case "calendar.php":
    case "usersettings.php":
    case "createtopic.php":

    echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\">";
    break;

    default:
    echo "<meta name=\"robots\" content=\"index, follow\">";
}
?>



 


It works, just for kicks I'm testing it on my temp site now.
www.antisource.com

Status: offline

ronack

Forum User
Full Member
Registered: 27/05/2003
Posts: 612
I just wanted to share my robots.txt file. I watched that blasted googlebot and really don't want it in some areas. Oh and the MSNbot was the worst, I had to Disallow the MSNbot completly. When the MSNbot hit my sites. It stayed for a very long time, hit every day in the calendar, every photo (one of my sites has over 1000 photos. My server slowed down to a crawl and sometimes froze up. Since I added the robots.txt file I rarely have a slow down or lockup.

PHP Formatted Code
User-agent: *
Disallow: /comment.php
Disallow: /submit.php
Disallow: /forum/createtopic.php
Disallow: /calendar.php
Disallow: /admin/
Disallow: /layout/
Disallow: /images/
Disallow: /stats/
Disallow: /search.php


User-agent: msnbot
Disallow: /
 

Status: offline

LWC

Forum User
Full Member
Registered: 19/02/2004
Posts: 818
Well, if we're all sharing, here's my full robots.txt:
PHP Formatted Code

User-agent: Googlebot
Disallow: /*.php/*/print$

User-agent: *
Disallow: /calendar.php
Disallow: /comment.php
Disallow: /index.php?topic=
Disallow: /pollbooth.php?qid=
Disallow: /portal.php
Disallow: /profiles.php
Disallow: /search.php
Disallow: /submit.php
Disallow: /stats.php
Disallow: /users.php
Disallow: /admin/
Disallow: /chatterblock/
Disallow: /filemgmt/brokenfile.php
Disallow: /filemgmt/downloadhistory.php
Disallow: /filemgmt/ratefile.php
Disallow: /filemgmt/viewcat.php
Disallow: /filemgmt/visit.php

 

Update:
I've just checked and most of these are on Google despite the fact that they're mentioned in my robots.txt!

Ok, when I think about it, I've recently switched my site from http://lior.weissbrod.com/lior (synonymous with http://www.weissbrod.com/lior - i.e. just a simple redirect) to simply http://lior.weissbrod.com (simulating virtually a stand alone site)
And turns out that most of the forbidden pages are from the former - now outdated - site!

When I did the switch, I assumed that Google would be smart enough to emit all the outdated "/lior/" results by itself because they all now give 404 errors. I guess I was wrong...

Well, before you think this explains everything, this is just the general case. In some cases, Google indexes them on the new site (for example, the calendar) so it really does ignore my robots.txt (at least sometimes).

All times are EST. The time is now 06:02 am.

  • Normal Topic
  • Sticky Topic
  • Locked Topic
  • New Post
  • Sticky Topic W/ New Post
  • Locked Topic W/ New Post
  •  View Anonymous Posts
  •  Able to post
  •  Filtered HTML Allowed
  •  Censored Content