Welcome to Geeklog, Anonymous Friday, November 08 2024 @ 10:07 pm EST
Geeklog Forums
Calendar? What calendar?
Page navigation
Status: offline
LWC
Forum User
Full Member
Registered: 02/19/04
Posts: 818
I've just taken a look at my site's log and saw that my most popular Geeklog page is my calendar.
I don't use my calendar...
Is there some known security flaw in Geeklog's calendar which makes spammers look for it or something?
I don't use my calendar...
Is there some known security flaw in Geeklog's calendar which makes spammers look for it or something?
38
31
Quote
Status: offline
LWC
Forum User
Full Member
Registered: 02/19/04
Posts: 818
Actually, today alone, my referer log looks something like this:
http://www.google.com/search?q=...-> /article.php/...
http://my-own-domain/article.php/... -> /images/speck.gif
http://www.google.com/search?q=...-> /article.php/...
http://www.google.com/search?q=...-> /article.php/...
- -> /calendar.php
- -> /calendar.php
- -> /calendar.php
About 5 pages of "- -> /calendar.php" later...
- -> /calendar.php
- -> /calendar.php
- -> /calendar.php
And so on...
Until finally, when I checked it out myself:
http://my-own-domain/ -> /calendar.php
That's insane...
Also, since there's no referer, it also means that's it's not even internal (because that would have been shown as my-own-domain like the final entry). It's as if someone goes there manually/via favorites!
Let's say I'm wrong and it is internal - it's not possible that Geeklog loads it every time anyone goes to ANY Geeklog page, is it? Because why would it take up like 90% of my referer log?
Text Formatted Code
http://www.google.com/search?q=...-> /article.php/...
http://my-own-domain/article.php/... -> /images/speck.gif
http://www.google.com/search?q=...-> /article.php/...
http://www.google.com/search?q=...-> /article.php/...
- -> /calendar.php
- -> /calendar.php
- -> /calendar.php
About 5 pages of "- -> /calendar.php" later...
- -> /calendar.php
- -> /calendar.php
- -> /calendar.php
And so on...
Until finally, when I checked it out myself:
http://my-own-domain/ -> /calendar.php
That's insane...
Also, since there's no referer, it also means that's it's not even internal (because that would have been shown as my-own-domain like the final entry). It's as if someone goes there manually/via favorites!
Let's say I'm wrong and it is internal - it's not possible that Geeklog loads it every time anyone goes to ANY Geeklog page, is it? Because why would it take up like 90% of my referer log?
26
36
Quote
Status: offline
Dirk
Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Don't you have access to your webserver's logfiles? Without that, it's all speculation ...
As I said, Googlebot and the other search engine spiders seem to love the calendar:
lj2035.inktomisearch.com - - [05/Feb/2005:17:00:20 +0100] "GET /calendar.php?month=1&year=2004 HTTP/1.0" 200 27367 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
lj2411.inktomisearch.com - - [05/Feb/2005:18:50:57 +0100] "GET /calendar.php?mode=&view=week&month=9&day=7&year=2003 HTTP/1.0" 200 11751 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
Etc., spread out over the entire day.
That's from a randomly selected logfile of one of my sites. It's Yahoo's spider in this case, but I've seen similar "walkthroughs" from other spiders, including Goolegbot and msnbot.
bye, Dirk
As I said, Googlebot and the other search engine spiders seem to love the calendar:
Text Formatted Code
lj2248.inktomisearch.com - - [05/Feb/2005:13:28:51 +0100] "GET /calendar.php?view=day&mode=&day=10&month=10&year=2003 HTTP/1.0" 200 17011 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"lj2035.inktomisearch.com - - [05/Feb/2005:17:00:20 +0100] "GET /calendar.php?month=1&year=2004 HTTP/1.0" 200 27367 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
lj2411.inktomisearch.com - - [05/Feb/2005:18:50:57 +0100] "GET /calendar.php?mode=&view=week&month=9&day=7&year=2003 HTTP/1.0" 200 11751 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
Etc., spread out over the entire day.
That's from a randomly selected logfile of one of my sites. It's Yahoo's spider in this case, but I've seen similar "walkthroughs" from other spiders, including Goolegbot and msnbot.
bye, Dirk
28
33
Quote
Status: offline
LWC
Forum User
Full Member
Registered: 02/19/04
Posts: 818
What do you mean? This was my webserver's logfile...
It's called referer.log and it's formatted like this:
left side --> right side
where "left side"=the referer and "right side"=the referered page.
So if "left side" is empty, that means there's no "official" referer - which means someone who came in manually or via favorites, but also someone who shut off his/her referer's header.
It's called referer.log and it's formatted like this:
left side --> right side
where "left side"=the referer and "right side"=the referered page.
So if "left side" is empty, that means there's no "official" referer - which means someone who came in manually or via favorites, but also someone who shut off his/her referer's header.
23
25
Quote
Status: offline
Dirk
Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by LWC: What do you mean? This was my webserver's logfile...
Sorry, never seen a logfile like that. Doesn't it list the user agent (aka browser)?
Quote by LWC: So if "left side" is empty, that means there's no "official" referer - which means someone who came in manually or via favorites, but also someone who shut off his/her referer's header.
If you check what I quoted from my logfiles above, you'll see that those also came in without any referrer (that's the "-" bit), which is common for search engine robots.
bye, Dirk
33
30
Quote
Status: offline
LWC
Forum User
Full Member
Registered: 02/19/04
Posts: 818
Your format is the "combined" one, which I guess is the most popular one, but there are others like mine - two separate logs - referrer log and browser log.
But if they have no referer, what's inktomisearch.com?
Anyway, so what you're saying is that spiders, respected (e.g. Google, MSNBot, etc.) or otherwise (i.e. spammers) have no referers.
And they all come to stare at my blank calendar...
Well, maybe if I start writing a calendar, my site would be popular.
But if they have no referer, what's inktomisearch.com?
Anyway, so what you're saying is that spiders, respected (e.g. Google, MSNBot, etc.) or otherwise (i.e. spammers) have no referers.
And they all come to stare at my blank calendar...
Well, maybe if I start writing a calendar, my site would be popular.
34
27
Quote
Status: offline
Dirk
Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by LWC: what's inktomisearch.com?
They're owned by Yahoo and provide the search results for them.
Everyone's talking about Google vs. MSN, but Yahoo has quietly aquired a lot of search engine technology (Inktome, Overture, Altavista, ...) to do their own thing ...
Quote by LWC: Anyway, so what you're saying is that spiders, respected (e.g. Google, MSNBot, etc.) or otherwise (i.e. spammers) have no referers.
Actually, spammers DO have referrers. Have you never wondered about all those porn and poker sites supposedly linking to you? Referrer spam is also very popular these days.
bye, Dirk
37
22
Quote
Status: offline
LWC
Forum User
Full Member
Registered: 02/19/04
Posts: 818
That's all nice and well, but, as you could see, when it comes to the calendar, the referers are empty.
Of course the log contained spam referers, but none that came to the calendar (usually just to the front page).
Is it a coincidence that usually the referer is logged, unless the accessed page is the calendar?
For example, if someone had kept it in his/her favorites and clicked it 500 times, that would have explained it.
Like you said, if it was Googlebot, it would say something like:
http://googlebot.google.com - -> /calendar.php
Of course the log contained spam referers, but none that came to the calendar (usually just to the front page).
Is it a coincidence that usually the referer is logged, unless the accessed page is the calendar?
For example, if someone had kept it in his/her favorites and clicked it 500 times, that would have explained it.
Like you said, if it was Googlebot, it would say something like:
Text Formatted Code
http://googlebot.google.com - -> /calendar.php
30
36
Quote
Quote by LWC: That's all nice and well, but, as you could see, when it comes to the calendar, the referers are empty.
Yes, because that's what the search engine spiders do. I thought we already covered that?
bye, Dirk
34
27
Quote
Status: offline
LWC
Forum User
Full Member
Registered: 02/19/04
Posts: 818
Then I assume "inktomisearch.com" was the visitor and not the referer (which I mistaked it for in the earlier posts).
So in my case, it would have appeared only in the browser log.
Basically, that means that with separate logs, like I have, one can never
match referers with visitors.
Which means one can't pin point the visitors (that have empty referers) that visit the calendar...
So in my case, it would have appeared only in the browser log.
Basically, that means that with separate logs, like I have, one can never
match referers with visitors.
Which means one can't pin point the visitors (that have empty referers) that visit the calendar...
29
27
Quote
Status: offline
tstockma
Forum User
Full Member
Registered: 07/22/03
Posts: 169
Some answers, and a question
Spiders and Search Engines try to find all your pages, and every day on your calendar is a link, even if you have nothing listed. They'll cheerfully spend eternity following up all the "links" your empty calendar contains.
You can use robots.txt to exclude search engines from your calendar entirely, but if you have some events, you'll exclude those events from the SEs.
My question: how can I turn off the link to a day on the main calendar page, if there's nothing listed? That link is what Calendar currently uses to allow entry of a new event, but could we set up a more generic "add callendar event" function that we could exclude with robots.txt, and you have to specify your date when you enter that process?
That would eliminate the SE hits we currently see. (It would also help my "broken links" searcher considerably, it's stupid and never stops looking at empty calendar days.)
Sorry if this has been asked before...thanks for any comments!
Tom
www.southparkcity.com
Spiders and Search Engines try to find all your pages, and every day on your calendar is a link, even if you have nothing listed. They'll cheerfully spend eternity following up all the "links" your empty calendar contains.
You can use robots.txt to exclude search engines from your calendar entirely, but if you have some events, you'll exclude those events from the SEs.
My question: how can I turn off the link to a day on the main calendar page, if there's nothing listed? That link is what Calendar currently uses to allow entry of a new event, but could we set up a more generic "add callendar event" function that we could exclude with robots.txt, and you have to specify your date when you enter that process?
That would eliminate the SE hits we currently see. (It would also help my "broken links" searcher considerably, it's stupid and never stops looking at empty calendar days.)
Sorry if this has been asked before...thanks for any comments!
Tom
www.southparkcity.com
26
26
Quote
Status: offline
LWC
Forum User
Full Member
Registered: 02/19/04
Posts: 818
First of All, Dirk would be glad to know that my access log now also shows referrers (in addition to the specialized referrers log...). And I think you're right - a sample check proved that Googlebot is the main visitor of the calendar.
But the sample check also revealed that there's a MUCH (and I mean MUCH) worse file - submit.php! And again, it's Googlebot that won't stop visiting it!
Which brings us to tstockma...
[quote ...a snippet from my robots.txt]
User-agent: *
.
.
.
Disallow: /calendar.php
.
.
.
Disallow: /submit.php
.
.
.
[/quote]
And yet, these two are bombed with hits...
Before you blame the file, I try to test my robots.txt with a personal search engine from time to time to make sure it's valid.
So if even a "respected" bot like Google's ignores robots.txt anyway, I don't know if it's worth to come out with a solution to your request.
BTW, I've known for a long time that Google ignores my robots.txt file (and never managed to change that), but what I didn't know was how much traffic its bot causes those forbidden files!
But the sample check also revealed that there's a MUCH (and I mean MUCH) worse file - submit.php! And again, it's Googlebot that won't stop visiting it!
Which brings us to tstockma...
[quote ...a snippet from my robots.txt]
User-agent: *
.
.
.
Disallow: /calendar.php
.
.
.
Disallow: /submit.php
.
.
.
[/quote]
And yet, these two are bombed with hits...
Before you blame the file, I try to test my robots.txt with a personal search engine from time to time to make sure it's valid.
So if even a "respected" bot like Google's ignores robots.txt anyway, I don't know if it's worth to come out with a solution to your request.
BTW, I've known for a long time that Google ignores my robots.txt file (and never managed to change that), but what I didn't know was how much traffic its bot causes those forbidden files!
26
38
Quote
Status: offline
Dirk
Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by LWC: BTW, I've known for a long time that Google ignores my robots.txt file (and never managed to change that), but what I didn't know was how much traffic its bot causes those forbidden files!
Hmm, I would be surprised if it ignored the robots.txt.
Are you sure your robots.txt is syntactically correct? Did you check it?
Also, are you sure it's really Googlebot and not some other bot claiming to be Googlebot? Check the IPs it's coming from - they should all belong to Google.
bye, Dirk
27
25
Quote
Status: offline
LWC
Forum User
Full Member
Registered: 02/19/04
Posts: 818
[quote ...a sample from the access log]
crawl-66-249-65-137.googlebot.com - - [06/Mar/2005:00:06:19 -0700] "GET /submit.php?type=event&mode=&month=04&day=11&year=2002&hour=16 HTTP/1.1" 200 14840 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
crawl-66-249-65-137.googlebot.com - - [06/Mar/2005:00:21:17 -0700] "GET /submit.php?type=event&mode=&month=05&day=07&year=2002&hour=14 HTTP/1.1" 200 14840 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
crawl-66-249-65-137.googlebot.com - - [06/Mar/2005:00:28:14 -0700] "GET /submit.php?type=event&mode=&month=07&day=26&year=2002&hour=23 HTTP/1.1" 200 14840 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[/quote]
What you see there can go for pages!
Hmm, when I look at it, Googlebot never uses referrers.
And your site has approved my robots.txt
crawl-66-249-65-137.googlebot.com - - [06/Mar/2005:00:06:19 -0700] "GET /submit.php?type=event&mode=&month=04&day=11&year=2002&hour=16 HTTP/1.1" 200 14840 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
crawl-66-249-65-137.googlebot.com - - [06/Mar/2005:00:21:17 -0700] "GET /submit.php?type=event&mode=&month=05&day=07&year=2002&hour=14 HTTP/1.1" 200 14840 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
crawl-66-249-65-137.googlebot.com - - [06/Mar/2005:00:28:14 -0700] "GET /submit.php?type=event&mode=&month=07&day=26&year=2002&hour=23 HTTP/1.1" 200 14840 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[/quote]
What you see there can go for pages!
Hmm, when I look at it, Googlebot never uses referrers.
And your site has approved my robots.txt
25
31
Quote
Status: offline
Dirk
Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Status: offline
LWC
Forum User
Full Member
Registered: 02/19/04
Posts: 818
I feel there's too much risk involved.
The "overloading" form states sending the form would lead to Google coming less to my site.
That's not what I want at all! My site is not updated enough in Google as it is (the problems are ignoring my robots.txt and never neglecting outdates pages)...
These are probably automated forms so even if I tell them that, the only one who'd read it is some robot, which won't care for anything I write other than I filled a "please send me less traffic" form.
The "overloading" form states sending the form would lead to Google coming less to my site.
That's not what I want at all! My site is not updated enough in Google as it is (the problems are ignoring my robots.txt and never neglecting outdates pages)...
These are probably automated forms so even if I tell them that, the only one who'd read it is some robot, which won't care for anything I write other than I filled a "please send me less traffic" form.
26
28
Quote
Status: offline
RickW
Forum User
Full Member
Registered: 01/28/04
Posts: 240
Location:United States
Do what I did and remove the calendar files:
http://www.geeklog.net/forum/viewtopic.php?forum=3&showtopic=38627
I had the same problem - I wasn't using the calendar, it was just an empty feature I didn't want visitors to see, and google was spidering it like crazy. I had 1000+ calendar pages in link:to-my-site.com, but they had zero page rank because it was all redundant and I had a feeling it was actually penalizing my site (google is sensitive to spamming).
This is what my robots.txt file looks like:
Disallow: /comment.php
Disallow: /submit.php
Disallow: /profiles.php
Disallow: /calendar.php
Disallow: /usersettings.php
Disallow: /forum/createtopic.php
Even after a proper robots.txt file, and removing the calendar, if you want to cover all your bases you could add this to your header template in with the other meta tags:
<?php
$this_page = basename($_SERVER['PHP_SELF']);
switch ($this_page)
{
case "comment.php":
case "submit.php":
case "profiles.php":
case "calendar.php":
case "usersettings.php":
case "createtopic.php":
echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\">";
break;
default:
echo "<meta name=\"robots\" content=\"index, follow\">";
}
?>
It works, just for kicks I'm testing it on my temp site now.
www.antisource.com
http://www.geeklog.net/forum/viewtopic.php?forum=3&showtopic=38627
I had the same problem - I wasn't using the calendar, it was just an empty feature I didn't want visitors to see, and google was spidering it like crazy. I had 1000+ calendar pages in link:to-my-site.com, but they had zero page rank because it was all redundant and I had a feeling it was actually penalizing my site (google is sensitive to spamming).
This is what my robots.txt file looks like:
Text Formatted Code
User-agent: *Disallow: /comment.php
Disallow: /submit.php
Disallow: /profiles.php
Disallow: /calendar.php
Disallow: /usersettings.php
Disallow: /forum/createtopic.php
Even after a proper robots.txt file, and removing the calendar, if you want to cover all your bases you could add this to your header template in with the other meta tags:
Text Formatted Code
<?php
$this_page = basename($_SERVER['PHP_SELF']);
switch ($this_page)
{
case "comment.php":
case "submit.php":
case "profiles.php":
case "calendar.php":
case "usersettings.php":
case "createtopic.php":
echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\">";
break;
default:
echo "<meta name=\"robots\" content=\"index, follow\">";
}
?>
It works, just for kicks I'm testing it on my temp site now.
www.antisource.com
31
31
Quote
Status: offline
ronack
Forum User
Full Member
Registered: 05/27/03
Posts: 612
I just wanted to share my robots.txt file. I watched that blasted googlebot and really don't want it in some areas. Oh and the MSNbot was the worst, I had to Disallow the MSNbot completly. When the MSNbot hit my sites. It stayed for a very long time, hit every day in the calendar, every photo (one of my sites has over 1000 photos. My server slowed down to a crawl and sometimes froze up. Since I added the robots.txt file I rarely have a slow down or lockup.
Disallow: /comment.php
Disallow: /submit.php
Disallow: /forum/createtopic.php
Disallow: /calendar.php
Disallow: /admin/
Disallow: /layout/
Disallow: /images/
Disallow: /stats/
Disallow: /search.php
User-agent: msnbot
Disallow: /
Text Formatted Code
User-agent: *Disallow: /comment.php
Disallow: /submit.php
Disallow: /forum/createtopic.php
Disallow: /calendar.php
Disallow: /admin/
Disallow: /layout/
Disallow: /images/
Disallow: /stats/
Disallow: /search.php
User-agent: msnbot
Disallow: /
33
30
Quote
Status: offline
LWC
Forum User
Full Member
Registered: 02/19/04
Posts: 818
Well, if we're all sharing, here's my full robots.txt:
User-agent: Googlebot
Disallow: /*.php/*/print$
User-agent: *
Disallow: /calendar.php
Disallow: /comment.php
Disallow: /index.php?topic=
Disallow: /pollbooth.php?qid=
Disallow: /portal.php
Disallow: /profiles.php
Disallow: /search.php
Disallow: /submit.php
Disallow: /stats.php
Disallow: /users.php
Disallow: /admin/
Disallow: /chatterblock/
Disallow: /filemgmt/brokenfile.php
Disallow: /filemgmt/downloadhistory.php
Disallow: /filemgmt/ratefile.php
Disallow: /filemgmt/viewcat.php
Disallow: /filemgmt/visit.php
Update:
I've just checked and most of these are on Google despite the fact that they're mentioned in my robots.txt!
Ok, when I think about it, I've recently switched my site from http://lior.weissbrod.com/lior (synonymous with http://www.weissbrod.com/lior - i.e. just a simple redirect) to simply http://lior.weissbrod.com (simulating virtually a stand alone site)
And turns out that most of the forbidden pages are from the former - now outdated - site!
When I did the switch, I assumed that Google would be smart enough to emit all the outdated "/lior/" results by itself because they all now give 404 errors. I guess I was wrong...
Well, before you think this explains everything, this is just the general case. In some cases, Google indexes them on the new site (for example, the calendar) so it really does ignore my robots.txt (at least sometimes).
Text Formatted Code
User-agent: Googlebot
Disallow: /*.php/*/print$
User-agent: *
Disallow: /calendar.php
Disallow: /comment.php
Disallow: /index.php?topic=
Disallow: /pollbooth.php?qid=
Disallow: /portal.php
Disallow: /profiles.php
Disallow: /search.php
Disallow: /submit.php
Disallow: /stats.php
Disallow: /users.php
Disallow: /admin/
Disallow: /chatterblock/
Disallow: /filemgmt/brokenfile.php
Disallow: /filemgmt/downloadhistory.php
Disallow: /filemgmt/ratefile.php
Disallow: /filemgmt/viewcat.php
Disallow: /filemgmt/visit.php
Update:
I've just checked and most of these are on Google despite the fact that they're mentioned in my robots.txt!
Ok, when I think about it, I've recently switched my site from http://lior.weissbrod.com/lior (synonymous with http://www.weissbrod.com/lior - i.e. just a simple redirect) to simply http://lior.weissbrod.com (simulating virtually a stand alone site)
And turns out that most of the forbidden pages are from the former - now outdated - site!
When I did the switch, I assumed that Google would be smart enough to emit all the outdated "/lior/" results by itself because they all now give 404 errors. I guess I was wrong...
Well, before you think this explains everything, this is just the general case. In some cases, Google indexes them on the new site (for example, the calendar) so it really does ignore my robots.txt (at least sometimes).
31
33
Quote
Page navigation
All times are EST. The time is now 10:07 pm.
- Normal Topic
- Sticky Topic
- Locked Topic
- New Post
- Sticky Topic W/ New Post
- Locked Topic W/ New Post
- View Anonymous Posts
- Able to post
- Filtered HTML Allowed
- Censored Content