Posted on: 02/06/05 07:19am
By: LWC
I've just taken a look at my site's log and saw that my most popular Geeklog page is my calendar.
I don't use my calendar...
Is there some known security flaw in Geeklog's calendar which makes spammers look for it or something?
Calendar? What calendar?
Posted on: 02/06/05 07:29am
By: Dirk
Did you check who was visiting the calendar? Googlebot seems to love it, for example ...
bye, Dirk
Calendar? What calendar?
Posted on: 02/06/05 09:40am
By: LWC
Actually, today alone, my referer log looks something like this:
http://www.google.com/search?q=...-> /article.php/...
http://my-own-domain/article.php/... -> /images/speck.gif
http://www.google.com/search?q=...-> /article.php/...
http://www.google.com/search?q=...-> /article.php/...
- -> /calendar.php
- -> /calendar.php
- -> /calendar.php
About 5 pages of "- -> /calendar.php" later...
- -> /calendar.php
- -> /calendar.php
- -> /calendar.php
And so on...
Until finally, when I checked it out myself:
http://my-own-domain/ -> /calendar.php
That's insane...
Also, since there's no referer, it also means that's it's not even internal (because that would have been shown as my-own-domain like the final entry). It's as if someone goes there manually/via favorites!
Let's say I'm wrong and it is internal - it's not possible that Geeklog loads it every time anyone goes to ANY Geeklog page, is it? Because why would it take up like 90% of my referer log?
Calendar? What calendar?
Posted on: 02/06/05 10:16am
By: Dirk
Don't you have access to your webserver's logfiles? Without that, it's all speculation ...
As I said, Googlebot and the other search engine spiders seem to love the calendar:
lj2248.inktomisearch.com - - [05/Feb/2005:13:28:51 +0100] "GET /calendar.php?view=day&mode=&day=10&month=10&year=2003 HTTP/1.0" 200 17011 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
lj2035.inktomisearch.com - - [05/Feb/2005:17:00:20 +0100] "GET /calendar.php?month=1&year=2004 HTTP/1.0" 200 27367 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
lj2411.inktomisearch.com - - [05/Feb/2005:18:50:57 +0100] "GET /calendar.php?mode=&view=week&month=9&day=7&year=2003 HTTP/1.0" 200 11751 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
Etc., spread out over the entire day.
That's from a randomly selected logfile of one of my sites. It's Yahoo's spider in this case, but I've seen similar "walkthroughs" from other spiders, including Goolegbot and msnbot.
bye, Dirk
Calendar? What calendar?
Posted on: 02/06/05 11:32am
By: LWC
What do you mean? This was my webserver's logfile...
It's called referer.log and it's formatted like this:
left side --> right side
where "left side"=the referer and "right side"=the referered page.
So if "left side" is empty, that means there's no "official" referer - which means someone who came in manually or via favorites, but also someone who shut off his/her referer's header.
Calendar? What calendar?
Posted on: 02/06/05 12:07pm
By: Dirk
[QUOTE BY= LWC] What do you mean? This was my webserver's logfile...[/QUOTE]
Sorry, never seen a logfile like that. Doesn't it list the user agent (aka browser)?
[QUOTE BY= LWC] So if "left side" is empty, that means there's no "official" referer - which means someone who came in manually or via favorites, but also someone who shut off his/her referer's header.[/QUOTE]
If you check what I quoted from my logfiles above, you'll see that those also came in without any referrer (that's the "-" bit), which is common for search engine robots.
bye, Dirk
Calendar? What calendar?
Posted on: 02/06/05 01:25pm
By: LWC
Your format is the "combined" one, which I guess is the most popular one, but there are others like mine - two separate logs - referrer log and browser log.
But if they have no referer, what's inktomisearch.com?
Anyway, so what you're saying is that spiders, respected (e.g. Google, MSNBot, etc.) or otherwise (i.e. spammers) have no referers.
And they all come to stare at my blank calendar...
Well, maybe if I start writing a calendar, my site would be popular.
Calendar? What calendar?
Posted on: 02/06/05 01:48pm
By: Dirk
[QUOTE BY= LWC] what's inktomisearch.com?[/QUOTE]
They're owned by Yahoo and provide the search results for them.
Everyone's talking about Google vs. MSN, but Yahoo has quietly aquired a lot of search engine technology (Inktome, Overture, Altavista, ...) to do their own thing ...
[QUOTE BY= LWC] Anyway, so what you're saying is that spiders, respected (e.g. Google, MSNBot, etc.) or otherwise (i.e. spammers) have no referers.[/QUOTE]
Actually, spammers DO have referrers. Have you never wondered about all those porn and poker sites supposedly linking to you? Referrer spam is also very popular these days.
bye, Dirk
Calendar? What calendar?
Posted on: 02/06/05 03:23pm
By: LWC
That's all nice and well, but, as you could see, when it comes to the calendar, the referers are empty.
Of course the log contained spam referers, but none that came to the calendar (usually just to the front page).
Is it a coincidence that usually the referer is logged, unless the accessed page is the calendar?
For example, if someone had kept it in his/her favorites and clicked it 500 times, that would have explained it.
Like you said, if it was Googlebot, it would say something like:
http://googlebot.google.com - -> /calendar.php
Calendar? What calendar?
Posted on: 02/06/05 03:38pm
By: Dirk
[QUOTE BY= LWC] That's all nice and well, but, as you could see, when it comes to the calendar, the referers are empty.[/QUOTE]
Yes, because that's what the search engine spiders do. I thought we already covered that?
bye, Dirk
Calendar? What calendar?
Posted on: 02/06/05 07:35pm
By: LWC
Then I assume "inktomisearch.com" was the visitor and not the referer (which I mistaked it for in the earlier posts).
So in my case, it would have appeared only in the browser log.
Basically, that means that with separate logs, like I have, one can never
match referers with visitors.
Which means one can't pin point the visitors (that have empty referers) that visit the calendar...
Some answers, and a question
Posted on: 03/07/05 12:56pm
By: tstockma
Some answers, and a question
Spiders and Search Engines try to find all your pages, and every day on your calendar is a link, even if you have nothing listed. They'll cheerfully spend eternity following up all the "links" your empty calendar contains.
You can use robots.txt to exclude search engines from your calendar entirely, but if you have some events, you'll exclude those events from the SEs.
My question: how can I turn off the link to a day on the main calendar page, if there's nothing listed? That link is what Calendar currently uses to allow entry of a new event, but could we set up a more generic "add callendar event" function that we could exclude with robots.txt, and you have to specify your date when you enter that process?
That would eliminate the SE hits we currently see. (It would also help my "broken links" searcher considerably, it's stupid and never stops looking at empty calendar days.)
Sorry if this has been asked before...thanks for any comments!
Calendar? What calendar?
Posted on: 03/08/05 02:58pm
By: LWC
First of All, Dirk would be glad to know that my access log now also shows referrers (in addition to the specialized referrers log...). And I think you're right - a sample check proved that Googlebot is the main visitor of the calendar.
But the sample check also revealed that there's a MUCH (and I mean MUCH) worse file - submit.php! And again, it's Googlebot that won't stop visiting it!
Which brings us to tstockma...
[quote ...a snippet from my robots.txt]
User-agent: *
.
.
.
Disallow: /calendar.php
.
.
.
Disallow: /submit.php
.
.
.
[/quote]
And yet, these two are bombed with hits...
Before you blame the file, I try to test my robots.txt with a personal search engine from time to time to make sure it's valid.
So if even a "respected" bot like Google's ignores robots.txt anyway, I don't know if it's worth to come out with a solution to your request.
BTW, I've known for a long time that Google ignores my robots.txt file (and never managed to change that), but what I didn't know was how much traffic its bot causes those forbidden files!
Calendar? What calendar?
Posted on: 03/08/05 03:12pm
By: Dirk
[QUOTE BY= LWC] BTW, I've known for a long time that Google ignores my robots.txt file (and never managed to change that), but what I didn't know was how much traffic its bot causes those forbidden files![/QUOTE]
Hmm, I would be surprised if it ignored the robots.txt.
Are you sure your robots.txt is syntactically correct? Did you
check[*1] it?
Also, are you sure it's really Googlebot and not some other bot claiming to be Googlebot? Check the IPs it's coming from - they should all belong to Google.
bye, Dirk
Calendar? What calendar?
Posted on: 03/08/05 04:33pm
By: LWC
[quote ...a sample from the access log]
crawl-66-249-65-137.googlebot.com - - [06/Mar/2005:00:06:19 -0700] "GET /submit.php?type=event&mode=&month=04&day=11&year=2002&hour=16 HTTP/1.1" 200 14840 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
crawl-66-249-65-137.googlebot.com - - [06/Mar/2005:00:21:17 -0700] "GET /submit.php?type=event&mode=&month=05&day=07&year=2002&hour=14 HTTP/1.1" 200 14840 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
crawl-66-249-65-137.googlebot.com - - [06/Mar/2005:00:28:14 -0700] "GET /submit.php?type=event&mode=&month=07&day=26&year=2002&hour=23 HTTP/1.1" 200 14840 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[/quote]
What you see there can go for pages!
Hmm, when I look at it, Googlebot never uses referrers.
And your site has approved my robots.txt
Calendar? What calendar?
Posted on: 03/08/05 04:49pm
By: Dirk
Yep, that's Googlebot. In that case I think you should email them and tell them that Googlebot has been a bad boy.
If you go to
this URL[*2] , it already has an option "Googlebot is overloading my servers".
Keep us posted, I'd be interested in their response.
bye, Dirk
Calendar? What calendar?
Posted on: 03/08/05 05:31pm
By: LWC
I feel there's too much risk involved.
The "overloading" form states sending the form would lead to Google coming less to my site.
That's not what I want at all! My site is not updated enough in Google as it is (the problems are ignoring my robots.txt and never neglecting outdates pages)...
These are probably automated forms so even if I tell them that, the only one who'd read it is some robot, which won't care for anything I write other than I filled a "please send me less traffic" form.
Calendar? What calendar?
Posted on: 03/08/05 10:56pm
By: RickW
Do what I did and remove the calendar files:
http://www.geeklog.net/forum/viewtopic.php?forum=3&showtopic=38627[*3]
I had the same problem - I wasn't using the calendar, it was just an empty feature I didn't want visitors to see, and google was spidering it like crazy. I had 1000+ calendar pages in link:to-my-site.com, but they had zero page rank because it was all redundant and I had a feeling it was actually penalizing my site (google is sensitive to spamming).
This is what my robots.txt file looks like:
User-agent: *
Disallow: /comment.php
Disallow: /submit.php
Disallow: /profiles.php
Disallow: /calendar.php
Disallow: /usersettings.php
Disallow: /forum/createtopic.php
Even after a proper robots.txt file, and removing the calendar, if you want to cover all your bases you could add this to your header template in with the other meta tags:
<?php
$this_page = basename($_SERVER['PHP_SELF']);
switch ($this_page)
{
case "comment.php":
case "submit.php":
case "profiles.php":
case "calendar.php":
case "usersettings.php":
case "createtopic.php":
echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\">";
break;
default:
echo "<meta name=\"robots\" content=\"index, follow\">";
}
?>
It works, just for kicks I'm testing it on my temp site now.
Calendar? What calendar?
Posted on: 03/08/05 11:36pm
By: ronack
I just wanted to share my robots.txt file. I watched that blasted googlebot and really don't want it in some areas. Oh and the MSNbot was the worst, I had to Disallow the MSNbot completly. When the MSNbot hit my sites. It stayed for a very long time, hit every day in the calendar, every photo (one of my sites has over 1000 photos. My server slowed down to a crawl and sometimes froze up. Since I added the robots.txt file I rarely have a slow down or lockup.
User-agent: *
Disallow: /comment.php
Disallow: /submit.php
Disallow: /forum/createtopic.php
Disallow: /calendar.php
Disallow: /admin/
Disallow: /layout/
Disallow: /images/
Disallow: /stats/
Disallow: /search.php
User-agent: msnbot
Disallow: /
Calendar? What calendar?
Posted on: 03/09/05 05:59am
By: LWC
Well, if we're all sharing, here's my full robots.txt:
User-agent: Googlebot
Disallow: /*.php/*/print$
User-agent: *
Disallow: /calendar.php
Disallow: /comment.php
Disallow: /index.php?topic=
Disallow: /pollbooth.php?qid=
Disallow: /portal.php
Disallow: /profiles.php
Disallow: /search.php
Disallow: /submit.php
Disallow: /stats.php
Disallow: /users.php
Disallow: /admin/
Disallow: /chatterblock/
Disallow: /filemgmt/brokenfile.php
Disallow: /filemgmt/downloadhistory.php
Disallow: /filemgmt/ratefile.php
Disallow: /filemgmt/viewcat.php
Disallow: /filemgmt/visit.php
Update:
I've just checked and most of these
are on Google despite the fact that they're mentioned in my robots.txt!
Ok, when I think about it, I've recently switched my site from http://lior.weissbrod.com/lior (synonymous with http://
www.weissbrod.com/lior - i.e. just a simple redirect) to simply http://lior.weissbrod.com (simulating virtually a stand alone site)
And turns out that most of the forbidden pages are from the former - now outdated - site!
When I did the switch, I assumed that Google would be smart enough to emit all the outdated "/lior/" results by itself because they all now give 404 errors. I guess I was wrong...
Well, before you think this explains everything, this is just the general case. In some cases, Google indexes them on the new site (for example, the calendar) so it really does ignore my robots.txt (at least sometimes).
Calendar? What calendar?
Posted on: 03/09/05 08:15am
By: RickW
[QUOTE BY= LWC]
When I did the switch, I assumed that Google would be smart enough to emit all the outdated /lior/ results by itself. I guess I was wrong...[/QUOTE]
Why would Google consider that smart? It crawls links, indexes them, then reindexes what it already has in it's database over and over. You never explicitly told Google to drop it from their index. Over time, Google would have seen the duplicate content, which neither is outdated because they are the same content, and it would have decided first to drop page rank to 0, then eventually might drop them from the results.
Just admit that you made a web design mistake, stop trying to blame everyone and everything but yourself, learn and move on.
Put this in your .htaccess file (create one if you don't have one):
Redirect 301 /lior/ http://lior.weissbrod.com/
If Google tries to crawl any of your pages in the subdirectory instead of through the subdomain, it's going to be redirected to your subdomain and told that the page has permanently moved - and any of those pages in their index will be dropped really quick.
Calendar? What calendar?
Posted on: 03/09/05 08:44am
By: LWC
Ah, but you underestimate me...
RewriteCond %{HTTP_HOST} !^lior\.weissbrod\.com$ [NC,OR]
RewriteRule ^.*$ http://lior.weissbrod.com/ [R,L]
(from my old .htaccess)
See? Google has no excuse...
Alright, alright, I'm willing to admit that I think the default R is 302 (MOVED TEMPORARILY) while you pointed out 301 (or, as I've found out, the easier to remember word "permanent").
I've just changed it into
RewriteCond %{HTTP_HOST} !^lior\.weissbrod\.com$ [NC,OR]
RewriteRule ^.*$ http://lior.weissbrod.com/ [R=permanent,L]
Let's see if Google catches on.
Calendar? What calendar?
Posted on: 03/09/05 09:07am
By: RickW
[QUOTE BY= LWC]
I've just changed it into
[quote ...a snipper from my new .htaccess]
RewriteCond %{HTTP_HOST} !^lior.weissbrod.com$ [NC]
RewriteRule ^.*$ http://lior.weissbrod.com [R=permanent,L]
[/quote]
Let's see if Google catches on.[/QUOTE]
What is it you are accomplishing with that code?
edit:
Okay if I understand what you're doing, if I tried to access the domain in any other way (just weissbrod.com or www.weissbrod.com) then I will be redirected to lior.weissbrod.com.
What I'm confused about, is above you made the statement:
I've recently switched my site from http://lior.weissbrod.com/lior to simply http://lior.weissbrod.com
Maybe you mean that you switched from WWW.weissbrod.com/lior to lior.weissbrod.com? If that is the case, then yes by using a 302 temporary redirect you are actually telling Google NOT to visit your new subdomain, and if googlebot does happen to crawl the new links, it will give precidence to your old links.
Oops.
Calendar? What calendar?
Posted on: 03/09/05 09:25am
By: LWC
I've used to redirect outdated pages with a 302 error ("temporary redirect") and from now on, thanks to you, I'll redirect them with a 301 error ("permanent redirect").
If what you don't understand is the entire command, then it's called "mod_rewrite". Actually, it is your command that I'm not familiar with, but "rewrite" has many redirect options and they're not only based on http_host.
For example, I use it to (now permanently...) redirect
referrer spam[*4] (BTW, I can't use "storyid:" because it adds article.php to the beginning of the URL).
Calendar? What calendar?
Posted on: 03/09/05 09:30am
By: RickW
See above - you beat me to a reply before I finished my post edit.
Calendar? What calendar?
Posted on: 03/10/05 11:09am
By: LWC
Actually, I've made a mistake, but not the one you think.
"www.weissbrod.com" (or "weissbrod.com") is a site on its own (I've made it, BTW, that the latter redirects to the former via the same mod_rewrite method in .htaccess).
"lior.weissbrod.com" is a subsite of that domain, but the ISP's server is smart enough to make it virtually a stand alone site (it has its own robots.txt, .htaccess, etc.).
However - and here lies the problem - it wasn't always like that. Before my ISP upgraded to their smart server, I've had to resort to a little CGI script (called "DomainDirector") that just made a simple redirect to "www.weissbrod.com/lior".
To make things even more complicated, during that time a new version of that CGI script soon came out and made it look better by redirecting it to "lior.weissbrod.com/lior" (the first "lior" is fake.
Unlike now, the "subsite" was still "www").
...So when I've quoted my .htaccess in the previous post (after upgrading it using your suggestion), I only fixed the "www.weissbrod.com/lior" problem!
Alas, since the new version of that CGI script came almost as soon as I've started to use the script in the first place, that version of the site didn't last long - so it barely matters anyway (only 1 match in Google...).
The big problem is "lior.weissbrod.com/lior" and my .htaccess currently has no solution for that.
But since I no longer use "/lior/", every page that Google has with it throws back a 404 message!
This time don't blame me - Google's own FAQ states that
"you don't need to bother us to remove your pages - just throw back 404 messages!"
Yet it has been a long time now and tons of "/lior/" pages still show up!
So my question is this - should I just give up on those 404 messages and use
RewriteCond %{HTTP_HOST} !^lior\.weissbrod\.com$ [NC,OR]
RewriteCond %{REQUEST_URI} ^/lior/ [NC]
RewriteRule ^.*$ http://lior.weissbrod.com/ [R=permanent,L]
?
In other words, 404 (like Google supposedly wants, but doesn't seem to respect) or 301?
Calendar? What calendar?
Posted on: 03/10/05 11:44am
By: RickW
[QUOTE BY= LWC]
So now my question is this - is the new robots.txt enough?
Or do you think I should use the new robots.txt (now with "/lior/" in it), but also use
[quote in .htaccess]
RewriteCond %{REQUEST_URI} ^/lior/ [NC]
RewriteRule ^.*$ http://lior.weissbrod.com [R=permanent,L]
[/quote]?[/QUOTE]
I would use the 301 redirect. I prefer the code I mentioned, it's just much cleaner looking. Also I'd suggest the php meta robots script I worked up for you to ensure your calendar and other files aren't getting indexed in case the robots.txt file isn't working 100%. The noindex and noarchive will promptly drop those pages from google.
I just checked out your 404 error via one of those links still in Google's index. It loads up Geeklog's 404.php page. Perhaps that is the problem? The page says 404, but the redirect that Geeklog uses might not indicate that to a search engine.
Dirk, if you're still following this thread, can you elaborate on the 404? Where is that code stored?
Calendar? What calendar?
Posted on: 03/10/05 12:47pm
By: LWC
First of all, I've updated my post (and also the one where I gave my entire robots.txt).
Basically, I've realized that it's absured to enter "/lior/" in my robots.txt because robots.txt is supposed to store existing links for files and folders.
Anyway, if you look at my new .htaccess quote, you'd see why it's probably better with "rewrite" (because it's two lines and uses "or" - can you do this with your method?).
Now, Dirk has nothing to do with 404 (other than providing nice text). Although some sites probably default to 404.php, you still better ensure this with, again, .htaccess:
[quote another snippet from my .htaccess]
ErrorDocument 404 /404.php
[/quote]
So it has to send back a real 404 error. Do you have a site or a software that shows you http responses (that means official errors and not just the text output)? I've tried one and indeed it officialy said "404".
Calendar? What calendar?
Posted on: 03/10/05 01:29pm
By: RickW
[QUOTE BY= LWC] First of all, I've updated my post (and also the one where I gave my entire robots.txt).
Basically, I've realized that it's absured to enter "/lior/" in my robots.txt because robots.txt is supposed to store existing links for files and folders.
Anyway, if you look at my new .htaccess quote, you'd see why it's probably better with "rewrite" (because it's two lines and uses "or" - can you do this with your method?).
Now, Dirk has nothing to do with 404 (other than providing nice text). Although some sites probably default to 404.php, you still better ensure this with, again, .htaccess:
[quote another snippet from my .htaccess]
ErrorDocument 404 /404.php
[/quote]
So it has to send back a real 404 error. Do you have a site or a software that shows you http responses (that means official errors and not just the text output)? I've tried one and indeed it officialy said "404".[/QUOTE]
It's not absurd, because if Google still has those pages in it's index, it's going to try and recrawl them. For some reason the cached version is not getting dropped which is why they're still there. That's what the noarchive is for.
And I'm under the impression that Geeklog does do something with the 404.php page, because it forwards you there if an article ID doesn't exist. The code used to parse out the "static" looking links allows for all sorts of fake directories to be inserted in the path if you wanted. I could be wrong.
I don't know why you think your 3 lines are code are more efficient than:
Redirect 301 /lior/ http://lior.weissbrod.com/
Personal preference I guess.
Calendar? What calendar?
Posted on: 03/10/05 02:01pm
By: LWC
robots.txt is not supposed to be matched against a (search engine's) predefined list, but against links found in the current crawl.
For example, Googlebot comes into my site, sees a link, checks if robots.txt allows it to be indexed, if so - indexes it, if not - ignores it, checks another link and so on...at least that's how it's supposed to be.
What article ID? Try just article. An article, dynamic or not, is still a page. If a page is not there, says .htaccess, show them the page called 404.php
I think you get confuse by the fact that 404.php mentions which page is missing. It just takes that from http_referer. Before I entered it in .htaccess, my site just showed an internal default text ("file x is missing"). Ok, maybe, just maybe, Googlebot somehow ignores the 404 and thinks 404.php is just a - probably temorary - redirect.
And Google's cached versions are Google pages - not mine (except images, because Google rudely hotlinks them from the original pages).
Redirect 301 /lior/ http://lior.weissbrod.com/
And what about "www.lior.weissbrod.com", etc.? I take less chances by just using an "if not my chosen URL" statement.
Calendar? What calendar?
Posted on: 03/10/05 02:06pm
By: Dirk
[QUOTE BY= LWC] But since I no longer use "/lior/", every page that Google has with it throws back a 404 message!
This time don't blame me - Google's own FAQ states that "you don't need to bother us to remove your pages - just throw back 404 messages!"
Yet it has been a long time now and tons of "/lior/" pages still show up![/QUOTE]
In my experience, 404s don't really help. I've resorted to 410 ("Gone") in cases where I really wanted to get rid of the old URL and use "redirect permanent" (301) everywhere else.
bye, Dirk
Calendar? What calendar?
Posted on: 03/10/05 02:10pm
By: Dirk
[QUOTE BY= LWC] Ok, maybe, just maybe, Geeklog somehow ignores the 404 and thinks 404.php is just a - probably temorary - redirect.[/QUOTE]
I assume you meant "Googlebot" here, not "Geeklog"?
When set up with ErrorDocument 404 /404.php
, Googlebot will still get a 404 response code (instead of a 200) for Geeklog's 404 page. There's no way it could mistake that for a redirect.
bye, Dirk
Calendar? What calendar?
Posted on: 03/10/05 02:18pm
By: LWC
Alright, I've started to use the aforementioned new .htaccess (that permanently redirects "/lior/" URLs into the new main page too).
About 404 (fixed that error - thanks...), that's what I thought, but what can I tell you? Google just won't remove my dead pages...
Calendar? What calendar?
Posted on: 03/10/05 02:18pm
By: RickW
[QUOTE BY= LWC]What article ID? Try just article. An article, dynamic or not, is still a page. If a page is not there, says .htaccess, show them the page called 404.php
I think you get confuse by the fact that 404.php mentions which page is missing. It just takes that from http_referer. Before I entered it in .htaccess, my site just showed an internal default text ("file x is missing"). Ok, maybe, just maybe, Geeklog somehow ignores the 404 and thinks 404.php is just a - probably temorary - redirect.[/QUOTE]
Yep you're right about geeklog and the 404, my bad. I just took it out of the htaccess to test it. With it out, if I put in a bad article id then I get redirected to the main page, or if I put in a bad path then I get a normal 404 error.
I can see what you're saying about the www.
Here's a theory why Google won't remove those pages. Those pages no longer have links going to them, so googlebot isn't crawling to them, so it's not encountering them in order to see the 404, so even with a meta noarchive it won't help. Your robots.txt is also telling google not to revisit those pages, so it's going to remain forever stale in it's index.
In the bottom of your main page, put in a really small link that nobody will notice that points to the old main page you want to get rid of. Take it out of your robots.txt for now. Googlebot will crawl it and go OOPS, 404, delete from index.
Calendar? What calendar?
Posted on: 03/10/05 02:28pm
By: LWC
RickW, quoting is for specific relevant issues. Don't abuse it...why quoting my entire posts - especially when you quote them only one post after I posted them (pardon the pun)? this topic is long enough as it is. Besides, I constantly update my posts anyway. So can you please edit some of your posts and erase some of those long quotes (when you quote an entire post)? Believe me, you'll thank me later if you come back to this topic.
Calendar? What calendar?
Posted on: 03/10/05 02:42pm
By: RickW
[QUOTE BY= LWC]Believe me, you'll thank me later if you come back to this topic. [/QUOTE]
I don't plan on coming back to this topic, I've lost my patience talking to you.
Calendar? What calendar?
Posted on: 03/10/05 02:46pm
By: LWC
RickW, why?
Hmm, strange. If someone tries to access story X (with "/lior/"), they now get to the topic in which the story is in and the location bar says "lior.weissbrod.com/?story=X" .
Maybe I should just use G (gone) instead of 301, like Dirk has suggested:
RewriteCond %{HTTP_HOST} !^lior\.weissbrod\.com$ [NC,OR]
RewriteCond %{REQUEST_URI} ^/lior/ [NC]
RewriteRule ^.*$ http://lior.weissbrod.com/ [G,L]
Update:
As for your theory, RickW (are you still here? Don't go...) - if that were true, Google would be completly worthless. When Google updates the index, it divides it into 2 parts: verifing existing pages and crawling in search for brand new pages. If they only did the latter part, they'd have not 8 billion, but 8 googol pages in their index...of course, the first part is not perfect.
Calendar? What calendar?
Posted on: 03/10/05 03:01pm
By: RickW
[QUOTE BY= LWC] RickW, why? [/QUOTE]
Sorry. It just seems like this entire thread has been a 1 way discussion, you're right and everyone and google is wrong. Maybe that's not your intention, so I apologize for being rude. Just acknowledge people are helping you instead of disagreeing with everything - if you knew all the answers then you wouldn't have started the thread.
Calendar? What calendar?
Posted on: 03/10/05 03:16pm
By: LWC
Not sure what you mean...I've applied the 301 instead of 302 suggestion (and thanked you for it). Now I've applied the 410 instead of 301 (and thanked Dirk for it). What exactly wasn't I convinced with? robots.txt? Because I didn't agree it was used for pre-defined lists? Well, sorry, I still don't think so. Or that 404.php is not called upon by Geeklog? But you admitted I was right.
I think what really happened is that you got upset over my "suggestion" not to quote whole posts. If that's so, I apologize if I sounded rude. I really didn't mean to.
Calendar? What calendar?
Posted on: 03/10/05 03:21pm
By: RickW
[QUOTE BY= LWC]If that's so, I apologize if I sounded rude. I really didn't mean to. [/QUOTE]
Calendar? What calendar?
Posted on: 03/15/05 11:36pm
By: RickW
[QUOTE BY= LWC]
Update:
As for your theory, RickW (are you still here? Don't go...) - if that were true, Google would be completly worthless. When Google updates the index, it divides it into 2 parts: verifing existing pages and crawling in search for brand new pages. If they only did the latter part, they'd have not 8 billion, but 8 googol pages in their index...of course, the first part is not perfect.[/QUOTE]
Depends on how many levels down the page is, what it's page rank is, who is linking to that page - I've seen some pages get really stagnant. Maybe in your case some of the issue is with the robots.txt and htaccess - it could be that a combination of algo flags and new algos just put some of your pages into a part of an index that just never gets updated. I still think if you remove the block from robots.txt, and put in a tiny link in your footer to those pages, and put in the php meta routine to set the noarchive, then you'll get rid of them for good.
edit:
Try putting this in your htaccess:
RedirectMatch 410 /lior/*
RewriteCond %{HTTP_HOST} !^lior\.weissbrod\.com [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^(.*) http://lior.weissbrod.com/$1 [L,R=301]
You can also add ErrorDocument 410 /410.php if you want, just make a copy of the 404.php and change the message.
I think that htaccess will match up with exactly what you need to accomplish.
edit2:
Hey, have you checked your htaccess to make sure it's doing what you think it's doing? I just put in http://lior.weissbrod.com at http://www.searchenginepromotionhelp.com/m/http-server-response/code-checker.php and got back a normal response 200. But then I tired http://www.weissbrod.com/lior, and it gave back a response 302! With that response, Google will continue to assume that your old path is the correct one.
Calendar? What calendar?
Posted on: 03/16/05 04:28am
By: LWC
Oops, that was because I shouldn't have used
RewriteCond %{HTTP_HOST} !^lior\.weissbrod\.com$ [NC,OR]
RewriteCond %{REQUEST_URI} ^/lior/ [NC]
RewriteRule ^.*$ http://lior.weissbrod.com/ [G,L]
The correct way is
RewriteCond %{HTTP_HOST} !^lior\.weissbrod\.com$ [NC,OR]
RewriteCond %{REQUEST_URI} ^/lior/ [NC]
RewriteRule .* - [G,L]
(the difference is in the last line)
Thanks!
If you want to re-test it, notice that it should be 410, not 301 (Dirk suggested it's even stronger).
Also, like I said, Google only has one such WWW link anyway. The problem is with the endless http://lior.weissbrod.com/lior/ links.
P.S.
Did you know this topic is on Google already? :-)
Dirk could probably pass such a fix in no time...
Calendar? What calendar?
Posted on: 03/16/05 08:10am
By: RickW
When you paste your htaccess onto here, make sure to wrap it in CODE tags - your escaping backslashes are getting stripped (assuming you're using them).
Calendar? What calendar?
Posted on: 03/16/05 08:40am
By: LWC
Fixed all of them, thanks!
But the same goes for your quotes of my .htaccess.