Welcome to Geeklog, Anonymous Saturday, October 05 2024 @ 12:16 am EDT

Geeklog Forums

Google News Cant Crawl My Site

Page navigation


Status: offline

Marites

Forum User
Chatty
Registered: 02/04/04
Posts: 64
The above blank posting actually shows the posting below to me but when submit pressed it comes up as blank. I cut and pasted it 2nd time it showed up.


Thanks ... basically I am using the Professional theme and as a matter of interest just set up an out of the box setup of 1.3.11 and run that through HTML Validate and get similar errors so it would seem that the theme has these errors already in place .... which is not good. Did you run it on Zim as my validator is throwing up also many errors. If you are running a site based on the Professional theme perhaps a similar test on that may also throw up errors.

The page size is not much more than that this site and google search do not run away from its size and within a hour google search (not news) have the article referenced. Although I consider the error issue very serious and I will see it is addressed early next week.

I sincerely thank you for running validate, something I should have done and not took it for granted that the Professional theme was error free.

Regards

Tess

I have just run the W3 validator and would make these comments ... 90% of the errors you relate come from the advertising bannercode which uses FRAMES. The majority of the other errors are actually caused by character problems, the origin of our text is Asia (i.e.) non SGML character number 147.

Our articles are emailed in from journalists using macs, pcs and such with the language set to who knows what. Our server on the other hand is set up correctly for US English/English in consequence differences in character codes throw up on a regular basis ... so I have discounted this type of error. The other 5 or 6 are sloppy code either by the theme or ourselves and we shall address them.

Funny news.google have never said the problem is W3 errors they concentrate only on URL problems I have written to the head of news.google who I have been dealing attaching a copy of your posting and will come back when he comments.

 Quote

Status: offline

Marites

Forum User
Chatty
Registered: 02/04/04
Posts: 64
Thank you Dirk

I will get someone to check the addition of extensions in lib.common and see what happens. Not sure if a change at this time with 25000 stories in the archive would make sense as most would be without extension others with.

We have checked the errors most as said are language the 5 or six with real errors some caused by us and some in the theme templates have been corrected.

news.google say it is not our page size average 63 which is the problem as compared with CNN and BBC ours is quite compact. They also say the W3 errors should not raise a problem they are still ademant the problem is how GL produces the URL's. AS GL seems to be the only php type CMS that is not conforming to convention it is not worth news.google adjusting the crawler to conform, as they are aware we are the only major organization, (their terminology) using GL with such an input and output.

Our page hits for October were just short of 1.8 million with 700 plus posted news articles. 95% of our articles are emailed into GL through Sendmail and straight into the mysql and as stated originate from a variety of users set at who knows what language.

It is because of our unique content that news.google have spent so much time trying to get everything to work.

Dirk if you are interested I can send you the various technical responses google have sent ... but as I say the alterations needed are just not work the effort from GL development point of view.

Despite the problems Dirk I am sticking with GL - my additions to this thread were not by way of compaint but just to respond to our Zimbabwe collegue who I felt was wasting his time to expect news.google to index his site after the problems we have had.

Thanks as always Dirk.

Marites


Quote by Dirk:
Quote by Marites: So now you are saying there is nothing to stop this - can you say how - we can hack lib.common if need be.

What I meant was that you could simply add ".php" or ".html" to the end of your story's ID (assuming this is really what Google News is looking for).

You didn't say how you post those news automatically. If you're relying on Geeklog to give them a story id (i.e. one of those numeric IDs like 20051105102630123), then you could hack COM_makesid (in lib-common.php) to add that extension (althoug that would then also apply to IDs for events and links).

Otherwise, you could modify whatever you use to post the news to add the extension there.

Again, I have no experience with getting a site into Google News so I don't know if this is really your problem. Others have pointed out other possible issues with your site, maybe you should look into those first ...

bye, Dirk
 Quote

Munazip

Anonymous
Thanks DIRK for your answer. I think Google News want a .html at the end of the Story ID, thats what they are saying exactly. Now have been to COM_makesid is lib.common and this is what i got.

Text Formatted Code
* Makes an ID based on current date/time
*
* This function creates a 17 digit sid for stories based on the 14 digit date
* and a 3 digit random number that was seeded with the number of microseconds
* (.000001th of a second) since the last full second.
* NOTE: this is now used for more than just stories!
*
* @return   string  $sid  Story ID
*
*/

function COM_makesid()
{
    $sid = date( 'YmdHis' );
    srand(( double ) microtime() * 1000000 );
    $sid .= rand( 0, 999 );

    return $sid;
}
 



Now i need to add the .html automatically in COM_makesid so that i wont have to add everytime i post a story manually. PLEASE HELP ME WITH THE HACK.


Thank you in advance

 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by Munazip: Now i need to add the .html automatically in COM_makesid so that i wont have to add everytime i post a story manually.

I have to correct myself here: Changing COM_makesid would, as mentioned above, also change the IDs used for links and events. The problem there is that those are too short (20 characters, while an ID with attached ".html" would be up to 23 characters long).

So, plan B is to change it where COM_makesid is called. In admin/story.php:
Text Formatted Code
$A['sid'] = COM_makesid() . '.html';
 
And in submit.php:
Text Formatted Code
$A['sid'] = COM_makeSid() . '.html';
 
and a second time:
Text Formatted Code
$A['sid'] = addslashes (COM_makeSid () . '.html');
 

Note that there are further calls to COM_makeSid in submit.php, but those are for events ($A['eid'] = ...)and links ($A['lid'] = ...) and should be left unchanged.

bye, Dirk
 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by Marites: Dirk if you are interested I can send you the various technical responses google have sent ... but as I say the alterations needed are just not work the effort from GL development point of view.

If you have any definitive information about what it is that Google News doesn't like about Geeklog's URLs, I'd like to hear it. At the very least, we could make an FAQ entry for it.

bye, Dirk
 Quote

Munazip

Anonymous
Thanks DIRK, am trying to understand where u say and a second time.

Where do i add that?

Thank you in advance
 Quote

Status: offline

Marites

Forum User
Chatty
Registered: 02/04/04
Posts: 64
Will put the relevant emails together and mail to you personally early next week.

Regards

- Tess -

Quote by Dirk:
Quote by Marites: Dirk if you are interested I can send you the various technical responses google have sent ... but as I say the alterations needed are just not work the effort from GL development point of view.

If you have any definitive information about what it is that Google News doesn't like about Geeklog's URLs, I'd like to hear it. At the very least, we could make an FAQ entry for it.

bye, Dirk
 Quote

Status: offline

Marites

Forum User
Chatty
Registered: 02/04/04
Posts: 64
Munazip ... Earlier this year on a test site we hand added .html extensions to about 50 articles - google still could not read them - I see you have added an extension to your first story on your site it will be interesting to see if news.google say they can read it or not.

- Tess -

Quote by Munazip: Thanks DIRK, am trying to understand where u say and a second time.

Where do i add that?

Thank you in advance
 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by Munazip: Where do i add that?

Just search for "makesid" (or "makeSid") in the two files and you will see.

bye, Dirk
 Quote

Munazip

Anonymous
Thankd Tess and DIRK, have done the necessary amendments but for some reason am failing to see where i can add this
Text Formatted Code
$A['sid'] = addslashes (COM_makeSid () . '.html');
 


It looks like it works fine without the line above.


And the aother thing is my images now have a .html extension as well, will that affect anything??
 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by Munazip: for some reason am failing to see where i can add this
Text Formatted Code
$A['sid'] = addslashes (COM_makeSid () . '.html');

 

This should be the second hit for "makeSid" in submit.php (assuming you're on Geeklog 1.3.11sr2 - haven't checked with any older versions). It's called when normal users (not admins) submit stories.


Quote by Munazip: And the aother thing is my images now have a .html extension as well, will that affect anything??

I have to admit that I didn't think of this, but then again, they should have two extensions now (20051105215730123.html_1.gif etc.), and that should be fine.

bye, Dirk
 Quote

ByteEnable

Anonymous
The google bot can crawl through bad html, but its always best to have good html. You never know what will make the crawler fail. My site LinuxElectrons has no problems with news.google.

The only major difference between my site and your site, is your story page still has the story title as a url. It's very hard to believe that the crawler wants a .html extension at the end. If this is true, it must be due to your story page title being a URL, instead of a real story title:

Yours: My Story

Mine: My Story

Byte
 Quote

ByteEnable

Anonymous
Text Formatted Code

Yours: <h2><a href="/article.php/2343243">My Story</a></h2>

Mine: <h1>My Story</h1>


 
 Quote

Status: offline

Marites

Forum User
Chatty
Registered: 02/04/04
Posts: 64
I just did a search I find the uptake news.google of your stories is no larger than ours - I counted one from your front page. As I stated earlier news.google can pick up some items but not all. Putting an extension will make no difference what so ever we set up a test site earlier this year and each url had an html extension and the pickup rate was 2 or 3 from the 50.

May be we are all blaming the URL when maybe it is something else - I have tried one part story to a page and 20 and that makes no difference. I spent weeks with Google trying to find out what was the problem and at no time could 100% of the articles be picked up in a crawl. At Googles suggestion I set up the same test site in Mambo (the paid version) and there was no problem every item was picked up as it was with ArticleLive and others.

Sadly I am not a programmer so I can only stab in the dark at solutions and causes and no doubt all are wrong.

Could there be something between the URL on (we'll call it) page 1 and the link to the full article as I do find the smaller articles which don't require a 'Read More' do get picked up every time.

I reiterate google.com has no such problem and every article is picked up - it sounds crazy but what is the difference between the crawler on google.com and news.google.com.

- Tess -

Quote by ByteEnable: The google bot can crawl through bad html, but its always best to have good html. You never know what will make the crawler fail. My site LinuxElectrons has no problems with news.google.

The only major difference between my site and your site, is your story page still has the story title as a url. It's very hard to believe that the crawler wants a .html extension at the end. If this is true, it must be due to your story page title being a URL, instead of a real story title:

Yours: My Story

Mine: My Story

Byte


 Quote

Status: offline

samstone

Forum User
Full Member
Registered: 09/29/02
Posts: 820
Aha, it is alway the big guy that calls the shot!

See, as you said, CNN and BBC have no problem picking up the news. But Google never thought about spanking their bot for not picking up the news faithfully. Isn't it a little arrogant that they would rather try to fix your Geeklog, than look into their own bot?

Well, I feel for your problem, so I am just joining the rant. Sometimes Google deserve a stick. Evil or Very Mad

Sam

 Quote

ByteEnable

Anonymous
Google News only retains news for 30 or 60 days. I can't remember exactly which one. I just did a news.google.com search and got:

Results 1 - 10 of about 333 for linuxelectrons

Search URL.
Text Formatted Code

http://news.google.com/news?hl=en&ned=us&q=linuxelectrons&btnG=Search+News

 


On Yahoo I get:
NEWS STORIES - Results 1 - 10 of about 62 for linuxelectrons.

I did a search on your domain, and not one Search engine has you listed, except for the one link on www.google.com.

I even did a search on ww.google.com.ph, and there is no links. Its probably has something to do with all that java script breaking the crawler, got to be.

Byte
 Quote

Status: offline

Marites

Forum User
Chatty
Registered: 02/04/04
Posts: 64
Try this URL

Look here

Smile

I guess I dends on the search criteria.

It's 30 days btw.

Tess




Quote by ByteEnable: Google News only retains news for 30 or 60 days. I can't remember exactly which one. I just did a news.google.com search and got:

Results 1 - 10 of about 333 for linuxelectrons

Search URL.
Text Formatted Code

http://news.google.com/news?hl=en&ned=us&q=linuxelectrons&btnG=Search+News


 


On Yahoo I get:
NEWS STORIES - Results 1 - 10 of about 62 for linuxelectrons.

I did a search on your domain, and not one Search engine has you listed, except for the one link on www.google.com.

I even did a search on ww.google.com.ph, and there is no links. Its probably has something to do with all that java script breaking the crawler, got to be.

Byte
 Quote

Status: offline

Marites

Forum User
Chatty
Registered: 02/04/04
Posts: 64
angry
Whether it is because I am of the opposite sex but everytime I get involved in a thread on this forum I receive disgusting suggestions from an/other member saying that should not be running a web site but a home instead. (Kept polite)

I know who this is as does he ... he has participated in this thread.

For everyones interest including his, Balita is a part-government owned site that has been online for 11 years now and yes we are able to work with people at news.google.

We are not a site that needs to make itself known to every search engine nor are we after accolades. We get that from our visitors.

I participated in this discussion to help a fellow GL'er who was experiencing similar problems to what our site had experienced.

Zimdaily does not carry Java which was suggested as the reason our site was not crawled .... as I pointed out our site was crawled hourly but it was the URL's that could not be captured in the same problem as Zimdaily had.

news.google have provided me with the explanation why our site cannot be crawled and it is not html errors, Java or page sizes I understand the same reason applies to ZimDaily (if he contact me proivately I will explain).

Enjoy your little chauvanist clique - I for one will read but will not participate in this forum again.

A million thanks go to Dirk and all the other polite members of the forum I have enjoyed their company.

Marites
 Quote

Status: offline

ByteEnable

Forum User
Full Member
Registered: 10/20/03
Posts: 138
Quote by Marites: Whether it is because I am of the opposite sex but everytime I get involved in a thread on this forum I receive disgusting suggestions from an/other member saying that should not be running a web site but a home instead. (Kept polite)

I know who this is as does he ... he has participated in this thread.

You should inform Dirk so he can take care of this problem.


I participated in this discussion to help a fellow GL'er who was experiencing similar problems to what our site had experienced.

Hmmm....looked like you were asking for help.

For others Google says this:

# Make sure that your TITLE and ALT tags are descriptive and accurate.
# Check for broken links and correct HTML.
# If you decide to use dynamic pages (i.e., the URL contains a "?" character), be aware that not every search engine spider crawls dynamic pages as well as static pages. It helps to keep the parameters short and the number of them few.



Enjoy your little chauvanist clique - I for one will read but will not participate in this forum again.

Don't let one bad apple spoil it, your posts brought out discussion which has never before been discussed, which is great!
 Quote

Status: offline

ByteEnable

Forum User
Full Member
Registered: 10/20/03
Posts: 138
I've been looking at this lately because these posts got me thinking again. For awhile I was diving into SEO technique's. I spent quite a bit of time on research. In the end, trying to be crafty required lots of time monitoring various search engines with very little payback. In one of Google's FAQ's they say don't try to get too crafty, just build a good website and they will come. Smile

At any rate, a previous post in this thread mentioned that Google requires a ".extension" at the end of a URL. I still find this hard to believe, why would such a high tech outfit as google, have their bread and butter algorithm fail because of a .extension. I guess its possible. I'm just in shock and disbelief. However, lightly scanning through some URL's at news.google, most of them have a .extension.

I also found another reference about the landing page should not be a URL but a h_tag:

we found the reason why our system may not crawl some your articles because your articles' headlines are still active links on the article page. If it is feasible, we recommend making the article headline inactive on the article page but not on the hubpage.

 Quote

Page navigation

All times are EDT. The time is now 12:16 am.

  • Normal Topic
  • Sticky Topic
  • Locked Topic
  • New Post
  • Sticky Topic W/ New Post
  • Locked Topic W/ New Post
  •  View Anonymous Posts
  •  Able to post
  •  Filtered HTML Allowed
  •  Censored Content