Welcome to Geeklog, Anonymous Tuesday, March 19 2024 @ 05:54 am EDT

Geeklog Forums

Google News Cant Crawl My Site

Page navigation


Munazip

Anonymous
I i sent my geeklog site Zimdaily.com to Google News for crawling and this is the response i have just received.

Thank you for your inquiry regarding inclusion in Google News. After some
investigation, we've found that our system cannot crawl your articles
because of the format of their URLs. In order to have your articles
crawled by Google News, their URLs must contain a number consisting of at
least three digits.

For example, our news crawler would not crawl articles with the following
URLs:
www.google.com/news/article23.html
www.google.com/lemurs_in_the_mist.html

It would crawl these pages:
www.google.com/news/08112003/article.html
www.google.com/news/lemurs_in_the_mist/23467.html

An example of a site that we are able to crawl successfully is
http://english.chosun.com. Please note that each article on this site has
a highly unique URL.

We apologize for this limitation of our system. If you are able to make
changes on your end to allow us to crawl your content, please let us know.

Regards,
The Google Team

Someone out there please help
 Quote

Status: offline

eyecravedvd

Forum User
Full Member
Registered: 06/09/03
Posts: 152
Answer
Shane | www.EyeCraveDVD.com
 Quote

Munazip

Anonymous
Have done all that, you can check my links, i have done it what is said in your answer.
 Quote

Status: offline

eyecravedvd

Forum User
Full Member
Registered: 06/09/03
Posts: 152
Interesting that other peoples sites can get added, but yours won't. I wonder if it has to do with your naming of the stories?
Shane | www.EyeCraveDVD.com
 Quote

Munazip

Anonymous
google sent me the e-mail above am trying to understand what they mean by

It would crawl these pages:
www.google.com/news/08112003/article.html
www.google.com/news/lemurs_in_the_mist/23467.html


and

For example, our news crawler would not crawl articles with the following
URLs:
www.google.com/news/article23.html
www.google.com/lemurs_in_the_mist.html



Here is an example link from my site http://zimdaily.com/news2/article.php/migrants_zim

How should i name the articles for google news to be happy, i named this one migrants_zim, any suggestions??
 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Sounds like they insist on numbers in the story's URL (which I find odd, but then again I've never tried to submit a site to Google News).

bye, Dirk
 Quote

Munazip

Anonymous
Dirk, you mean i shouldn't the numbers?? and the other does the .html at the end make any difference because it looks like the example they have given me have .html and geeklog is php....
 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by Munazip: In order to have your articles crawled by Google News, their URLs must contain a number consisting of at least three digits.

At first I thought this meant that they wanted numbers in your URLs, e.g. http://zimdaily.com/news/article.php/migrants_zim12345 or something like that.

But I just noticed the "news2" in your URL - maybe that's what they mean: They want at least three digits there. "news123" would be okay, "news12" wouldn't.

No guarantees. If in doubt, try asking them again.

bye, Dirk
 Quote

Munazip

Anonymous
Thanks DIRK ...have sent them an e-mail and will update you on what they say.

Thank you again
 Quote

Status: offline

Marites

Forum User
Chatty
Registered: 02/04/04
Posts: 64
I have been battling with this one for over a year. Our site posts 50 or so news items daily news.google cannot use the GL URL's which is not good as we are the oldest and busiest news web site in our country.

Google (the search engine side) however, have no problem with the GL URL's and our stories 25000 plus can be located on that part of Google without problem.

We spent a great deal of time and many exchanges of mail with news.google and in the end had a custom php page written which google bot many times a day to pick up URL's.

Now here is a result from a Google search using the following:

"security in Philippine skies"

and the result


30 - Security nightmare with unguarded Philippine skies ... Unguarded Philippine skies is no longer a wishful thinking. For military and strategic defense ...
news.balita.ph/html/article.php/20051030144608086 -
25k - Cached - Similar pages - Filter


I did not respond straight away to your posting as I emailed a friend at Google first and he says their bot has not problem with any url numbers letters, or combination of letters and numbers. So I assume you are trying to get links on news.google which as I say is not as cut and dry as with plain google..

I realise this reply does not address your problem directly only give you an insight to how we approached it.

I will say I upgraded the lately and all the URL's changed - Google put these changed URL's in place within 24 - 36 hours and I did not bother to write the script some said would be required for Google to show the new URL's.

I do understand your frustrations as they say I have been there and bought the tee-shirt.

Can I suggest you go to my site look at the top and click Headlines - it is a page similar to this which Google access each day to extract our news.

It works for us so should in theory work for you also.

Marites
 Quote

Munazip

Anonymous
Marites thank you for your post and have been to your site. It looks fine. What do you suggest i do with my articles so that google news can be happy??
 Quote

Status: offline

Marites

Forum User
Chatty
Registered: 02/04/04
Posts: 64
With Google.com (the search engine) there is no problem they will find and link you.

news.google is a different matter as they say they cannot read the GL URL's all I can suggest is you set up a page similar to our 'Headlines' and ask Google news if they will use it.

Over the past few weeks out of 500 plus stories these are the only ones news.google has picked up:

See here

I did bring this up with Dirk over a year ago - as there are a limited number of users of GL that have the need to have their items listed on news.balita I guess it is difficult for Dirk to know what news.google need in order to read GL URL's. I really do not understand why or what the difference is with the URL on google.com and news.google.

All I can do is wish you luck and hope you have more success than I have.

Any thoughts Dirk ? or better still any suggestions.

I am always around.

Late note: I have just looked back over my correspondence with news.google it seems they prefer a url to be like this

h**p://news.balita.ph/html/article//20051104110320755.(extention)
rather than:
h**p://news.balita.ph/html/article.php/20051104110320755 as with GL.

Perhaps it is just the .(extension) that is missing (ie htm, http, php .... )

Marites

Quote by Munazip: Marites thank you for your post and have been to your site. It looks fine. What do you suggest i do with my articles so that google news can be happy??

 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by Marites: Perhaps it is just the .(extension) that is missing (ie htm, http, php .... )

There should be nothing stoping you from using ".html" or ".php" as part of the story's ID ...

bye, Dirk
 Quote

Status: offline

samstone

Forum User
Full Member
Registered: 09/29/02
Posts: 820
OK, here is my little help with what my little brain can come up with.

I think Google has given you guys a canned answer without investigating specificly what their system can't list you. They list my site with no problem so far. I use basic GL installation without even having url rewrite turned on. The only tweek is removing the hyperlink from the full article's title.

- For both Zimdaily.com and Balita.com, I think the problem could be that you forward your domain root to a sub directory (new2 or html).

- For Balita.com, you would need to remove the hyperlink from the full article title.

Other than that I don't see any problem.

Sam



 Quote

Status: offline

Marites

Forum User
Chatty
Registered: 02/04/04
Posts: 64
Dirk A year ago I asked you about using .php or .html and how this could be done automatically - you may or may remember you told me that this was not possible. Bear in mind we do not post stories by hand they are posted automatically so renaming 50 - 100 stories a day would not be practicable. (I run 6 web sites all news and GL based).

So now you are saying there is nothing to stop this - can you say how - we can hack lib.common if need be.

Tess

Quote by Dirk:
Quote by Marites: Perhaps it is just the .(extension) that is missing (ie htm, http, php .... )

There should be nothing stoping you from using ".html" or ".php" as part of the story's ID ...

bye, Dirk
 Quote

Status: offline

Marites

Forum User
Chatty
Registered: 02/04/04
Posts: 64
Can't vouch for zimdaily but Balita worked with news.google on this one for almost 3 to 4 months as they wanted our news. We set up a test rig with Mambo and another paid CMS and both were readable by there bot. For various reasons our organization and writers prefer GL and want to stick with it.

I don't think it is the sub-directory structure as our other sites use differing methods of layout ie without the html and it does not make the slightest difference to the news.goggle bot.

Tess

Quote by samstone: OK, here is my little help with what my little brain can come up with.

I think Google has given you guys a canned answer without investigating specificly what their system can't list you. They list my site with no problem so far. I use basic GL installation without even having url rewrite turned on. The only tweek is removing the hyperlink from the full article's title.

- For both Zimdaily.com and Balita.com, I think the problem could be that you forward your domain root to a sub directory (new2 or html).

- For Balita.com, you would need to remove the hyperlink from the full article title.

Other than that I don't see any problem.

Sam



 Quote

Status: offline

samstone

Forum User
Full Member
Registered: 09/29/02
Posts: 820
How about the hyperlink on the full story, which could get the bot caught in a loop, since it is linked to itself.

I am just helping you brain storm. Since mine works out of the box with the hyperlink on full story title. It makes me think that might be something you did, rather than you didn't, that changes the course.

Sam
 Quote

Status: offline

Marites

Forum User
Chatty
Registered: 02/04/04
Posts: 64
Bit lost on this one ... surely each title has a hyperlink and that should be sufficient. Our government partner has their hyperlink on 'see more' also using a CMS (not GL) that can be read by news.google.

As I say with news.google we tried dozens of methods whereas zimdaily may have got a 'canned' answer we did not news.google even tried to write scripts to help. They are adament GL is at fault .... I am not certain what the problem is but I would have thought if there was a solution news.google programmers would have found it for us.

Having said that both CNN and the BBC are able to read our URL's and often link to our news stories.

Thanks tell me what you had in mind to hyperlink and I will get someone to try.

Tess

Quote by samstone: How about the hyperlink on the full story, which could get the bot caught in a loop, since it is linked to itself.

I am just helping you brain storm. Since mine works out of the box with the hyperlink on full story title. It makes me think that might be something you did, rather than you didn't, that changes the course.

Sam
 Quote

Status: offline

ByteEnable

Forum User
Full Member
Registered: 10/20/03
Posts: 138
Quote by Marites:
As I say with news.google we tried dozens of methods whereas zimdaily may have got a 'canned' answer we did not news.google even tried to write scripts to help. They are adament GL is at fault .... I am not certain what the problem is but I would have thought if there was a solution news.google programmers would have found it for us.


Your site has numerous html problems:
http://news.balita.ph/html/article.php/20051104110320755
Text Formatted Code

# Error  Line 318 column 138: end tag for element "A" which is not open.
# Error Line 361 column 141: there is no attribute "FRAMESPACING".
# Error Line 361 column 160: value of attribute "FRAMEBORDER" cannot be "NO"; must be one of "1", "0".
# Error Line 364 column 172: value of attribute "FRAMEBORDER" cannot be "NO"; must be one of "1", "0".
# Error Line 364 column 232: there is no attribute "ALLOWTRANSPARENCY".
# Error Line 367 column 194: end tag for element "A" which is not open.
# Error Line 368 column 29: end tag for element "NOLAYER" which is not open.
# Error Line 369 column 100: end tag for element "ILAYER" which is not open.
# Error Line 378 column 11: there is no attribute "SRC".
# Error Line 378 column 116: there is no attribute "WIDTH".
# Error Line 378 column 129: there is no attribute "HEIGHT".
# Error Line 378 column 146: there is no attribute "VISIBILITY".
# Error Line 378 column 162: there is no attribute "ONLOAD".
# Error Line 378 column 269: element "LAYER" undefined.
# Error Line 461 column 29: required attribute "TYPE" not specified.
# Error Line 471 column 71: required attribute "TYPE" not specified.
# Error Line 480 column 175: value of attribute "FRAMEBORDER" cannot be "NO"; must be one of "1", "0".
# Error Line 592 column 188: value of attribute "FRAMEBORDER" cannot be "NO"; must be one of "1", "0".
# Error Line 647 column 11: end tag for element "TABLE" which is not open.

 


Also on that page:

Your story title should be:
Text Formatted Code

<h1>04 - Rescuers unable to find plane wreckage (with earlier report)</h1>

 

Not a url. Also your html is full of javascript and lots of unused whitespace.

Also your page is 167K in size. Google is not going to wait around too long for a page to load.

Byte
LinuxElectrons
 Quote

Status: offline

Dirk

Site Admin
Admin
Registered: 01/12/02
Posts: 13073
Location:Stuttgart, Germany
Quote by Marites: So now you are saying there is nothing to stop this - can you say how - we can hack lib.common if need be.

What I meant was that you could simply add ".php" or ".html" to the end of your story's ID (assuming this is really what Google News is looking for).

You didn't say how you post those news automatically. If you're relying on Geeklog to give them a story id (i.e. one of those numeric IDs like 20051105102630123), then you could hack COM_makesid (in lib-common.php) to add that extension (althoug that would then also apply to IDs for events and links).

Otherwise, you could modify whatever you use to post the news to add the extension there.

Again, I have no experience with getting a site into Google News so I don't know if this is really your problem. Others have pointed out other possible issues with your site, maybe you should look into those first ...

bye, Dirk
 Quote

Page navigation

All times are EDT. The time is now 05:54 am.

  • Normal Topic
  • Sticky Topic
  • Locked Topic
  • New Post
  • Sticky Topic W/ New Post
  • Locked Topic W/ New Post
  •  View Anonymous Posts
  •  Able to post
  •  Filtered HTML Allowed
  •  Censored Content