Each time the spider runs on our intranet it manages to report errors to pages that haven't existed for several years. I manually removed them from the textbase.
The consistent message is Spider error - HTTP error 404 File not found, URL:
If I untick the re-spider documents already catalogued it doesnt appear to pick up any new pages that are added.
Ticking the box means a massive log file.
If these files aren't in the textbase and they aren't on the server and there are no links to them, (I know this becuuase we moved from .htm to .shtm extensions on everything), why is the spider still reporting these broken links?
Because they're still in the NAVSDB.* files, which is where the pages are "already catalogued". The Spider doesn't read your textbase to find out what "already catalogued" pages to spider, it reads these NAVSDB.* files.
You can start your spider from scratch by deleting these files, but you'll need to make sure your Initial URL List is right, or the spider won't crawl anything.
Posts: 1920 | Location: Woburn, MA, USA | Registered: Thu July 13 2000