|
Go
![]() |
New
![]() |
Find
![]() |
Notify
![]() |
Tools
![]() |
Reply
![]() |
|
I'm probably really goofing up here, but I cannot get the spider to spider my intranet.
I've put the host's IP address and name in the initial url, but the same message 'Inaccessible URL' is returned. The spider sits on the intranet server: would this make a difference? |
|||
|
quote: It shouldn't matter what server the spider is on, as long as the account being used has the correct permissions. If you don't remember setting up any accounts/permissions with the service, then it's likely just using the system account. This is fine if you're spidering the same server. Some notes: In the domain list, put in *only* the IP or domain (e.g. www.domain.com or 127.0.0.1 or servername). In the initial URL list add in the http:// to the front (e.g. http://www.domain.com/index.htm) |
||||
|
This is weird.
I did what you said - specified just server. But now it only spiders non-html files (*apart from* the Initial URL file). quote: |
||||
|
This is embarrassing: I think I've worked out why. The initial url file doesn't have links to the other docs.
I was hoping that the spider would catalog all docs in a folder, without there needing to be a page linking to those docs first - I'd hoped that a page returned by a canned query would in fact serve as an index page. I don't suppose there's any way to do that? quote: |
||||
|
quote: I'll answer the second part of your question first--No, I don't think there's any way to have the spider execute the canned query through webpublisher. You might try putting the canned query string in the initial URL field to see if that will work, but no guarantees. The dynamic nature of pulling things out of databases isn't condusive to spidering technology. It's a cludge, but you could just write the report to file (html) and use that as your initial URL. It doesn't have to be an active page on your intranet, but used just as a spider starting point. Ok, now the first part of your question... To be ultra simplistic, there are two parts of the spider: The File Crawler and the Spider. The Spider part is what goes link to link through html documents. The File Crawler can be given a directory tree to start in, and will grab all documents in those directories (so long as they match the file extensions you've specified--this is so you can control what documents get imported, rather than EVERYTHING that may be in a folder.) You can do one, the other, or both, during a given spider session, but it's redundant to spider http://localhost/ and also to crawl c:\inetpub\wwwroot\ because you'll get the same document twice, just specified once with a url and once with a file url. also, the crawler doesnt' care about links--it only cares about cataloging the documents that it grabs, so if a document has a link to a doc that is in a separate directory tree, the crawler will ignore that doc, but the spider will get to it. Hope that's cleared a few things up. ------------------ Rachel |
||||
|
Another way to have the spider crawl all files in a directory is to:
- disable the default page, and - enable directory browsing, and - specify a directory rather than a specific page as your initial URL (e.g., http://www.domain.com/files/) If there's no default page for a directory, the HTTP server will return a directory listing page instead. The spider will then crawl all the links on this page. I don't believe you can use this for recursive subdirectory searching, just for that one top-level directory. It will also then continue to spider all the links in those pages. |
||||
|
Thank you, both. You've answered all my questions!
|
||||
|
Glad to have helped--I love the spider. (Don't always love troubleshooting it, but .... that comes with the territory
|
||||
|
| Previous Topic | Next Topic | powered by eve community |
| Please Wait. Your request is being processed... |
|

