PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

How to index specific directories and eliminate others?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to index specific directories and eliminate others?

    Hello, I have been working with Zoom Search Engine for years for our Intranet. But, I'm running into a new challenge. Our Intranet has grown larger, and I need the search engine to search on specific directories and miss others. In the past I had set up the spider mode to start from a specific URL (directory), and then added additional start points (new directories but as they do not have index files within the directories, it seems they are being skipped (External site - does not match base URL).

    I thought to do a offline index on a directory with copies of all pertinent files, but this would not help our users as they need to be able to locate the file on our Intranet.

    Any advise on best solution? Thanks!! /Cat

  • #2
    Originally posted by Cat H View Post
    ... but as they do not have index files within the directories, it seems they are being skipped (External site - does not match base URL).
    If the skipped message you are seeing is "External site - does not match base URL" then the reason is not because they have no index files (at least, that won't be the only reason). The reason is because the URL is outside the base URL specified.

    For a URL to be considered part of the site, it must contain the entirety of the base URL. So if your base URL is:
    http://www.mysite.com/sectionA/

    Then a link to:
    http://www.mysite.com/sectionB/page.htm
    or
    http://www.mysite.com/index.html

    Would both be considered "external" and not part of the same site. What you can do is CHANGE your base URL (click "More" and then "Edit") and for example, change it to:
    http://www.mysite.com/

    Then all of the above URLs will be considered part of the site.

    Now the fact that you have no index files and you are pointing at a directory, is potentially another problem. Spider Mode needs to have a HTML page to crawl for more links.

    Some web servers are configured to generate a "directory listing" page such that when you access a directory URL with no index page, it will automatically show a HTML page with links to the contents of that directory. So you may want to enable this on your server if it is not already, and you want to do this. Easy to check if not -- just go to that URL in your browser and see if you get a directory listing with hypertext links that you can follow through to the files you want indexed.

    Originally posted by Cat H View Post
    I thought to do a offline index on a directory with copies of all pertinent files, but this would not help our users as they need to be able to locate the file on our Intranet.
    You can use Offline Mode for this situation. The difference with Offline Mode is that the URL is rewritten based on your "Base URL". It does not link to the actual file that was indexed. So if you (offline) index from this start folder:
    C:\MyFiles\MyLocalCopySite\

    And we index this file:
    C:\MyFiles\MyLocalCopySite\news.html

    And you've specified your base URL is:
    http://www.mysite.com/

    Then the search result will link to:
    http://www.mysite.com/news.html

    Hope that clears things up.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Thank you Ray for your reply.
      My concern is, if I set the base URL to the primary directory, it will index from our Intranet home page and index a bunch of files that I don’t want.

      Here’s an analogy. I only want “green” files. So, I set the base URL to…
      http://dsshome/green/, INDEX_AND_FOLLOW_ALL, http://dsshome/green/

      Now that more “green” files have been added within additional directories that are not in the green directory I added additional starting points as follows:
      http://dsshome/other1/, INDEX_AND_FOLLOW, http://dsshome/other1/
      http://dsshome/other2/, INDEX_AND_FOLLOW, http://dsshome/other2/

      But, these directories are skipped. How can I add these files to the search?

      I checked, and we do have the server configured to display a directory listing for a URL to a directory with no index page.

      Thanks so much for your help!

      Comment


      • #4
        Originally posted by Cat H View Post
        Now that more “green” files have been added within additional directories that are not in the green directory I added additional starting points as follows:
        http://dsshome/other1/, INDEX_AND_FOLLOW, http://dsshome/other1/
        http://dsshome/other2/, INDEX_AND_FOLLOW, http://dsshome/other2/

        But, these directories are skipped. How can I add these files to the search?
        I don't see how this would be the case if the error was regarding base URL. Because here the base URL matches the start URL.

        The devil is very likely in the details. Can you give us the real URLs and the actual log messages? Save the entire index log and e-mail this to us (zip if large).

        If you are getting a different skip message for these start points (e.g. the URL is already indexed) then that would be possible. And that would be because your first start point (the green one) is set to "INDEX_AND_FOLLOW_ALL" which meant it followed links outside of the green folder. Are you sure you need this? You haven't said anything that indicated you needed this. Perhaps you misunderstood how it works. Check the Help button on that Spider Options window.

        Because of this, it is possible it found a link to "/other1/" already and indexed that single page, so the additional start points wouldn't work by the time it gets around to it.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment

        Working...
        X