PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

V5 development progress - Incremental indexing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • V5 development progress - Incremental indexing

    These features will allow you to update or manage an existing set of index files, without having to perform a full re-index. There are several options available, which can all be found under the "Index" menu.

    Requirements
    ------------
    First of all, incremental indexing is only available for the PHP, ASP, and CGI versions. It is not available for the Javascript version since it is incapable of indexing a large enough set of files where incremental indexing would be beneficial.

    Second, in order to use incremental indexing, you must NOT have modified your indexing configuration since the last index was made. The ZCFG file must contain the exact same settings, and the index files must still be in the output folder specified.

    "Update existing index"
    ------------------------
    This option will look through the list of pages found in your existing index and check if they have since been modified. It will then perform a partial index of only the pages that have changed (and potentially index any new pages that you have added links to).

    Note that there are some limitations to this, and that with each subsequent update, the index gets larger and less efficient. We recommend performing a full re-index regularly where possible (perhaps once a week, or once a month, depending on how often you perform a partial index).

    Note also that the ability for Zoom to determine whether a file was modified is dependent entirely on the last-modified date retrieved and the filesize. If these attributes are inaccurate or do not represent the changes to the file, then it will not be able to accurately find the files which have been changed.

    "Add start points to existing index"
    -------------------------------------------------
    This option allows you to add and index a list of start points (usually a new website, or a part of a new website) to an existing index. This can be useful if you manage a list of websites as start points and you wish to add new start points to the index on a regular basis.

    It will index the new start point, append this data to the existing index (without having to re-index the existing start points) and save the configuration with your added start points (so that on your next full re-index, the new start points will be included).

    "Add list of new or updated pages"
    -------------------------------------------------
    This feature allows you to specify a list of new pages which are to be indexed and added to the existing index. If you specify a page here which already exists in the index, Zoom will assume that this page has been updated/modified, and will remove the old data for this page, and add the new one.

    "View or delete pages from existing index"
    -------------------------------------------------
    This allows you to browse the list of pages which exist in your current index. It also allows you to mark certain pages for deletion - removing them from the searchable content. Note that deleting pages using this function will NOT decrease the size of your index files.

    To summarize regarding the effect of these features on your existing index:

    - Adding new pages do not compromise the efficiency of an existing index
    - Updating and removing pages causes an existing index to become progressively less efficient (as more pages are removed/updated).
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

  • #2
    Command-line parameters for incremental indexing
    ------------------------------------------------
    We've added a list of command-line parameters to Zoom that will allow you to call upon the above incremental indexing features via the command-line. This will allow developers to call Zoom to perform these operations via external scripts or applications (eg. you could have a server-side script which calls upon Zoom to add a new start point to an existing index when a user submits them via a webpage).

    The new commands are:

    -update
    This will perform an incremental update (as described above) on the specified ZCFG file. You must also specify the index mode (offline or spider) and the config file like so:

    Code:
    ZoomIndexer.exe -s zoom.zcfg -update

    -addpage
    This will add a specific page to the existing index specified by the config file and index mode. eg.

    Code:
    ZoomIndexer.exe -s zoom.zcfg -addpage [URL]http://www.mywebsite.com/newpage.html[/URL]
    Note that if you are using offline mode, you will need to specify a base URL following the addpage URL with a pipe ("|") character, eg.

    Code:
    ZoomIndexer.exe -o zoom.zcfg -addpage C:\mywebsite\newpage.html|http://www.mywebsite.com/
    -addpages
    This is the same as -addpage but allows you to specify a text file containing a list of new pages (rather than calling it for one page only). eg.

    Code:
    ZoomIndexer.exe -s zoom.zcfg -addpage newpages.txt
    Similarly, offline mode will expect a base URL following the text filename (separated by a pipe character).

    -addstartpt
    This option will perform an incremental add start point operation on the specified config file and index mode. eg.

    Code:
    ZoomIndexer.exe -s zoom.zcfg -addstartpt [URL]http://www.mynewsite.com/[/URL]
    Offline mode will expect a base URL following the start directory (separated by a pipe character). eg.

    Code:
    ZoomIndexer.exe -o zoom.zcfg -addstartpt C:\mynewwebsite\|http://www.mynewwebsite.com/
    -addstartpts
    This is the same as -addstartpt but allows you to specify a text file containing a list of start points.

    In spider mode, the format of this text file is the same as the "Import start points" feature, which allows you to specify spidering options such as "index and follow" or "index only", etc. As well as allowing you to specify a Limit of the number of pages to index for each start point. See the chapter on "Importing and exporting additional URLs" in the Users Guide for more information.


    -deletepage
    This parameter will delete the specified page from the index as configured by the ZCFG file given and the index mode specified. eg.

    Code:
    ZoomIndexer.exe -s zoom.zcfg -deletepage http://www.mywebsite.com/oldnews.html
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      "Add list of new or updated pages"
      -------------------------------------------------
      This feature allows you to specify a list of new pages which are to be indexed and added to the existing index. If you specify a page here which already exists in the index, Zoom will assume that this page has been updated/modified, and will remove the old data for this page, and add the new one.


      Hello, does this mean that say on my server if I update the page storm.html, I can add this page using the method above. Am I correct in thinking Zoom will remove storm.html and all its info and then add it again to the index?

      Does this mean no performance loss?

      Thank you
      AG!

      Comment


      • #4
        Yes, you can update a single page in the index (or a list of pages). When using any of the update options the set of index files become less tightly compressed. So this has a small impact on search performance, which becomes greater as more and more files are replaced. Adding new pages that don't already exist in the index doesn't have any significant performance impact. The situation is analogous to hard disk fragmentation. Every so often you'll need to defag. Except in the case of Zoom, the defag is a full re index.

        Although we don't have any good benchmark figures to post. If you have updated more than 20% of the pages in your index, you should do a full re-index to 'defrag' the index.

        Comment


        • #5
          Sounds good, thank you for the quick reply.

          One more quick question. Is it possible to 'defrag' on the local machine instead of having to re-index (theoretically all of the data is in the files, it just needs repacking). I'm asking this because a) It will be quicker b) saves bandwidth and some external sites which I index can't take the extra load.

          AG!
          AG!

          Comment


          • #6
            We currently do not have plans to add a feature to "re-compact" an index without performing a full re-index. We do agree that it would be useful and it is something we could consider for a future version. It is technically complicated to implement though, and we've locked down on the feature set for V5 and hoping for a release soon, so... maybe V6.

            In the meantime, you could consider increasing the size of your web cache (in Windows/Internet Explorer which shares a common cache) and allowing Zoom to use the cache to minimize web traffic.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              I'm finally beginning to get my teeth in to these features.

              However, some things are still unclear.

              **These questions are regarding the CLI***

              Add start points - If I have a start point already in the cfg file, say google.com and I want to add a single page to the index from another domain, do I have to add a start point first or can I just add the additional page?

              Update existing index - If the site im indexing is dynamic (with no date or meta info), what does zoom do? Does it skip the file and does not update it (matching it with the URL) OR does it delete the file from the index and add it anyway OR does it do something else?

              Does a command-line command autostart zoom? (I assume it does)
              If yes, does it auto-close zoom upon completion?
              If zoom is already open what happens?
              If zoom is in the middle of an operation and a command is executed, what will happen?

              I intend to host zoom remotely on a shared / dedicated server, what are hosting requirements / any tips?

              Thank you for your time
              AG!

              Comment


              • #8
                Originally posted by AG! View Post
                Add start points - If I have a start point already in the cfg file, say google.com and I want to add a single page to the index from another domain, do I have to add a start point first or can I just add the additional page?
                You can just add the additional page.

                Update existing index - If the site im indexing is dynamic (with no date or meta info), what does zoom do? Does it skip the file and does not update it (matching it with the URL) OR does it delete the file from the index and add it anyway OR does it do something else?
                If a page does not return filesize or date information, Zoom will presume that it is dynamic and has changed and it will update the file.

                If a site does not contain any date or filesize information, we would not recommend using the Incremental Update feature (since all pages will need to be removed and re-indexed, so you would be better off doing a full proper re-index).

                Does a command-line command autostart zoom? (I assume it does)
                Yes. All command-line features autostart Zoom. But you will need to specify the -s or -o or -r commands to autostart Zoom in either spider, offline, or report mode respectively. See the existing Users Guide regarding these autostart commands.

                If yes, does it auto-close zoom upon completion?
                Yes.

                If zoom is already open what happens?
                A new instance of Zoom will start and perform the operation you specified.

                If zoom is in the middle of an operation and a command is executed, what will happen?
                See above.

                I intend to host zoom remotely on a shared / dedicated server, what are hosting requirements / any tips?
                Your server must run Windows and meet the System Requirements for running the Zoom Indexer application.

                Try to schedule/run the indexing during low load periods on your server.

                Most shared hosting solutions will not allow you to run a native Windows application on the server. You may need to have a dedicated (or your own hosted) server to do this.
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment


                • #9
                  Hello

                  When using incremental indexing with command-line, can I use a remote url to the cfg file?

                  Thanks
                  AG!

                  Comment


                  • #10
                    No. You can't use a URL. You can use a config file on a networked drive however.

                    Comment


                    • #11
                      I'm a bit fuzzy on where the command lines are put, to make all this work?

                      Comment


                      • #12
                        The command line interface is really for advanced users. You can access the same functions from the Index menu in the Zoom graphical user interface.

                        Comment

                        Working...
                        X