PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

reindex large sites takes too long

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • reindex large sites takes too long

    hi,
    we use the Enterprise Zoom search v6 and I have to crawl a large E-Commerce site, with nearly 60.000 items/pages.
    I use some Datafields for article numbers, ean, price and producer.
    I stripped out as much as possible via ZOOMSTOP/RESTART and use a lot of follow, noindex meta tags and adjusted the robots.txt, to get just the needed article data. Further I use the CGI option.
    So far so good.
    Indexing this whole site takes around two hours on first crawl.
    Send a searchquery to this dataset takes max 2 seconds. This seems very much, since your comparison of crawling/searching large websites offers a lower querytime.
    On the other site, I think it's because all of the technical data that is indexed and blow up the file?. How can I optimize this?

    So this is all ok when making the index the first time. But I have one problem.
    I tried incremental index and it took too long, to add new and changed sites (and changes happen every 15 minutes).
    The pages give back a proper last modified header - so it doesn't reindex the whole site, but it's slow.
    What can I do?
    I saw the possibility to add a text file with new/changed pages and use console mode.
    Would this goes faster, to provide all changed/new pages in this textfile?
    I suppose I can't provide a url with all new pages?
    Because I could setup a simple sql query to printout all pages that are newer than the zoom_index filetime.

    What is the best strategy to handle this case.
    re-index every 15-30 minutes all new/updated files, provided via http as textfile.

    Thanks and sorry for the blabla ^^ I just want to describe my scenario as good as possible.

  • #2
    The noindex meta tag isn't anywhere near as efficient as using either the robots.txt file or the page skip list in Zoom.

    Using the meta tag means Zoom still needs to hit your page, and the database behind it, in order to discover the meta tag. Using the skip list avoids the page hit entirely.

    Second point is that 2sec is a long time to spend in your database. There would have be to scope for speeding this up (if it is your own database). Things to look at would include,
    - Checking the query is efficient
    - Checking the correct index fields are defined in the DB
    - Check the locking to make sure queries don't block each other
    - Checking there is enough RAM in the machine to cache the DB
    - You might also look at what you are getting from the DB. Maybe you don't need to get some data from the DB for Zoom (and Google) indexing. i.e. have a cut down efficient page for the spiders.
    - Check the networking if the DB is on a separate machine from the web server.
    - Check the general background load on the machine.
    I would have thought 0.2 seconds would be a more reasonable time for a DB query for a web page.

    Another option might include having a 2nd staging server, and doing the indexing on the 2nd machine. Or some complex option of having static versions of the pages as well as dynamic versions.

    In terms of incremental index updates in Zoom, feeding it with a list of known page changes (as you are suggesting) will save a huge amount of time. Adding 5 new pages via a list might only take 10sec, depending on your hardware. You are saving the spider all the work of visiting each page to check the update date and time.

    This FAQ might also help,
    http://www.wrensoft.com/zoom/support...rge_sites.html

    Comment


    • #3
      thanks,
      for incremental update, I wrote a batch file, fetching a urllist to updated pages and write it to disc, then start zoomindexer with -addpages urllist.txt,
      works fast, yes - super (just uploading takes time).

      one question:
      if I start a batch from programm dir like this:

      Code:
      ZoomIndexer64.exe -s -c "D:\zoom.zcfg" -addpages updated.txt
      the indexer breaks with
      Code:
      08|07/14/11 15:36:35|Error: Could not open text file containing list of pages to delete: C:\ProgramData\Wrensoft\Zoom Search Engine Indexer\updated.txt
      so per default zoomindexer looks in ProgrammData path?
      can I change this?

      for the slow queries:
      Unfortunately you missunderstood me.
      Not the generation of my pages takes 2 seconds - they run fast enought.
      The searchquery to the zoomindexer files takes around 2 seconds in cgi mode.

      here are some infos about the finished crawling process
      Code:
      12|07/14/11 12:40:29|INDEX SUMMARY
      12|07/14/11 12:40:29|Files indexed: 59675
      12|07/14/11 12:40:29|Files skipped: 79783
      12|07/14/11 12:40:29|Files filtered: 0
      12|07/14/11 12:40:29|Files downloaded: 70144
      12|07/14/11 12:40:29|Unique words found: 285869
      12|07/14/11 12:40:29|Variant words found: 204525
      12|07/14/11 12:40:29|Total words found: 4683457
      12|07/14/11 12:40:29|Avg. unique words per page: 4.79
      12|07/14/11 12:40:29|Avg. words per page: 78
      12|07/14/11 12:40:29|Start index time: 10:25:16 (2011/07/14)
      12|07/14/11 12:40:29|Elapsed index time: 02:15:13
      12|07/14/11 12:40:29|Peak physical memory used: 221 MB
      12|07/14/11 12:40:29|Peak virtual memory used: 508 MB
      12|07/14/11 12:40:29|Errors: 0
      12|07/14/11 12:40:29|URLs visited by spider: 74961
      12|07/14/11 12:40:29|URLs in spider queue: 0
      12|07/14/11 12:40:29|Total bytes scanned/downloaded: 923655701
      12|07/14/11 12:40:29|File extensions: 
      12|07/14/11 12:40:29|    .php indexed: 59675
      We are not on shared hosting maschine!
      I fear the crawler finds too much garbage no one search for, like product specific data
      at example: width: 62 mm, height: 33 mm
      I exclude as much as possible text and try to keep just productdescription, article numbers, EAN numbers, Price, headers and metatags.

      Anyway - thanks for clearing up the fact, that <meta robots follow,noindex> slow down the process.
      I thought the same, when watching the log running.

      what is processed first:
      robots.txt or internal skiplist?

      thanks m.

      Comment


      • #4
        Originally posted by localhorst View Post
        one question:
        if I start a batch from programm dir like this:

        Code:
        ZoomIndexer64.exe -s -c "D:\zoom.zcfg" -addpages updated.txt
        the indexer breaks with
        Code:
        08|07/14/11 15:36:35|Error: Could not open text file containing list of pages to delete: C:\ProgramData\Wrensoft\Zoom Search Engine Indexer\updated.txt
        so per default zoomindexer looks in ProgrammData path?
        can I change this?
        The default path is the working directory.

        Specify a full path for your config file in your bath script, i.e.:

        Code:
        ZoomIndexer64.exe -s -c "D:\zoom.zcfg" -addpages C:\somewhereelse\updated.txt
        Originally posted by localhorst View Post
        for the slow queries:
        Unfortunately you missunderstood me.
        Not the generation of my pages takes 2 seconds - they run fast enought.
        The searchquery to the zoomindexer files takes around 2 seconds in cgi mode.
        That is unusual. Given the number of files and words indexed.

        Things to check:

        - What sort of queries are you submitting? Wildcards and exact phrases can be slower than keyword searches.

        - Are you looking at the search time measured by Zoom itself? If not, you should enable this for comparison at least, under "Configure"->"Search Page"->"Show time taken to perform search".

        - Is the CGI called from another script? For example, if you have a PHP page which acts as a wrapper to the CGI output. There may be other things adding to the time of the search.

        - Is the search function online? We can take a look at it if you can give us the URL (via PM or email if needed). We may notice something quicker this way than guessing.

        Originally posted by localhorst View Post
        what is processed first:
        robots.txt or internal skiplist?
        Practically the same time, there's little difference in terms of speed. Strictly speaking, it's the internal skip list first. Also, if this is a long list, there's extra time to download the "robots.txt" file and parse it into memory.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Batchfile for incremental index works perfectly now.

          for the other problem:
          The time is measured by Zoom itself.
          There is a php file which acts as wrapper (via curl) but that's not the problem, because searching over the plain cgi page takes the same time.

          Right now, I update the Index and then I will send you a PM with some relevant files (zcfg, last indexlog, HTML structure etc) and a link to the live site (the zoom search is hidden right now)
          thanks in advance!

          Comment

          Working...
          X