PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

refusing aspx addition - skipping some files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • refusing aspx addition - skipping some files

    As our sites contains a lot of asp files that generates urls with only slightly different characters in it, the option to use crc on the results still generates lots of the same pages. ( the same content, slightly different url)
    So we tried to disregard the asp files and added a aspx file to search the database and only generates one copy of every entry with the same content.
    But the .aspx is not seen by zoom. It seems every addition we do got ignored by zoom. I tried several extensions with only one file but all of them were missed. Index status does shows these files extension but with 0 results.
    Even manually adding the aspx pages does not "see" them. What can be wrong?
    please advice..

  • #2
    ( the same content, slightly different url)
    If this was the case then CRC should filter the pages. But I was bet that the content was also subtly different. In addition to the URL being different.

    Indexing ASPX pages should be no problem in spider mode. But it won't work very well in offline mode as APSX scripts needs a server to execute the script.

    Are you using spider mode?

    There are several reasons why a file might not be indexed. Start with these FAQ questions.

    Q. Why are some of my pages being skipped by the indexer?

    Q. Why are links in my Javascript menus being skipped?

    Q. I am indexing with spider mode but it is not finding all the pages on my web site

    You should also turn on verbose logging. As there might be a log entry that indicates why the file was skipped during indexing.

    Comment


    • #3
      the verbose logging creates a enormous file. Will take me some time to find something there ;(

      But one problems is obvious. I am running spider mode but looking at the result s the added extensions are not found. I even linked a page so during the search it will be followed. The result is still 0 files in the counter. So even if the file is not working Zoom should see it i presume?


      working with zoom 1007
      added aspx extension in scan options ( tried several other extensions to, just to test, like .qqq)
      created a page with this extension in the web
      put on linked aspx page in the web
      results counter= 0
      Last edited by sed; Apr-04-2007, 09:39 AM.

      Comment


      • #4
        Load the (verbose) log file into a text editor (or even MS Word) then do a search for the file name you think should have been found.

        You might find a warning like,
        File skipped becuase it was too large
        File skipped becuase it failed the CRC duplicate page check
        File failed to download with a HTTP 403 error
        or one of many other possible reasons.

        There is no problem indexing ASPX pages. Your problem is surely not with the file extension, but elsewhere I would think.

        Comment


        • #5
          sorry, after a complete index run the log file is about 1Gb. Word is unable to read those monster files, notepad is not an option and even wordpad crashes.
          Any thoughts on how to read those kind of monster files?

          never mind... after a tip i found editpad
          http://www.editpadpro.com/
          Last edited by sed; Apr-05-2007, 07:33 AM.

          Comment


          • #6
            We use UltraEdit V12 to open huge files like this.

            Or we do a simple 'grep' on the file to pull out some of the data.

            Comment


            • #7
              brr, some hours of intensive reading ...i do not like these kind of logfiles ;(


              the file is seen by Zoom, it's cued, downloaded and.... "filtered" by Zoom.

              I do not have a clue why... the only filter word is -admin
              the only skip word is rss

              I assume skipping rss only skips files and maps with that name. Not a page with somewhere the word rss on it?
              That would be the only explanation i can think of..

              Comment


              • #8
                brr, some hours of intensive reading
                It shouldn't take you more than 15sec to search a log file with a text editor for the file name.

                Can you post the full log lines concerning the file in question.

                There are options in Zoom for filtering pages based on their URL (skip options), and additional options for filtering files based on their content (content filtering). And also a few others, like filtered based on a CRC duplicate page check (scan options).

                I don't know which of these options you are using, but it would seem that one of them triggered the filtering of the file. Probably the content filter.

                Comment


                • #9
                  took a easter break over here..

                  reading this monster file takes a lot of time since there are multiple entry's of the file we are talking about. So in stead of 15 seconds it literally takes hours to read and search for the next entry.

                  Running a full index with debug log will take more time. A normal full index already takes about 5 hours with log option on it takes more than 12 hours. So do not hold your breath, i will get back with the requested info.

                  The last run i ended with lost of entry's that the file was Filtered, but i did not saw a reason for it. Is there a special place in the log where to look for such a entry?
                  ( just to save me some precious time)

                  Comment


                  • #10
                    If a file is filtered as a result of the content filtering options you have set, then you will see a line like this in the log.

                    Filtered out file: http://www.yourdomain.com/folder/file.htm

                    If it is skipped for some other reason, then the message will be different.

                    reading this monster file takes a lot of time since there are multiple entry's of the file we are talking about.
                    Then I can only suggest you are using the wrong tools. We regularly deal with files much larger than 1GB without any drama. Grep is your friend!

                    Comment


                    • #11
                      a total server crash took some time. Somewhere during the full index with debug logs the server crashed severely. It took some time tot restore the thing. Not sure if theres a connection but i took more time then we expected.

                      To give all the data i can find i copy the entrys and the zoon config . maybe you can shed some light overhere. we sure do need it


                      Queued URL: http://www.2college.nl/web/asp/2006/zoekrobot2006.aspx

                      04/11/07 11:57:45 - DL Thread #3, got URL (http://www.2college.nl/web/asp/2006/zoekrobot2006.aspx) off queue

                      04/11/07 11:57:45 - Downloading file http://www.2college.nl/web/asp/2006/zoekrobot2006.aspx

                      04/11/07 11:57:46 - Index Thread got ready buffer for http://www.2college.nl/web/asp/2006/zoekrobot2006.aspx (Content-type: HTML text)

                      04/11/07 11:57:46 - Spidering for links on http://www.2college.nl/web/asp/2006/zoekrobot2006.aspx


                      04/11/07 11:58:15 - Filtered out file: http://www.2college.nl/web/asp/2006/zoekrobot2006.aspx


                      Summary knip======


                      04/11/07 17:52:39 - File extensions:
                      04/11/07 17:52:39 - .htm indexed: 1430
                      04/11/07 17:52:39 - .html indexed: 1242
                      04/11/07 17:52:39 - .txt indexed: 0
                      04/11/07 17:52:39 - .php indexed: 0
                      04/11/07 17:52:39 - .cgi indexed: 0
                      04/11/07 17:52:39 - .aspx indexed: 0
                      04/11/07 17:52:39 - .pl indexed: 0
                      04/11/07 17:52:39 - .php3 indexed: 0
                      04/11/07 17:52:39 - .pdf indexed: 4
                      04/11/07 17:52:39 - .ppt indexed: 14
                      04/11/07 17:52:39 - .pot indexed: 0
                      04/11/07 17:52:39 - .pps indexed: 0
                      04/11/07 17:52:39 - .asp indexed: 83666
                      04/11/07 17:52:39 - No extensions indexed: 38



                      ZOOM config:

                      __5_0
                      #STARTDIR:E:\inetpub\www.2college.nl
                      #SPIDERURL:http://www.2college.nl/index.htm
                      #BASEURL:http://www.2college.nl/
                      #OUTDIR:E:\inetpub\www.2college.nl\search
                      #SPIDERURLTYPE:0
                      #SPIDERURLUSELIMIT:0
                      #SPIDERURLLIMIT:0
                      #USE-CRC:1
                      #CURRENTMODE:1
                      #DLTHREADS:3
                      #NOCACHE:1
                      #BEEP-ON-FINISH:0
                      #OUTPUT:CGI
                      #OUTPUT_OS:0
                      #VERBOSE:1
                      #LOGOPTIONS:INDEXED|INIT|DOWNLOAD|UPLOAD|FILEIO|PL UGIN|INFO|ERROR|WARNING|QUEUE|SUMMARY|
                      #LOGWRITETOFILE:0
                      #LOGWRITETOFILENAME:C:\Program Files\Zoom Search Engine 5.0\indexlog.txt
                      #LOGDEBUGMODE:1
                      #SCAN_NOEXTENSION:1
                      #SCAN_FILELINKS:1
                      #SCAN_USELOCALDESCPATH:0
                      #SCAN_LOCALDESCPATH:
                      #REWRITELINKS:0
                      #REWRITEFIND:
                      #REWRITEWITH:
                      #INDEXOPTIONS:METADESC|CONTENT|TITLE|KEYWORDS|
                      #RESULTOPTIONS:NUMBER|TITLE|CONTEXT|TERMS|DATE|URL |
                      #USE-UTF8:0
                      #CODEPAGE:1252
                      #ZLANGFILEutch.zlang
                      #SKIPUNDERSCORE:1
                      #MINWORDLEN:2
                      #FORMFORMAT:2
                      #HIGHLIGHTING:1
                      #GOTOHIGHLIGHT:0
                      #USEXML:0
                      #XMLTITLE:
                      #XMLDESC:
                      #XMLURL:
                      #XML_OPENSEARCH_DESCURL:
                      #LOGGING:1
                      #LOGGING_FILE:C:\Program Files\Zoom Search Engine 5.0\statistics\logs\zoek.log
                      #TIMING:1
                      #NOCHARSET:0
                      #DEFAULT_TO_AND:1
                      #CONTEXTSIZE:30
                      #EXACTPHRASE:500
                      #SEARCHASSUBSTRING:0
                      #NO_TOLOWER:0
                      #ZOOMINFO:0
                      #USEDATETIME:1
                      #WORDJOINCHARS:.-_'
                      #ZOOMIMAGE:0
                      #SPELLING:1
                      #SPELLINGWHENLESSTHAN:5
                      #WIZARD_UPLOADREQD:0
                      #REPORTLOGFILE:C:\Program Files\Zoom Search Engine 5.0\statistics\logs\zoek.log
                      #REPORTOUTDIR:C:\Program Files\Zoom Search Engine 5.0\statistics\
                      #REPORTUSEDATES:0
                      #REPORT_TOP10:3
                      #REPORT_TOPNR:3
                      #REPORT_DAY:3
                      #REPORT_DAY_TYPE:0
                      #REPORT_WEEK:2
                      #REPORT_WEEK_TYPE:0
                      #REPORT_MONTH:3
                      #REPORT_MONTH_TYPE:0
                      #REPORT_LISTALL:100
                      #WORDWEIGHT_TITLE:1
                      #WORDWEIGHT_DESC:0
                      #WORDWEIGHT_KEYWORDS:3
                      #WORDWEIGHT_FILENAME:0
                      #WORDWEIGHT_HEADINGS:0
                      #WORDWEIGHT_LINKTEXT:0
                      #WORDWEIGHT_DENSITY:2
                      #WORDWEIGHT_SHORTURLS:0
                      #USE-AUTH:0
                      #USE-COOKIES:1
                      #BINUSEDESC:0
                      #PLUGIN_DESCFILES:
                      #PLUGIN_USEMETA:PDF|DOC|PPT|RTF|SWF|WPD|XLS|DJVU|I MAGE|MP3|DWF|
                      #PLUGIN_USETECHNICAL:MP3|IMAGE|DWF|
                      #PLUGIN_PDF_METHOD:0
                      #PLUGIN_PDF_HIGHLIGHT:1
                      #PLUGIN_IMG_MINFILESIZE:5
                      #MAXPAGES_LIMIT:200000
                      #MAXWORDS_LIMIT:200000
                      #MAXFILESIZE_LIMIT:1048576
                      #DESCLENGTH_LIMIT:150
                      #OPTIMIZE_SETTING:3
                      #EXTENSIONS_START
                      .htm
                      .html
                      .txt
                      .php
                      .cgi
                      .aspx
                      .pl|THUMBSEXT:.jpg|THUMBSPATH:./
                      .php3
                      .pdf
                      .ppt
                      .pot
                      .pps
                      .asp
                      #EXTENSIONS_END
                      #ADDSTARTURLS_START
                      http://www.2college.nl/
                      2|0|0
                      http://www.2college.nl/
                      #ADDSTARTURLS_END
                      #SKIPPAGES_START
                      #SKIPPAGES_END
                      #SKIPWORDS_START
                      and
                      or
                      the
                      it
                      is
                      an
                      on
                      we
                      us
                      to
                      of
                      has
                      be
                      all
                      for
                      in
                      as
                      so
                      are
                      that
                      can
                      you
                      at
                      its
                      by
                      have
                      with
                      into
                      ed
                      of
                      in
                      met
                      #SKIPWORDS_END
                      #USECATS:0
                      #USEDEFCATNAME:0
                      #SEARCHMULTICATS:0
                      #SYNONYMS_START:
                      vacature,vakature
                      durendaal,durendael
                      cobbenhagen,cobbehagen
                      #SYNONYMS_END
                      #RECOMMENDED_START:
                      vakanties
                      http://www.2college.nl/web/informati...e/vakantie.htm
                      Vakanties overzicht

                      open dagen
                      http://www.2college.nl/web/asp/2006/...erslag&id=3000
                      Open dagen 2007
                      Overzicht van Open dagen alle locaties 2007
                      vakantie
                      http://www.2college.nl/web/informati...e/vakantie.htm
                      overzicht vakanties

                      rss
                      http://www.2college.nl/web/asp/info/rss.asp
                      Informatie omtrent RRS gebruik

                      vacature
                      http://www.2college.nl/web/asp/2006/...ag&action=zoek
                      recente vacatures

                      #RECOMMENDED_END
                      #RECOMMENDED_MAX:3
                      #USEFILTER:1
                      #FILTER_START
                      -admin
                      #FILTER_END
                      #SITEMAP_TXT:1
                      #SITEMAP_XML:1
                      #SITEMAP_UPLOAD:1
                      #SITEMAP_UPLOADPATH:e:/inetpub/www.2college.nl/
                      #SITEMAP_USEPAGEBOOST:1
                      #SITEMAP_BASEURL:http://www.2college.nl/

                      Comment


                      • #12
                        Thanks for posting those details. I think it might be a bug or design flaw in our software (or at the very least something we forgot to document).

                        It seems the content filter that you are using, -admin, is causing this page to be filtered. That page contains words like, administratief. And the filter matches this word and others that start with "admin".

                        But this should not happen. Zoom should be doing an exact word match and not a partial word match. So this is something we need to fix or document.

                        For a quick solution I suggest,

                        1) Removing the -admin content filter word. Instead try filtering pages based on the URL by using the skip list instead, if you can.

                        OR

                        2) Adding a space character after the 'm' in admin on the content filter page. This will stop the partial match occuring.

                        Comment


                        • #13
                          Oh, and regarding the system crash. I don't think this was directly due to Zoom. But,

                          A) Zoom can use a lot of RAM. So make sure you have enough.

                          B) If Zoom is hitting 80,000+ of your ASP pages, then maybe you have a memory leak or resource leak in your ASP scripts. ASP as a language had a lot of problems like this. So monitor your systems resource usage during indexing.

                          Comment


                          • #14
                            Content filtering will be changed in the next release (V5.0 build 100 so that the words will have to match entirely rather than allowing partial matches (ie. "-admin" will not skip pages containing "administrator" as it currently does).
                            --Ray
                            Wrensoft Web Software
                            Sydney, Australia
                            Zoom Search Engine

                            Comment

                            Working...
                            X