PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Suspected invalid html on page with version 6.0.1010

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Suspected invalid html on page with version 6.0.1010

    I have upgraded to the latest version 6.0.1010 and I for the first time I am experiencing the following problem:

    In about 7% of the pages I get the following Warning message:

    "Suspected invalid HTML on page .... (content may not be correctly indexed)"

    The pages that receive this error message are not scanned.

    I am using Zoom Search Engine since version 4.
    The same html files have been scanned with version 4, version 5 and version 6 (prior to this release).
    I have never had any problem like this and all the files were scanned successfully.


    Is this a bug or is something wrong with my configuration?
    I am starting thinking going back to version 5. Version 6 is much more powerful, but with v5 everything was working fine for me.

    Thank you

  • #2
    This is a new error message to indicate that not all text on the page could be parsed due to HTML syntax errors.

    It was a trade off. Previously releases were slightly more tolerant of bad HTML, but as a result it end up indexing some HTML source code some of the time. (the syntax errors on bad pages meant it wasn't possible to determine what was HTML mark up and what was text).

    We had a few bug reports of HTML code being indexed. So to avoid indexing HTML source code, we tightened the HTML syntax checking in a few areas, especially around link tags.

    Can you post some of the URLs the are throwing the invlaid HTML errors and I can check the HTML on the page.

    Comment


    • #3
      Thank you for answering to my question.

      I have sent an email with an example of a file giving errors.
      Unfortunately it is not easy to access my website because it is running only in localhost.
      From my check in the file there are some errors in some a tags which have duplicate attributes.
      From the 49.000 files indexed in my website I get an error in about 2900.
      The pattern is not always the same, so it is not possible to fix the small errors by regular expressions. Fixing them manually would take for years.

      Are duplicate attributes creating problems to the engine?

      Is there any change in the configuration I could do in order to make the engine more tolerant to html errors?

      Thank you

      Comment


      • #4
        I put the page you sent us through the official HTML validator. It gave a list of 61 Errors and 55 warnings for this single page.

        But for the most part Zoom deals with all of these errors and you can ignore them.

        The only signficant issue seems to be the page meta description. You have this code,

        Code:
        <meta name="description" content="Unit IV - Drugs Affecting
        the Cardiovascular System > Chapter 17 - Antiarrhythmics" />
        You should avoid using the ">" character inside a tag when you don't want to close the tag. You should use the character entity &gt; instead.

        Here is the relevant section of the HTML specifications.

        "Authors wishing to put the "<" character in text should use "&lt;" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use "&gt;" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values."

        If we just switched back to the old behaviour we are going to have other users complaining that we don't parse valid HTML code correctly. So we thought is is better that we error out on the invalid code rather than the valid code.

        Comment


        • #5
          If any one else has this problem, let us know, we might be able to write extra code to deal with a larger variety of bad HTML code at the cost of indexing speed and programming effort. Or allow the user to select a parsing method (i.e. tell the parser how it should deal with bad code if you have systematic faults across an entire site).

          Comment


          • #6
            Thanks to your help I have managed to correct all html errors that were causing the "suspected invalid html" message with version 6.0.1010.

            I believe however that allowing the user to select a parsing method would be ideal.

            Thank you

            Comment


            • #7
              I have just upgraded to 6.0.1011 and get this with PDF files. How can a PDF file have suspect html?

              I am also now getting lots of warnings that there was no text in a PDF file that the search indexer checked. I know that, so why is this warning necessary?

              Bob
              Robert Isaac
              Volvo Owners Club

              Comment


              • #8
                There was a change in build 1011, to improved the tolerance of invalid HTML and parsing of inline JavaScripts (such as onClick, onHover, etc.). This addresses issues with build V6.0.1010 where users found that many pages with HTML errors would be skipped.

                But you shouldn't be seeing this for PDF files. What is the URL for an example PDF that reports this message?

                Comment


                • #9
                  Although I have corrected all the suspected errors that appeared on version 6.0.1010, with version 6.01011 I get new warning messages.
                  Are these pages skipped or is this only an error report without skipping pages?

                  Thank you

                  Comment


                  • #10
                    Originally posted by wrensoft View Post
                    There was a change in build 1011, to improved the tolerance of invalid HTML and parsing of inline JavaScripts (such as onClick, onHover, etc.). This addresses issues with build V6.0.1010 where users found that many pages with HTML errors would be skipped.

                    But you shouldn't be seeing this for PDF files. What is the URL for an example PDF that reports this message?
                    They are all on a password protected section of our web site, and you do not have the facility to add attachements here.

                    Bob
                    Robert Isaac
                    Volvo Owners Club

                    Comment


                    • #11
                      Originally posted by cardiogr View Post
                      Although I have corrected all the suspected errors that appeared on version 6.0.1010, with version 6.01011 I get new warning messages.
                      Are these pages skipped or is this only an error report without skipping pages?
                      "Warning" messages are just that, warnings. They do not imply that a page was skipped. Some of the warning messages would indicate if the problem on the page might mean that the page was not indexed correctly.

                      Can you find the HTML problems in these new URLs that are being reported? If not, send us some example URLs which have these problems and we can take a look. We are not aware of any incorrect warnings with build 6.0.1011 at the moment. It should be much better than 6.0.1010, in that it will:
                      (1) Index more content, and much more cleanly even when there are HTML mistakes on the page.
                      (2) Report more accurate warning messages to tell you what the problems are.

                      Originally posted by firstrebel View Post
                      I am also now getting lots of warnings that there was no text in a PDF file that the search indexer checked. I know that, so why is this warning necessary?
                      This warning was actually added in build 6.0.1010. We often get people who index PDF files without realizing that it only contains a scanned image of a printed page. This warning brings this to the user's attention.

                      It is generally unusual to want to index PDF files without any content. If that was your intention, and you only need to index their filenames, you would be better off adding ".pdf" as a "Binary (filename only)" file type as it would index much faster.

                      Originally posted by firstrebel View Post
                      They are all on a password protected section of our web site, and you do not have the facility to add attachements here.
                      Bob, you can e-mail us your PDF files.
                      --Ray
                      Wrensoft Web Software
                      Sydney, Australia
                      Zoom Search Engine

                      Comment


                      • #12
                        I have emailed the file and warning text.

                        I have pdf's that have content as images to preserve originallity and some that have searchable text, so unless there is a way for Zoom to differentiate there is not much I can do.

                        Bob
                        Robert Isaac
                        Volvo Owners Club

                        Comment


                        • #13
                          Judging by the ZCFG file you sent us previously, you may have ".pdf" added to your extensions list as "HTML text" file type.

                          Check this under "Configure"->"Scan options". On the Extensions list, look for ".pdf". In the second column, if it says "HTML text", then this is what's happening. Zoom is configured to handle ".pdf" files as HTML.

                          To correct this, remove the extension and add a new entry for ".pdf", making sure to leave it on the default for "Acrobat document".
                          --Ray
                          Wrensoft Web Software
                          Sydney, Australia
                          Zoom Search Engine

                          Comment


                          • #14
                            Originally posted by Ray View Post
                            "Warning" messages are just that, warnings. They do not imply that a page was skipped. Some of the warning messages would indicate if the problem on the page might mean that the page was not indexed correctly.

                            Can you find the HTML problems in these new URLs that are being reported? If not, send us some example URLs which have these problems and we can take a look. We are not aware of any incorrect warnings with build 6.0.1011 at the moment. It should be much better than 6.0.1010, in that it will:
                            (1) Index more content, and much more cleanly even when there are HTML mistakes on the page.
                            (2) Report more accurate warning messages to tell you what the problems are.
                            Thank you for your answer.

                            In version 6.01010 the pages with errors were skipped. This is why I am asking if in this version pages are also skipped.
                            I can identify the html errors (it now gives more information in the warning description). It is very painful however to fix 2000 pages (even with regular expressions), so at this time I had to downgrade to 6.01010.

                            Comment


                            • #15
                              Originally posted by Ray View Post
                              Judging by the ZCFG file you sent us previously, you may have ".pdf" added to your extensions list as "HTML text" file type.

                              Check this under "Configure"->"Scan options". On the Extensions list, look for ".pdf". In the second column, if it says "HTML text", then this is what's happening. Zoom is configured to handle ".pdf" files as HTML.

                              To correct this, remove the extension and add a new entry for ".pdf", making sure to leave it on the default for "Acrobat document".
                              You are right Ray. Has this changed since 5.1 as I am sure they were setup correctly in that version.

                              Bob
                              Robert Isaac
                              Volvo Owners Club

                              Comment

                              Working...
                              X