PassMark Logo
Home » Forum


No announcement yet.

Indexing words split over line break

  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexing words split over line break

    Today the engine does not properly index words split over a line break.
    Let us consider the following examples (where // stands for a line break) :
    consti-//tution ; family-//owned
    The engine will index "consti", "tution", "family", "owned", but not "constitution" nor "family-owned".
    I suggest that in the case where an hyphen is followed by a line break, two index entries are added : the whole word or expression with and without hyphen, i.e. here :
    constitution, consti-tution
    familyowned, family-owned
    which would I think solve the problem in all cases.
    Would that be possible in a future release ?

  • #2
    Are you referring to PDF files?

    Sometimes these words are actually stored correctly (unbroken) in the text layer within the PDF file. You might be able to index this by changing the scan method within Zoom, under "Configure"->"Scan options"->Select the pdf extension and click "Configure". Then change the "Scan method" to "text layer" or "raw formatting order" and see if they make a difference for you.

    It all depends on how the PDF file was created. Some have the actual text content stored, and then wrap the words for layout/presentation purposes. Other times, it's the same as what you see. The different scan methods allow you to try what works best for your particular set of files.

    In the case of non-PDF files, it's a bit more complicated. Line breaks are not necessarily defined strictly within HTML, which is a content markup, not a presentation markup. While some people do use <br> tags, it's not really the designed purpose. It gets more complicated with different standards, and also ways to define a non-breaking wrap, etc.
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine


    • #3
      I should have been more specific : I was talking about pdf files, in case where the hyphen is added manually (thus is part of the content) so as to cause an automatic line break at this very location.
      The mode "text layer" seems to solve the problem, as it obviously ignores the sequence : hypen + line break, which is fine in most cases, except when the hyphen is part of the expression as in "family-//owned" that will be indexed as "familyowned". That is why I thought words split across two lines could be indexed with and without an hyphen. If an issue, I will agree it is a low-priority one.
      Last edited by Dacey; 08-07-2015, 08:38 AM.