PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Using Header1, Header2, etc. from Word .doc format

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using Header1, Header2, etc. from Word .doc format

    I currently have a search engine setup with PHP that I made with Zoom. I have a rather large index of .doc files that are named based on the date (the titles of the documents are all the date as well). Instead of displaying, for instance, 122006.doc as the link title on the results page, how would I modify the results to display the Microsoft Word Header1 or Header2?

    Is this possible?

    Thanks!

  • #2
    You can set the document title from within Microsoft Word. (from the File / Properties menu). In which case you need to configure Zoom to pick up this meta-data from Word documents. It will then be used in the Zoom search results.

    Other option might be using .DESC files.

    Comment


    • #3
      I guess the plugin doesn't support pulling Header information from .doc files then? Can I customize what meta-data Zoom is pulling from them?

      Is there a quick way to create .desc files for about 2000 word documents?

      Comment


      • #4
        I think you mean 'heading' and not 'header'?

        Heading text is extracted from a Word file, but only the text. All formatting is lost, including the heading level styles you defined in MS Word.

        It would be not be too hard to write a script that creates bulk .DESC files, but the real problem is how to get sensible, meaningful data into each file.

        Comment


        • #5
          Ah yes, I meant heading. I'm not worried about the styles, I basically want to display heading1 text instead of the title as the search result link. It sounds like this is doable?

          Comment


          • #6
            Heading levels are styles. They are the same thing. They are just a style with the name heading. And all style information is lost during the text extraction. So no, Zoom can't search for text in the document which has the heading style and use that as the title.

            Comment


            • #7
              I think I'm going to convert everything to HTML and then use <a name> markers... Can I make Zoom display the section of a page that the keyword was found in using something like this?

              By the way, thank you for your quick and informed responses.

              Comment


              • #8
                Did you check the File / Properties menu in MS Word. I would have thought ensuring a valid title in the title field would be the best solution. Word often sets this value automatically to something half sensible.

                I don't see how conversion to HTML solves the problem of not having titles set? Doesn't this just transfer the problem to having HTML files without a valid title?

                Zoom will automatically display the block of text around the keyword in a search result (except with the Javascript option).

                Comment


                • #9
                  I've got titles for all of the documents, I am just trying to accomplish something that seems logical to me but may not be possible with Zoom... Allow me to explain further: I need to be sent to the portion of the page containing the keyword whenever I click the result link on the search page. For example, I search for "warehouse" and I get a result to a document titled "January 2005 Newsletter" (and the filename is 0105.htm) I want to be sent to the section of the page that contains "warehouse" (let's say that section is called "Safety Hazards") when I click the link, so if I convert the .doc files to .html I could setup a symbolic link to that section:

                  <h1><a name="Section1">Safety Hazards</a><h1>

                  And I want the result link to send me to <a href="0105.htm#Section1">.

                  Comment


                  • #10
                    I see. This is a different issue from having valid looking titles in the search results.

                    In the index Zoom creates there is an association between the keywords and the documents they were found in. But we do not store any information that indicates which pages a word was found on within a Word file. And in the case of a HTML file we do not store which named anchor the each occurrence of a keyword was found near.

                    So it is not possible to do exactly what you want.

                    However there is another possible solution. For HTML documents we have a feature called highlight and jump. Which will scroll a HTML page to the keyword being searched for. There is also a similar feature for PDF files. It might be enough to be a reasonable compromise.

                    Comment


                    • #11
                      That's even better than what I was talking about! I can't believe I didn't see that checkbox.

                      Thanks again for all of your help!

                      Comment

                      Working...
                      X