PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Zoom cgi handling of extended chars

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Zoom cgi handling of extended chars

    We have a number of pdf files in our library which were inadvertently created some years ago with a faulty header template, so that the first line of text extracted by the pdftotext utility you use contains a garbage sequence of characters, typically containing something like this:

    æ õíí øð ï ìé îð ðë íð Š ©© ©ò®³ ò ®¹

    i.e. a mixture of characters above hex aa intermixed with spaces.

    When these files are returned as search matches, part or all of the sequence is sometimes displayed.

    It is easy enough to edit out this garbage from the raw text provided by pdftotext, but the results that come back from search.cgi have been modified so that some (but not all) of the characters are expressed using the sequence &#nnn; i.e. as the Latin1 escapes, instead of directly.

    It's not clear to me which of the extended characters will be escaped and which not, which makes it difficult to write a regular expression within our post-processor to remove them reliably. For instance, the above example comes back as (I've added extra spaces after the apostrophes and before the hashes to make the Latin1 sequences visible):

    æ õíí øð ï ìé îð ðë íð Š & #169;& #169; & #169;ò& #174;³ ò & #174;¹

    Could you give me a pointer as to when your scripts translate characters into Latin1 and when not?

    thanks
Working...
X