Zoom cgi handling of extended chars

AndrewD

Newbie

Join Date: Dec 2006

Posts: 49
- Share
- Tweet
#1

Zoom cgi handling of extended chars

Mar-27-2009, 05:09 PM

We have a number of pdf files in our library which were inadvertently created some years ago with a faulty header template, so that the first line of text extracted by the pdftotext utility you use contains a garbage sequence of characters, typically containing something like this:

� ��

i.e. a mixture of characters above hex aa intermixed with spaces.

When these files are returned as search matches, part or all of the sequence is sometimes displayed.

It is easy enough to edit out this garbage from the raw text provided by pdftotext, but the results that come back from search.cgi have been modified so that some (but not all) of the characters are expressed using the sequence &#nnn; i.e. as the Latin1 escapes, instead of directly.

It's not clear to me which of the extended characters will be escaped and which not, which makes it difficult to write a regular expression within our post-processor to remove them reliably. For instance, the above example comes back as (I've added extra spaces after the apostrophes and before the hashes to make the Latin1 sequences visible):

� �� & #169;& #169; & #169;�& #174;� � & #174;�

Could you give me a pointer as to when your scripts translate characters into Latin1 and when not?

thanks
Tags: None

Announcement