Print Page - Indexing files containing DBCS (Double-Byte Character Set)

Title: Indexing files containing DBCS (Double-Byte Character Set)
Post by: frasert on January 20, 2011, 09:05:17 PM

Has anyone successfully processed files (PDF or otherwise) containing DBCS?

We have a PDF that contains chinese characters, but the PDF graphical indexer fails with a server error.
arspdump is also unable to process the file:

$ arspdump -f file.pdf -p 1 | head -40
file.pdf
Number of Pages = 4

WordFinder version: 3

------------- Page 1 -------------

?
ul.h = 0.49 ul.v = 0.18 lr.h = 0.61 lr.v = 0.37

?
ul.h = 0.59 ul.v = 0.18 lr.h = 0.65 lr.v = 0.37

....

Title: Re: Indexing files containing DBCS (Double-Byte Character Set)
Post by: Justin Derrick on January 21, 2011, 08:55:26 AM

Even if you get past that issue, I don't think you'll be able to store those double-byte index values inside your database without switching it to UTF-16. That's something I run into constantly during migrations. Non-ASCII characters get converted to double-byte strings, meaning they won't fit inside the columns defined in databases with 8-bit codepages.

The icing on the cake is that I have no idea how this would affect searching. (Can a Windows client or web browser properly convey double-byte characters and make it all the way through CMOD down to the database for a query?)

Hopefully some of our users from Europe will be able to help with this.

-JD.

OnDemand Users Group

Support Forums => Report Indexing => Topic started by: frasert on January 20, 2011, 09:05:17 PM