OnDemand Users Group

Support Forums => Report Indexing => Topic started by: frasert on January 20, 2011, 09:05:17 PM

Title: Indexing files containing DBCS (Double-Byte Character Set)
Post by: frasert on January 20, 2011, 09:05:17 PM
Has anyone successfully processed files (PDF or otherwise) containing DBCS?

We have a PDF that contains chinese characters, but the PDF graphical indexer fails with a server error.
arspdump is also unable to process the file:

$ arspdump -f file.pdf -p 1 | head -40
file.pdf
Number of Pages = 4

WordFinder version: 3

------------- Page 1 -------------

?
ul.h = 0.49     ul.v = 0.18     lr.h = 0.61     lr.v = 0.37

?
ul.h = 0.59     ul.v = 0.18     lr.h = 0.65     lr.v = 0.37

....
Title: Re: Indexing files containing DBCS (Double-Byte Character Set)
Post by: Justin Derrick on January 21, 2011, 08:55:26 AM
Even if you get past that issue, I don't think you'll be able to store those double-byte index values inside your database without switching it to UTF-16.  That's something I run into constantly during migrations.  Non-ASCII characters get converted to double-byte strings, meaning they won't fit inside the columns defined in databases with 8-bit codepages.

The icing on the cake is that I have no idea how this would affect searching.  (Can a Windows client or web browser properly convey double-byte characters and make it all the way through CMOD down to the database for a query?)

Hopefully some of our users from Europe will be able to help with this.

-JD.