Has anyone successfully processed files (PDF or otherwise) containing DBCS?
We have a PDF that contains chinese characters, but the PDF graphical indexer fails with a server error.
arspdump is also unable to process the file:
$ arspdump -f file.pdf -p 1 | head -40
file.pdf
Number of Pages = 4
WordFinder version: 3
------------- Page 1 -------------
?
ul.h = 0.49 ul.v = 0.18 lr.h = 0.61 lr.v = 0.37
?
ul.h = 0.59 ul.v = 0.18 lr.h = 0.65 lr.v = 0.37
....
Even if you get past that issue, I don't think you'll be able to store those double-byte index values inside your database without switching it to UTF-16. That's something I run into constantly during migrations. Non-ASCII characters get converted to double-byte strings, meaning they won't fit inside the columns defined in databases with 8-bit codepages.
The icing on the cake is that I have no idea how this would affect searching. (Can a Windows client or web browser properly convey double-byte characters and make it all the way through CMOD down to the database for a query?)
Hopefully some of our users from Europe will be able to help with this.
-JD.