Hi
If I have multiple documents in the input file then I would have to specify the GROUP_OFFSET and GROUP_LENGTH.
Suppose I have an input file with multiple word documents, how can I find the values for each GROUP_OFFSET and GROUP_LENGTH.
Thanks
Pankaj.
Hello Pankaj,
if you have 1 index file, and several separate word documents, then your index file might look like that:
CODEPAGE:923
COMMENT:DOCUMENT 1
GROUP_FIELD_NAME:field1
GROUP_FIELD_VALUE:value1
...
GROUP_FIELD_NAME:field4
GROUP_FIELD_VALUE:value4
GROUP_OFFSET:0
GROUP_LENGTH:0
GROUP_FILENAME:word1.doc
COMMENT:DOCUMENT 2
GROUP_FIELD_NAME:field1
GROUP_FIELD_VALUE:value1
...
GROUP_FIELD_NAME:field4
GROUP_FIELD_VALUE:value4
GROUP_OFFSET:0
GROUP_LENGTH:0
GROUP_FILENAME:word2.doc
COMMENT:DOCUMENT 3
GROUP_FIELD_NAME:field1
GROUP_FIELD_VALUE:value1
...
GROUP_FIELD_NAME:field4
GROUP_FIELD_VALUE:value4
GROUP_OFFSET:0
GROUP_LENGTH:0
GROUP_FILENAME:word3.doc
COMMENT:DOCUMENT 4
GROUP_FIELD_NAME:field1
GROUP_FIELD_VALUE:value1
...
GROUP_FIELD_NAME:field4
GROUP_FIELD_VALUE:value4
GROUP_OFFSET:0
GROUP_LENGTH:0
GROUP_FILENAME:word4.doc
Well, if you have all the word files concatenated together, then you need and must know the offset and length of each file inside the concatenated file.
CODEPAGE:923
COMMENT:DOCUMENT 1
GROUP_FIELD_NAME:field1
GROUP_FIELD_VALUE:value1
...
GROUP_FIELD_NAME:field4
GROUP_FIELD_VALUE:value4
GROUP_OFFSET:0
GROUP_LENGTH:1000
GROUP_FILENAME:wordsingle.concat
COMMENT:DOCUMENT 2
GROUP_FIELD_NAME:field1
GROUP_FIELD_VALUE:value1
...
GROUP_FIELD_NAME:field4
GROUP_FIELD_VALUE:value4
GROUP_OFFSET:1001
GROUP_LENGTH:1203
GROUP_FILENAME:wordsingle.concat
COMMENT:DOCUMENT 3
GROUP_FIELD_NAME:field1
GROUP_FIELD_VALUE:value1
...
GROUP_FIELD_NAME:field4
GROUP_FIELD_VALUE:value4
GROUP_OFFSET:2205
GROUP_LENGTH:800
GROUP_FILENAME:wordsingle.concat
COMMENT:DOCUMENT 4
GROUP_FIELD_NAME:field1
GROUP_FIELD_VALUE:value1
...
GROUP_FIELD_NAME:field4
GROUP_FIELD_VALUE:value4
GROUP_OFFSET:3006
GROUP_LENGTH:997
GROUP_FILENAME:wordsingle.concat
But if you don't have the offset/length.... then you must ask the people who provided you with this file. OR you need to know exactly how a word file is structured and find it with some tools.
Cheers,
Alessandro
Minor Correction Alessandro...
In your second sample, the first GROUP_LENGTH is 1000, the next GROUP_OFFSET needs to be incremented by 1 -- so, 1001. You've got this mistake throughout your example.
-JD.
Quote from: Justin Derrick on April 15, 2011, 01:23:53 PM
Minor Correction Alessandro...
In your second sample, the first GROUP_LENGTH is 1000, the next GROUP_OFFSET needs to be incremented by 1 -- so, 1001. You've got this mistake throughout your example.
-JD.
Hello Justin,
Thanks, I've corrected the example!
Cheers,
Alessandro