Word .doc file, extract text

Started by TimoVJL, March 24, 2019, 07:39:18 PM

Previous topic - Next topic

TimoVJL

A simple example just for extracting text portion from Word .doc file, no formatting or other processing.
May the source be with you

bitcoin

It's perfect! Thank you very much! This is very hard code - i don't see any doc files parser before.

TimoVJL

#2
Not so usefull code, don't even handle piece table :(

Links:
Office Binary (doc, xls, ppt) Translator to Open XML

EDIT Test program using some OLE functions:
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <ole2.h>

#pragma comment(lib, "ole32.lib")

int __cdecl main(int argc, char **argv)
{
LPSTORAGE lpStorage;
BYTE szTmp[512];

MultiByteToWideChar(CP_OEMCP, 0, argv[1], -1, (WCHAR*)szTmp, 260);
SCODE sc = StgOpenStorage((WCHAR*)szTmp, NULL, STGM_READ | STGM_SHARE_EXCLUSIVE, 0, 0, &lpStorage);
if (sc == NOERROR)
{
LPSTREAM lpStream;
sc = lpStorage->lpVtbl->OpenStream(lpStorage, L"WordDocument", NULL, STGM_READ | STGM_SHARE_EXCLUSIVE, 0, (void*)&lpStream);
if (lpStream) {
STATSTG statsg;
DWORD nRead;
puts("WordDocument");
lpStream->lpVtbl->Stat(lpStream, &statsg, STATFLAG_NONAME);
LARGE_INTEGER li = {0};
lpStream->lpVtbl->Seek(lpStream, li, STREAM_SEEK_SET, NULL);
lpStream->lpVtbl->Read(lpStream, &szTmp, 32, &nRead);
if (*(WORD*)szTmp == 0xA5EC || *(WORD*)szTmp == 0xA5DC) { // Word.8 Word.6
DWORD nTxOfs1 = *(DWORD*)(szTmp+0x18);
DWORD nTxOfs2 = *(DWORD*)(szTmp+0x1C);
printf("text starts: %Xh\n", *(DWORD*)(szTmp+0x18));
printf("text ends:   %Xh\n", *(DWORD*)(szTmp+0x1C));
strcpy(szTmp, argv[1]);
strcat(szTmp, ".txt");
HANDLE hFileTxt = CreateFile(szTmp, GENERIC_WRITE, 0, NULL,
CREATE_ALWAYS ,FILE_FLAG_SEQUENTIAL_SCAN, NULL);
DWORD nSize = statsg.cbSize.u.LowPart;
DWORD nWrite;
li.u.LowPart = nTxOfs1; // start of text (incremental saving)
lpStream->lpVtbl->Seek(lpStream, li, STREAM_SEEK_SET, NULL);
nSize = nTxOfs2 - nTxOfs1; // saving area
while (nSize) {
nRead = nSize > 512 ? 512 : nSize;
lpStream->lpVtbl->Read(lpStream, &szTmp, nRead, &nRead);
WriteFile(hFileTxt, szTmp, nRead, &nWrite, NULL);
nSize -= nRead;
}
CloseHandle(hFileTxt);

}
lpStream->lpVtbl->Release(lpStream);
}
lpStorage->lpVtbl->Release(lpStorage);
}

return 0;
}
May the source be with you

bitcoin

TimoVJL
I am often need to find some word in a lot of documents. Your code, I think, will be very useful.

jj2007

Quote from: TimoVJL on March 26, 2019, 06:28:05 PMEDIT Test program using some OLE functions

Works like a charm, Timo :)

I have MS Word installed; would it work without?

TimoVJL

#5
Quote from: jj2007 on March 28, 2019, 12:54:19 PM
I have MS Word installed; would it work without?
It doesn't depend on MS Word, just API OLE2.

Notepad2 shows some special chars for formatting.
May the source be with you

bitcoin

Quote from: jj2007 on March 28, 2019, 12:54:19 PM
I have MS Word installed; would it work without?
In my home computer I don't have MS Word.
But it works good. :)

jj2007

#7
Good to know. So you are basically looking for some magic numbers, right? Are they documented somewhere?

if (*(WORD*)szTmp == 0xA5EC || *(WORD*)szTmp == 0xA5DC)

I saw something here at MIT, but it doesn't look very official. This looks better 8)


Vortex

Hi Timo,

Nice work. Thanks for the new tool.
Code it... That's all...