Pelles C forum

C language => Work in progress => Topic started by: TimoVJL on March 24, 2019, 07:39:18 PM

Title: Word .doc file, extract text
Post by: TimoVJL on March 24, 2019, 07:39:18 PM
A simple example just for extracting text portion from Word .doc file, no formatting or other processing.
Title: Re: Word .doc file, extract text
Post by: bitcoin on March 26, 2019, 05:31:47 PM
It's perfect! Thank you very much! This is very hard code - i don't see any doc files parser before.
Title: Re: Word .doc file, extract text
Post by: TimoVJL on March 26, 2019, 06:28:05 PM
Not so usefull code, don't even handle piece table :(

Links:
Office Binary (doc, xls, ppt) Translator to Open XML (http://b2xtranslator.sourceforge.net/)

EDIT Test program using some OLE functions:
Code: [Select]
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <ole2.h>

#pragma comment(lib, "ole32.lib")

int __cdecl main(int argc, char **argv)
{
LPSTORAGE lpStorage;
BYTE szTmp[512];

MultiByteToWideChar(CP_OEMCP, 0, argv[1], -1, (WCHAR*)szTmp, 260);
SCODE sc = StgOpenStorage((WCHAR*)szTmp, NULL, STGM_READ | STGM_SHARE_EXCLUSIVE, 0, 0, &lpStorage);
if (sc == NOERROR)
{
LPSTREAM lpStream;
sc = lpStorage->lpVtbl->OpenStream(lpStorage, L"WordDocument", NULL, STGM_READ | STGM_SHARE_EXCLUSIVE, 0, (void*)&lpStream);
if (lpStream) {
STATSTG statsg;
DWORD nRead;
puts("WordDocument");
lpStream->lpVtbl->Stat(lpStream, &statsg, STATFLAG_NONAME);
LARGE_INTEGER li = {0};
lpStream->lpVtbl->Seek(lpStream, li, STREAM_SEEK_SET, NULL);
lpStream->lpVtbl->Read(lpStream, &szTmp, 32, &nRead);
if (*(WORD*)szTmp == 0xA5EC || *(WORD*)szTmp == 0xA5DC) { // Word.8 Word.6
DWORD nTxOfs1 = *(DWORD*)(szTmp+0x18);
DWORD nTxOfs2 = *(DWORD*)(szTmp+0x1C);
printf("text starts: %Xh\n", *(DWORD*)(szTmp+0x18));
printf("text ends:   %Xh\n", *(DWORD*)(szTmp+0x1C));
strcpy(szTmp, argv[1]);
strcat(szTmp, ".txt");
HANDLE hFileTxt = CreateFile(szTmp, GENERIC_WRITE, 0, NULL,
CREATE_ALWAYS ,FILE_FLAG_SEQUENTIAL_SCAN, NULL);
DWORD nSize = statsg.cbSize.u.LowPart;
DWORD nWrite;
li.u.LowPart = nTxOfs1; // start of text (incremental saving)
lpStream->lpVtbl->Seek(lpStream, li, STREAM_SEEK_SET, NULL);
nSize = nTxOfs2 - nTxOfs1; // saving area
while (nSize) {
nRead = nSize > 512 ? 512 : nSize;
lpStream->lpVtbl->Read(lpStream, &szTmp, nRead, &nRead);
WriteFile(hFileTxt, szTmp, nRead, &nWrite, NULL);
nSize -= nRead;
}
CloseHandle(hFileTxt);

}
lpStream->lpVtbl->Release(lpStream);
}
lpStorage->lpVtbl->Release(lpStorage);
}

return 0;
}
Title: Re: Word .doc file, extract text
Post by: bitcoin on March 26, 2019, 06:31:36 PM
TimoVJL
I am often need to find some word in a lot of documents. Your code, I think, will be very useful.
Title: Re: Word .doc file, extract text
Post by: jj2007 on March 28, 2019, 12:54:19 PM
EDIT Test program using some OLE functions

Works like a charm, Timo :)

I have MS Word installed; would it work without?
Title: Re: Word .doc file, extract text
Post by: TimoVJL on March 28, 2019, 01:19:50 PM
I have MS Word installed; would it work without?
It doesn't depend on MS Word, just API OLE2.

Notepad2 shows some special chars for formatting.
Title: Re: Word .doc file, extract text
Post by: bitcoin on March 28, 2019, 01:27:41 PM
I have MS Word installed; would it work without?
In my home computer I don't have MS Word.
But it works good. :)
Title: Re: Word .doc file, extract text
Post by: jj2007 on March 28, 2019, 01:38:47 PM
Good to know. So you are basically looking for some magic numbers, right? Are they documented somewhere?

Code: [Select]
if (*(WORD*)szTmp == 0xA5EC || *(WORD*)szTmp == 0xA5DC)
I saw something here at MIT (https://stuff.mit.edu/afs/sipb/user/zacheiss/wv/notes/kb40.html), but it doesn't look very official. This looks better (https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/2edea690-135f-4c73-a9ae-b296ab70ce51) 8)
Title: Re: Word .doc file, extract text
Post by: TimoVJL on March 28, 2019, 01:45:44 PM
https://docs.microsoft.com/en-us/openspecs/windows_protocols/MS-CFB/53989ce4-7b05-4f8d-829b-d08d6148375b
https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22
http://www.opennet.ru/docs/formats/wword8.html
https://www.decalage.info/file_formats_security/office
Title: Re: Word .doc file, extract text
Post by: Vortex on March 28, 2019, 06:50:01 PM
Hi Timo,

Nice work. Thanks for the new tool.