A simple example just for extracting text portion from Word .doc file, no formatting or other processing.
It's perfect! Thank you very much! This is very hard code - i don't see any doc files parser before.
Not so usefull code, don't even handle piece table :(
Links:
Office Binary (doc, xls, ppt) Translator to Open XML (http://b2xtranslator.sourceforge.net/)
EDIT Test program using some OLE functions:
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <ole2.h>
#pragma comment(lib, "ole32.lib")
int __cdecl main(int argc, char **argv)
{
LPSTORAGE lpStorage;
BYTE szTmp[512];
MultiByteToWideChar(CP_OEMCP, 0, argv[1], -1, (WCHAR*)szTmp, 260);
SCODE sc = StgOpenStorage((WCHAR*)szTmp, NULL, STGM_READ | STGM_SHARE_EXCLUSIVE, 0, 0, &lpStorage);
if (sc == NOERROR)
{
LPSTREAM lpStream;
sc = lpStorage->lpVtbl->OpenStream(lpStorage, L"WordDocument", NULL, STGM_READ | STGM_SHARE_EXCLUSIVE, 0, (void*)&lpStream);
if (lpStream) {
STATSTG statsg;
DWORD nRead;
puts("WordDocument");
lpStream->lpVtbl->Stat(lpStream, &statsg, STATFLAG_NONAME);
LARGE_INTEGER li = {0};
lpStream->lpVtbl->Seek(lpStream, li, STREAM_SEEK_SET, NULL);
lpStream->lpVtbl->Read(lpStream, &szTmp, 32, &nRead);
if (*(WORD*)szTmp == 0xA5EC || *(WORD*)szTmp == 0xA5DC) { // Word.8 Word.6
DWORD nTxOfs1 = *(DWORD*)(szTmp+0x18);
DWORD nTxOfs2 = *(DWORD*)(szTmp+0x1C);
printf("text starts: %Xh\n", *(DWORD*)(szTmp+0x18));
printf("text ends: %Xh\n", *(DWORD*)(szTmp+0x1C));
strcpy(szTmp, argv[1]);
strcat(szTmp, ".txt");
HANDLE hFileTxt = CreateFile(szTmp, GENERIC_WRITE, 0, NULL,
CREATE_ALWAYS ,FILE_FLAG_SEQUENTIAL_SCAN, NULL);
DWORD nSize = statsg.cbSize.u.LowPart;
DWORD nWrite;
li.u.LowPart = nTxOfs1; // start of text (incremental saving)
lpStream->lpVtbl->Seek(lpStream, li, STREAM_SEEK_SET, NULL);
nSize = nTxOfs2 - nTxOfs1; // saving area
while (nSize) {
nRead = nSize > 512 ? 512 : nSize;
lpStream->lpVtbl->Read(lpStream, &szTmp, nRead, &nRead);
WriteFile(hFileTxt, szTmp, nRead, &nWrite, NULL);
nSize -= nRead;
}
CloseHandle(hFileTxt);
}
lpStream->lpVtbl->Release(lpStream);
}
lpStorage->lpVtbl->Release(lpStorage);
}
return 0;
}
TimoVJL
I am often need to find some word in a lot of documents. Your code, I think, will be very useful.
Quote from: TimoVJL on March 26, 2019, 06:28:05 PMEDIT Test program using some OLE functions
Works like a charm, Timo :)
I have MS Word installed; would it work without?
Quote from: jj2007 on March 28, 2019, 12:54:19 PM
I have MS Word installed; would it work without?
It doesn't depend on MS Word, just API OLE2.
Notepad2 shows some special chars for formatting.
Quote from: jj2007 on March 28, 2019, 12:54:19 PM
I have MS Word installed; would it work without?
In my home computer I don't have MS Word.
But it works good. :)
Good to know. So you are basically looking for some magic numbers, right? Are they documented somewhere?
if (*(WORD*)szTmp == 0xA5EC || *(WORD*)szTmp == 0xA5DC)
I saw something here at MIT (https://stuff.mit.edu/afs/sipb/user/zacheiss/wv/notes/kb40.html), but it doesn't look very official. This looks better (https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/2edea690-135f-4c73-a9ae-b296ab70ce51) 8)
https://docs.microsoft.com/en-us/openspecs/windows_protocols/MS-CFB/53989ce4-7b05-4f8d-829b-d08d6148375b
https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22
http://www.opennet.ru/docs/formats/wword8.html
https://www.decalage.info/file_formats_security/office
Hi Timo,
Nice work. Thanks for the new tool.