NO

Author Topic: Word .doc file, extract text  (Read 504 times)

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 1836
Word .doc file, extract text
« on: March 24, 2019, 07:39:18 pm »
A simple example just for extracting text portion from Word .doc file, no formatting or other processing.
« Last Edit: March 24, 2019, 07:42:27 pm by TimoVJL »
May the source be with you

Offline bitcoin

  • Member
  • *
  • Posts: 49
Re: Word .doc file, extract text
« Reply #1 on: March 26, 2019, 05:31:47 pm »
It's perfect! Thank you very much! This is very hard code - i don't see any doc files parser before.

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 1836
Re: Word .doc file, extract text
« Reply #2 on: March 26, 2019, 06:28:05 pm »
Not so usefull code, don't even handle piece table :(

Links:
Office Binary (doc, xls, ppt) Translator to Open XML

EDIT Test program using some OLE functions:
Code: [Select]
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <ole2.h>

#pragma comment(lib, "ole32.lib")

int __cdecl main(int argc, char **argv)
{
LPSTORAGE lpStorage;
BYTE szTmp[512];

MultiByteToWideChar(CP_OEMCP, 0, argv[1], -1, (WCHAR*)szTmp, 260);
SCODE sc = StgOpenStorage((WCHAR*)szTmp, NULL, STGM_READ | STGM_SHARE_EXCLUSIVE, 0, 0, &lpStorage);
if (sc == NOERROR)
{
LPSTREAM lpStream;
sc = lpStorage->lpVtbl->OpenStream(lpStorage, L"WordDocument", NULL, STGM_READ | STGM_SHARE_EXCLUSIVE, 0, (void*)&lpStream);
if (lpStream) {
STATSTG statsg;
DWORD nRead;
puts("WordDocument");
lpStream->lpVtbl->Stat(lpStream, &statsg, STATFLAG_NONAME);
LARGE_INTEGER li = {0};
lpStream->lpVtbl->Seek(lpStream, li, STREAM_SEEK_SET, NULL);
lpStream->lpVtbl->Read(lpStream, &szTmp, 32, &nRead);
if (*(WORD*)szTmp == 0xA5EC || *(WORD*)szTmp == 0xA5DC) { // Word.8 Word.6
DWORD nTxOfs1 = *(DWORD*)(szTmp+0x18);
DWORD nTxOfs2 = *(DWORD*)(szTmp+0x1C);
printf("text starts: %Xh\n", *(DWORD*)(szTmp+0x18));
printf("text ends:   %Xh\n", *(DWORD*)(szTmp+0x1C));
strcpy(szTmp, argv[1]);
strcat(szTmp, ".txt");
HANDLE hFileTxt = CreateFile(szTmp, GENERIC_WRITE, 0, NULL,
CREATE_ALWAYS ,FILE_FLAG_SEQUENTIAL_SCAN, NULL);
DWORD nSize = statsg.cbSize.u.LowPart;
DWORD nWrite;
li.u.LowPart = nTxOfs1; // start of text (incremental saving)
lpStream->lpVtbl->Seek(lpStream, li, STREAM_SEEK_SET, NULL);
nSize = nTxOfs2 - nTxOfs1; // saving area
while (nSize) {
nRead = nSize > 512 ? 512 : nSize;
lpStream->lpVtbl->Read(lpStream, &szTmp, nRead, &nRead);
WriteFile(hFileTxt, szTmp, nRead, &nWrite, NULL);
nSize -= nRead;
}
CloseHandle(hFileTxt);

}
lpStream->lpVtbl->Release(lpStream);
}
lpStorage->lpVtbl->Release(lpStorage);
}

return 0;
}
« Last Edit: March 26, 2019, 08:03:35 pm by TimoVJL »
May the source be with you

Offline bitcoin

  • Member
  • *
  • Posts: 49
Re: Word .doc file, extract text
« Reply #3 on: March 26, 2019, 06:31:36 pm »
TimoVJL
I am often need to find some word in a lot of documents. Your code, I think, will be very useful.

Offline jj2007

  • Member
  • *
  • Posts: 506
Re: Word .doc file, extract text
« Reply #4 on: March 28, 2019, 12:54:19 pm »
EDIT Test program using some OLE functions

Works like a charm, Timo :)

I have MS Word installed; would it work without?

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 1836
Re: Word .doc file, extract text
« Reply #5 on: March 28, 2019, 01:19:50 pm »
I have MS Word installed; would it work without?
It doesn't depend on MS Word, just API OLE2.

Notepad2 shows some special chars for formatting.
« Last Edit: March 28, 2019, 01:23:36 pm by TimoVJL »
May the source be with you

Offline bitcoin

  • Member
  • *
  • Posts: 49
Re: Word .doc file, extract text
« Reply #6 on: March 28, 2019, 01:27:41 pm »
I have MS Word installed; would it work without?
In my home computer I don't have MS Word.
But it works good. :)

Offline jj2007

  • Member
  • *
  • Posts: 506
Re: Word .doc file, extract text
« Reply #7 on: March 28, 2019, 01:38:47 pm »
Good to know. So you are basically looking for some magic numbers, right? Are they documented somewhere?

Code: [Select]
if (*(WORD*)szTmp == 0xA5EC || *(WORD*)szTmp == 0xA5DC)
I saw something here at MIT, but it doesn't look very official. This looks better 8)
« Last Edit: March 28, 2019, 01:42:05 pm by jj2007 »


Offline Vortex

  • Member
  • *
  • Posts: 517
    • http://www.vortex.masmcode.com
Re: Word .doc file, extract text
« Reply #9 on: March 28, 2019, 06:50:01 pm »
Hi Timo,

Nice work. Thanks for the new tool.
Code it... That's all...