Web-crawler

kobold · July 28, 2006, 05:20:01 PM

Hi folks!

Right now, after one week of spare-time-programing, the first alpha version of my selfmade webcrawler is working (more or less).

Because of g00gles supremacy and all those open and lame java crawl- and search engines i decided to program something in C for my own.

Some information:
- I am using sqlite for data storage and winsock for the i-net connection stuff.
- Around 1100 lines of code so far.
- A lot of text handling and manipulation routines (pure horror in C - f*** htm!).
- Indexing is done by a simple text analysis.
- Very simple ranking is implemented.
- Mail harvesting is possible too but not finished yet.
- cgi based search engine for the database is planned

Some problems:
- has some problems with special characters (apostrophe) in link adresses in combination with the database (uses apostrophs for strings!).
- Invalid webadresses are causing an endless loop. (invalid adresses will be recalled ever and ever)
- Database stores 'ä', 'ö', 'ü', 'æ', 'ø', 'å' etc corrupted (?) (thanks to all those different character-sets and html)

Another bad thing is, that i have holidays and i am away for a week.

If someone is working on a similar project or is interested in mine - let me know.

MrBcx · July 29, 2006, 11:21:23 PM

Consider me interested.

Will you be sharing source code?

kobold · August 05, 2006, 11:03:47 PM

I am back from holidays... hehe - wheather was sh**! Rain, rain and even more rain... :cry:

I thought about sharing the code... hm.. but i do not know if you want to see my ugly code :wink:
Before my travel i added some bugfixes. Now only one mature problem is present.
When i have fixed it i will publish the code for this very first alpha version. Some features are still missing but the whole hack runs very smooth and stable.

iancasey · August 06, 2006, 11:25:52 PM

Kobold,
sounds interesting, have you had any luck on Yahoo groups, the BCXOLR team made a DB of all BCX msgs but I can no longer save the Yahoo msgs to a database. Your program sounds almost perfect. Have you had any luck?

Regards,
Ian

kobold · August 07, 2006, 08:41:41 PM

The endless loop was a result of an stupid programing error - yesterday i got it to work

I ran the program until the DB reached a size of 1,3 mb (over 8100 links). Works great so far.

(update: tried another page - another endless loop but a different reason... :roll: the program catched or created an invalid adress i guess)

One thing i know now is: i have to separate the link crawler and the indexer, because the number of new links is increasing faster than the number of indexed pages. But that is only one thing on my long to-do list.

If you want to help me (sending some bucks, coding, webspace, ideas, documentation, tidy up my room etc.) feel free to mail me :mrgreen:

But now the ugly hack for all interested people... have fun!

PhilG57 · January 15, 2013, 03:28:52 AM

I know this is an old thread but I've just downloaded the code and am playing with it. It's a pretty easy way to begin to learn some network programming and some SQL usage as well...

News:

Web-crawler

kobold

MrBcx

kobold

iancasey

kobold

PhilG57