Link crawler/harvester

kobold · February 07, 2006, 10:57:26 PM

This tool is able to find 'http://' strings in any kind of file and write the complete link into an html file.

I played arround with http://www.yacy.net/yacy/.
It is a p2p search engine for the web, a crawler, indexer, proxy, dns, webserver...

Because i saved all my links as shortcuts (for a better oversight) it was not possible to pass them to the yacy crawler. So i decided to code a little tool to crawl all my shortcuts and save the 'http://' links in an .html file. This way it worked.

How to use (in the console):
link_crawler [-s directory] [-d html_file] [-f filter]

-s 'directory': this directory and all subdirectories will be crawled
-d 'html_file': path and name for the result file
-f 'filter': in which type of files the program will search ('.txt' for example, without the 's )

known limitations:
- only the first 1 mb of a file will be scanned (should be enough)
- the programm will only search for one type of file, it is not possible to search in multiple filetypes at one time.

It is still alpha, but it works for me. If you find any bugs you can post it here.

kobold · February 08, 2006, 10:26:45 PM

I modified some parts of the code and added the option to change the searchstring (max 1024 bytes).
Now you can search for any word you want.

I discovered, that he sometimes insert blank lines - i do not know why.

Pelle · February 09, 2006, 07:44:10 PM

Not sure if it's a bug or a feature, but if I specify .c for 'filter' it will also match .css, .cpp... It's a bit surprising at least.

Pelle

kobold · February 10, 2006, 04:36:07 PM

Oh... yes... i call it 'feature' :mrgreen:
I am looking with strstr for the filter - so it is possible to find for example more files than you expect, and with other extensions too.
This feature is good if you type in '.htm', then he will find '.html' files too. But for other files it could be bad. I will look for it.

THX

kobold · February 10, 2006, 09:09:07 PM

And again a new version

- No more empty lines if you crawl in html files :wink:
- The results have now at least the length of your searchstring+1. If you crawl for http://, empty adresses will be suppressed. (this program is not meant as an alternative to the windows search function)
- The whole filename will be written to the html file. This makes the search easier if you found something interesting or strange.
- Speed optimizations (~20% more speed) (thx to the profile output)
- Possibility to set the file buffer size. This is the size that is reserved for loading the file into memory. If it is bigger than that, only the beginning of the file will be loaded and crawled. (switch '-b', size is in kb)
- Some information will be printed if the program has finished his job. Example:

Code Select

Results:
--------

Found:
3597 matches in 429 files. Crawled 1634 folders.

Settings:
Sourcedir:      x:
HTML file:      z:\dir.html
Filter:         .htm
Searchstring:   http://
Buffersize:     1024 kb

CPU time left:  2 seconds (2296 ticks)

---
For the next version i am planning to build in templates, so you can create your own html page and fit it to your needs.
Further, support for '*' in the filter argument.
Possibility for more powerfull searchstrings, so you can search for whole sentences and every sign on your keyboard.

Freddy · April 08, 2006, 12:57:17 AM

How do you know the speed of your code?
How profiling works on Pelles C?

Thanks!

kobold · April 11, 2006, 09:27:33 PM

In the PellesC help is a short part about profiling. Unfortunatly it is a little bit more complicated to activate than in LCC... but better than nothing.

Such an output could look like this:

Quote2006-02-10, 20.55.11:
1580 ms, 429 time(s): _searchlinks
780 ms, 1635 time(s): _findfiles
0 ms, 3597 time(s): _cut_end
0 ms, 1 time(s): _StartCrawl
0 ms, 1 time(s): _main

But now it is getting a little bit off-topic :mrgreen:

News:

Link crawler/harvester

kobold

kobold

Pelle

kobold

kobold

Freddy

kobold