[REQ] Extremely Fast Database [Archive]

View Full Version : [REQ] Extremely Fast Database

CoLt-[45]

05-04-2005, 07:52 AM

Looking for a dupe script that can handle at least 50,000 or 60,000 files (NOTE : NOT DIR!!! - I'm talking about FILES!) - tried couple of scripts but everytime someone upload a file, it's lags like hell - like 5 sec before going on the next one.

Any suggestion, or new script for me to test, etc.. would be gladly to try out those.

I'm willing to install a fresh ioFTPD if I have to. - but other than that,
i'm still running on
[ioFTPD 5-8-5r]-[dZSbot 1.15]-[Eggdrop 1.6.17]

TIA

FTPServerTools

05-04-2005, 05:21 PM

Do me a favor and try my dupechecker with such an amount of files. Upload checking of 60000 files should take about 17 reads in a file, meanign within 0.2 seconds it can check a dupe out of 60000 files and yet it is still in a simple SORTED!! list. It works on dirs but I can extend it to work on files. The drawback is tho that after a save it takes more time to process... Have you considered using sqlite with an index in tcl? There is a tcl extension for sqlite that might give you the stuff you need. You would need some tcl scripting then tho. 60000 files with an average of lets say 14 characters is only a measly 830K file which is basically a small file. Reading such a file can be done super quickly. You can use DupeLister to make new dupelists. Please let me know if it is fast enough for you. I have been given reports of 400000 entries in it being handled within 2 seconds, so I assume in your case it'd be fast enough. If it is let me know then I'll see if I can add file support to it as well (shouldnt be hard at all).
DupeSearch and DupeLister are what you need for testing. OnDirCreated does dupe dir blocking I can make a OnPreFileUpload or something like that that tests a filename against a list like being created with DupeLister.

esmandil

05-05-2005, 03:54 PM

Alternatively, newest version of esmNewdir has some support for file dupe... and should be fast enough for your needs. Be warned, however, that I don't use file dupe functions, so there are probably still a couple of bugs hiding there.

CoLt-[45]

05-05-2005, 03:58 PM

Heh, yeah saw that part :p

and posted ;)

http://www.ioftpd.com/board/showthread.php?s=&postid=31975#post31975

Thanks by the way.

Colt

deo

05-06-2005, 01:26 PM

Block dupes by filenames for ioFTPd.
Undupe with wildcard.
Alter database on delete.

Compiled in c from modified poci source.

http://ioftpd.humandroids.net/

badDUPE may be fast enuff...?

darkone

05-14-2005, 01:09 PM

Originally posted by FTPServerTools
Do me a favor and try my dupechecker with such an amount of files. Upload checking of 60000 files should take about 17 reads in a file, meanign within 0.2 seconds it can check a dupe out of 60000 files and yet it is still in a simple SORTED!! list. It works on dirs but I can extend it to work on files. The drawback is tho that after a save it takes more time to process... Have you considered using sqlite with an index in tcl? There is a tcl extension for sqlite that might give you the stuff you need. You would need some tcl scripting then tho. 60000 files with an average of lets say 14 characters is only a measly 830K file which is basically a small file. Reading such a file can be done super quickly. You can use DupeLister to make new dupelists. Please let me know if it is fast enough for you. I have been given reports of 400000 entries in it being handled within 2 seconds, so I assume in your case it'd be fast enough. If it is let me know then I'll see if I can add file support to it as well (shouldnt be hard at all).
DupeSearch and DupeLister are what you need for testing. OnDirCreated does dupe dir blocking I can make a OnPreFileUpload or something like that that tests a filename against a list like being created with DupeLister.

Consider saving data to more than one file, when file (database) size grows too large. I'm assuming you're using method similar to binary search algorithm on sorted file that I posted a while ago.

Here's simple example of how contents of files should/could look like:

filedb_1.dat
[number of files in database]
[min value of database 1]
[min value of database ...]
[filename of database ...]
...
[min value of database N]
[filename of database N]
[database contents part 1]

filedb_....dat
[database contents part ...]

filedb_N.dat
[database contents part N]

When file grows larger than eg. 5000 entries, it's split into two and header information in file_db1.dat is to be updated. With 1000000 entries you'd end up having 200files. That equals to maximum of 8 (binary search variant on min values) + 13 (binary search on file) comparisons. Neat and very efficient. Also it might be wise to limit size of filename to fixed value, and use twice as large read buffer - so you'll get full entry no matter what.

If you need further information, just message me on irc.