New core (revision 3)

darkone · 08-12-2004, 11:52 PM

Ok, as most of you know I've been working on non-ioFTPD related project for past 'few' months. While I've been busy with the other project, I've also done some plans (notepad drafts & quick performance tests) for the new io (input output) core.

It took 3rewrites to get it all working, but now I'm confident that I've overcome all the theoretical problems that I had with the fully asynchronous processing line.

Old core used to process everything in certain order. When reading data from ssl encrypted socket, it first called 'read' function. Once read completed, it called ÃÂ´decryption' function. Finally when decryption completed, it called some other function (to look for linefeeds or so). Because everything is done in linear order, it is very easy to synchronize. 'read' -> 'decrypt' -> 'do something with buffer'

New core incorporates completely new ideology. Order of events is no longer predeterminated. 'read' function may be called at the same time as 'decryption' function is processing buffer returned by previous read call. It's also possible that function that does something with decrypted buffer, is also running while decryption is in progress...

As you can imagine, it is no longer a trivial task to get it synchronized properly. With proper manner I mean, that it is not acceptable for thread to block another for longer than few quantums (quantum = slice of cpu time thread gets from kernel to execute) Because whenever the pipeline stalls, the efficiency drops. Therefore I have tried to figure out a way to make stalls as short as possible, and in many ways I think I have succeeded in this.

I have placed high hopes for the new core. And when it is done, I hope I can say it was worth it. There are no examples of anything alike available - and AFAIK this method has never been done before. When I have the performance figures - and if they are what I expect them to be, I can be sure that it's unique solution.

- Old core: ~160mb/sec cached disk read (to socket)
- New core (rev 2): ~220mb/sec cached disk read (to socket)
- New core (rev 3): Faster than rev 2.

darkone · 08-13-2004, 12:04 AM

Notepad draft from yesterday:

http://www.ioftpd.com/~darkone/tmp/stupid.txt

As usual example does not compile, nor has any real functionality. Idea of example is to show, complexity (and or simplicity) of algorithms in use. For legal reasons I have to mention that, one is not allowed to use this code or its' director or indirect derivants without my permission.

peep · 08-13-2004, 12:57 AM

Oh man, oh man. Have I been waiting for this post to come

Great going d1. Gonna go read your post more throughly now!
Keep doing what you do best!

Let's show them Linux servers who's got the fastest daemon/servers around

Microsoft Windows Server 2003 vs. Linux Competitive File Server

peep

darkone · 08-16-2004, 03:21 AM

I wrapped up few tests for the new core...

ftp buffered file -> socket -> socket [ftp.exe] -> nul
- Parameters: FILE_FLAG_SEQUENTIAL_SCAN, SO_SNDBUF = 0, 3 * 32kb application write buffer, ftp.exe
- Result: ~180mb/sec
- Conclusion: No significant speed gains above the current core. Reduced memory and cpu (~25% in user mode) usage.

ftp unbuffered file -> socket -> socket [ftp.exe] -> null
- Parameters: FILE_FLAG_NO_BUFFERING, SO_SNDBUF = 0, 3 * 8kb application write buffer, ftp.exe
- Result: ~60mb/sec
- Conclusion: User mode cpu usage reduced significantly (to 1-2%), while kernel mode cpu usage increased somewhat. Greatest benefits when used asynchronous devices. (scsi, network shares)

buffered loopback transfer file -> socket -> socket -> null
- Parameters: FILE_FLAG_SEQUENTIAL_SCAN, no socket write buffers, 3 * 32kb application read buffer + 3 * 8kb application write buffer, within single daemon thread
- Result: ~120mb/sec transfer (240mb/sec throughput)
- Conclusion: Result is a bit lower than exepected. But because test was limited to single thread, it only utilized capacity of one processor.

It currently seems, that there is very little to be optimized in the main code - as nearly 100% of cpu-time is now used by mandatory calls: GetQueuedCompletionStatus(), WSASend(), WSARecv(), ReadFile() & WriteFile().

Mr_X · 08-16-2004, 01:07 PM

I seen a long time ago an article on Onversity about optimizing loops. This may help you. Here's a link to the PDF file:
d-loop

darkone · 08-19-2004, 12:35 PM

I actually use similar optimizations already...

Code:

void foo(int lOperation)
{
  switch (lOperation) {
  case 0:
    DoSomethin1();
    break;
  case 1:
    DoSomething2();
    break;
  }
}

could be written as:

Code:

lpProc[2] = { DoSomething1, DoSomething2 };

void foo(int lOperation)
{
  lpProc[lOperation]();
}

darkone · 08-20-2004, 04:15 AM

Finished the first test round with 2048 test connections; total transfer speed was ~180mb/sec. In the test daemon was acting as both client and server, so number of downloading client connections was 1024 and uploading server connections was 1024 as well. Cpu usage was at ~80% on both cpus. Memory usage using 3read & 3write buffers of 16kb, was ~160mb.

=> There's still room for optimizations, but core seems to handle heavy io loads now with ease.

ADDiCT · 08-20-2004, 05:34 AM

(stupid question, but who am i

) do these optimiziations for extreme situations (+1000 transfers) produce any overhead for normal situations (~20 transfers) ?

darkone · 08-20-2004, 07:00 AM

Ofcourse not

Resources free by reduction in memory copying negate the overhead that required thread synchronization adds.

For uploads, I expect tremendenous performance improvent; seperate encryption thread pool is now gone - io threads are able to transform themselves to encryption threads when required. This means that thread is in many sitautations able to decrypt/encrypt/chash/whatever the buffer immeaditely after processing it. Just like with the old core, amount of threads performing such operations simultanously is definable parameter.

darkone · 08-20-2004, 08:19 PM

It seems that I'm getting close to the performance requirements I set to the core.

Transfer times for 800mb file, when daemon is working as both client and server:
1 connection: 6.3 seconds (253mb/sec)
2 connections: 12.6 seconds
10 connections: 62.6 seconds
100 connections: 619.8 seconds
1000 connections: 6285.2 seconds

Performance of single cached transfer seems to be always constant.

Just one odd thing.. I noticed I had pulled the wrong figure for old core performance: http://www.ioftpd.com/board/showthread.phpthreadid=3174

... and the odd thing is, that now that I try to transfer same file, I get lower performance (even with the old io) And I can't remember any changes since (other than I added 1Gb of memory)

Syrinx · 08-21-2004, 02:09 AM

"Transfer times for 800mb file, when daemon is working as both client and server:
1 connection: 6.3 seconds (253mb/sec)
"

The above reslt confuses me.How Can transfering a 800 mb file with 6.3 seconds,and the transfer speed is 253 mb/sec.

800 mb / 6.3 seconds = 253 mb/sec?

ADDiCT · 08-21-2004, 03:32 AM

800 MB / 6.3s = 126,98 MB/s

"when daemon is working as both client and server"

in those 6.3s, it has both uploaded and downloaded 800 MB, so it's 126,98 MB/s "full duplex", or a total of 253 MB/s

iXi · 08-21-2004, 04:54 AM

Quote:

ftp unbuffered file -> socket -> socket [ftp.exe] -> null
- Parameters: FILE_FLAG_NO_BUFFERING, SO_SNDBUF = 0, 3 * 8kb application write buffer, ftp.exe
- Result: ~60mb/sec
- Conclusion: User mode cpu usage reduced significantly (to 1-2%), while kernel mode cpu usage increased somewhat. Greatest benefits when used asynchronous devices. (scsi, network shares)

d1 does io needs special buffer settings for asynchronous devices? because i'm using some scsi raids..

cya

darkone · 08-21-2004, 09:23 AM

Not likely that I'll support that FILE_FLAG_NO_BUFFERING, implementation seems to be costly. (amount of code compared to performance delta)

iXi · 08-21-2004, 09:39 AM

sounds quit good