Comments on: TCP Stack Flaking Out http://laurentszyster.be/blog/tcp-stack-flaking-out/ Python on Peers Tue, 07 Feb 2012 14:29:09 +0000 http://wordpress.org/?v=1.5.1.3 by: frontera000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2732 Sun, 27 Aug 2006 17:05:24 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2732 Hi, I think that you should know I have tried varying sleep times in server (upping to 1 or 2 seconds) as well as using very small buffers on the server. These things don't make any difference in terms of connections dropping. They don't. As long as the clients and server do check the error conditions properly (NOBUFS, WOULDBLOCK, etc.) the OS will not send RST to terminate TCP connection. If it did, it would be indeed quite broken. It does not do that. If you don't believe me, please run the program I have posted, and tweak all the numbers (SND BUF and RCV BUF as well as various other numbers, including number of threads and how much to send, etc.) Do all the tweaking you can, but you won't see the connections dropping. That's about all I can say here. It is true that many software developers are "opinionated", but I am not. I could be wrong. It would be nice to be corrected with a clear evidence, rather than conjecture. But if you cannot be convinced by the test program I wrote, I have no more to contribute here. Regards to you and have a good day! Hi,

I think that you should know I have tried varying sleep times in server (upping to 1 or 2 seconds) as well as using very small buffers on the server. These things don’t make any difference in terms of connections dropping. They don’t. As long as the clients and server do check the error conditions properly (NOBUFS, WOULDBLOCK, etc.) the OS will not send RST to terminate TCP connection. If it did, it would be indeed quite broken. It does not do that. If you don’t believe me, please run the program I have posted, and tweak all the numbers (SND BUF and RCV BUF as well as various other numbers, including number of threads and how much to send, etc.) Do all the tweaking you can, but you won’t see the connections dropping.

That’s about all I can say here. It is true that many software developers are “opinionated”, but I am not. I could be wrong. It would be nice to be corrected with a clear evidence, rather than conjecture. But if you cannot be convinced by the test program I wrote, I have no more to contribute here.

Regards to you and have a good day!

]]>
by: Laurent Szyster http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2722 Sun, 27 Aug 2006 03:03:24 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2722 > If you notice, I wait until all clients have connected, before doing the I/O intensive testing. If you don’t do that it can cause RESET conditions as in your case. That's not the case: as you can see from the log entries I get from the server and which clearly shows that: debug Flaking id="a95610" in="18432" out="104448" the client managed to send 18432 bytes and then broke the connection. Also, if you care to read it once more, flaking_client.py do connect all its dispatcher's socket before polling for I/O. Finally I could not establish a relation with the number of clients: 500 clients with buffers twice as small than the server would do fine, but when the same 500 clients sent chunks of 128KB at once for a server with a 512 bytes application buffer ... many closed after having actually sent only 18432 bytes (in that lapse of time the server looped 51 times around select and managed to send 102KBps in chunks of 2048 bytes). So, presumably the error happened as there was 42MB of data buffered *by the OS* without raising an error, never actually sent. > And everything sent by client is not read “at once” by the server. That is incorrect. hmm. Indeed the server sends and receives twice as less than its clients at once, but it is also presumably a lot faster (there is only one server thread, not a thousand). And even if it sleeps 0.1 seconds between lapses, that's barely the order of magnitude of the time it takes to loop through a thousand file descriptor in C, then call send and recv once. FYI a select based TCP server can handle around one connection per millisecond implementing a non-trivial protocol like HTTP ... in Python. So, a test-purpose C server doing a lot less *must* be at least ten time faster, maybe a lot more. Probably well below 0.1 second for 1000 concurrent connections. And maybe twice as fast than 1000 clients threads contending in the same process. I suggest you time each server loop runs when the clients run in a separate process: you may be surprised ... Regards, > If you notice, I wait until all clients have connected, before doing the I/O intensive testing. If you don’t do that it can cause RESET conditions as in your case.

That’s not the case: as you can see from the log entries I get from the server and which clearly shows that:

debug
Flaking id=”a95610″
in=”18432″ out=”104448″

the client managed to send 18432 bytes and then broke the connection.

Also, if you care to read it once more, flaking_client.py do connect all its dispatcher’s socket before polling for I/O.

Finally I could not establish a relation with the number of clients: 500 clients with buffers twice as small than the server would do fine, but when the same 500 clients sent chunks of 128KB at once for a server with a 512 bytes application buffer … many closed after having actually sent only 18432 bytes (in that lapse of time the server looped 51 times around select and managed to send 102KBps in chunks of 2048 bytes). So, presumably the error happened as there was 42MB of data buffered *by the OS* without raising an error, never actually sent.

> And everything sent by client is not read “at once” by the server. That is incorrect.

hmm.

Indeed the server sends and receives twice as less than its clients at once, but it is also presumably a lot faster (there is only one server thread, not a thousand). And even if it sleeps 0.1 seconds between lapses, that’s barely the order of magnitude of the time it takes to loop through a thousand file descriptor in C, then call send and recv once.

FYI a select based TCP server can handle around one connection per millisecond implementing a non-trivial protocol like HTTP … in Python. So, a test-purpose C server doing a lot less *must* be at least ten time faster, maybe a lot more.

Probably well below 0.1 second for 1000 concurrent connections. And maybe twice as fast than 1000 clients threads contending in the same process.

I suggest you time each server loop runs when the clients run in a separate process: you may be surprised …

Regards,

]]>
by: frontera000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2719 Sat, 26 Aug 2006 21:46:40 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2719 More clarifications. By the way your observation "Using threads, you will instead glob all CPU time and bring the OS to a crawling state before you can overflow it’s shared socket buffer." is incorrect. Using threads it is very easy to glob all buffer spaces. Besides the socket buffers are *not* shared. Socket buffers are per socket. And everything sent by client is not read "at once" by the server. That is incorrect. More clarifications.

By the way your observation “Using threads, you will instead glob all CPU time and bring the OS to a crawling state before you can overflow it’s shared socket buffer.” is incorrect.

Using threads it is very easy to glob all buffer spaces. Besides the socket buffers are *not* shared. Socket buffers are per socket.

And everything sent by client is not read “at once” by the server. That is incorrect.

]]>
by: frontera000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2718 Sat, 26 Aug 2006 21:42:08 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2718 Actually I have done the experiments you proposed already. I have a version of the code that has client threads and server in different processes. It makes no difference. I also have experimented with ridiculously small send and receive buffer sizes at socket level on the server side. No difference. This is just how things should be. There are no problems other than the applications getting ENOBUF errors as they should. Hope that clarifies your concerns. I think one of the errors you may not be catching (other than the send not checking ENOBUFS error condition) is that you may be connecting clients fast to the server, while I/O is going on at the same time. Doing that can cause problems in Windows as well as BSD, due to the fact that the backlog on listen is set at 5 maximum. If you notice, I wait until all clients have connected, before doing the I/O intensive testing. If you don't do that it can cause RESET conditions as in your case. Actually I have done the experiments you proposed already. I have a version of the code that has client threads and server in different processes. It makes no difference. I also have experimented with ridiculously small send and receive buffer sizes at socket level on the server side. No difference. This is just how things should be. There are no problems other than the applications getting ENOBUF errors as they should.

Hope that clarifies your concerns.

I think one of the errors you may not be catching (other than the send not checking ENOBUFS error condition) is that you may be connecting clients fast to the server, while I/O is going on at the same time. Doing that can cause problems in Windows as well as BSD, due to the fact that the backlog on listen is set at 5 maximum.

If you notice, I wait until all clients have connected, before doing the I/O intensive testing. If you don’t do that it can cause RESET conditions as in your case.

]]>
by: Laurent Szyster http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2713 Sat, 26 Aug 2006 09:58:07 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2713 Well, of course I disagree: this is about *opiniated* software ;-) First, you are using threads to multiplex your clients instead of an asynchronous loop around calls to the select () interface. Second, you are not using two different program instances for the server and the clients, but different threads in the same process. Third you don't slow down your server enough: reading 34000 bytes from 1000 socket output every 100 milliseconds is still reading more than 300MB per seconds. So there is no chance that you'll hit the OS limit on the total size of the buffer space allocated for every sockets. Because everything sent by the client is read at once by the server! Using threads, you will instead glob all CPU time and bring the OS to a crawling state before you can overflow it's shared socket buffer. To reproduce the same error condition, to make client program fast enough to overflow the OS buffer allocated to TCP output, you should remove any contention between client threads *and* between clients and the server thread (that's why it can "hang" your PC if you run the clients forever, because all CPU is allocated to contention between server and client threads ;-). Then, setting smaller buffers for the server and the client, on input and output, will get your two programs to a state where the client is outputing data so fast that the OS runs out of global buffer. If you do such experiment in C, be sure that I'm interested in the result! One thing I agree with you is that ENOBUFS is obviously not available in CPython. So I had a look at it's socket.c binding to look for a handler of that error condition ... and found none: http://svn.python.org/view/python/trunk/Modules/socketmodule.c If you look into its sock_send, there is no other error handling but a call to set_error. So, CPython bindings don't handle the WSAENOBUFS error condition on send, they simply report the OS error condition. Yet what Allegra's async_core.Dispatcher's socket.send calls *do* raise is a WSACONNRESET which puzzled me before I could reproduce it systematically and make sense of it. Anyway, even if we still disagree, this turns out to be a nice way to learn about non-blocking socket programming on Windows for both of us (and our readers). Regards, Well, of course I disagree: this is about *opiniated* software ;-)

First, you are using threads to multiplex your clients instead of an asynchronous loop around calls to the select () interface. Second, you are not using two different program instances for the server and the clients, but different threads in the same process. Third you don’t slow down your server enough: reading 34000 bytes from 1000 socket output every 100 milliseconds is still reading more than 300MB per seconds.

So there is no chance that you’ll hit the OS limit on the total size of the buffer space allocated for every sockets. Because everything sent by the client is read at once by the server!

Using threads, you will instead glob all CPU time and bring the OS to a crawling state before you can overflow it’s shared socket buffer.

To reproduce the same error condition, to make client program fast enough to overflow the OS buffer allocated to TCP output, you should remove any contention between client threads *and* between clients and the server thread (that’s why it can “hang” your PC if you run the clients forever, because all CPU is allocated to contention between server and client threads ;-). Then, setting smaller buffers for the server and the client, on input and output, will get your two programs to a state where the client is outputing data so fast that the OS runs out of global buffer.

If you do such experiment in C, be sure that I’m interested in the result!

One thing I agree with you is that ENOBUFS is obviously not available in CPython. So I had a look at it’s socket.c binding to look for a handler of that error condition … and found none:

http://svn.python.org/view/python/trunk/Modules/socketmodule.c

If you look into its sock_send, there is no other error handling but a call to set_error. So, CPython bindings don’t handle the WSAENOBUFS error condition on send, they simply report the OS error condition. Yet what Allegra’s async_core.Dispatcher’s socket.send calls *do* raise is a WSACONNRESET which puzzled me before I could reproduce it systematically and make sense of it.

Anyway, even if we still disagree, this turns out to be a nice way to learn about non-blocking socket programming on Windows for both of us (and our readers).

Regards,

]]>
by: frontera000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2702 Sat, 26 Aug 2006 03:41:43 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2702 Hello again. Regarding ENOBUFS not being in Python API, I believe you. That is not something I was suggesting otherwise. We are not talking about Python features. We are talking about whether Windows TCP flakes out. Since I cannot be certain Allegra sample program is not flaking out due to Allegra itself, I wanted a simpler program. So I wrote one in C. Why? This topic interests me. I wrote a TCP/IP implementation long time ago and it is a pet peeve of mine. Anyway you can see another posting I made which lists out the C program I used to test. Here is the link: http://sparebandwidth.blogspot.com/2006/08/more-on-tcp-flaking-out.html In any case, the conclusion, at least on my end, is that Windows socket implementaion and TCP/IP are OK. I think you have some bugs in your code. I think the test I wrote is pretty good. It creates 1000 threads sending to one server that is artificially slowed by a Sleep(). Things work as expected. If you disagree, I'd like to hear about it. Hello again. Regarding ENOBUFS not being in Python API, I believe you. That is not something I was suggesting otherwise. We are not talking about Python features. We are talking about whether Windows TCP flakes out.

Since I cannot be certain Allegra sample program is not flaking out due to Allegra itself, I wanted a simpler program. So I wrote one in C. Why? This topic interests me. I wrote a TCP/IP implementation long time ago and it is a pet peeve of mine.

Anyway you can see another posting I made which lists out the C program I used to test. Here is the link: http://sparebandwidth.blogspot.com/2006/08/more-on-tcp-flaking-out.html

In any case, the conclusion, at least on my end, is that Windows socket implementaion and TCP/IP are OK. I think you have some bugs in your code. I think the test I wrote is pretty good. It creates 1000 threads sending to one server that is artificially slowed by a Sleep(). Things work as expected.

If you disagree, I’d like to hear about it.

]]>
by: Laurent Szyster http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2690 Fri, 25 Aug 2006 17:39:52 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2690 Hi again, Thanks for that gentler response (although I still can't see where you found an ENOBUFS constant in Python's socket bindings on Win32: c:\python25\python >>> import socket >>> [s for s in dir (socket) if s.find ('ENOBUFS') > -1] [] Don't take my word for granted, try it ;-) >Your test programs use Allegra which makes it look like it could be Allegra problem, not Winsock API implemention on Windoze boxes. Indeed it could. But if you take the patience to follow up what a flaking_client.py imports, namely: read async_loop and async_core, then you will quickly notice that there is very little between this simplistic client and the select () call. Not even application input buffers or output queues (that's in async_chat and async_net). >I cannot understand precisely what your thoughts on Win select/socket API is. To make it brief here is what I think: the implementation of Win32 the POSIX select call must be compatible with the rest of the API ... which is tailored for both asynchronous applications and operating system (Windows 95 ran on top of a single process DOS and Microsoft nicely followed the path of least resistance to an asynchronous OS). Win32 actually "fakes" blocking sockets like it "fakes" threads and processes (hence the blue screen of death). So, when Windows stops faking to block on a socket operation, your application will fall-back on the native asynchronous system call ... which do raise a fatal error when the buffer allocated to TCP is overflowed. So that the application fails instead of the system. From what I read here and there, Windows can allocate a pretty large elastic buffer shared by all sockets (that's easy ... in an asynchronous OS). That's why TCP sockets will start to "flake out" when they overflow that buffer. Because as long as its shared output buffer is not full, Windows apparently signal write events on sockets *independantly* from the network or the actual state of the TCP protocol. If the connected peer is too slow and the volume of data to big, that buffer will overflow. And sockets at fault will be closed. That case is not uncommon for a BitTorrent peer used to distribute very large files to a slow network from peers with limited upload bandwith. Regards, Hi again,

Thanks for that gentler response (although I still can’t see where you found an ENOBUFS constant in Python’s socket bindings on Win32:

c:\python25\python
>>> import socket
>>> [s for s in dir (socket) if s.find (’ENOBUFS’) > -1]
[]

Don’t take my word for granted, try it ;-)

>Your test programs use Allegra which makes it look like it could be Allegra problem, not Winsock API implemention on Windoze boxes.

Indeed it could.

But if you take the patience to follow up what a flaking_client.py imports, namely: read async_loop and async_core, then you will quickly notice that there is very little between this simplistic client and the select () call. Not even application input buffers or output queues (that’s in async_chat and async_net).

>I cannot understand precisely what your thoughts on Win select/socket API is.

To make it brief here is what I think: the implementation of Win32 the POSIX select call must be compatible with the rest of the API … which is tailored for both asynchronous applications and operating system (Windows 95 ran on top of a single process DOS and Microsoft nicely followed the path of least resistance to an asynchronous OS).

Win32 actually “fakes” blocking sockets like it “fakes” threads and processes (hence the blue screen of death). So, when Windows stops faking to block on a socket operation, your application will fall-back on the native asynchronous system call … which do raise a fatal error when the buffer allocated to TCP is overflowed.

So that the application fails instead of the system.

From what I read here and there, Windows can allocate a pretty large elastic buffer shared by all sockets (that’s easy … in an asynchronous OS). That’s why TCP sockets will start to “flake out” when they overflow that buffer. Because as long as its shared output buffer is not full, Windows apparently signal write events on sockets *independantly* from the network or the actual state of the TCP protocol.

If the connected peer is too slow and the volume of data to big, that buffer will overflow. And sockets at fault will be closed. That case is not uncommon for a BitTorrent peer used to distribute very large files to a slow network from peers with limited upload bandwith.

Regards,

]]>
by: frontera000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2689 Fri, 25 Aug 2006 17:06:07 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2689 Hi, I am aware that you are talking mostly about Windows specific TCP issues. I think that this article is unclear about exactly why the problem is due to Windows TCP. (It's funny but I guess I am defending Windows TCP socket interface API, shudders.) Because of the lack of clarity in this article, it is hard to see what problem is being described. I have read your flacking server and client code. They are written to use your library called allegra. I looked into allegra. Allegra does not seem to handle errors in a robust way. For send, it seems to only check for EWOULDBLOCK. Winsock API is sufficiently more complicated than equivalent POSIX socket API. It is also far more capable (I can't believe myself saying so, but it is true) and versatile. (probably more buggy too.. haha!) For nonblocking send() in Winsock, the following errors can be returned. ENOBUFS, ENOTCONN, ENOTSOCK, ESHUTDOWN, EINTR, EINPROGRESS, ENETDOWN, EMSGSIZE, EHOSTUNREACH, EINVAL, ECONNABORTED, ECONNRESET, ETIMEDOUT, and of course EWOULDBLOCK. Similarly there are many different error codes that can result from recv() and select() in Winsock. Are you sure your code (Allegra) is not at fault? I am not saying so myself, but perhaps it is best to write a simpler test case -- not using Allegra but just using socket API to prove your point. Your test programs use Allegra which makes it look like it could be Allegra problem, not Winsock API implemention on Windoze boxes. Why not just write a simple test case, in C or python. I guess I could do that just as easily, but I am not sure exactly what you are trying to do in your tests (due to use of underlying Allegra library). If you are simply trying to use non blocking socket to do lots of writes from the client, while server is reading very slowly, I can write that and test it out. But it would be better if you could write a simple socket test program in C or python, instead of Allegra to prove your point first. Since I cannot understand precisely what your thoughts on Win select/socket API is. Hi, I am aware that you are talking mostly about Windows specific TCP issues. I think that this article is unclear about exactly why the problem is due to Windows TCP. (It’s funny but I guess I am defending Windows TCP socket interface API, shudders.) Because of the lack of clarity in this article, it is hard to see what problem is being described.

I have read your flacking server and client code. They are written to use your library called allegra. I looked into allegra. Allegra does not seem to handle errors in a robust way. For send, it seems to only check for EWOULDBLOCK. Winsock API is sufficiently more complicated than equivalent POSIX socket API. It is also far more capable (I can’t believe myself saying so, but it is true) and versatile. (probably more buggy too.. haha!)

For nonblocking send() in Winsock, the following errors can be returned. ENOBUFS, ENOTCONN, ENOTSOCK, ESHUTDOWN, EINTR, EINPROGRESS, ENETDOWN, EMSGSIZE, EHOSTUNREACH, EINVAL, ECONNABORTED, ECONNRESET, ETIMEDOUT, and of course EWOULDBLOCK.

Similarly there are many different error codes that can result from recv() and select() in Winsock.

Are you sure your code (Allegra) is not at fault? I am not saying so myself, but perhaps it is best to write a simpler test case — not using Allegra but just using socket API to prove your point.

Your test programs use Allegra which makes it look like it could be Allegra problem, not Winsock API implemention on Windoze boxes.

Why not just write a simple test case, in C or python.

I guess I could do that just as easily, but I am not sure exactly what you are trying to do in your tests (due to use of underlying Allegra library). If you are simply trying to use non blocking socket to do lots of writes from the client, while server is reading very slowly, I can write that and test it out. But it would be better if you could write a simple socket test program in C or python, instead of Allegra to prove your point first. Since I cannot understand precisely what your thoughts on Win select/socket API is.

]]>
by: Laurent Szyster http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2669 Thu, 24 Aug 2006 16:46:36 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2669 Hi frontera000, > This is not something like what you described in this blog entry. Your understanding is completely wrong. Before publishing this article I took great care to do some research and check my facts. You should do the same before coming to such a definitive conclusion. At least you should have read the sources and run the tests linked in the article: http://svn.berlios.de/svnroot/repos/allegra/test/flaking_server.py to http://svn.berlios.de/svnroot/repos/allegra/test/flaking_client.py Then you would have noticed that I'm conducting those tests on the loopback device address (127.0.0.1) and that their results are independant from the network state, bandwith, etc. More puzzling, it looks as if you have not read the article itself, which makes very clear that I'm not talking about a frame stack, POSIX systems or a BSD implementation of the TCP stack. This article is about a Win32 *API* problem, a reminder about the fact that the implementation of select () must somehow be compatible with the asynchronous nature of Windows NT and its applications. Read the sources and run the tests on Windows first. Once you do, I'm sure that you'll have more relevant comment to make. Because obviously you know a lot more than me about Lisp and TCP socket programming on POSIX systems than I do: http://sparebandwidth.blogspot.com/2006/08/trivial-p2p-in-newlisp.html Regards, Hi frontera000,

> This is not something like what you described in this blog entry. Your understanding is completely wrong.

Before publishing this article I took great care to do some research and check my facts. You should do the same before coming to such a definitive conclusion. At least you should have read the sources and run the tests linked in the article:

http://svn.berlios.de/svnroot/repos/allegra/test/flaking_server.py

to

http://svn.berlios.de/svnroot/repos/allegra/test/flaking_client.py

Then you would have noticed that I’m conducting those tests on the loopback device address (127.0.0.1) and that their results are independant from the network state, bandwith, etc.

More puzzling, it looks as if you have not read the article itself, which makes very clear that I’m not talking about a frame stack, POSIX systems or a BSD implementation of the TCP stack.

This article is about a Win32 *API* problem, a reminder about the fact that the implementation of select () must somehow be compatible with the asynchronous nature of Windows NT and its applications.

Read the sources and run the tests on Windows first.

Once you do, I’m sure that you’ll have more relevant comment to make. Because obviously you know a lot more than me about Lisp and TCP socket programming on POSIX systems than I do:

http://sparebandwidth.blogspot.com/2006/08/trivial-p2p-in-newlisp.html

Regards,

]]>
by: frontera000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2664 Thu, 24 Aug 2006 06:07:03 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2664 Windows TCP/IP nteworking stack code and socket layer code is loosely based on BSD UNIX code. BSD TCP/IP code is one of the original and most widely deployed implementation. Windows networking code is not that different in handling socket level buffering. The term "stack" is as in protocol stack in previous sentence. Like the OSI 7 layer stack model. There is no "TCP stack" as you use the term, as in function call stack or stack as opposed to heap. That is not a commonly understood terminology. I guess what you mean is the socket level buffering (as in mbufs). For every socket the kernel socket level code will allocate (or not actually allocate but reserve) a high-water mark which indicates how much data is allowed to be buffered for that socket. There are two socket level buffers for each socket. One is for read side. Another for write side. A socket programmer who is using the socket API can reserve different amount of these buffers per socket. Using setsockopt(). Not only that, it is possible to get the current size and find out how many data is pending in the socket level buffers. This is done via ioctl() or fcntl() or other means. It is easy to find out how many one can write before overflowing the buffers. Application programmers should write programs so that the code that writes checks the buffering beforehand, especially when the application is I/O intensive. In case of blocking I/O, the call to write() to socket will block until more buffer space (socket level buffer for write side) is available. This has nothing to do with OS kernel being slower in sending data than what your spiffy application code can do so fast sending out. It has to do with data transfer over network. OS is plenty fast enough, usually faster and more efficiently coded than any application written in python. It has to do with flow control and injecting data at the right rate into the network. TCP depends on many different heuristics to accomplish its throughput and latency issues. Read up on Van Jacobsen slow-start and congestion control algorithms. Sending as fast as you can is not going to give you the best throughput over a network that has dynamic behaviors. Getting back to the Non-blocking I/O case, as you mentioned "asynchronous" I/O, I suppose you mean that you can write without blocking. In those cases, a proper kernel will notify you with error. Say there is space for 100 bytes in your socket level buffer, and then you call write() repeatedly to that socket via non-blocking I/O (I assume you set a flag or something on that socket like FIONBIO or FIOASYNC). What that does is to attempt to write data into space that is not sufficient to support the request. The request will be rejected and error code will be returned. That is what happens in most kernels, including most Unix style kernels and Windows. If you are seeing Reset on the connection, it is likely that you are seeing the result of the network event. A TCP packet with RST flag set is probably being received, which resets the connection. In order to see why this happens, you will need to trace the network and see the packets on the wire. This is not something like what you described in this blog entry. Your understanding is completely wrong. Windows TCP/IP nteworking stack code and socket layer code is loosely based on BSD UNIX code. BSD TCP/IP code is one of the original and most widely deployed implementation. Windows networking code is not that different in handling socket level buffering.

The term “stack” is as in protocol stack in previous sentence. Like the OSI 7 layer stack model. There is no “TCP stack” as you use the term, as in function call stack or stack as opposed to heap. That is not a commonly understood terminology.

I guess what you mean is the socket level buffering (as in mbufs). For every socket the kernel socket level code will allocate (or not actually allocate but reserve) a high-water mark which indicates how much data is allowed to be buffered for that socket. There are two socket level buffers for each socket. One is for read side. Another for write side.

A socket programmer who is using the socket API can reserve different amount of these buffers per socket. Using setsockopt(). Not only that, it is possible to get the current size and find out how many data is pending in the socket level buffers. This is done via ioctl() or fcntl() or other means. It is easy to find out how many one can write before overflowing the buffers. Application programmers should write programs so that the code that writes checks the buffering beforehand, especially when the application is I/O intensive.

In case of blocking I/O, the call to write() to socket will block until more buffer space (socket level buffer for write side) is available. This has nothing to do with OS kernel being slower in sending data than what your spiffy application code can do so fast sending out. It has to do with data transfer over network. OS is plenty fast enough, usually faster and more efficiently coded than any application written in python. It has to do with flow control and injecting data at the right rate into the network. TCP depends on many different heuristics to accomplish its throughput and latency issues. Read up on Van Jacobsen slow-start and congestion control algorithms. Sending as fast as you can is not going to give you the best throughput over a network that has dynamic behaviors.

Getting back to the Non-blocking I/O case, as you mentioned “asynchronous” I/O, I suppose you mean that you can write without blocking. In those cases, a proper kernel will notify you with error. Say there is space for 100 bytes in your socket level buffer, and then you call write() repeatedly to that socket via non-blocking I/O (I assume you set a flag or something on that socket like FIONBIO or FIOASYNC). What that does is to attempt to write data into space that is not sufficient to support the request. The request will be rejected and error code will be returned. That is what happens in most kernels, including most Unix style kernels and Windows.

If you are seeing Reset on the connection, it is likely that you are seeing the result of the network event. A TCP packet with RST flag set is probably being received, which resets the connection. In order to see why this happens, you will need to trace the network and see the packets on the wire.

This is not something like what you described in this blog entry. Your understanding is completely wrong.

]]>
by: Laurent Szyster http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2609 Mon, 21 Aug 2006 23:37:32 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2609 Hi Stephen, > I find it astounding that you’ve not presented the basic solution for this problem. I do. Polling for writability before calling handle_write_event is not enough, your application must also never call the dispatcher's socket.send outside of that write event handler. > I see allegrya doesn’t handle all the error conditions optimally. Thanks for the pointers to various possible TCP errors. Regarding error handling, I'll point you back to async_loop._io_poll methods and async_core.Dispatcher's send and recv. Here is how socket errors are handled. a) EINTR is tolerated for select or poll call, anything else will be raised this is possibly a good place to exit the loop. b) EWOULDBLOCK is handled by recv and ECONNRESET, ENOTCONN, ESHUTDOWN in send are translated into handle_close event. Any other exception throwed by the socket in a dependency of handle_read_event and handle_write_event will be catched by the try: except: clause in the io_poll loop. c) Finally, note that EINPROGRESS, EALREADY and EWOULDBLOCK error condition on a socket.connect are also handled. Up until now that's been safe enough for asyncore and asynchat too. I don't know about ENOBUFS on Win32 and running out of OS buffer is precisely what Allegra is expected *not* to do. As for the EBADF it's a fatal error that you can't handle! Here is en enlightening thread about that exception and select: http://www.developerweb.net/forum/archive/index.php/t-3247.html I cite: "But, really, this shouldn't be a situation that's regularly occuring, either... It indicates some sort of bug in your code, not some normal, expected condition, which you should try to work around (...)" and "Well, it most likely would suggest a coding (or logic) flaw, wherein a failed read() or write() was not detected or handled, but there is a window between the last successful I/O operation and the call to select() during which a socket may go bad due to peer activity." and "Yeah, as Anders says, the situation you describe just shouldn't ever happen, at all... select() failing with EBADF doesn't mean there's some socket error on the FD, or something; it means that the FD is literally bogus, and not usable, at all... Usually, because you've probably already closed it, and failed to remove it from your fd_sets, or something... But, no, it doesn't suggest a simple error with the connection, at all..." Can I stop now and get out of Twistedland? Please ... Hi Stephen,

> I find it astounding that you’ve not presented the basic solution for this problem.

I do. Polling for writability before calling handle_write_event is not enough, your application must also never call the dispatcher’s socket.send outside of that write event handler.

> I see allegrya doesn’t handle all the error conditions optimally.

Thanks for the pointers to various possible TCP errors. Regarding error handling, I’ll point you back to async_loop._io_poll methods and async_core.Dispatcher’s send and recv.

Here is how socket errors are handled.

a) EINTR is tolerated for select or poll call, anything else will be raised this is possibly a good place to exit the loop.

b) EWOULDBLOCK is handled by recv and ECONNRESET, ENOTCONN, ESHUTDOWN in send are translated into handle_close event. Any other exception throwed by the socket in a dependency of handle_read_event and handle_write_event will be catched by the try: except: clause in the io_poll loop.

c) Finally, note that EINPROGRESS, EALREADY and EWOULDBLOCK error condition on a socket.connect are also handled.

Up until now that’s been safe enough for asyncore and asynchat too. I don’t know about ENOBUFS on Win32 and running out of OS buffer is precisely what Allegra is expected *not* to do.

As for the EBADF it’s a fatal error that you can’t handle! Here is en enlightening thread about that exception and select:

http://www.developerweb.net/forum/archive/index.php/t-3247.html

I cite:

“But, really, this shouldn’t be a situation that’s regularly occuring,
either… It indicates some sort of bug in your code, not some
normal, expected condition, which you should try to work around (…)”

and

“Well, it most likely would suggest a coding (or logic) flaw, wherein a failed read() or write() was not detected or handled, but there is a window between the last successful I/O operation and the call to select() during which a socket may go bad due to peer activity.”

and

“Yeah, as Anders says, the situation you describe just shouldn’t ever happen, at all… select() failing with EBADF doesn’t mean there’s some socket error on the FD, or something; it means that the FD is literally bogus, and not usable, at all… Usually, because you’ve probably already closed it, and failed to remove it from your fd_sets, or something… But, no, it doesn’t suggest a simple error with the connection, at all…”

Can I stop now and get out of Twistedland?

Please …

]]>
by: Greg Hazel http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2606 Mon, 21 Aug 2006 22:15:25 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2606 So, I'm the developer that switched BitTorrent over to using Twisted. The problem BitTorrent had is that sometimes select() throws an error if you pass it too many file descriptors. It results in ENOBUFS. This happens particularly with Win9x, and even more so with applications like NOD32 installed. Because python's select wrapper sucks, the lists you passed it will not be modified, so initially the BitTorrent client would eat 100% cpu (repeating the same call with the same list). They discovered this problem and simply threw the 'tcp stack flaking out' message, which is incorrect. I switched to Twisted so I could get IOCP, which works better and faster than select(), and does not run in to this problem. Twisted is "fast" enough to hit the exact same error with select, which I reported here: http://twistedmatrix.com/trac/ticket/1228 So, I’m the developer that switched BitTorrent over to using Twisted.

The problem BitTorrent had is that sometimes select() throws an error if you pass it too many file descriptors. It results in ENOBUFS. This happens particularly with Win9x, and even more so with applications like NOD32 installed. Because python’s select wrapper sucks, the lists you passed it will not be modified, so initially the BitTorrent client would eat 100% cpu (repeating the same call with the same list). They discovered this problem and simply threw the ‘tcp stack flaking out’ message, which is incorrect.

I switched to Twisted so I could get IOCP, which works better and faster than select(), and does not run in to this problem. Twisted is “fast” enough to hit the exact same error with select, which I reported here:

http://twistedmatrix.com/trac/ticket/1228

]]>
by: Stephen Thorne http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2587 Mon, 21 Aug 2006 04:16:02 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2587 I find it astounding that you've not presented the basic solution for this problem. Using select to find out when a socket is writable. You seem to have alluded to this in the text, but haven't bothered to present it to the reader: # where readables and writeables are a dict of # {fd:ConnectionAbstraction()} rlist, wlist, ignored = select(readables.keys(), writables.keys(), [], timeout) for r in rlist: readables[r].readData() for w in wlist: writeables[w].writeData() where writeData does buffering correctly, limits the .send to 128*1024 (to stop windows from flaking out on a big .send()), and stops when it runs out of data to send or the socket buffer is full. There's a bunch of error conditions you have to be mindful of, EINTR for signals, EWOULDBLOCK and ENOBUFS for the send, EBADF for the select. I see allegrya doesn't handle all the error conditions optimally. How does allegrya handle EBADF and ENOBUFS? EBADF could be raised from _io_select() if the file descriptor is closed on you, ENOBUFS could be raised from Dispatcher.handle_write(), and if that happens I'm pretty certain from reading the code that you'll close the socket.... I find it astounding that you’ve not presented the basic solution for this problem. Using select to find out when a socket is writable. You seem to have alluded to this in the text, but haven’t bothered to present it to the reader:

# where readables and writeables are a dict of
# {fd:ConnectionAbstraction()}
rlist, wlist, ignored = select(readables.keys(), writables.keys(), [], timeout)
for r in rlist: readables[r].readData()
for w in wlist: writeables[w].writeData()

where writeData does buffering correctly, limits the .send to 128*1024 (to stop windows from flaking out on a big .send()), and stops when it runs out of data to send or the socket buffer is full. There’s a bunch of error conditions you have to be mindful of, EINTR for signals, EWOULDBLOCK and ENOBUFS for the send, EBADF for the select.

I see allegrya doesn’t handle all the error conditions optimally.

How does allegrya handle EBADF and ENOBUFS? EBADF could be raised from _io_select() if the file descriptor is closed on you, ENOBUFS could be raised from Dispatcher.handle_write(), and if that happens I’m pretty certain from reading the code that you’ll close the socket….

]]>
by: Laurent Szyster http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2534 Fri, 18 Aug 2006 14:34:27 +0000 http://laurentszyster.be/blog/tcp-stack-flaking-out/#comment-2534 >You seem to have written Allegra in the wrong language according to your own logic! That's not my logic, it's the logic of a Mac user who can compare against a Java peer available, and goes out of its way to find out why the 4.0 twisted release could be so slow. I can tell it's not Python. According the my result, and you have everything at hand to test that, raw async_core dispatchers can saturate the OS on a loopback device. That's around what? With buffers between 16KB and 4KB, I reached between 1,2MBps and 0,8MBps on the loopback device under XP. That's well above the 512Kbps upload bandwith of a cable-modem peer. With the appropriate buffers, even the added process of async_net or async_chat buffering I/O may not prevent distributed Allegra application peers to saturate their 100Mbps network by all chating together. I found out about "TCP stack flaking out" while stressing the PNS metabase and the HTTP/1.1, server and clients, as they were simple and "allmost finished but not quite yet", just working enough. It happened as soon as the MIME body producer filled a buffer much larger than the TCP send window to many clients concurrently. Or with pipelined protocols that fill up the buffer too fast. As soon as you cache content in memory this "bug" must be fixed, because moving bytes in the application address space is much faster than the network. The trouble with Python is that there *must* be less code in order to run fast enough. Allegra passed that test, moving bytes in and out of memory faster than the OS can handle it over to a starved network. Python is still as slow. But its C parts are damn fast. Here less is *practicaly* more. >You seem to have written Allegra in the wrong language according to your own logic!

That’s not my logic, it’s the logic of a Mac user who can compare against a Java peer available, and goes out of its way to find out why the 4.0 twisted release could be so slow. I can tell it’s not Python.

According the my result, and you have everything at hand to test that, raw async_core dispatchers can saturate the OS on a loopback device. That’s around what? With buffers between 16KB and 4KB, I reached between 1,2MBps and 0,8MBps on the loopback device under XP.

That’s well above the 512Kbps upload bandwith of a cable-modem peer. With the appropriate buffers, even the added process of async_net or async_chat buffering I/O may not prevent distributed Allegra application peers to saturate their 100Mbps network by all chating together.

I found out about “TCP stack flaking out” while stressing the PNS metabase and the HTTP/1.1, server and clients, as they were simple and “allmost finished but not quite yet”, just working enough. It happened as soon as the MIME body producer filled a buffer much larger than the TCP send window to many clients concurrently. Or with pipelined protocols that fill up the buffer too fast. As soon as you cache content in memory this “bug” must be fixed, because moving bytes in the application address space is much faster than the network.

The trouble with Python is that there *must* be less code in order to run fast enough. Allegra passed that test, moving bytes in and out of memory faster than the OS can handle it over to a starved network.

Python is still as slow. But its C parts are damn fast.

Here less is *practicaly* more.

]]>