Wednesday, June 04, 2003

Monday was a pretty good day - work went well, with us troubleshooting the difficulties of Friday night. After more work Tuesday, and many tests, we were unable to reproduce the problem, but I think I have at least a rough understanding of what happened:
1> BrightStor connects to W2K server - several months ago, when backups would fail, the connection/port was being dropped due to corrupted packets. We changed the reconnect port and it began functioning.
2> Now it reconnects. A lot. If the network is stressed, and the W2K server is dropping ports to BEB, it reconnects a lot - like 1500+ connections in under 90 min. Each connection termination means another corrupted packet, adding to the ones that caused the port to be dropped.
3> Thus we end up with hundreds of dead ports and only 1 or 2 live ones, huge IP input error numbers on the BEB server side, and huge IP output error numbers on the W2K side.
4> When the network is pushing 70-80% utilization, with 30% retransmits due to errors, and hundreds of ports being occupied by the reconnecting BEB software, the clients that need 24-7 access to this W2K server sometimes get dropped or refused - which takes the client offline.
5> When the client is offline, it doesn't talk to the Data General server, who (since there is no response to queries) decides that its network card is dead and locks up. *sigh*