From: SMTP%"CARROLL@pppl.gov" 5-DEC-2001 09:16:25.01 To: RONEY@pppl.gov, TGIBNEY@pppl.gov CC: TCARROLL@pppl.gov Subj: Notes for Europa's ethernet interface FASTFD 100MBIT Full Duplex Phyllis and Tom, This is for your notes file: The reboot of Europa yesterday after The cpu's were swapped and cards reseated created a new and mysterious problem. The console ethernet device eia0 was set FASTFD but the PPLCC hub changed its speed to 100 half duplex. This created a very slow cluster and a Europa that ran at one tenth it normal speed. Many tasks on all cluster nodes in rwscs state(lock management wait) for very long periods of time. Many Disks were going into mount verify timeout also across the cluster. The command On Europa: $ mcr ncp show known line counters Displayed send failures at a rate of about 1/sec. $ mcr ncp zero known line counters ! to zap them to get the current !situation Bob Persing and Ken Tindall worked on this and said the problem was that Europa was now allowing auto-negociate on its nic card. Compaq did update the console firmware to version 6 but they did not think it would change the nic card settings. Display of those setting looked normal from the Europa console point of view. After various tests the PPPLCC hub was changed to make the card always auto-neg the nic card. Europa was still at fastfd on its console and was unchanged in all these tests. We powered off Europa and rebooted several times and it always came up ok if the hub was auto-neg. In tests where the pplcc hub was 100 full duplex it failed every time. We have decided to leave the PPLCC hub with the auto-neg setting for the Europa port. This is a new problem so it may happen with other cluster nodes. I suspect the new cabletron hubs installed on Nov 19 created this new behavior because we never had a issue like this before with Europa. We did have a issue with IO where it changed its speed while IO was up and running. Tom Gibney was involved with this issue when I was out some time ago. After a week it was traced to the network speed issue. If this occurs when the experiment is running it will degrade the cluster into a non-function state because the lock management can't be done in a timely way. Or in other words we need reliable networking to run. ---Tom ************************************************************** * Tom Carroll | Princeton Plasma Physics Lab * * Analyst/Programmer ! Computer Division * * Email: tcarroll@pppl.gov ! Office phone (609) 243-3419 * **************************************************************