Monday, 29 May 2017

Troubleshooting Exadata Infiniband Switch "Port xx has xx total errors"

We received the following error message from OEM that port 36 on IB Switch dm01sw-iba01 has errors

Alert message from OEM 12c:

Host=dm01db01.netsoftmate.com
Target type=Oracle Infiniband Switch
Target name=dm01sw-iba01.netsoftmate.com
Categories=Error
Message=Port 36 has 10 total errors, crossed warning (10) or critical ( ) threshold.
Severity=Warning
Event reported time=May 29, 2017 2:11:14 AM CDT
Target Lifecycle Status=Production


Here are the few IB commands that can be used to identify the problem with IB Port.

Troubleshooting steps:

  • Login to problematic IB switch using putty as root user
login as: root
root@dm01sw-iba01.netsoftmate.com's password:
Last login: Wed May 17 01:36:17 2017 from dm01db01.netsoftmate.com
You are now logged in to the root shell.
It is recommended to use ILOM shell instead of root shell.
All usage should be restricted to documented commands and documented
config files.
To view the list of documented commands, use "help" at linux prompt.

 
  • Using listlinkup command to check the port status.
[root@dm01sw-iba01 ~]# listlinkup
Connector  0A Not present
Connector  1A Not present
Connector  2A Not present
Connector  3A Not present
Connector  4A Not present
Connector  5A Not present
Connector  6A Present <-> Switch Port 35 is up (Enabled)
Connector  7A Present <-> Switch Port 33 is up (Enabled)
Connector  8A Present <-> Switch Port 31 is up (Enabled)
Connector  9A Present <-> Switch Port 14 is up (Enabled)
Connector 10A Present <-> Switch Port 16 is up (Enabled)
Connector 11A Present <-> Switch Port 18 is up (Enabled)
Connector 12A Not present
Connector 13A Present <-> Switch Port 09 is up (Enabled)
Connector 14A Present <-> Switch Port 07 is up (Enabled)
Connector 15A Present <-> Switch Port 05 is up (Enabled)
Connector 16A Present <-> Switch Port 03 is up (Enabled)
Connector 17A Present <-> Switch Port 01 is up (Enabled)
Connector  0B Not present
Connector  1B Not present
Connector  2B Not present
Connector  3B Not present
Connector  4B Not present
Connector  5B Not present
Connector  6B Present <-> Switch Port 36 is up (Enabled)
Connector  7B Present <-> Switch Port 34 is up (Enabled)
Connector  8B Present <-> Switch Port 32 is down (Enabled)
Connector  9B Present <-> Switch Port 13 is up (Enabled)
Connector 10B Present <-> Switch Port 15 is up (Enabled)
Connector 11B Present <-> Switch Port 17 is up (Enabled)
Connector 12B Present <-> Switch Port 12 is up (Enabled)
Connector 13B Present <-> Switch Port 10 is up (Enabled)
Connector 14B Present <-> Switch Port 08 is up (Enabled)
Connector 15B Present <-> Switch Port 06 is up (Enabled)
Connector 16B Present <-> Switch Port 04 is up (Enabled)
Connector 17B Present <-> Switch Port 02 is up (Enabled)


From the above output we can see that port 36 is Enabled and there are no issues reported.


  • Using ibportstate and getportstatus IB commands to identify the port status
First identify the lid number for the problematic IB Switch.
Here the lid number for IB switch dm01sw-iba01 is 1.

[root@dm01sw-iba01 ~]# ibswitches
Switch  : 0x0010e0650e2ea0a0 ports 36 "SUN DCS 36P QDR dm01sw-iba01 10.21.50.2" enhanced port 0 lid 1 lmc 0
Switch  : 0x0010e0650d90a0a0 ports 36 "SUN DCS 36P QDR dm01sw-ibb01 10.21.50.3" enhanced port 0 lid 2 lmc 0

[root@dm01sw-iba01 ~]# ibportstate 1 36
PortInfo:
# Port info: Lid 1 port 36
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
Peer PortInfo:
# Port info: Lid 1 DR path slid 65535; dlid 65535; 0,36 port 2
LinkState:.......................Active
PhysLinkState:...................LinkUp

LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps

[root@dm01sw-iba01 ~]# getportstatus 36
Port status for connector 6B Switch port 36
Adminstate:......................Enabled
LinkWidthEnabled:................1X or 4X
LinkWidthSupported:..............1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkSpeedActive:.................10.0 Gbps

LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
NeighborMTU:.....................4096
OperVLs:.........................VL0-3


From the above output we can see that port 36 is Enabled, linkstate is Active and there are no issues reported.


  • Using ibdiagnet command to identify the network quality and errors.
[root@dm01sw-iba01 ~]# ibdiagnet
Loading IBDIAGNET from: /usr/lib/ibdiagnet1.2
-W- Topology file is not specified.
    Reports regarding cluster links will use direct routes.
Loading IBDM from: /usr/lib/ibdm1.2
-I- Using port 0 as the local port.
-I- Discovering ... 17 nodes (2 Switches & 15 CA-s) discovered.


-I---------------------------------------------------
-I- Bad Guids/LIDs Info
-I---------------------------------------------------
-I- skip option set. no report will be issued

-I---------------------------------------------------
-I- Links With Logical State = INIT
-I---------------------------------------------------
-I- No bad Links (with logical state = INIT) were found

-I---------------------------------------------------
-I- PM Counters Info
-I---------------------------------------------------
-I- No illegal PM counters values were found

-I---------------------------------------------------
-I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)
-I---------------------------------------------------

-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x0001 QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- No members found for group
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- No members found for group

-I---------------------------------------------------
-I- Bad Links Info
-I- No bad link were found
-I---------------------------------------------------
----------------------------------------------------------------
-I- Stages Status Report:
    STAGE                                    Errors Warnings
    Bad GUIDs/LIDs Check                     0      0
    Link State Active Check                  0      0

    Performance Counters Report              0      0
    Partitions Check                         0      0
    IPoIB Subnets Check                      0      2

Please see /tmp/ibdiagnet.log for complete log
----------------------------------------------------------------

-I- Done. Run time was 15 seconds.



From the above output we can see that there are no issues reported.


Conclusion:
In this article we have learned how to execute various IB Switch commands to identify the IB port errors or issues. 


No comments:

Post a Comment