Tuesday 30 May 2017

Exadata Infiniband Switch Port Issue



Some times we encounter Infiniband Port related issues. These alerts can be triggered from  from OEM or any other monitoring tools.
 
Sample Alert from OEM 12c:

Example 1:
Port xx on dm01sw-ib3.netsoftmate.com is disconnected from port xx 

Example 2:
Cable is present on Port xx but the port is disabled.

This document provides the steps to resolve Infiniband Switch Port related issues mentioned above.

Unless otherwise stated, run all the commands from compute node 1.

Identify the Problematic Infiniband Switch Port

  • Using OEM 12c


Log in to OEM 12c using web browser of your choice

Click on Target à Exadata
From the list select the appropriate Exadata Cluster
From the left pain expand “IB Network”
Select the Infiniband switch having problem.

Now this will display the switch status. If there are any issues with the port it will mark in RED


From the picture above we can see that there is an issue with Port 35 on Infiniband Switch “dm01sw-ib3”


  • Using IB Switch Commands

Verify-Topology

Oracle supplies a script/utility called /opt/oracle.SupportTools/ibdiagtools/verify-topology, with Exadata, which is used to validate InfiniBand network layout.

Verify the InfiniBand topology using the following command from a database server or Exadata Storage Server:

[root@dm01db01]# cd /opt/oracle.SupportTools/ibdiagtools/
[root@dm01db01]# ./verify-topology

Oracle Exadata Database Machine includes the verify-topology utility. This utility can be used to identify the following network connection problems:

  • Missing InfiniBand cable 
  • Missing InfiniBand connection
  • Incorrectly-seated cable 
  • Cable connected to the wrong endpoint
[root@dm01db01]# cd /opt/oracle.SupportTools/ibdiagtools/

[root@dm01db01]# ./verify-topology

        [ DB Machine Infiniband Cabling Topology Verification Tool ]
                [Version IBD VER 2.d ]
External non-Exadata-image nodes found:
...will check for ZFS if on SSC - else ignore

Found 2 leaf, 1 spine, 0 top spine switches

Check if all hosts have 2 HCAs to different switches...............[SUCCESS]
Leaf switch check: cardinality and even distribution..............[SUCCESS]
Spine switch check: Are any Exadata nodes connected ..............[SUCCESS]
Spine switch check: Any inter spine switch links..................[SUCCESS]
Spine switch check: Any inter top-spine switch links..............[SUCCESS]
Spine switch check: Correct number of spine-leaf links............[SUCCESS]
Leaf switch check: Inter-leaf link check..........................[SUCCESS]
Leaf switch check: Correct number of leaf-spine links.............[SUCCESS]


 In the example above, there are NO ERRORS reported.

 Listlinkup

Run the listlinkup command to verify InfiniBand Port status enabled/disabled:
Run this command on problematic Infiniband Switch.

[root@dm01db01]# ssh root@dm01sw-ib3
[root@dm01sw-ib3 ~]# listlinkup

[root@dm01sw-ib3 ~]# listlinkup
Connector  0A Not present
Connector  1A Not present
Connector  2A Not present
Connector  3A Not present
Connector  4A Not present
Connector  5A Not present
Connector  6A Present <-> Switch Port 35 is down (AutomaticHighErrorRate)
Connector  7A Present <-> Switch Port 33 is up (Enabled)
Connector  8A Present <-> Switch Port 31 is up (Enabled)
Connector  9A Present <-> Switch Port 14 is up (Enabled)
Connector 10A Present <-> Switch Port 16 is up (Enabled)
Connector 11A Present <-> Switch Port 18 is up (Enabled)
Connector 12A Not present
Connector 13A Not present
Connector 14A Present <-> Switch Port 07 is up (Enabled)
Connector 15A Not present
Connector 16A Not present
Connector 17A Present <-> Switch Port 01 is up (Enabled)
Connector  0B Not present
Connector  1B Not present
Connector  2B Not present
Connector  3B Not present
Connector  4B Not present
Connector  5B Present <-> Switch Port 29 is up (Enabled)
Connector  6B Not present
Connector  7B Present <-> Switch Port 34 is up (Enabled)
Connector  8B Not present
Connector  9B Present <-> Switch Port 13 is up (Enabled)
Connector 10B Present <-> Switch Port 15 is up (Enabled)
Connector 11B Present <-> Switch Port 17 is up (Enabled)
Connector 12B Not present
Connector 13B Present <-> Switch Port 10 is up (Enabled)
Connector 14B Not present
Connector 15B Not present
Connector 16B Present <-> Switch Port 04 is up (Enabled)
Connector 17B Present <-> Switch Port 02 is up (Enabled)

There is an issue with port 32 on the Infiniband Switch “dm01sw-ib3”.
This need to be addressed.

Ibswitches

Use this command to get the Infiniband switch LID number.

[root@dm01sw-ib3 ~]# ibswitches
Switch  : 0x002128469deca0a0 ports 36 "SUN DCS 36P QDR dm01sw-ib3 10.213.23.85" enhanced port 0 lid 3 lmc 0
Switch  : 0x002128469e45a0a0 ports 36 "SUN DCS 36P QDR dm01sw-ib2 10.213.23.84" enhanced port 0 lid 1 lmc 0

Here the lid number for dm01sw-ib3 is 3.

Ibportstate

Use this command to identify the port state.

[root@dm01sw-ib3 ~]# ibportstate 3 35
PortInfo:
# Port info: Lid 3 port 35
LinkState:.......................Down
PhysLinkState:...................Disabled
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................2.5 Gbps

From the output above we can see that the port is diabled and the link speed is reduced.

Getportstatus:

Use this command to get the port status

[root@dm01sw-ib3 ~]# getportstatus 35
Port status for connector 6A Switch port 35
Adminstate:......................Disabled (AutomaticHighErrorRate)
LinkWidthEnabled:................1X or 4X
LinkWidthSupported:..............1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkState:.......................Down
PhysLinkState:...................Disabled
LinkSpeedActive:.................2.5 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
NeighborMTU:.....................4096
OperVLs:.........................VL0


  • Step to resolve the IB Port Issue

Autodisable is a feature that can display the connectors in the presence of high error rates or suboptimal link speed or width.
This feature doesn't cause any issues, it just alerts customer with abnormal status of connectors.
Autodisable feature has been introduced only in firmware 2.1 and does not apply to firmware 1.3.
Correct way to account for this is to check and ensure whether any auto-disabled ports exist and if present then re-enable using enableswitchport --automatic 'before' up/downgrading fw to a different version. This will ensure compatible settings when moving between different fw.

Problematic Inifiniband switch details:
Switch name              :           dm01sw-ib3
Firware verison        :           2.1.3-4
Port number             :           35
Lid number                :           3

This solution for the Infiniband switch firmware verion “2.1.3-4”.

To reenable an autodisabled connector or IB switch port, on the leaf switch dm01sw-ib3 do the following:

[root@dm01sw-ib3 ~]# enableswitchport --automatic Switch 35
Enable connector 6A Switch port 35
Adminstate:......................Enabled
LinkWidthEnabled:................1X or 4X
LinkWidthSupported:..............1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkState:.......................Down
PhysLinkState:...................PortConfigurationTraining
LinkSpeedActive:.................2.5 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
NeighborMTU:.....................4096
OperVLs:.........................VL0


  • Verify

 Now verify the port status using the following different commands.

Ibportstate command

[root@dm01sw-ib3 ~]# ibportstate 3 35
PortInfo:
# Port info: Lid 3 port 35
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
Peer PortInfo:
# Port info: Lid 3 DR path slid 65535; dlid 65535; 0,35 port 2
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps

Getportstatus command

[root@dm01sw-ib3 ~]# getportstatus 35
Port status for connector 6A Switch port 35
Adminstate:......................Enabled
LinkWidthEnabled:................1X or 4X
LinkWidthSupported:..............1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkSpeedActive:.................10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
NeighborMTU:.....................4096
OperVLs:.........................VL0

Listlinkup command

[root@dm01sw-ib3 ~]# listlinkup
Connector  0A Not present
Connector  1A Not present
Connector  2A Not present
Connector  3A Not present
Connector  4A Not present
Connector  5A Not present
Connector  6A Present <-> Switch Port 35 is up (Enabled)
Connector  7A Present <-> Switch Port 33 is up (Enabled)
Connector  8A Present <-> Switch Port 31 is up (Enabled)
Connector  9A Present <-> Switch Port 14 is up (Enabled)
Connector 10A Present <-> Switch Port 16 is up (Enabled)
Connector 11A Present <-> Switch Port 18 is up (Enabled)
Connector 12A Not present
Connector 13A Not present
Connector 14A Present <-> Switch Port 07 is up (Enabled)
Connector 15A Not present
Connector 16A Not present
Connector 17A Present <-> Switch Port 01 is up (Enabled)
Connector  0B Not present
Connector  1B Not present
Connector  2B Not present
Connector  3B Not present
Connector  4B Not present
Connector  5B Present <-> Switch Port 29 is up (Enabled)
Connector  6B Not present
Connector  7B Present <-> Switch Port 34 is up (Enabled)
Connector  8B Not present
Connector  9B Present <-> Switch Port 13 is up (Enabled)
Connector 10B Present <-> Switch Port 15 is up (Enabled)
Connector 11B Present <-> Switch Port 17 is up (Enabled)
Connector 12B Not present
Connector 13B Present <-> Switch Port 10 is up (Enabled)
Connector 14B Not present
Connector 15B Not present
Connector 16B Present <-> Switch Port 04 is up (Enabled)
Connector 17B Present <-> Switch Port 02 is up (Enabled)

Conclusion
In this article we have learned various Infiniband Switch command to identify the port status and resolve the port related issues.

Monday 29 May 2017

Exadata Infiniband Switch ILOM Snapshot



When working with Oracle Support on a Infiniband Switch Hardware Service Request, Oracle Support request you to upload ILOM SNAPSHOT to properly assess the hardware failure. Starting with Exadata X4 and higher, you can now collect snapshot for Infiniband Switch using web browser interface. 

In this article I will demonstrate the steps to collect ILOM snapshot data for an Infiniband Switch. You connect to Infiniband Switch using a web browser to collect the ILOM snapshot.

Steps to collect ILOM Snapshot for IB Switch

  • Open a web browser (use something other than Internet Explorer) and enter the Infiniband Switch hostname.

Note: There is NO *-ILOM* in the hostname.

  • Enter root as User Name and its password and click on Log In.

 

Note:  You may see complaints about security – ignore or override – click I understand the risks/Add exception/Confirm Security Exception

  • Select Maintenance -> Snapshot

  • This will take you to the Server Snapshot Utility Page show below

 


On the above Screen, Select Data Set “Normal”, Select Transfer Method as “Browser” and Click “Run”.

Normal - Specifies that ILOM, operating system, and hardware information is collected.
The download file will be saved according to your browser settings.

Important Note:  Do not enable this option: 'Collect Only Log Files from Data Set'.  Doing so will limit the snapshot to a much smaller sub-section of log files.
 


  • In the dialog box, specify the directory to which to save the file and the file name.

Click OK. The file is saved to the specified directory.
 



  • Upload the zip to Oracle Support SR for review.


Conclusion
In this article we have learned how to collect the ILOM Snapshot diagnostic data for Infiniband Switch to investigate the hardware failure. It common that Oracle Support request you to upload ILOM snapshot for IB switch to investigate hardware issues.
 

Comparing Oracle Database Appliance X8-2 Model Family

September 2019 Oracle announced Oracle Database Appliance X8-2 (Small, Medium and HA). ODA X8-2 comes with more computing resources com...