Take a step back and think about what is being seen. Ask the following questions: If the SAN has been working, what has changed? Apart from a catastrophic hardware failure, SANs typically do not just stop working – something has to have changed.

What is the observed behavior compared to the expected behavior?

For example, a planned storage fail-over during system maintenance took eight minutes to occur, but was expected to complete in two minutes.

This observation should be performed on two levels – the overall problem with the system, e.g. users of the system experienced an outage of eight minutes, and the exact specific problem, e.g. the storage controller is reporting an error.

Is the expected behavior supported by the storage and system providers?

Is a manually initiated path fail-over of two minutes supported by the storage and path management software providers? For example, a storage provider may only support two-minute failovers when their internal RAID controller cache is disabled.

• What are the exact symptoms? Make a list. For example:

  • Mouse pointer stopped moving for 30 seconds immediately after failover was initiated then went to an hour glass until the failover completed.
  • Path management software reported errors in the system error log 60 seconds after the failover was initiated.
  • HBA Fibre Channel link to switch dropped and did not recover immediately after failover was initiated.

Is the problem repeatable? Can it be set up on a non-production test system?

Collecting additional information such as system error logs is often needed which may require the problem to be duplicated. Since production systems are not generally set to collect this extra information in normal operation, it is important to be able to configure the system to collect data and recreate the problem. Obviously, it is best to do this on a non-production test system.

 

Step 2: Narrow Domain Where the Problem Lies

 

Once the problem is observed an analyzed, determining where the problem originates is the next step. Eliminate possible problem sources on a coarse level:

 

Is the problem in the SAN or in the server?

Determining this essentially cuts the size and scope of the troubleshooting effort into a fraction, i.e. if the SAN can be eliminated, only the server and its components need to be focused on.

SAN components include the switch, the storage and the cabling. Clues that the problem lies in the SAN are:

        • Cabling:

  1. Bad cables – Noise generates many CRC or other invalid data on the Fibre Channel link. This translates into slower performance on disks and failed backups on tapes.
  2. Old cables – 4Gb/s Fibre Channel cannot run as far as 2Gb/s Fibre Channel with the same cable.

        • Switch:

  1. Improper zoning – Devices are not discovered properly, SAN does not recover from disturbances.
  2. Link problems – Wrong speed, wrong topology, no link at all.

        • Storage:

  1. Performance –For example, common causes are not enough spindles dedicated to the busiest Logical Units (LUNs) and too many hosts talking to a particular storage port

Server components include the HBA and its firmware, HBA device drivers, HBA management software, the operating system SCSI stack drivers, and path management software. Clues that the problem lies in the server are:

        • HBA and its firmware

  1. HBA and Firmware level not qualified by the storage and systems providers – unpredictable behaviors could result.
  2. Wrong HBA model – Some dual-channel HBAs are not supported by certain systems.
  3. Old firmware – Link problems (wrong speed, wrong topology, no link at all), failed tape backups when using the FCP-2 Sequence Level Recovery feature which is enabled automatically if the tape system supports it.

        • HBA device drivers

  1. Level and driver type not qualified by the storage and systems providers – Windows FC-Port driver is being used when storage or system provider requires Windows Storport. Brand new version of driver which is supported by the storage or system provider is installed but the storage or system provider compatibility matrix specifies an older driver for that particular configuration.
  2. Driver configuration parameters are set incorrectly – Erratic behaviors occur such as very long failure timeouts, storage being overrun by too many concurrent commands, and failures in path management software that relies on specific driver settings.
  3. Persistent Binding of targets is not correct – devices that are expected to show in the operating system storage management tool are not present.

        • HBA management software

  1. Management software package does not match driver version – Erratic behavior occurs with management standard the HBA API, firmware downloads, and function of HBAnyware. HBAnyware works best when it is matched with the driver it was packaged with. Follow the upgrade instructions below

        • Operating system SCSI stack drivers

  1. Wrong OS patch level – Operating system providers often patch elements of the SCSI stack to fix problems and improve performance. OS providers generally maintain their knowledge base with problem and fix information so it is important to search there for information on problems being experienced. Occasionally, OS patches require coordinated changes in low-level device drivers from HBA makers to work properly.

        • Path management software

  1. Unsupported configuration – Incorrect SAN zoning or configuration causes unexpected behaviors.
  2. Unsupported HBA drivers – Tight integration between the software and HBA drivers is critical. The correct version of the HBA driver must be used for proper operation.
  3. Unsupported HBA driver settings – Failovers that take too long or trigger unnecessarily can be the result of incorrectly configured timers in the HBA driver. The correct setting sin the HBA driver specified by the path management software provider must be used.
  • Checklist for a Common Connectivity Problem

Storage does not appear or is marked offline in the operating system storage tool

 

After setting up a new SAN or reconfiguring an existing SAN, connectivity problems occasionally occur. This is defined as storage volumes (devices) not appearing in the operating system storage management tool as being ready for use. The devices could be marked offline or not present at all. Below is a checklist that can be used to help solve connectivity problems.

        • Check for a good Fibre Channel link

 

Verifying a good link at the desired speed is the most overlooked troubleshooting step. Most HBAs have LED lights that provide an indication of the link status and current speed.


        • Check switch zoning

Large switched fabrics are complicated to set up and maintain. Zoning is a key element of a good configuration, and it is also one of the most complicated to manage. An improperly zoned fabric can cause a variety of problems that are not always directly attributable to zoning. For example, a small disturbance on a large fabric with no zoning can cause connectivity problems because all the devices on the fabric will be trying to resolve the disturbance at the same time which overloads the switch. It is important to verify that the device that is not being discovered is properly zoned.

 

  • Use HBA utility software to examine the devices discovered by the HBA and driver

In nearly all operating systems, the device driver for the Fibre Channel HBA is responsible for discovering all the devices in a SAN. These devices must then be assigned SCSI Target IDs and mapped to the operating system. The HBA utility software(provided with the HBA) tool provides the unique ability to see all the devices on the SAN that the device driver discovered, regardless of whether the devices where mapped to the operating system. This provides a good way of determining whether or not the device that is having connectivity problems is visible on the SAN.

Make sure your Topology is set to point-2-point if you are connected to a fabric.

 

• Check the storage configuration

When storage devices are presented in the operating system storage management tool, they are typically described as volumes or disks. The HBA device driver maps discovered storage ports on a SAN as SCSI Targets and the Logical Units (LUNs) are discovered and managed by the operating system, not the HBA driver. If LUNs are configured on a particular storage controller port and the port is not discovered by the HBA, all the LUNs associated with that port will not be available for use.


Many storage systems provide security by requiring that each storage controller port be configured with the host server World Wide Names that will be communicating with that storage port. The storage controller will not allow a server with unknown World Wide Names to communicate with LUNs on the storage system. Further, some storage systems allow LUNs to be allocated to particular servers based on World Wide Name.

If the storage controller with this security and LUN mapping feature is not properly configured, the controller will allow the server HBA port to login and communicate with the storage controller, however, the storage controller will not present any LUNs to the server. This means that the HBA driver will complete its discovery and map the Targets to the operating system, but the operating system will not find LUNs with which to communicate.

 

Basic Check List

More important than any of the points below is see if anyone could have made any changes without telling you.

1.     Check HBA is present in Device Manager

2.     Make sure there is a link light on the HBA

3.     Log in to the switch as make sure the port is seeing the connection

4.     Check the fibre cable is intact

5.     Make sure the correct settings are in the HBA BIOS (e.g. Point-Point for Fabric)

6.     Check your Zoning

7.     Check your HBA’s are registered in Navisphere

8.     Check your versions driver/firmware versions

9.     Check other HBA’s still have access to storage.

10.  Check the storage is assigned to the Host