9th September 2013 – Tony de Figueiredo
Often system administrators, DBAs and application owners are quick to blame the Storage when there is a slight hint of any performance degradation.
On the receiving end, the storage administrators have to spend many hours of troubleshooting and proving that their Storage wasn’t the cause of any performance degradation.
We normally don’t keep an eye on the Storage Area Network infrastructure, the glue that connects the host to the storage.
Storage Area Network are made up of Fibre Channel Switches, Fibre Channel Directors using FC, FCIP, iFCP, iSCSI, and FCoE protocol.
With the basic storage and host management tools there is a lack of ability to pinpoint Storage Area Network problems such as latency, slow draining devices, flapping HBA’s, damaged cables or a failing GBIC/SFP.
It’s essential to understand the Fibre Channel transmission and the steps to investigate these errors. There are few errors such as CRC Errors, Class-3 Frame Discards, Code Violation Errors, Loss of Sync and Slow Draining Devices in a Storage Area Network that could result performance degradation and even an outage.
What are CRC Errors?
Cyclic Redundancy Check errors is the number of CRC frames that have failed. A CRC is a four-byte field that verifies the data integrity of the frame header and Data Field.
By looking at the logs, you can determine most of the CRC errors, i.e. “enc out” errors will imply a faulty fibre cable, “enc out” + “crc err” errors together will imply a faulty GBIC/SFP.
What are “Loss of Sync” errors?
Every time an administrator disables/enables a port, reboots/power cycles the host or storage, disconnects/reconnects a fibre cable, this will invoke the following errors “loss sig“, “loss sync” known as “Loss of synchronization”.
What are “Code Violation” Errors?
Code Violation errors are bit errors caused by corruption in the sequence frame such as character corruption. A typical cause of this would be a failing Host Bus Adapter, an optic degradation prior to its complete failure or incompatible speeds between points, or when a Host Bus Adapter or GBIC/SFP are flapping and are about to fail causing imminent Loss of Signal and again this is typical of optical degradation within the SAN infrastructure.
What are “Class 3-Frame Discards” errors?
Class-3 is a datagram service based on frame switching and is a connectionless service. Class-3 frame discards are caused by routable destination address errors.
Another indication of Class-3 discards are zoning conflicts where a frame has been transmitted and cannot reach its destination device. This is caused by either legacy or zoning mistakes or decommissioned devices leading to Class-3 frame discards and causing degraded throughput. These zones are known as “hanging zones”, a recommendation is to delete all “hanging zones” from the zone set configuration.
What are “Slow Draining Devices”?
Slow draining devices are devices that requests more information than it can consume. This can be because they have slower link rates than the rest of the environment or there are other factors within the devices preventing them from functioning to its optimal bandwidth.
A slow draining device or an Inter Switch Link with a low physical data rate could actually impact the fabric. The longer the delay caused by the device in returning credits to the switch, the more severe the problem.
The above are some facts that will impact your Storage Area Network, to reduce errors in your fabric there are couple of recommendations that can be introduced to create a healthy Storage Area Network infrastructure: –
- Plan your Storage Area Network infrastructure (i.e. check out for Slow Draining Devices, etc.),
- Allow enough bandwidth between your fabrics to reduce bottle necks,
- Schedule regular Storage Area Network Health Checks,
- Install a Central Management Software to allow an “eagle eye” over your Storage Area Network infrastructure,
- Setup “Call Home”, SNMP Traps or email alerting,
- Keep the Storage Area Network FC Switches/Directors and the Host Bus Adapters firmware/drivers up to date,
- Verify the correct fibre cable specifications
- Follow the installation guides,
- Avoid dust on unused GBIC/SFP and fibre cables ,
- Remove “hanging zones” from the environment,