[OmniOS-discuss] iSCSI traffic suddenly comes to a halt and then resumes

Tue May 5 07:46:01 UTC 2015

Hello!

Back again with a follow up from 'iSCSI target hang, no way to restart 
but server reboot', where we had troubles with random iSCSI target 
freezing and only reboot helped.

Once we had enough, we switch to a new gear and software:
- new server - IBM xServer 3550 M4 with 265GB memory and SAS HBA LSI 
Logic SAS2308 controller
- installed the latest OmniOS LTS(r151014)
- updated the firmware on LSI controller to version P19.

We still kept our SATA hard drives in Supermicro JBOD with SAS expander 
and SATA drives.

After the upgrade, things worked smooth for about a week with no errors 
in logs.

After a week, some clients reported that their iSCSI drive failed and 
remounted as read-only. Weirdly, Nagios on our end did not report any 
anomaly. I looked at OmniOS logs, and there was nothing connected with 
iscsi in them at all. After a while, all clients connected back, so 
iscsi target did not crash like it used to.

Looking at the clients logs, it seems like there was a connection error:
Apr 29 10:33:53 317 kernel: connection1:0: detected conn error (1021)
Apr 29 10:33:54 317 iscsid: Kernel reported iSCSI connection 1:0 error 
(1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result 
of SCSI error recovery) state (3)
Apr 29 10:33:56 317 iscsid: connection1:0 is operational after recovery 
(1 attempts)
Apr 29 10:36:37 317 kernel: connection1:0: detected conn error (1021)
Apr 29 10:36:37 317 iscsid: Kernel reported iSCSI connection 1:0 error 
(1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result 
of SCSI error recovery) state (3)
Apr 29 10:36:40 317 iscsid: connection1:0 is operational after recovery 
(1 attempts)
Apr 29 10:36:50 317 kernel: sd 3:0:0:0: Device offlined - not ready 
after error recovery

For test, I set up a ping from my workstation to clients server and our 
iscsi target, to see if there is a network problem when iscsi drops. A 
week later it happened again. I looked at ping requests and ping was 
going through without a problem and nagios check on iscsi port was also 
working, yet our traffic graph shows 100% drop:
http://i59.tinypic.com/59vl10.png

I failed to catch the server in 'down' state to investigate.

Looking up on the internet about the error that client gets, it looks 
like there could be too many commands sent and iscsi timed out.
Our pool is made out of cca 40 drives in one RAIDZ vdev, so we can't do 
many IOPS, so I suspect clients send too many IO requests, it takes 
server too long to respond and iscsi crashes. Does that sounds like a 
possible option?
Is there a way to measure how many iscsi commands are sent to drives, to 
see if there is a peak when it crashes?
Is there a way to measure how busy are disks and if they really cant 
return data that fast?
What else should/can I check/monitor to find out what our problem it?

Matej