[OmniOS-discuss] iSCSI traffic suddenly comes to a halt and then resumes
Matej Zerovnik
matej at zunaj.si
Tue May 5 07:46:01 UTC 2015
Hello!
Back again with a follow up from 'iSCSI target hang, no way to restart
but server reboot', where we had troubles with random iSCSI target
freezing and only reboot helped.
Once we had enough, we switch to a new gear and software:
- new server - IBM xServer 3550 M4 with 265GB memory and SAS HBA LSI
Logic SAS2308 controller
- installed the latest OmniOS LTS(r151014)
- updated the firmware on LSI controller to version P19.
We still kept our SATA hard drives in Supermicro JBOD with SAS expander
and SATA drives.
After the upgrade, things worked smooth for about a week with no errors
in logs.
After a week, some clients reported that their iSCSI drive failed and
remounted as read-only. Weirdly, Nagios on our end did not report any
anomaly. I looked at OmniOS logs, and there was nothing connected with
iscsi in them at all. After a while, all clients connected back, so
iscsi target did not crash like it used to.
Looking at the clients logs, it seems like there was a connection error:
Apr 29 10:33:53 317 kernel: connection1:0: detected conn error (1021)
Apr 29 10:33:54 317 iscsid: Kernel reported iSCSI connection 1:0 error
(1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result
of SCSI error recovery) state (3)
Apr 29 10:33:56 317 iscsid: connection1:0 is operational after recovery
(1 attempts)
Apr 29 10:36:37 317 kernel: connection1:0: detected conn error (1021)
Apr 29 10:36:37 317 iscsid: Kernel reported iSCSI connection 1:0 error
(1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result
of SCSI error recovery) state (3)
Apr 29 10:36:40 317 iscsid: connection1:0 is operational after recovery
(1 attempts)
Apr 29 10:36:50 317 kernel: sd 3:0:0:0: Device offlined - not ready
after error recovery
For test, I set up a ping from my workstation to clients server and our
iscsi target, to see if there is a network problem when iscsi drops. A
week later it happened again. I looked at ping requests and ping was
going through without a problem and nagios check on iscsi port was also
working, yet our traffic graph shows 100% drop:
http://i59.tinypic.com/59vl10.png
I failed to catch the server in 'down' state to investigate.
Looking up on the internet about the error that client gets, it looks
like there could be too many commands sent and iscsi timed out.
Our pool is made out of cca 40 drives in one RAIDZ vdev, so we can't do
many IOPS, so I suspect clients send too many IO requests, it takes
server too long to respond and iscsi crashes. Does that sounds like a
possible option?
Is there a way to measure how many iscsi commands are sent to drives, to
see if there is a peak when it crashes?
Is there a way to measure how busy are disks and if they really cant
return data that fast?
What else should/can I check/monitor to find out what our problem it?
Matej
More information about the OmniOS-discuss
mailing list