[OmniOS-discuss] iSCSI target hang, no way to restart but server reboot

Tue Mar 31 12:54:37 UTC 2015

We were primarily using the machines for serving iscsi to VMs, and we'd see
bad cascading failures (iscsi lun timeouts would cause the watchdog to kick
in on the linux hosts, resetting the initiator, meanwhile the VM would
decide that the virtio devices in the VM were dead, requiring a client
reboot). In some cases, the problems would happen across all luns, in
others it would be just particular luns. I assume this followed the
severity of the situation with the failing drive (or number of failing
drives before got aggressive about replacement). Similarly, we'd see a
range of behaviors with local pool commands, ranging from everything
looking alright to zpool commands hanging or running *extremely* slowly.

I'd hacked up some quick scripts to correlate info from the different
sources. They are here:
https://github.com/narayandesai/diy-lsi
They may or may not be portable, but demonstrate all of the info gathering
methods we found useful. Another thing that was useful was maintaining a
pool inventory (stored somewhere else) with device addresses, serial
numbers, and jbod bay mappings. Having to map that you when things are
falling apart is seriously sad times.

fwiw, you might still be ok with seagate drives; we were only using the
self-check predictive failure flag, as opposed to anything more
complicated.
good luck
 -nld

On Tue, Mar 31, 2015 at 5:08 AM, Matej Zerovnik <matej at zunaj.si> wrote:

>
> On 27. 03. 2015 16:13, Narayan Desai wrote:
>
>> Having been on the receiving end of similar advice, it is a frustrating
>> situation to be in, since you have (and will likely continue to have) the
>> hardware in production, without much option for replacement.
>>
>> When we had systems like this, we had a lot of success being aggressive
>> in swapping out disks that were showing signs of going bad, even before
>> critical failures occurred. Also looking at SMART statistics, and
>> aggressively replacing those as well. This made the situation manageable.
>> Basically, having sata drives in sas expanders means the system is brittle,
>> and you should treat it as such. Look for:
>>  - errors in iostat -En
>>  - high service times in iostat -xnz
>>  - smartctl (this causes harmless sense messages when devices are probed,
>> but it is easy enough to ignore these)
>>  - any errors reported out of lsiutil, showing either problems with
>> cabling/enclosures, or devices
>>  - decode any sense errors reported by the lsi driver
>>
>> Aggressively replace devices implicated by these, and hope for the best.
>> The best may or may not be what you're hoping for, but may be livable; it
>> was for us.
>>
>>  When errors happened to you, were you able to use the pool itself and
> only iscsi target froze or did you have troubles with the pool itself as
> well...
>
> Because on our end, when iscsi target freezes, zpool is perfectly ok. We
> can access it and use it locally, but iscsi target is frozen and can't be
> restarted.
>
> I will check my sistem with iostat and smartctl, but we are using seagate
> drives, so some of the smartctl stats are useless on 1st sight:)
>
> Matej
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150331/63dfc139/attachment.html>