[OmniOS-discuss] iSCSI traffic suddenly comes to a halt and then resumes
Matej Zerovnik
matej at zunaj.si
Fri May 22 09:50:58 UTC 2015
After having troubles almost every week and missing the time frame to
catch the bastard, today I finally had the opportunity to catch it in
action:)
As it turns out, it looks like a ZFS(not likely) or HW(probably)
problem. When in "hangup" state, iscsi and network worked flawlessly and
I was able to connect to iSCSI(but mounting the FS and issuing
commands(show lvm volume,..) worked really slow). I was also able to
work on the server, so it wasn't locked up.
Then I decided to check the ZFS FS. I tried to create a file in ZFS
mount directory by issuing 'touch test-file' and command froze. I tried
to kill it with CTRL+C to no success. I tried to kill the process with
kill -9, but that did not help either. Looking at iostat output, there
was some reading happening, but absolutely no writes (0, nada).
I used 'lsiutils' to connect to my LSI HBA and issued port reset,
following a hard SAS link reset in a hope it will come back, but it was
still frozen. I also checked 'phy counters' in lsiutils, and there were
some devices with errors, but that could be due to port / link reset.
Long story short, after 30min, everything returned to normal, without an
errors message in logs or anywhere else. Bad thing is, iSCSI target
froze a few minutes later and only way to resolve the trouble was to
restart the server:(
Matej
On 12. 05. 2015 07:13, Matej Zerovnik wrote:
> I know building a single 50 drives RaidZ2 is a bad idea. As I said,
> it's a legacy that I can't easily change. I already have a backup pool
> with 7x10 drives RaidZ2 to which I hope I will be able to switch this
> week. I hope to get some better results and less crashing...
>
> What is interesting is that when the 'event' happens, server works
> normaly, ZFS is accessable and writable(at least, there is no errors
> in log files), only iscsi reports errors and drops the connection.
> Another interesting thing is that after the 'event', all write stops,
> only read continues for another 30min. After 30min all traffic stops
> for half an hour. After that, everything starts to coming back up...
> Weird?!
>
> Matej
>
> On 09. 05. 2015 02:49, Richard Elling wrote:
>>
>>> On May 5, 2015, at 9:48 AM, Matej Zerovnik <matej at zunaj.si
>>> <mailto:matej at zunaj.si>> wrote:
>>>
>>> I will replace the hardwarw in about 4 months with all SAS drives,
>>> but I would love to have a working setup for the time being as well;)
>>>
>>> I looked at smart stats and there doesnt seem to be any errors.
>>> Also, no hard/soft/transfer error reported by any drive. Will take a
>>> look at service time tomorrow, maybe put the drives to graphite and
>>> look at them over a longer period.
>>>
>>> I looked at iostat -x status today and stats for pool itself
>>> reported 100% busy most of the time, 98-100% wait, 500-1300
>>> transactions in queue, around 500 active,... First line, that is
>>> average from boot, says avg service time.is <http://time.is> around
>>> 1600ms which seems like aaaalot. Can it be due to really big queue?
>>>
>>> Would it help to create 5 10drives raidz pools instead of one with
>>> 50 drives?
>>
>> It is a bad idea to build a single raidz set with 50 drives. Very
>> bad. Hence the zpool
>> man page says, "The recommended number is between 3 and 9 to help
>> increase performance."
>> But this recommendation applies to reliability, too.
>> -- richard
>>
>
>
>
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150522/0b05a61f/attachment-0001.html>
More information about the OmniOS-discuss
mailing list