[OmniOS-discuss] iSCSI traffic suddenly comes to a halt and then resumes

Fri May 22 09:50:58 UTC 2015

After having troubles almost every week and missing the time frame to 
catch the bastard, today I finally had the opportunity to catch it in 
action:)

As it turns out, it looks like a ZFS(not likely) or HW(probably) 
problem. When in "hangup" state, iscsi and network worked flawlessly and 
I was able to connect to iSCSI(but mounting the FS and issuing 
commands(show lvm volume,..) worked really slow). I was also able to 
work on the server, so it wasn't locked up.

Then I decided to check the ZFS FS. I tried to create a file in ZFS 
mount directory by issuing 'touch test-file' and command froze. I tried 
to kill it with CTRL+C to no success. I tried to kill the process with 
kill -9, but that did not help either. Looking at iostat output, there 
was some reading happening, but absolutely no writes (0, nada).

I used 'lsiutils' to connect to my LSI HBA and issued port reset, 
following a hard SAS link reset in a hope it will come back, but it was 
still frozen. I also checked 'phy counters' in lsiutils, and there were 
some devices with errors, but that could be due to port / link reset.

Long story short, after 30min, everything returned to normal, without an 
errors message in logs or anywhere else. Bad thing is, iSCSI target 
froze a few minutes later and only way to resolve the trouble was to 
restart the server:(

Matej

On 12. 05. 2015 07:13, Matej Zerovnik wrote:
> I know building a single 50 drives RaidZ2 is a bad idea. As I said, 
> it's a legacy that I can't easily change. I already have a backup pool 
> with 7x10 drives RaidZ2 to which I hope I will be able to switch this 
> week. I hope to get some better results and less crashing...
>
> What is interesting is that when the 'event' happens, server works 
> normaly, ZFS is accessable and writable(at least, there is no errors 
> in log files), only iscsi reports errors and drops the connection. 
> Another interesting thing is that after the 'event', all write stops, 
> only read continues for another 30min. After 30min all traffic stops 
> for half an hour. After that, everything starts to coming back up... 
> Weird?!
>
> Matej
>
> On 09. 05. 2015 02:49, Richard Elling wrote:
>>
>>> On May 5, 2015, at 9:48 AM, Matej Zerovnik <matej at zunaj.si 
>>> <mailto:matej at zunaj.si>> wrote:
>>>
>>> I will replace the hardwarw in about 4 months with all SAS drives, 
>>> but I would love to have a working setup for the time being as well;)
>>>
>>> I looked at smart stats and there doesnt seem to be any errors. 
>>> Also, no hard/soft/transfer error reported by any drive. Will take a 
>>> look at service time tomorrow, maybe put the drives to graphite and 
>>> look at them over a longer period.
>>>
>>> I looked at iostat -x status today and stats for pool itself 
>>> reported 100% busy most of the time, 98-100% wait, 500-1300 
>>> transactions in queue, around 500 active,... First line, that is 
>>> average from boot, says avg service time.is <http://time.is> around 
>>> 1600ms which seems like aaaalot. Can it be due to really big queue?
>>>
>>> Would it help to create 5 10drives raidz pools instead of one with 
>>> 50 drives?
>>
>> It is a bad idea to build a single raidz set with 50 drives. Very 
>> bad. Hence the zpool
>> man page says, "The recommended number is between 3 and 9 to help 
>> increase performance."
>> But this recommendation applies to reliability, too.
>>  -- richard
>>
>
>
>
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150522/0b05a61f/attachment-0001.html>