[OmniOS-discuss] [discuss] COMSTAR hanging
Matej Žerovnik
matej at zunaj.si
Wed Jun 1 11:26:19 UTC 2016
Hey there,
I was a bit too quick with the solution. It crashed a few days later.
What I discovered is that the number of iSCSI sessions starts climbing (one client opens tons of sessions) and when the number of sessions reaches a certain number, target crashes. When me and Dan looked at the memory dump, the client with open sessions is the one holding the lock, so that even stmf/target won’t restart (via svcadm restart).
On the graph we can see the number of sessions rising until it crashes. Number of TCP sessions stays the same.
Matej
> On 13 Apr 2016, at 16:07, Matej Žerovnik <matej at zunaj.si> wrote:
>
> Hey there,
>
> we were having the same problems with an old storage with SATA drives and a new one with SAS drives and all HW from Nexenta HCL list, so everything should work. Yet iSCSI target died every 2-6 days nonetheless. With the help of memory dump and Dan, it looks like, we managed to solve the issue.
>
> In my case, one of the threads was waiting for memory to free. I remembered I hadn’t set max ARC sizeon my system, so it was possible for ARC to eat all my free memory and could not release it quick enough. After settings max ARC size, everything works for the last month. I still can’t say if this really fixed the problem or not, but so far so good:)
>
> Matej
>
>
>> On 13 Jan 2016, at 05:52, John Barfield <john.barfield at bissinc.com <mailto:john.barfield at bissinc.com>> wrote:
>>
>> Oh I didnt catch that detail.
>>
>> Okay well nevermind :)
>>
>>
>> Sent from Outlook Mobile <https://aka.ms/qtex0l>
>>
>>
>>
>> On Tue, Jan 12, 2016 at 8:21 PM -0800, "Brian Hechinger" <wonko at 4amlunch.net <mailto:wonko at 4amlunch.net>> wrote:
>>
>> In my case the SATA disks aren’t on the 1068E.
>>
>> -brian
>>
>>> On Jan 12, 2016, at 11:19 PM, John Barfield <john.barfield at bissinc.com <mailto:john.barfield at bissinc.com>> wrote:
>>>
>>> BTW I left off that it has the same LSI controller chipset
>>>
>>> Sent from Outlook Mobile <https://aka.ms/qtex0l>
>>> _____________________________
>>> From: John Barfield <john.barfield at bissinc.com <mailto:john.barfield at bissinc.com>>
>>> Sent: Tuesday, January 12, 2016 10:17 PM
>>> Subject: Re: [OmniOS-discuss] [discuss] COMSTAR hanging
>>> To: <discuss at lists.illumos.org <mailto:discuss at lists.illumos.org>>, omnios-discuss <omnios-discuss at lists.omniti.com <mailto:omnios-discuss at lists.omniti.com>>
>>>
>>>
>>> My input may or may not be valid but Im going to throw it out there anyway :)
>>>
>>> do you have any Mpt disconnect errors in /var/adm/messages?
>>>
>>> Also do you have smartmontools installed?
>>>
>>> I ran into similiar issues just booting a sunfire x4540 recently off of OmniOS live, i/o would just hang while probing device nodes.
>>>
>>> I found the drive that was acting up and pulled it.
>>>
>>> All of a sudden everything miraculously worked amazing.
>>>
>>> I compiled smartmontools after I got it to boot and found 10 drives out of 48 with bad sectors in prefail state.
>>>
>>> I dont know if this happens with SAS drives or not but Im using SATA and saw this was a common issue in old opensolaris threads.
>>>
>>> -barfield
>>>
>>> Sent from Outlook Mobile <https://aka.ms/qtex0l>
>>>
>>>
>>>
>>> On Tue, Jan 12, 2016 at 8:08 PM -0800, "Brian Hechinger" <wonko at 4amlunch.net <mailto:wonko at 4amlunch.net>> wrote:
>>>
>>> In the meantime I’ve removed the SLOG and L2ARC just in case. I don’t think that’s it though. At least will have some sort of data point to work with here. :)
>>>
>>> -brian
>>>
>>> > On Jan 12, 2016, at 10:55 PM, Brian Hechinger <wonko at 4amlunch.net <mailto:wonko at 4amlunch.net>> wrote:
>>> >
>>> > Ok, it has happened.
>>> >
>>> > Checking this here, the pool seems to be fine. I can read and write files.
>>> >
>>> > except ‘zpool status’ is now currently hanging. I can still read/write from the pool, however.
>>> >
>>> > I can telnet to port 3260, but restarting target services has hung.
>>> >
>>> > root at basket1:/tank/Share# svcs -a | grep stmf
>>> > online Jan_05 svc:/system/stmf:default
>>> > root at basket1:/tank/Share# svcs -a | grep target
>>> > disabled Jan_05 svc:/system/fcoe_target:default
>>> > online Jan_05 svc:/network/iscsi/target:default
>>> > online Jan_05 svc:/system/ibsrp/target:default
>>> > root at basket1:/tank/Share# svcadm restart /system/ibsrp/target
>>> > root at basket1:/tank/Share# svcadm restart /network/iscsi/target
>>> > root at basket1:/tank/Share# svcadm restart /system/stmf
>>> > root at basket1:/tank/Share# svcs -a | grep target
>>> > disabled Jan_05 svc:/system/fcoe_target:default
>>> > online* 22:43:03 svc:/system/ibsrp/target:default
>>> > online* 22:43:13 svc:/network/iscsi/target:default
>>> > root at basket1:/tank/Share# svcs -a | grep stmf
>>> > online* 22:43:18 svc:/system/stmf:default
>>> > root at basket1:/tank/Share#
>>> >
>>> > I’m doing a crash dump reboot. I’ll post the output somewhere.
>>> >
>>> > The output of echo '$<threadlist' | mdb -k is attached.
>>> >
>>> > <threadlist.out>
>>> >
>>> >> On Jan 8, 2016, at 3:11 PM, Matej Zerovnik <matej at zunaj.si <mailto:matej at zunaj.si>> wrote:
>>> >>
>>> >> Is the pool usable during comstar hang?
>>> >> Can you write and read from the pool (test both, in my case, when pool froze, I wasn’t able to write to the pool, but I could read).
>>> >>
>>> >> Again, this might not be connected with Comstar, but in my case, Comstar and pool hang were exchanging.
>>> >>
>>> >> Matej
>>> >>
>>> >>> On 08 Jan 2016, at 20:11, Brian Hechinger <wonko at 4amlunch.net <mailto:wonko at 4amlunch.net>> wrote:
>>> >>>
>>> >>> Yeah, I’m using the 1068E to boot from (this has been supported since before Illumos) but that doesn’t have anything accessed by COMSTAR.
>>> >>>
>>> >>> It’s the ICH10R SATA that hosts the disks that COMSTAR shares out space from.
>>> >>>
>>> >>> -brian
>>> >>>
>>> >>>> On Jan 8, 2016, at 1:31 PM, Richard Jahnel <rjahnel at ellipseinc.com <mailto:rjahnel at ellipseinc.com>> wrote:
>>> >>>>
>>> >>>> First off, love SuperMicro good choice IMHO.
>>> >>>>
>>> >>>> This board has two on board controllers.
>>> >>>>
>>> >>>> LSI SAS1068E (not 100% sure there are working illumos drivers for this one)
>>> >>>>
>>> >>>> And
>>> >>>>
>>> >>>> Intel ICH10R SATA (So I'm guessing your using this one.)
>>> >>>>
>>> >>>> -----Original Message-----
>>> >>>> From: OmniOS-discuss [ mailto:omnios-discuss-bounces at lists.omniti.com <mailto:omnios-discuss-bounces at lists.omniti.com>] On Behalf Of Brian Hechinger
>>> >>>> Sent: Friday, January 08, 2016 12:16 PM
>>> >>>> To: Matej Zerovnik <matej at zunaj.si <mailto:matej at zunaj.si>>
>>> >>>> Cc: omnios-discuss <omnios-discuss at lists.omniti.com <mailto:omnios-discuss at lists.omniti.com>>
>>> >>>> Subject: Re: [OmniOS-discuss] [discuss] COMSTAR hanging
>>> >>>>
>>> >>>>
>>> >>>>> Which controller exactly do you have?
>>> >>>>
>>> >>>> Whatever ACHI stuff is built into the motherboard. Motherboard is X8DTL-3F.
>>> >>>>
>>> >>>>> Do you know firmware version?
>>> >>>>
>>> >>>> I’m assuming this is linked to the BIOS version?
>>> >>>>
>>> >>>>> Which hard drives?
>>> >>>>
>>> >>>> Hitachi-HUA723030ALA640-MKAOAA50-2.73TB
>>> >>>>
>>> >>>>> It might not tell much, but it’s good to have as much information as possible.
>>> >>>>>
>>> >>>>> When comstar hangs, can you telnet to the iSCSI port?
>>> >>>>> What does svcs says, is the service running?
>>> >>>>> What happens in you try to restart it?
>>> >>>>> How do you restart it?
>>> >>>>
>>> >>>> I’ll try all these things next time.
>>> >>>>
>>> >>>>> In my case, svcs reported service running, but when I tried to telnet, there was no connection as well as there was no listening port opened when checking with 'netstat -an'. If I tried to restart target and stmf service, but stmf service got stucked in online* state and would not start. Reboot was the only solution in my case, but as I said, latest 014 release is working OK (but then again, load got reduced).
>>> >>>>
>>> >>>> All good info. Thanks!
>>> >>>>
>>> >>>> -brian
>>> >>>>
>>> >>>>>
>>> >>>>> Matej
>>> >>>>>
>>> >>>>>> On 08 Jan 2016, at 17:50, Dave Pooser <dave-oo at pooserville.com <mailto:dave-oo at pooserville.com>> wrote:
>>> >>>>>>
>>> >>>>>>>> On Jan 8, 2016, at 11:22 AM, Brian Hechinger <wonko at 4amlunch.net <mailto:wonko at 4amlunch.net>> wrote:
>>> >>>>>>>>
>>> >>>>>>>> No, ZFS raid10
>>> >>>>>>>
>>> >>>>>>> Saw the HW-RAID term, and got concerned. That's what, raidz2 in ZFS-ese?
>>> >>>>>>
>>> >>>>>> It's a zpool with multiple mirror vdevs.
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> Dave Pooser
>>> >>>>>> Cat-Herder-in-Chief, Pooserville.com <http://pooserville.com/>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> _______________________________________________
>>> >>>>>> OmniOS-discuss mailing list
>>> >>>>>> OmniOS-discuss at lists.omniti.com <mailto:OmniOS-discuss at lists.omniti.com>
>>> >>>>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss <http://lists.omniti.com/mailman/listinfo/omnios-discuss>
>>> >>>>>
>>> >>>>> _______________________________________________
>>> >>>>> OmniOS-discuss mailing list
>>> >>>>> OmniOS-discuss at lists.omniti.com <mailto:OmniOS-discuss at lists.omniti.com>
>>> >>>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss <http://lists.omniti.com/mailman/listinfo/omnios-discuss>
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> OmniOS-discuss mailing list
>>> >>>> OmniOS-discuss at lists.omniti.com <mailto:OmniOS-discuss at lists.omniti.com>
>>> >>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss <http://lists.omniti.com/mailman/listinfo/omnios-discuss>
>>> >>>
>>> >>
>>> >
>>>
>>>
>>>
>>> http://www.listbox.com <http://www.listbox.com/>
>>> illumos-discuss | Archives <https://www.listbox.com/member/archive/182180/=now> <https://www.listbox.com/member/archive/rss/182180/26677440-40b316d8> | Modify <https://www.listbox.com/member/?member_id=26677440&id_secret=26677440-8fd7f4fe> Your Subscription <http://www.listbox.com/>
>>>
>>> _______________________________________________
>>> OmniOS-discuss mailing list
>>> OmniOS-discuss at lists.omniti.com <mailto:OmniOS-discuss at lists.omniti.com>
>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss <http://lists.omniti.com/mailman/listinfo/omnios-discuss>
>>
>> _______________________________________________
>> OmniOS-discuss mailing list
>> OmniOS-discuss at lists.omniti.com <mailto:OmniOS-discuss at lists.omniti.com>
>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20160601/ec44d641/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: iscsi-sessions.png
Type: image/png
Size: 25575 bytes
Desc: not available
URL: <https://omniosce.org/ml-archive/attachments/20160601/ec44d641/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3468 bytes
Desc: not available
URL: <https://omniosce.org/ml-archive/attachments/20160601/ec44d641/attachment-0001.bin>
More information about the OmniOS-discuss
mailing list