From richard at netbsd.org Thu Oct 1 09:50:03 2015 From: richard at netbsd.org (Richard PALO) Date: Thu, 01 Oct 2015 11:50:03 +0200 Subject: [OmniOS-discuss] strangeness ssh into omnios from oi_151a9 In-Reply-To: <20150930080248.GA4668@gutsman.lotheac.fi> References: <5606BAD8.8090101@netbsd.org> <33923013-0E59-4223-8EF4-A77A168E1C70@omniti.com> <8963D7A6-2339-4E6F-9559-9DBAAAAD23BF@omniti.com> <20150928134639.GC17072@gutsman.lotheac.fi> <20150928154027.GD5062@gutsman.lotheac.fi> <20150929103507.GE17072@gutsman.lotheac.fi> <560B95BF.4080404@netbsd.org> <20150930080248.GA4668@gutsman.lotheac.fi> Message-ID: <560D01CB.1060303@netbsd.org> Le 30/09/15 10:02, Lauri Tirkkonen a ?crit : > On Wed, Sep 30 2015 09:56:47 +0200, Richard PALO wrote: >>> To be clear, it's not implementing RFC 1323 (and not even *not* >>> implementing 7323) that causes the issue. 1323 actually didn't specify >>> what to do with non-timestamped segments on a timestamp-negotiated >>> connection, and illumos pre-5850 did something very surprising which I >>> doubt nobody else did (stop generating timestamps on all future >>> segments), so I don't think you will be able to reproduce the hang with >>> other operating systems, but you'll likely be able to see the unexpected >>> non-timestamped segments in connections between other OSes as well (but >>> I still can't be sure because I don't know what middlebox is injecting >>> them or why :) >>> >> >> In that case, wouldn't setting tcp_tstamp_always on OI to '1' be better in >> this case (or would OI not honour that setting correctly)? > > It wouldn't work. From what I can tell, those ndd settings only affect > the SYN segments (ie. timestamp negotiation); pre-5850 illumos will > always stop timestamping mid-connection if it receives a non-timestamped > segment. > Okay, I set tcp_tstamp_if_wscale to 0 and it does seem to work fine. (Hoping there isn't any fallout from doing this now...) kiitoksia From lotheac at iki.fi Thu Oct 1 09:58:00 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Thu, 1 Oct 2015 12:58:00 +0300 Subject: [OmniOS-discuss] strangeness ssh into omnios from oi_151a9 In-Reply-To: <560D01CB.1060303@netbsd.org> References: <33923013-0E59-4223-8EF4-A77A168E1C70@omniti.com> <8963D7A6-2339-4E6F-9559-9DBAAAAD23BF@omniti.com> <20150928134639.GC17072@gutsman.lotheac.fi> <20150928154027.GD5062@gutsman.lotheac.fi> <20150929103507.GE17072@gutsman.lotheac.fi> <560B95BF.4080404@netbsd.org> <20150930080248.GA4668@gutsman.lotheac.fi> <560D01CB.1060303@netbsd.org> Message-ID: <20151001095800.GB4668@gutsman.lotheac.fi> On Thu, Oct 01 2015 11:50:03 +0200, Richard PALO wrote: > >>In that case, wouldn't setting tcp_tstamp_always on OI to '1' be better in > >>this case (or would OI not honour that setting correctly)? > > > >It wouldn't work. From what I can tell, those ndd settings only affect > >the SYN segments (ie. timestamp negotiation); pre-5850 illumos will > >always stop timestamping mid-connection if it receives a non-timestamped > >segment. > > > > Okay, I set tcp_tstamp_if_wscale to 0 and it does seem to work fine. Thanks, that pretty much confirms the issue is what I suspected it is. > (Hoping there isn't any fallout from doing this now...) As long as that middlebox has been mucking with your traffic in the way it is, timestamps have been getting turned off mid-connection for your pre-5850 box. I recommend you to ugprade to post-5850 if you can, or to scream loudly at whoever is modifying your traffic :) -- Lauri Tirkkonen | lotheac @ IRCnet From richard at netbsd.org Thu Oct 1 11:49:03 2015 From: richard at netbsd.org (Richard PALO) Date: Thu, 01 Oct 2015 13:49:03 +0200 Subject: [OmniOS-discuss] strangeness ssh into omnios from oi_151a9 In-Reply-To: <20151001095800.GB4668@gutsman.lotheac.fi> References: <33923013-0E59-4223-8EF4-A77A168E1C70@omniti.com> <8963D7A6-2339-4E6F-9559-9DBAAAAD23BF@omniti.com> <20150928134639.GC17072@gutsman.lotheac.fi> <20150928154027.GD5062@gutsman.lotheac.fi> <20150929103507.GE17072@gutsman.lotheac.fi> <560B95BF.4080404@netbsd.org> <20150930080248.GA4668@gutsman.lotheac.fi> <560D01CB.1060303@netbsd.org> <20151001095800.GB4668@gutsman.lotheac.fi> Message-ID: Le 01/10/15 11:58, Lauri Tirkkonen a ?crit : > On Thu, Oct 01 2015 11:50:03 +0200, Richard PALO wrote: >>>> In that case, wouldn't setting tcp_tstamp_always on OI to '1' be better in >>>> this case (or would OI not honour that setting correctly)? >>> >>> It wouldn't work. From what I can tell, those ndd settings only affect >>> the SYN segments (ie. timestamp negotiation); pre-5850 illumos will >>> always stop timestamping mid-connection if it receives a non-timestamped >>> segment. >>> >> >> Okay, I set tcp_tstamp_if_wscale to 0 and it does seem to work fine. > > Thanks, that pretty much confirms the issue is what I suspected it is. > >> (Hoping there isn't any fallout from doing this now...) > > As long as that middlebox has been mucking with your traffic in the way > it is, timestamps have been getting turned off mid-connection for your > pre-5850 box. I recommend you to ugprade to post-5850 if you can, or to > scream loudly at whoever is modifying your traffic :) > Actually I still notice some problems.. This morning in the direction OI => omnios things seemed okay. Now, omnios => OI I just now experienced the hang again, and it is repeatable. Could it be that your workaround is only useful for outbound connections (relative to OI)? -- Richard PALO From lotheac at iki.fi Thu Oct 1 12:24:11 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Thu, 1 Oct 2015 15:24:11 +0300 Subject: [OmniOS-discuss] strangeness ssh into omnios from oi_151a9 In-Reply-To: References: <20150928134639.GC17072@gutsman.lotheac.fi> <20150928154027.GD5062@gutsman.lotheac.fi> <20150929103507.GE17072@gutsman.lotheac.fi> <560B95BF.4080404@netbsd.org> <20150930080248.GA4668@gutsman.lotheac.fi> <560D01CB.1060303@netbsd.org> <20151001095800.GB4668@gutsman.lotheac.fi> Message-ID: <20151001122411.GD4668@gutsman.lotheac.fi> On Thu, Oct 01 2015 13:49:03 +0200, Richard PALO wrote: > Le 01/10/15 11:58, Lauri Tirkkonen a ?crit : > > On Thu, Oct 01 2015 11:50:03 +0200, Richard PALO wrote: > >>>> In that case, wouldn't setting tcp_tstamp_always on OI to '1' be better in > >>>> this case (or would OI not honour that setting correctly)? > >>> > >>> It wouldn't work. From what I can tell, those ndd settings only affect > >>> the SYN segments (ie. timestamp negotiation); pre-5850 illumos will > >>> always stop timestamping mid-connection if it receives a non-timestamped > >>> segment. > >>> > >> > >> Okay, I set tcp_tstamp_if_wscale to 0 and it does seem to work fine. > > > > Thanks, that pretty much confirms the issue is what I suspected it is. > > > >> (Hoping there isn't any fallout from doing this now...) > > > > As long as that middlebox has been mucking with your traffic in the way > > it is, timestamps have been getting turned off mid-connection for your > > pre-5850 box. I recommend you to ugprade to post-5850 if you can, or to > > scream loudly at whoever is modifying your traffic :) > > > > Actually I still notice some problems.. This morning in the direction OI => omnios > things seemed okay. > Now, omnios => OI I just now experienced the hang again, and it is repeatable. > > Could it be that your workaround is only useful for outbound connections (relative to OI)? Yeah, it's possible. Whoever sends the SYN expresses their capability to timestamp by including the tsopt, and you can disable that with the ndd options. I assumed that the ndd options would affect SYNACK as well, but I didn't actually read the code; I guess that's not the case after all, so inbound connections still get timestamping negotiated. I don't have a workaround for this, sorry. -- Lauri Tirkkonen | lotheac @ IRCnet From richard at netbsd.org Thu Oct 1 12:29:32 2015 From: richard at netbsd.org (Richard PALO) Date: Thu, 01 Oct 2015 14:29:32 +0200 Subject: [OmniOS-discuss] strangeness ssh into omnios from oi_151a9 In-Reply-To: <20151001122342.GC4668@gutsman.lotheac.fi> References: <20150928134639.GC17072@gutsman.lotheac.fi> <20150928154027.GD5062@gutsman.lotheac.fi> <20150929103507.GE17072@gutsman.lotheac.fi> <560B95BF.4080404@netbsd.org> <20150930080248.GA4668@gutsman.lotheac.fi> <560D01CB.1060303@netbsd.org> <20151001095800.GB4668@gutsman.lotheac.fi> <20151001122342.GC4668@gutsman.lotheac.fi> Message-ID: <560D272C.9040506@netbsd.org> Le 01/10/15 14:23, Lauri Tirkkonen a ?crit : > On Thu, Oct 01 2015 13:49:03 +0200, Richard PALO wrote: >> Le 01/10/15 11:58, Lauri Tirkkonen a ?crit : >>> On Thu, Oct 01 2015 11:50:03 +0200, Richard PALO wrote: >>>>>> In that case, wouldn't setting tcp_tstamp_always on OI to '1' be better in >>>>>> this case (or would OI not honour that setting correctly)? >>>>> >>>>> It wouldn't work. From what I can tell, those ndd settings only affect >>>>> the SYN segments (ie. timestamp negotiation); pre-5850 illumos will >>>>> always stop timestamping mid-connection if it receives a non-timestamped >>>>> segment. >>>>> >>>> >>>> Okay, I set tcp_tstamp_if_wscale to 0 and it does seem to work fine. >>> >>> Thanks, that pretty much confirms the issue is what I suspected it is. >>> >>>> (Hoping there isn't any fallout from doing this now...) >>> >>> As long as that middlebox has been mucking with your traffic in the way >>> it is, timestamps have been getting turned off mid-connection for your >>> pre-5850 box. I recommend you to ugprade to post-5850 if you can, or to >>> scream loudly at whoever is modifying your traffic :) >>> >> >> Actually I still notice some problems.. This morning in the direction OI => omnios >> things seemed okay. >> Now, omnios => OI I just now experienced the hang again, and it is repeatable. >> >> Could it be that your workaround is only useful for outbound connections (relative to OI)? > > Yeah, it's possible. Whoever sends the SYN expresses their capability to > timestamp by including the tsopt, and you can disable that with the ndd > options. I assumed that the ndd options would affect SYNACK as well, but > I didn't actually read the code; I guess that's not the case after all, > so inbound connections still get timestamping negotiated. I don't have a > workaround for this, sorry. > Too bad. Naturally it isn't feasible to turn things off via ndd on omnios for just one target. Is there any way to do that differently? That is, for only one target (and primarily ssh)? Unfortunately as well seems my inquiry to the OI list went unheard, even after subscribing (again). Must not have any moderators any longer... oh bother. The easiest would be to have 5850 integrated into OI. -- Richard PALO From lotheac at iki.fi Thu Oct 1 12:34:11 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Thu, 1 Oct 2015 15:34:11 +0300 Subject: [OmniOS-discuss] strangeness ssh into omnios from oi_151a9 In-Reply-To: <560D272C.9040506@netbsd.org> References: <20150928154027.GD5062@gutsman.lotheac.fi> <20150929103507.GE17072@gutsman.lotheac.fi> <560B95BF.4080404@netbsd.org> <20150930080248.GA4668@gutsman.lotheac.fi> <560D01CB.1060303@netbsd.org> <20151001095800.GB4668@gutsman.lotheac.fi> <20151001122342.GC4668@gutsman.lotheac.fi> <560D272C.9040506@netbsd.org> Message-ID: <20151001123411.GE4668@gutsman.lotheac.fi> On Thu, Oct 01 2015 14:29:32 +0200, Richard PALO wrote: > >> Actually I still notice some problems.. This morning in the direction OI => omnios > >> things seemed okay. > >> Now, omnios => OI I just now experienced the hang again, and it is repeatable. > >> > >> Could it be that your workaround is only useful for outbound connections (relative to OI)? > > > > Yeah, it's possible. Whoever sends the SYN expresses their capability to > > timestamp by including the tsopt, and you can disable that with the ndd > > options. I assumed that the ndd options would affect SYNACK as well, but > > I didn't actually read the code; I guess that's not the case after all, > > so inbound connections still get timestamping negotiated. I don't have a > > workaround for this, sorry. > > > > Too bad. Naturally it isn't feasible to turn things off via ndd on omnios for just one target. > Is there any way to do that differently? That is, for only one target (and primarily ssh)? Not that I know of. -- Lauri Tirkkonen | lotheac @ IRCnet From pekka.niiranen at pp5.inet.fi Thu Oct 1 19:00:18 2015 From: pekka.niiranen at pp5.inet.fi (Pekka Niiranen) Date: Thu, 1 Oct 2015 22:00:18 +0300 Subject: [OmniOS-discuss] Intel HD 3000 graphics Message-ID: Hello, has anybody managed to get Intel HD 3000 in Sandy Bridge work using X from SmartOS packages? Dmesg shows the chip and X -configure sets "intel", but only after replacing it with "vesa" works. Otherwise I get "No screens found". I found https://www.illumos.org/issues/4044 but what does "Work completed" in there mean for the user? Do I need to build the driver myself? -pekka- From daleg at omniti.com Thu Oct 1 19:31:42 2015 From: daleg at omniti.com (Dale Ghent) Date: Thu, 1 Oct 2015 15:31:42 -0400 Subject: [OmniOS-discuss] Intel HD 3000 graphics In-Reply-To: References: Message-ID: <28E20456-7B62-445E-B60D-5A4F93CA8A2B@omniti.com> > On Oct 1, 2015, at 3:00 PM, Pekka Niiranen wrote: > > Hello, > > has anybody managed to get Intel HD 3000 in Sandy Bridge > work using X from SmartOS packages? Dmesg shows the chip > and X -configure sets "intel", but only after replacing > it with "vesa" works. Otherwise I get "No screens found". > > I found https://www.illumos.org/issues/4044 but > what does "Work completed" in there mean for the user? > Do I need to build the driver myself? Xwindows/Xorg is not a target audience for OmniOS, so little, if any, attention is given to issues surrounding support for a windows environment. /dale From richard at netbsd.org Thu Oct 1 19:47:41 2015 From: richard at netbsd.org (Richard PALO) Date: Thu, 01 Oct 2015 21:47:41 +0200 Subject: [OmniOS-discuss] Intel HD 3000 graphics In-Reply-To: References: Message-ID: Le 01/10/15 21:00, Pekka Niiranen a ?crit : > Hello, > > has anybody managed to get Intel HD 3000 in Sandy Bridge > work using X from SmartOS packages? Dmesg shows the chip > and X -configure sets "intel", but only after replacing > it with "vesa" works. Otherwise I get "No screens found". > > I found https://www.illumos.org/issues/4044 but > what does "Work completed" in there mean for the user? > Do I need to build the driver myself? > > -pekka- > according to http://www.x.org/wiki/IntelGraphicsDriver/ I believe you will need a working Kernel Mode Setting (KMS) implementation. seems all versions since 2.15 of the Intel-driver only support KMS... -- Richard PALO From Robert.Brock at 2hoffshore.com Fri Oct 2 09:26:13 2015 From: Robert.Brock at 2hoffshore.com (Robert A. Brock) Date: Fri, 2 Oct 2015 09:26:13 +0000 Subject: [OmniOS-discuss] Cannot access CIFS with \\servername.fqdn Message-ID: <2859482C466CCA42AD9B84B9F56212301506AF27@2H199.2hukwok2.local> List, I've got a box that's doing this: https://www.illumos.org/issues/1087 Trying to hit the cifs share by fqdn gets me this: Error: \\server.domain.local is not accessible. You might not have permission to use this network resource. Contact the administrator of this server to find out if you have access permissions. The account is not authorized to log in from this station. But I can connect fine with \\hostname. For added weirdness, I have another OmniOS box that isn't exhibiting this problem. Anybody seen this before and managed to figure out the cause? Regards, Rob -------------- next part -------------- An HTML attachment was scrubbed... URL: From gate03 at landcroft.co.uk Fri Oct 2 22:56:07 2015 From: gate03 at landcroft.co.uk (Michael Mounteney) Date: Sat, 3 Oct 2015 08:56:07 +1000 Subject: [OmniOS-discuss] Unable to install r151014 Message-ID: <20151003085607.19a2bf44@pimple.landy.net> Hello, I tried to install LTS yesterday but at the menu where one selects [1] to install, it just paused for a moment before flashing up what looked like a Python stacktrace, before returning to the five-option menu. The machine (Supermicro SYS 5107C-LF) is running r151013 so I believe is compatible hardware. The first step is to capture that stacktrace. How to do that ? ______________ Michael Mounteney From danmcd at omniti.com Sat Oct 3 01:34:51 2015 From: danmcd at omniti.com (Dan McDonald) Date: Fri, 2 Oct 2015 21:34:51 -0400 Subject: [OmniOS-discuss] Unable to install r151014 In-Reply-To: <20151003085607.19a2bf44@pimple.landy.net> References: <20151003085607.19a2bf44@pimple.landy.net> Message-ID: <6EF1C261-400F-4020-A2D8-C58119CF2561@omniti.com> If you're running 013, you can just use IPS to upgrade. But yes, I'd like to see what happened. I'll also reverify locally. You using a USB or an ISO? Dan Sent from my iPhone (typos, autocorrect, and all) > On Oct 2, 2015, at 6:56 PM, Michael Mounteney wrote: > > Hello, I tried to install LTS yesterday but at the menu where one > selects [1] to install, it just paused for a moment before flashing up > what looked like a Python stacktrace, before returning to the > five-option menu. > > The machine (Supermicro SYS 5107C-LF) is running r151013 so I believe is > compatible hardware. > > The first step is to capture that stacktrace. How to do that ? > > ______________ > Michael Mounteney > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss From marcus at plumdev.com Mon Oct 5 00:35:06 2015 From: marcus at plumdev.com (Marcus Marinelli) Date: Sun, 4 Oct 2015 17:35:06 -0700 Subject: [OmniOS-discuss] Unable to boot r151014 USB installation media; difference between ISO and USB files? Message-ID: Hi All, I'm hoping to be able to use OmniOS in a hosted/cloud environment, but having some problems getting a working image setup. I began by trying to install the latest stable (r151014) by writing the USB disk installation file (OmniOS_Text_r151014.usb-dd) to a virtual device on the VM provider, adding a blank disk to install to, and booting the virtualized KVM node against the usb image. The VM comes up and shows the OmniOS grub splash screen, lets me choose the regular or ttya/ttyb boot menu selections, however after getting the initial "SunOS Release 5.11 Version omnios-f090f73 64-bit" messages to show up, things start going poorly. After "Preparing image for use" is shown, next I see: Requesting System Maintenance Mode (See /lib/svc/share/README for more information.) Console login service(s) cannot run Enter use name for system maintenance (control-d to bypass): At this point, once I drop into the shell, svcs -vx shows svc:/system/filesystem/root-assembly:media (Installation file system assembly) is in state maintenance due to reason "Start method exited with $SMF_EXIT_ERR_FATAL" and consequently a whole bunch (55) of dependent services are not running. If I look at the svc log for system-filesystem-root-assembly:media, I see: Executing start method ("/lib/svc/method/media-assembly") Unable to mount media Looking in /lib/svc/method/media-assembly, I believe it is failing quite early: . /lib/svc/share/media_include.sh . /lib/svc/share/smf_include.sh . /lib/svc/share/fs_include.sh volsetid=$( < "/.volsetid" ) echo "\rPreparing image for use" >/dev/msglog /usr/sbin/mount_media $volsetid if [ $? -ne 0]; then echo "Unable to mount media" echo $SMF_EXIT_ERR_FATAL fi For reference, the contents of /.volsetid ($volsetid) is "r151014-2015-09-30T13:22:46.571554" If I run the /lib/svc/method/media-assembly script at this point (from the shell) I can see that /usr/sbin/mount_media is returning exit code 1, which explains why the script is failing and consequently SMF has given up. In order to diagnose this further, I then took the same USB image (OmniOS_Text_r151014.usb-dd) and wrote it to a real physical USB disk, and booted a real physical machine from it. The machine is 2 or 3 years old at this point - not brand new, but not ancient either. The surprising thing is I experienced the same behavior - the "real" machine failed to boot in the same way with seemingly the same issue. At this point, I thought maybe something was wrong with the r151014 USB installation media, so I downloaded r151012, r151010, r151006 and r151004's USB installation files. Flashed them all (in sequence, working backwards) to a real USB stick and experienced the same problem on all of them. I then took the r151014 ISO image file, burned that to a disk, and put it in the same physical computer, and it booted right up and loaded the OmniOS installer fine. It looks like this may be related to another user's recent experience: http://lists.omniti.com/pipermail/omnios-discuss/2015-September/005651.html (at least it sure seems like the same failure mode, although it's not clear from that email chain if the user installed r151012 successfully from the .usb-dd image or if they, perhaps, also used an ISO image for the successful '012 installation they mentioned) Can anyone (Dan? :)) help shed some light on what I might be doing wrong here with the USB installation media? Thanks, Marcus -------------- next part -------------- An HTML attachment was scrubbed... URL: From danmcd at omniti.com Mon Oct 5 14:04:32 2015 From: danmcd at omniti.com (Dan McDonald) Date: Mon, 5 Oct 2015 10:04:32 -0400 Subject: [OmniOS-discuss] Unable to boot r151014 USB installation media; difference between ISO and USB files? In-Reply-To: References: Message-ID: <18281AE8-D7CF-48E4-8B85-7B132118C4D2@omniti.com> > On Oct 4, 2015, at 8:35 PM, Marcus Marinelli wrote: > > Can anyone (Dan? :)) help shed some light on what I might be doing wrong here with the USB installation media? It is possible there's a problem in distro_const(1M) when it comes to the USB media. What's concerning, however is this: > At this point, I thought maybe something was wrong with the r151014 USB installation media, so I downloaded r151012, r151010, r151006 and r151004's USB installation files. Flashed them all (in sequence, working backwards) to a real USB stick and experienced the same problem on all of them. Another user recently had a problem with USB, but it turned out it was a bad USB drive on his end. Please make sure. Also, and this may be a documentation error, are you writing to the right disk device? Dan From asc1111 at gmail.com Mon Oct 5 16:45:37 2015 From: asc1111 at gmail.com (Aaron Curry) Date: Mon, 5 Oct 2015 10:45:37 -0600 Subject: [OmniOS-discuss] zfs send/receive corruption? Message-ID: We have a file server implementing CIFS to serve files to our users. Periodic snapshots are replicated to a secondary system via zfs send/receive. I recently moved services (shares, ip addresses, etc) to the secondary system while we performed some maintenance on the primary server. Shortly after everything was up and running on the secondary system, that server panic'ed. Here's the stack trace: panicstr = assertion failed: 0 == zfs_acl_node_read(dzp, B_TRUE, &paclp, B_FALSE), file: ../../common/fs/zfs/zfs_acl.c, line: 1717 panicstack = fffffffffba8b1a8 () | zfs:zfs_acl_ids_create+4d2 () | zfs:zfs_make_xattrdir+96 () | zfs:zfs_get_xattrdir+103 () | zfs:zfs_lookup+1b6 () | genunix:fop_lookup+a2 () | genunix:xattr_dir_realdir+b3 () | genunix:xattr_lookup_cb+65 () | genunix:gfs_dir_lookup_dynamic+7c () | genunix:gfs_dir_lookup+18c () | genunix:gfs_vop_lookup+35 () | genunix:fop_lookup+a2 () | smbsrv:smb_vop_lookup+ea () | smbsrv:smb_vop_stream_lookup+e5 () | smbsrv:smb_fsop_lookup_name+158 () | smbsrv:smb_open_subr+1b8 () | smbsrv:smb_common_open+54 () | smbsrv:smb_com_nt_create_andx+ac () | smbsrv:smb_dispatch_request+687 () | smbsrv:smb_session_worker+a0 () | genunix:taskq_d_thread+b7 () | unix:thread_start+8 () | Luckily, it wasn't hard to identify the steps to reproduce the problem. Accessing a particular directory from a Mac OS X system causes this panic every time, but only on the secondary (zfs send/receive target) system. Accessing the same directory on the primary system does not cause a panic. I have tested this on other systems and have been able to reproduce the panic on the zfs send/receive target every time. Also, I have been able to reproduce it with OmniOS versions 151010, 151012 and the latest 151014. Replicating between two separate system and replicating to the local system both exhibit the same behavior. While I have been able to reliably pin the down a particular file or file/directory combination that is causing the problem and can easily reproduce the panic, I am at a loss of where to go from here. Are there any known issues with zfs send/receive? Any help would be appreciated. Aaron -------------- next part -------------- An HTML attachment was scrubbed... URL: From mir at miras.org Mon Oct 5 17:09:46 2015 From: mir at miras.org (Michael Rasmussen) Date: Mon, 5 Oct 2015 19:09:46 +0200 Subject: [OmniOS-discuss] zfs send/receive corruption? In-Reply-To: References: Message-ID: <20151005190946.503fa851@sleipner.datanom.net> On Mon, 5 Oct 2015 10:45:37 -0600 Aaron Curry wrote: > > While I have been able to reliably pin the down a particular file or > file/directory combination that is causing the problem and can easily > reproduce the panic, I am at a loss of where to go from here. Are there any > known issues with zfs send/receive? Any help would be appreciated. > What is the sync setting on the receiving pool? -- Hilsen/Regards Michael Rasmussen Get my public GnuPG keys: michael rasmussen cc http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xD3C9A00E mir datanom net http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE501F51C mir miras org http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE3E80917 -------------------------------------------------------------- /usr/games/fortune -es says: There are no emotional victims, only volunteers. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 181 bytes Desc: OpenPGP digital signature URL: From mir at miras.org Mon Oct 5 17:45:41 2015 From: mir at miras.org (Michael Rasmussen) Date: Mon, 5 Oct 2015 19:45:41 +0200 Subject: [OmniOS-discuss] zfs send/receive corruption? In-Reply-To: References: <20151005190946.503fa851@sleipner.datanom.net> Message-ID: <20151005194541.6e7d7b70@sleipner.datanom.net> On Mon, 5 Oct 2015 11:30:04 -0600 Aaron Curry wrote: > # zfs get sync pool/fs > NAME PROPERTY VALUE SOURCE > pool/fs sync standard default > > Is that what you mean? > Yes. Default means honor sync requests. -- Hilsen/Regards Michael Rasmussen Get my public GnuPG keys: michael rasmussen cc http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xD3C9A00E mir datanom net http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE501F51C mir miras org http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE3E80917 -------------------------------------------------------------- /usr/games/fortune -es says: Love isn't only blind, it's also deaf, dumb, and stupid. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 181 bytes Desc: OpenPGP digital signature URL: From chip at innovates.com Mon Oct 5 18:28:48 2015 From: chip at innovates.com (Schweiss, Chip) Date: Mon, 5 Oct 2015 13:28:48 -0500 Subject: [OmniOS-discuss] zfs send/receive corruption? In-Reply-To: <20151005194541.6e7d7b70@sleipner.datanom.net> References: <20151005190946.503fa851@sleipner.datanom.net> <20151005194541.6e7d7b70@sleipner.datanom.net> Message-ID: This smells of a problem reported fixed on FreeBSD and ZoL. http://permalink.gmane.org/gmane.comp.file-systems.openzfs.devel/1545 On the Illumos ZFS the question was posed if the fixed have been incorporated, but unanswered: http://www.listbox.com/member/archive/182191/2015/09/sort/time_rev/page/1/entry/23:71/20150916025648:1487D326-5C40-11E5-A45A-20B0EF10038B/ I'd be curious to confirm if this has been fixed in Illimos or not as I now have systems with lots of CIFS and ACLs and potential vulnerable to the same sort of problem. Thus far I cannot find reference to it, but I could be looking in the wrong place, or for the wrong keywords. -Chip On Mon, Oct 5, 2015 at 12:45 PM, Michael Rasmussen wrote: > On Mon, 5 Oct 2015 11:30:04 -0600 > Aaron Curry wrote: > > > # zfs get sync pool/fs > > NAME PROPERTY VALUE SOURCE > > pool/fs sync standard default > > > > Is that what you mean? > > > Yes. Default means honor sync requests. > > -- > Hilsen/Regards > Michael Rasmussen > > Get my public GnuPG keys: > michael rasmussen cc > http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xD3C9A00E > mir datanom net > http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE501F51C > mir miras org > http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE3E80917 > -------------------------------------------------------------- > /usr/games/fortune -es says: > Love isn't only blind, it's also deaf, dumb, and stupid. > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From henson at acm.org Tue Oct 6 20:54:26 2015 From: henson at acm.org (Paul B. Henson) Date: Tue, 06 Oct 2015 13:54:26 -0700 Subject: [OmniOS-discuss] zdb -h bug? In-Reply-To: <034601d0f278$08ce27b0$1a6a7710$@acm.org> References: <005801d0ef58$da1898f0$8e49cad0$@acm.org> <55F8A73E.6050003@genashor.com> <034601d0f278$08ce27b0$1a6a7710$@acm.org> Message-ID: <02f401d10079$32e1b2b0$98a51810$@acm.org> > From: Paul B. Henson > Sent: Friday, September 18, 2015 6:11 PM > > Thanks for the verification. I gotta tell you, when zdb core dumped while I > was trying to determine if my pool had been corrupted by the L2ARC bug, it > was not a good feeling 8-/. But I'm pretty sure at this point it is an > unrelated bug with zdb and not a pool corruption issue. I still haven't had > time to set up a test environment to reproduce it, maybe next week. I noticed a review request for "6290 zdb -h overflows stack" fly by on the ZFS development list, I haven't confirmed it, but I think this is the bug we were running into that was causing zdb -h to dump core. It's a pretty trivial change; Dan, any chance you might be able to backport it to LTS at some point? Thanks. From danmcd at omniti.com Tue Oct 6 23:10:05 2015 From: danmcd at omniti.com (Dan McDonald) Date: Tue, 6 Oct 2015 19:10:05 -0400 Subject: [OmniOS-discuss] zdb -h bug? In-Reply-To: <02f401d10079$32e1b2b0$98a51810$@acm.org> References: <005801d0ef58$da1898f0$8e49cad0$@acm.org> <55F8A73E.6050003@genashor.com> <034601d0f278$08ce27b0$1a6a7710$@acm.org> <02f401d10079$32e1b2b0$98a51810$@acm.org> Message-ID: > On Oct 6, 2015, at 4:54 PM, Paul B. Henson wrote: > > Dan, any chance you might be able to backport it to LTS at > some point? I'm hoping the next batch of changes stays entirely in userland and doesn't require a reboot. The problem with patching zdb is that it requires an upgrade of the whole ZFS package, which forces a reboot. r151014(~)[0]% pkg search `which zdb` INDEX ACTION VALUE PACKAGE path hardlink usr/sbin/zdb pkg:/system/file-system/zfs at 0.5.11-0.151014 r151014(~)[0]% So if it happens, it mightn't happen as quickly as some other things I have in the backport pipeline. Dan From henson at acm.org Wed Oct 7 19:43:20 2015 From: henson at acm.org (Paul B. Henson) Date: Wed, 07 Oct 2015 12:43:20 -0700 Subject: [OmniOS-discuss] zdb -h bug? In-Reply-To: References: <005801d0ef58$da1898f0$8e49cad0$@acm.org> <55F8A73E.6050003@genashor.com> <034601d0f278$08ce27b0$1a6a7710$@acm.org> <02f401d10079$32e1b2b0$98a51810$@acm.org> Message-ID: <03fb01d10138$6bec7530$43c55f90$@acm.org> > From: Dan McDonald > Sent: Tuesday, October 06, 2015 4:10 PM > > I'm hoping the next batch of changes stays entirely in userland and doesn't > require a reboot. The problem with patching zdb is that it requires an > upgrade of the whole ZFS package, which forces a reboot. That's a bummer. Maybe you can pull it in the next time you back port a zfs kernel change. It's not particular important, I'm mostly curious to see if it fixes the core dump I get from zdb, which is the last niggly annoyance I have left from the L2ARC corruption scare :). Thanks. From bmx1955 at gmail.com Wed Oct 7 20:59:14 2015 From: bmx1955 at gmail.com (Mick Burns) Date: Wed, 7 Oct 2015 16:59:14 -0400 Subject: [OmniOS-discuss] big zfs storage? In-Reply-To: References: <559EE5BF.4040900@kateley.com> <559FF4DE.4040202@kateley.com> Message-ID: So... how does Nexenta copes with hot spares and all kinds of disk failures ? Adding hot spares is part of their administration manuals so can we assume things are almost always handled smoothly ? I'd like to hear from tangible experiences in production. thanks On Mon, Jul 13, 2015 at 7:58 AM, Schweiss, Chip wrote: > Liam, > > This report is encouraging. Please share some details of your > configuration. What disk failure parameters are have you set? Which > JBODs and disks are you running? > > I have mostly DataON JBODs and a some Supermicro. DataON has PMC SAS > expanders and Supermicro has LSI, both setups have pretty much the same > behavior with disk failures. All my servers are Supermicro with LSI HBAs. > > If there's a magic combination of hardware and OS config out there that > solves the disk failure panic problem, I will certainly change my builds > going forward. > > -Chip > > On Fri, Jul 10, 2015 at 1:04 PM, Liam Slusser wrote: >> >> I have two 800T ZFS systems on OmniOS and a bunch of smaller <50T systems. >> Things generally work very well. We loose a disk here and there but its >> never resulted in downtime. They're all on Dell hardware with LSI or Dell >> PERC controllers. >> >> Putting in smaller disk failure parameters, so disks fail quicker, was a >> big help when something does go wrong with a disk. >> >> thanks, >> liam >> >> >> On Fri, Jul 10, 2015 at 10:31 AM, Schweiss, Chip >> wrote: >>> >>> Unfortunately for the past couple years panics on disk failure has been >>> the norm. All my production systems are HA with RSF-1, so at least things >>> come back online relatively quick. There are quite a few open tickets in >>> the Illumos bug tracker related to mpt_sas related panics. >>> >>> Most of the work to fix these problems has been committed in the past >>> year, though problems still exist. For example, my systems are dual path >>> SAS, however, mpt_sas will panic if you pull a cable instead of dropping a >>> path to the disks. Dan McDonald is actively working to resolve this. He >>> is also pushing a bug fix in genunix from Nexenta that appears to fix a lot >>> of the panic problems. I'll know for sure in a few months after I see a >>> disk or two drop if it truly fixes things. Hans Rosenfeld at Nexenta is >>> responsible for most of the updates to mpt_sas including support for 3008 >>> (12G SAS). >>> >>> I haven't run any 12G SAS yet, but plan to on my next build in a couple >>> months. This will be about 300TB using an 84 disk JBOD. All the code from >>> Nexenta to support the 3008 appears to be in Illumos now, and they fully >>> support it so I suspect it's pretty stable now. From what I understand >>> there may be some 12G performance fixes coming sometime. >>> >>> The fault manager is nice when the system doesn't panic. When it panics, >>> the fault manger never gets a chance to take action. It is still the >>> consensus that is is better to run pools without hot spares because there >>> are situations the fault manager will do bad things. I witnessed this >>> myself when building a system and the fault manger replaced 5 disks in a >>> raidz2 vdev inside 1 minute, trashing the pool. I haven't completely yield >>> to the "best practice". I now run one hot spare per pool. I figure with >>> raidz2, the odds of the fault manager causing something catastrophic is much >>> less possible. >>> >>> -Chip >>> >>> >>> >>> On Fri, Jul 10, 2015 at 11:37 AM, Linda Kateley >>> wrote: >>>> >>>> I have to build and maintain my own system. I usually help others >>>> build(i teach zfs and freenas classes/consulting). I really love fault >>>> management in solaris and miss it. Just thought since it's my system and I >>>> get to choose I would use omni. I have 20+ years using solaris and only 2 on >>>> freebsd. >>>> >>>> I like freebsd for how well tuned for zfs oob. I miss the network, v12n >>>> and resource controls in solaris. >>>> >>>> Concerned about panics on disk failure. Is that common? >>>> >>>> >>>> linda >>>> >>>> >>>> On 7/9/15 9:30 PM, Schweiss, Chip wrote: >>>> >>>> Linda, >>>> >>>> I have 3.5 PB running under OmniOS. All my systems have LSI 2108 HBAs >>>> which is considered the best choice for HBAs. >>>> >>>> Illumos leaves a bit to be desired with handling faults from disks or >>>> SAS problems, but things under OmniOS have been improving, much thanks to >>>> Dan McDonald and OmniTI. We have a paid support on all of our production >>>> systems with OmniTI. Their response and dedication has been very good. >>>> Other than the occasional panic and restart from a disk failure, OmniOS has >>>> been solid. ZFS of course never has lost a single bit of information. >>>> >>>> I'd be curious why you're looking to move, have there been specific >>>> problems under BSD or ZoL? I've been slowly evaluating FreeBSD ZFS, but of >>>> course the skeletons in the closet never seem to come out until you do >>>> something big. >>>> >>>> -Chip >>>> >>>> On Thu, Jul 9, 2015 at 4:21 PM, Linda Kateley >>>> wrote: >>>>> >>>>> Hey is there anyone out there running big zfs on omni? >>>>> >>>>> I have been doing mostly zol and freebsd for the last year but have to >>>>> build a 300+TB box and i want to come back home to roots(solaris). Feeling >>>>> kind of hesitant :) Also, if you had to do over, is there anything you would >>>>> do different. >>>>> >>>>> Also, what is the go to HBA these days? Seems like i saw stable code >>>>> for lsi 3008? >>>>> >>>>> TIA >>>>> >>>>> linda >>>>> >>>>> >>>>> _______________________________________________ >>>>> OmniOS-discuss mailing list >>>>> OmniOS-discuss at lists.omniti.com >>>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss >>>> >>>> >>>> >>>> -- >>>> Linda Kateley >>>> Kateley Company >>>> Skype ID-kateleyco >>>> http://kateleyco.com >>> >>> >>> >>> _______________________________________________ >>> OmniOS-discuss mailing list >>> OmniOS-discuss at lists.omniti.com >>> http://lists.omniti.com/mailman/listinfo/omnios-discuss >>> >> > > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > From richard.elling at richardelling.com Wed Oct 7 22:38:11 2015 From: richard.elling at richardelling.com (Richard Elling) Date: Wed, 7 Oct 2015 15:38:11 -0700 Subject: [OmniOS-discuss] big zfs storage? In-Reply-To: References: <559EE5BF.4040900@kateley.com> <559FF4DE.4040202@kateley.com> Message-ID: <6A4C3B06-D2C5-4F07-B9A3-D0F477AE89AA@richardelling.com> > On Oct 7, 2015, at 1:59 PM, Mick Burns wrote: > > So... how does Nexenta copes with hot spares and all kinds of disk failures ? > Adding hot spares is part of their administration manuals so can we > assume things are almost always handled smoothly ? I'd like to hear > from tangible experiences in production. I do not speak for Nexenta. Hot spares are a bigger issue when you have single parity protection. With double parity and large pools, warm spares is a better approach. The reasons are: 1. Hot spares exist solely to eliminate the time between disk failure and human intervention for corrective action. There is no other reason to have hot spares. The exposure for a single disk failure under single parity protection is too risky for most folks, but with double parity (eg raidz2 or RAID-6) the few hours you save has little impact on overall data availabilty vs warm spares. 2. Under some transient failure conditions (eg isolated power failure, IOM reboot, or fabric partition), all available hot spares can be kicked into action. This can leave you with a big mess for large pools with many drives and spares. You can avoid this by making a human be involved in the decision process, rather than just *locally isolated,* automated decision making. -- richard > > > thanks > > On Mon, Jul 13, 2015 at 7:58 AM, Schweiss, Chip wrote: >> Liam, >> >> This report is encouraging. Please share some details of your >> configuration. What disk failure parameters are have you set? Which >> JBODs and disks are you running? >> >> I have mostly DataON JBODs and a some Supermicro. DataON has PMC SAS >> expanders and Supermicro has LSI, both setups have pretty much the same >> behavior with disk failures. All my servers are Supermicro with LSI HBAs. >> >> If there's a magic combination of hardware and OS config out there that >> solves the disk failure panic problem, I will certainly change my builds >> going forward. >> >> -Chip >> >> On Fri, Jul 10, 2015 at 1:04 PM, Liam Slusser wrote: >>> >>> I have two 800T ZFS systems on OmniOS and a bunch of smaller <50T systems. >>> Things generally work very well. We loose a disk here and there but its >>> never resulted in downtime. They're all on Dell hardware with LSI or Dell >>> PERC controllers. >>> >>> Putting in smaller disk failure parameters, so disks fail quicker, was a >>> big help when something does go wrong with a disk. >>> >>> thanks, >>> liam >>> >>> >>> On Fri, Jul 10, 2015 at 10:31 AM, Schweiss, Chip >>> wrote: >>>> >>>> Unfortunately for the past couple years panics on disk failure has been >>>> the norm. All my production systems are HA with RSF-1, so at least things >>>> come back online relatively quick. There are quite a few open tickets in >>>> the Illumos bug tracker related to mpt_sas related panics. >>>> >>>> Most of the work to fix these problems has been committed in the past >>>> year, though problems still exist. For example, my systems are dual path >>>> SAS, however, mpt_sas will panic if you pull a cable instead of dropping a >>>> path to the disks. Dan McDonald is actively working to resolve this. He >>>> is also pushing a bug fix in genunix from Nexenta that appears to fix a lot >>>> of the panic problems. I'll know for sure in a few months after I see a >>>> disk or two drop if it truly fixes things. Hans Rosenfeld at Nexenta is >>>> responsible for most of the updates to mpt_sas including support for 3008 >>>> (12G SAS). >>>> >>>> I haven't run any 12G SAS yet, but plan to on my next build in a couple >>>> months. This will be about 300TB using an 84 disk JBOD. All the code from >>>> Nexenta to support the 3008 appears to be in Illumos now, and they fully >>>> support it so I suspect it's pretty stable now. From what I understand >>>> there may be some 12G performance fixes coming sometime. >>>> >>>> The fault manager is nice when the system doesn't panic. When it panics, >>>> the fault manger never gets a chance to take action. It is still the >>>> consensus that is is better to run pools without hot spares because there >>>> are situations the fault manager will do bad things. I witnessed this >>>> myself when building a system and the fault manger replaced 5 disks in a >>>> raidz2 vdev inside 1 minute, trashing the pool. I haven't completely yield >>>> to the "best practice". I now run one hot spare per pool. I figure with >>>> raidz2, the odds of the fault manager causing something catastrophic is much >>>> less possible. >>>> >>>> -Chip >>>> >>>> >>>> >>>> On Fri, Jul 10, 2015 at 11:37 AM, Linda Kateley >>>> wrote: >>>>> >>>>> I have to build and maintain my own system. I usually help others >>>>> build(i teach zfs and freenas classes/consulting). I really love fault >>>>> management in solaris and miss it. Just thought since it's my system and I >>>>> get to choose I would use omni. I have 20+ years using solaris and only 2 on >>>>> freebsd. >>>>> >>>>> I like freebsd for how well tuned for zfs oob. I miss the network, v12n >>>>> and resource controls in solaris. >>>>> >>>>> Concerned about panics on disk failure. Is that common? >>>>> >>>>> >>>>> linda >>>>> >>>>> >>>>> On 7/9/15 9:30 PM, Schweiss, Chip wrote: >>>>> >>>>> Linda, >>>>> >>>>> I have 3.5 PB running under OmniOS. All my systems have LSI 2108 HBAs >>>>> which is considered the best choice for HBAs. >>>>> >>>>> Illumos leaves a bit to be desired with handling faults from disks or >>>>> SAS problems, but things under OmniOS have been improving, much thanks to >>>>> Dan McDonald and OmniTI. We have a paid support on all of our production >>>>> systems with OmniTI. Their response and dedication has been very good. >>>>> Other than the occasional panic and restart from a disk failure, OmniOS has >>>>> been solid. ZFS of course never has lost a single bit of information. >>>>> >>>>> I'd be curious why you're looking to move, have there been specific >>>>> problems under BSD or ZoL? I've been slowly evaluating FreeBSD ZFS, but of >>>>> course the skeletons in the closet never seem to come out until you do >>>>> something big. >>>>> >>>>> -Chip >>>>> >>>>> On Thu, Jul 9, 2015 at 4:21 PM, Linda Kateley >>>>> wrote: >>>>>> >>>>>> Hey is there anyone out there running big zfs on omni? >>>>>> >>>>>> I have been doing mostly zol and freebsd for the last year but have to >>>>>> build a 300+TB box and i want to come back home to roots(solaris). Feeling >>>>>> kind of hesitant :) Also, if you had to do over, is there anything you would >>>>>> do different. >>>>>> >>>>>> Also, what is the go to HBA these days? Seems like i saw stable code >>>>>> for lsi 3008? >>>>>> >>>>>> TIA >>>>>> >>>>>> linda >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> OmniOS-discuss mailing list >>>>>> OmniOS-discuss at lists.omniti.com >>>>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss >>>>> >>>>> >>>>> >>>>> -- >>>>> Linda Kateley >>>>> Kateley Company >>>>> Skype ID-kateleyco >>>>> http://kateleyco.com >>>> >>>> >>>> >>>> _______________________________________________ >>>> OmniOS-discuss mailing list >>>> OmniOS-discuss at lists.omniti.com >>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss >>>> >>> >> >> >> _______________________________________________ >> OmniOS-discuss mailing list >> OmniOS-discuss at lists.omniti.com >> http://lists.omniti.com/mailman/listinfo/omnios-discuss >> > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss From chip at innovates.com Thu Oct 8 00:56:30 2015 From: chip at innovates.com (Schweiss, Chip) Date: Wed, 7 Oct 2015 19:56:30 -0500 Subject: [OmniOS-discuss] big zfs storage? In-Reply-To: <6A4C3B06-D2C5-4F07-B9A3-D0F477AE89AA@richardelling.com> References: <559EE5BF.4040900@kateley.com> <559FF4DE.4040202@kateley.com> <6A4C3B06-D2C5-4F07-B9A3-D0F477AE89AA@richardelling.com> Message-ID: I completely concur with Richard on this. Let me give an a real example that emphases this point as it's a critical design decision. I never fully understood this until I saw in action the problem can automate hot spares can cause. I had all 5 hot spares get put into action on one raidz2 vdev of a 300TB pool. This was triggered by an HA event that was taking SCSI reservations in a split brain situation that was supposed to trigger a panic on one system. This caused a highly corrupted pool. Fortunately this was not a production pool and I simply trashed it and started reloading data. Now I only run one hot spare per pool. Most of my pools are raidz2 or raidz3. This way any event like this can not take out more than one disk and data parity will never be lost. There are other causes that can trigger multiple disk replacements. I have not encountered them. If I do, they won't hurt my data with the limit of one hot spare. -Chip On Wed, Oct 7, 2015 at 5:38 PM, Richard Elling < richard.elling at richardelling.com> wrote: > > > On Oct 7, 2015, at 1:59 PM, Mick Burns wrote: > > > > So... how does Nexenta copes with hot spares and all kinds of disk > failures ? > > Adding hot spares is part of their administration manuals so can we > > assume things are almost always handled smoothly ? I'd like to hear > > from tangible experiences in production. > > I do not speak for Nexenta. > > Hot spares are a bigger issue when you have single parity protection. > With double parity and large pools, warm spares is a better approach. > The reasons are: > > 1. Hot spares exist solely to eliminate the time between disk failure and > human > intervention for corrective action. There is no other reason to have > hot spares. > The exposure for a single disk failure under single parity protection > is too risky > for most folks, but with double parity (eg raidz2 or RAID-6) the few > hours you > save has little impact on overall data availabilty vs warm spares. > > 2. Under some transient failure conditions (eg isolated power failure, IOM > reboot, or fabric > partition), all available hot spares can be kicked into action. This > can leave you with a > big mess for large pools with many drives and spares. You can avoid > this by making a > human be involved in the decision process, rather than just *locally > isolated,* automated > decision making. > > -- richard > > > > > > > thanks > > > > On Mon, Jul 13, 2015 at 7:58 AM, Schweiss, Chip > wrote: > >> Liam, > >> > >> This report is encouraging. Please share some details of your > >> configuration. What disk failure parameters are have you set? Which > >> JBODs and disks are you running? > >> > >> I have mostly DataON JBODs and a some Supermicro. DataON has PMC SAS > >> expanders and Supermicro has LSI, both setups have pretty much the same > >> behavior with disk failures. All my servers are Supermicro with LSI > HBAs. > >> > >> If there's a magic combination of hardware and OS config out there that > >> solves the disk failure panic problem, I will certainly change my builds > >> going forward. > >> > >> -Chip > >> > >> On Fri, Jul 10, 2015 at 1:04 PM, Liam Slusser > wrote: > >>> > >>> I have two 800T ZFS systems on OmniOS and a bunch of smaller <50T > systems. > >>> Things generally work very well. We loose a disk here and there but > its > >>> never resulted in downtime. They're all on Dell hardware with LSI or > Dell > >>> PERC controllers. > >>> > >>> Putting in smaller disk failure parameters, so disks fail quicker, was > a > >>> big help when something does go wrong with a disk. > >>> > >>> thanks, > >>> liam > >>> > >>> > >>> On Fri, Jul 10, 2015 at 10:31 AM, Schweiss, Chip > >>> wrote: > >>>> > >>>> Unfortunately for the past couple years panics on disk failure has > been > >>>> the norm. All my production systems are HA with RSF-1, so at least > things > >>>> come back online relatively quick. There are quite a few open > tickets in > >>>> the Illumos bug tracker related to mpt_sas related panics. > >>>> > >>>> Most of the work to fix these problems has been committed in the past > >>>> year, though problems still exist. For example, my systems are dual > path > >>>> SAS, however, mpt_sas will panic if you pull a cable instead of > dropping a > >>>> path to the disks. Dan McDonald is actively working to resolve > this. He > >>>> is also pushing a bug fix in genunix from Nexenta that appears to fix > a lot > >>>> of the panic problems. I'll know for sure in a few months after I > see a > >>>> disk or two drop if it truly fixes things. Hans Rosenfeld at Nexenta > is > >>>> responsible for most of the updates to mpt_sas including support for > 3008 > >>>> (12G SAS). > >>>> > >>>> I haven't run any 12G SAS yet, but plan to on my next build in a > couple > >>>> months. This will be about 300TB using an 84 disk JBOD. All the > code from > >>>> Nexenta to support the 3008 appears to be in Illumos now, and they > fully > >>>> support it so I suspect it's pretty stable now. From what I > understand > >>>> there may be some 12G performance fixes coming sometime. > >>>> > >>>> The fault manager is nice when the system doesn't panic. When it > panics, > >>>> the fault manger never gets a chance to take action. It is still the > >>>> consensus that is is better to run pools without hot spares because > there > >>>> are situations the fault manager will do bad things. I witnessed > this > >>>> myself when building a system and the fault manger replaced 5 disks > in a > >>>> raidz2 vdev inside 1 minute, trashing the pool. I haven't > completely yield > >>>> to the "best practice". I now run one hot spare per pool. I figure > with > >>>> raidz2, the odds of the fault manager causing something catastrophic > is much > >>>> less possible. > >>>> > >>>> -Chip > >>>> > >>>> > >>>> > >>>> On Fri, Jul 10, 2015 at 11:37 AM, Linda Kateley > > >>>> wrote: > >>>>> > >>>>> I have to build and maintain my own system. I usually help others > >>>>> build(i teach zfs and freenas classes/consulting). I really love > fault > >>>>> management in solaris and miss it. Just thought since it's my system > and I > >>>>> get to choose I would use omni. I have 20+ years using solaris and > only 2 on > >>>>> freebsd. > >>>>> > >>>>> I like freebsd for how well tuned for zfs oob. I miss the network, > v12n > >>>>> and resource controls in solaris. > >>>>> > >>>>> Concerned about panics on disk failure. Is that common? > >>>>> > >>>>> > >>>>> linda > >>>>> > >>>>> > >>>>> On 7/9/15 9:30 PM, Schweiss, Chip wrote: > >>>>> > >>>>> Linda, > >>>>> > >>>>> I have 3.5 PB running under OmniOS. All my systems have LSI 2108 > HBAs > >>>>> which is considered the best choice for HBAs. > >>>>> > >>>>> Illumos leaves a bit to be desired with handling faults from disks or > >>>>> SAS problems, but things under OmniOS have been improving, much > thanks to > >>>>> Dan McDonald and OmniTI. We have a paid support on all of our > production > >>>>> systems with OmniTI. Their response and dedication has been very > good. > >>>>> Other than the occasional panic and restart from a disk failure, > OmniOS has > >>>>> been solid. ZFS of course never has lost a single bit of > information. > >>>>> > >>>>> I'd be curious why you're looking to move, have there been specific > >>>>> problems under BSD or ZoL? I've been slowly evaluating FreeBSD ZFS, > but of > >>>>> course the skeletons in the closet never seem to come out until you > do > >>>>> something big. > >>>>> > >>>>> -Chip > >>>>> > >>>>> On Thu, Jul 9, 2015 at 4:21 PM, Linda Kateley > >>>>> wrote: > >>>>>> > >>>>>> Hey is there anyone out there running big zfs on omni? > >>>>>> > >>>>>> I have been doing mostly zol and freebsd for the last year but have > to > >>>>>> build a 300+TB box and i want to come back home to roots(solaris). > Feeling > >>>>>> kind of hesitant :) Also, if you had to do over, is there anything > you would > >>>>>> do different. > >>>>>> > >>>>>> Also, what is the go to HBA these days? Seems like i saw stable code > >>>>>> for lsi 3008? > >>>>>> > >>>>>> TIA > >>>>>> > >>>>>> linda > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> OmniOS-discuss mailing list > >>>>>> OmniOS-discuss at lists.omniti.com > >>>>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Linda Kateley > >>>>> Kateley Company > >>>>> Skype ID-kateleyco > >>>>> http://kateleyco.com > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> OmniOS-discuss mailing list > >>>> OmniOS-discuss at lists.omniti.com > >>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss > >>>> > >>> > >> > >> > >> _______________________________________________ > >> OmniOS-discuss mailing list > >> OmniOS-discuss at lists.omniti.com > >> http://lists.omniti.com/mailman/listinfo/omnios-discuss > >> > > _______________________________________________ > > OmniOS-discuss mailing list > > OmniOS-discuss at lists.omniti.com > > http://lists.omniti.com/mailman/listinfo/omnios-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cks at cs.toronto.edu Thu Oct 8 01:36:43 2015 From: cks at cs.toronto.edu (Chris Siebenmann) Date: Wed, 07 Oct 2015 21:36:43 -0400 Subject: [OmniOS-discuss] big zfs storage? In-Reply-To: chip's message of Wed, 07 Oct 2015 19:56:30 -0500. Message-ID: <20151008013643.2D31F7A06B2@apps0.cs.toronto.edu> > I completely concur with Richard on this. Let me give an a real example > that emphases this point as it's a critical design decision. [...] > Now I only run one hot spare per pool. Most of my pools are raidz2 or > raidz3. This way any event like this can not take out more than one > disk and data parity will never be lost. > > There are other causes that can trigger multiple disk replacements. I > have not encountered them. If I do, they won't hurt my data with the > limit of one hot spare. My view is that spare handling needs to be a local decision based on your storage topology and pool and vdev structure (and on your durability needs, and even on how staffing is handled, eg if you have a 24/7 on call rotation). I don't think there is any single global right answer; hot spares will be good for some people and bad for others. Locally we use mirrored vdevs, multiple pools, an iSCSI SAN to connect to actual disks, multiple backend disk controllers, and no 24/7 on call setup. We've developed an automated spares handling system that knows a great deal about our local storage topology (so it knows what are 'good' and 'bad' spares for any particular bad disk, using various criteria) and having it available has been very helpful in the face of various things going wrong, both individual disk failures and entire backend disk controllers suffering power failures after the end of the workday. Our solution is of course very local, but the important thing is that it's clear that automating this has been the right tradeoff *for us*. (In another environment it would probably be the wrong answer, eg if we had a 24/7 NOC staffed with people to swap physical disks and hardware at any time of the day, night, or holidays, and a 24/7 on call sysadmin to do system things like 'zpool replace'. There are other parts of the university which do have this. I suspect that they don't use an automated spares system of any kind, although I don't know for sure.) - cks From lotheac at iki.fi Thu Oct 8 07:59:20 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Thu, 8 Oct 2015 10:59:20 +0300 Subject: [OmniOS-discuss] zfs recv assertion failed when scrubbing source pool Message-ID: <20151008075920.GA10733@gutsman.lotheac.fi> We're sending nightly incremental replication snaphots of a large filesystem tree (about 3900 filesystems) to a backup host. It's been working mostly okay - we scrub the source pool every month and that hasn't had any effect on the sends/receives. However, on Sep 21, I upgraded the backup host from entire at 11-0.151014:20150402T192159Z to entire at 11-0.151014:20150914T123242Z, and during the zpool scrub on the source host at the start of October we got this: Assertion failed: ilen <= SPA_MAXBLOCKSIZE, file ../common/libzfs_sendrecv.c, line 1706, function recv_read It seemed a transient failure as I was at first unable to reproduce it, but firing off another scrub on the source pool did cause it to happen again the following night, when scrub was still running. I further upgraded the backup host to the Sep29 151014 update (which apparently didn't bump the 'entire' version), and it's still happening. The source host is currently in production and still running omnios-170cea2 (or entire at 11-0.151014:20150402T192159Z); we're scheduled to upgrade it next Monday. It had a cache device up until Dan's recent advice to remove it; I suspected maybe we'd been hit by corruption, but that doesn't explain why the assertion happens only when the source pool is scrubbing. We use this kind of command to send snapshots to the backup host: zfs send -R -i $yesterday ${filesystem}@today | ssh backuphost zfs recv -ud $targetfs We're not running either send or recv as root, opting to use delegations instead. Don't know if that's relevant or not. Any clues? -- Lauri Tirkkonen | lotheac @ IRCnet From peter.tribble at gmail.com Thu Oct 8 14:35:50 2015 From: peter.tribble at gmail.com (Peter Tribble) Date: Thu, 8 Oct 2015 15:35:50 +0100 Subject: [OmniOS-discuss] pkg verify failing on pyc files Message-ID: I'm using pkg verify to ensure that nothing has been tampered with. Unfortunately, I keep getting verify errors like the following: pkg://omnios/library/python-2/ply ERROR file: usr/lib/python2.6/vendor-packages/ply/__init__.pyc Group: 'root (0)' should be 'bin (2)' Size: 178 bytes should be 176 Hash: d769283e99c45552467e95e55e5f5a3df00875b4 should be d2f6ea4ff88fd7a35b5e56b7d34dd71a72479fd6 file: usr/lib/python2.6/vendor-packages/ply/lex.pyc Group: 'root (0)' should be 'bin (2)' Size: 26838 bytes should be 26728 Hash: 00fb25fb4ab79ec5cb85fb745ff0996ab34230a8 should be 1c5b1e78d4531bcc2c0c202c153531377b0cc17f file: usr/lib/python2.6/vendor-packages/ply/yacc.pyc Group: 'root (0)' should be 'bin (2)' Size: 63183 bytes should be 62924 Hash: 4064718466fb95fc61eb5c585e8463b1f68ce884 should be 67d904213ad4ab85d34e6564920acda6582deb9a pkg://omnios/library/python-2/pybonjour ERROR file: usr/lib/python2.6/vendor-packages/pybonjour.pyc Group: 'root (0)' should be 'bin (2)' Size: 54053 bytes should be 53919 Hash: 361960dab53ecc51d163b6bfd840309115db41c9 should be 68717718c8a8ac2bfb21a1442e6f143218e7cb60 pkg://omnios/library/python-2/pyopenssl-26 ERROR file: usr/lib/python2.6/vendor-packages/OpenSSL/__init__.pyc Group: 'root (0)' should be 'bin (2)' Size: 959 bytes should be 957 Hash: 751a19a02c3fd4bbd6f33aec2c4cb3de0cd79af4 should be b6d57b6ec31252af5fff3098bc80e9ece560f91d file: usr/lib/python2.6/vendor-packages/OpenSSL/version.pyc Group: 'root (0)' should be 'bin (2)' Size: 253 bytes should be 251 Hash: 4062c7a082e198a3e3a8208e19776df8903ec255 should be 3b4fc0e99a45ffdb472d17dde5d9bfc0e74f4fa2 It looks like something is deciding to recompile the pyc files, which ends up changing them. Is there any way to stop this, to keep pkg verify clean? Thanks, -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From danmcd at omniti.com Thu Oct 8 15:12:41 2015 From: danmcd at omniti.com (Dan McDonald) Date: Thu, 8 Oct 2015 11:12:41 -0400 Subject: [OmniOS-discuss] pkg verify failing on pyc files In-Reply-To: References: Message-ID: <37E9FE40-E0B7-4E9C-A201-7D18749E5306@omniti.com> > On Oct 8, 2015, at 10:35 AM, Peter Tribble wrote: > > It looks like something is deciding to recompile the pyc files, which ends up > changing them. > > Is there any way to stop this, to keep pkg verify clean? You'll notice several workarounds in omnios-build. For example from python26-coverage: # Prevents pkgdepend from freaking out. set pkg.depend.bypass-generate .* > set pkg.depend.bypass-generate .* > set pkg.depend.bypass-generate .* > Here's Tim Foster's blog entry about it: https://timsfoster.wordpress.com/2011/02/24/pkgdepend-improvements/ You may need to bypass-generate a few things. Dan From bmx1955 at gmail.com Thu Oct 8 15:14:57 2015 From: bmx1955 at gmail.com (Mick Burns) Date: Thu, 8 Oct 2015 11:14:57 -0400 Subject: [OmniOS-discuss] big zfs storage? In-Reply-To: <20151008013643.2D31F7A06B2@apps0.cs.toronto.edu> References: <20151008013643.2D31F7A06B2@apps0.cs.toronto.edu> Message-ID: Thanks everyone who answered, very insightful. What scares me the most is hearing about the panics and FMA not having time to react at all, and also stories of sub-optimal multi hot-spare kicking into action like described by Chip. Recipe for disaster. I guess this is an area where Nexenta has worked hard in implementing their own graceful handling of various tested failure scenarios. However you're covered if and only if you have a system conforming to their HCL. This goes in-line with what Chris has implemented where he works; very customized to their environment and policies. On Wed, Oct 7, 2015 at 9:36 PM, Chris Siebenmann wrote: >> I completely concur with Richard on this. Let me give an a real example >> that emphases this point as it's a critical design decision. > [...] >> Now I only run one hot spare per pool. Most of my pools are raidz2 or >> raidz3. This way any event like this can not take out more than one >> disk and data parity will never be lost. >> >> There are other causes that can trigger multiple disk replacements. I >> have not encountered them. If I do, they won't hurt my data with the >> limit of one hot spare. > > My view is that spare handling needs to be a local decision based on > your storage topology and pool and vdev structure (and on your durability > needs, and even on how staffing is handled, eg if you have a 24/7 on > call rotation). I don't think there is any single global right answer; > hot spares will be good for some people and bad for others. > > Locally we use mirrored vdevs, multiple pools, an iSCSI SAN to connect > to actual disks, multiple backend disk controllers, and no 24/7 on call > setup. We've developed an automated spares handling system that knows a > great deal about our local storage topology (so it knows what are 'good' > and 'bad' spares for any particular bad disk, using various criteria) > and having it available has been very helpful in the face of various > things going wrong, both individual disk failures and entire backend > disk controllers suffering power failures after the end of the workday. > Our solution is of course very local, but the important thing is that > it's clear that automating this has been the right tradeoff *for us*. > > (In another environment it would probably be the wrong answer, eg if we > had a 24/7 NOC staffed with people to swap physical disks and hardware > at any time of the day, night, or holidays, and a 24/7 on call sysadmin > to do system things like 'zpool replace'. There are other parts of > the university which do have this. I suspect that they don't use an > automated spares system of any kind, although I don't know for sure.) > > - cks > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss From peter.tribble at gmail.com Thu Oct 8 17:56:51 2015 From: peter.tribble at gmail.com (Peter Tribble) Date: Thu, 8 Oct 2015 18:56:51 +0100 Subject: [OmniOS-discuss] pkg verify failing on pyc files In-Reply-To: <37E9FE40-E0B7-4E9C-A201-7D18749E5306@omniti.com> References: <37E9FE40-E0B7-4E9C-A201-7D18749E5306@omniti.com> Message-ID: On Thu, Oct 8, 2015 at 4:12 PM, Dan McDonald wrote: > > > On Oct 8, 2015, at 10:35 AM, Peter Tribble > wrote: > > > > It looks like something is deciding to recompile the pyc files, which > ends up > > changing them. > > > > Is there any way to stop this, to keep pkg verify clean? > > You'll notice several workarounds in omnios-build. For example from > python26-coverage: > > # Prevents pkgdepend from freaking out. > set > pkg.depend.bypass-generate .* > > set > pkg.depend.bypass-generate .* > > set > pkg.depend.bypass-generate .* > > > Here's Tim Foster's blog entry about it: > > > https://timsfoster.wordpress.com/2011/02/24/pkgdepend-improvements/ > > You may need to bypass-generate a few things. This is all coming from omnios so it wouldn't be me who would be making changes... I'm a little more confused than I was, though. The .pyc files encode the python version and the metadata of the source file (in particular, its timestamp) in the .pyc file. That's all correct. The packages appear to have been published using pkgsend -T so that the timestamps on the .py files are preserved, and they match the timestamps encoded into the packaged .pyc files.. Which makes me wonder even more why it's found it necessary to recompile the files, it's not the normal python version or source timestamp mismatch. The only thing I notice is that the original (packaged) .pyc file has site-packages encoded in it, whereas the recompiled version has the vendor-packages path (which is where the files are installed to). As far as I can tell that's the only difference in the contents of the .pyc files. Curious... -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From lotheac at iki.fi Thu Oct 8 19:37:00 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Thu, 8 Oct 2015 22:37:00 +0300 Subject: [OmniOS-discuss] pkg verify failing on pyc files In-Reply-To: References: <37E9FE40-E0B7-4E9C-A201-7D18749E5306@omniti.com> Message-ID: <20151008193700.GC26155@gutsman.lotheac.fi> On Thu, Oct 08 2015 18:56:51 +0100, Peter Tribble wrote: > That's all correct. The packages appear to have been published using > pkgsend -T so > that the timestamps on the .py files are preserved, and they match the > timestamps > encoded into the packaged .pyc files.. > > Which makes me wonder even more why it's found it necessary to recompile > the files, it's not the normal python version or source timestamp mismatch. I think I found a clue for this. I checked an old box I've upgraded through multiple releases and sure enough, .pyc files had been regenerated after install for, among other things, the simplejson-26 package. I checked the timestamps for one such .py/.pyc pair: -rw-r--r-- 1 root bin 1036 Jul 22 2014 /usr/lib/python2.6/vendor-packages/simplejson/compat.py -rw-r--r-- 1 root root 2040 Apr 3 2015 /usr/lib/python2.6/vendor-packages/simplejson/compat.pyc But this is strange - surely the package is newer than July 2014 on this 151014 box? % pkg list -Hv simplejson-26 pkg://omnios/library/python-2/simplejson-26 at 3.6.5-0.151014:20150402T184431Z i-- Yes, yes it is, and it *was* built using pkgsend -T '*.py'. However, the commit in omnios-build introducing that was authored on Mar 25 2015 (8cc8f3ef45d9c7d8ccdfda608d00599cd3890597). My theory is that if the file content does not change, even if pkgsend -T is used to preserve the timestamps, the file is not touched on update; would that help explain what you're seeing? -- Lauri Tirkkonen | lotheac @ IRCnet From eric.sproul at circonus.com Thu Oct 8 20:23:35 2015 From: eric.sproul at circonus.com (Eric Sproul) Date: Thu, 8 Oct 2015 16:23:35 -0400 Subject: [OmniOS-discuss] pkg verify failing on pyc files In-Reply-To: <20151008193700.GC26155@gutsman.lotheac.fi> References: <37E9FE40-E0B7-4E9C-A201-7D18749E5306@omniti.com> <20151008193700.GC26155@gutsman.lotheac.fi> Message-ID: On Thu, Oct 8, 2015 at 3:37 PM, Lauri Tirkkonen wrote: > % pkg list -Hv simplejson-26 > pkg://omnios/library/python-2/simplejson-26 at 3.6.5-0.151014:20150402T184431Z i-- > > Yes, yes it is, and it *was* built using pkgsend -T '*.py'. However, > the commit in omnios-build introducing that was authored on Mar 25 2015 > (8cc8f3ef45d9c7d8ccdfda608d00599cd3890597). My theory is that if the > file content does not change, even if pkgsend -T is used to preserve the > timestamps, the file is not touched on update; would that help explain > what you're seeing? Since .pyc files are evidently locally modified outside of pkg(5), would it make sense to mark them as such in their manifests, i.e. setting the "preserve" attribute? This makes pkg verify not report differences from the installed manifest. Perhaps not a total win, though, as it would mean potentially losing upgrade content, unless preserve was set to "renameold" or some such. Eric From peter.tribble at gmail.com Thu Oct 8 20:31:07 2015 From: peter.tribble at gmail.com (Peter Tribble) Date: Thu, 8 Oct 2015 21:31:07 +0100 Subject: [OmniOS-discuss] pkg verify failing on pyc files In-Reply-To: <20151008193700.GC26155@gutsman.lotheac.fi> References: <37E9FE40-E0B7-4E9C-A201-7D18749E5306@omniti.com> <20151008193700.GC26155@gutsman.lotheac.fi> Message-ID: On Thu, Oct 8, 2015 at 8:37 PM, Lauri Tirkkonen wrote: > On Thu, Oct 08 2015 18:56:51 +0100, Peter Tribble wrote: > > That's all correct. The packages appear to have been published using > > pkgsend -T so > > that the timestamps on the .py files are preserved, and they match the > > timestamps > > encoded into the packaged .pyc files.. > > > > Which makes me wonder even more why it's found it necessary to recompile > > the files, it's not the normal python version or source timestamp > mismatch. > > I think I found a clue for this. I checked an old box I've upgraded > through multiple releases That seems to be the key point. I'm seeing it on upgraded boxes. Just checked a fresh install, that's clean. Going back, in earlier omnios releases the .py files didn't have fixed timestamps. Some of them (pybonjour.py for example) should be dated 2008. As a result, the older releases pretty much always rebuilt the pyc files. The upgrade process correctly sets the timestamp on the .py files. That bit appears to work. Presumably, it also should put the correct .pyc file from the repo, but that doesn't seem to work correctly. I suspect some sort of race between python explicitly writing the .pyc file from the repo and python recompiling the .pyc file because the one it had becomes invalid as soon as the timestamp on the .py file gets updated. In any event, I suspect that if you run pkg fix after the upgrade, then because everything now matches up, it'll be good in the future. At least, I've done that on one system and it hasn't deviated since. > and sure enough, .pyc files had been > regenerated after install for, among other things, the simplejson-26 > package. I checked the timestamps for one such .py/.pyc pair: > > -rw-r--r-- 1 root bin 1036 Jul 22 2014 > /usr/lib/python2.6/vendor-packages/simplejson/compat.py > -rw-r--r-- 1 root root 2040 Apr 3 2015 > /usr/lib/python2.6/vendor-packages/simplejson/compat.pyc > > But this is strange - surely the package is newer than July 2014 on this > 151014 box? > > % pkg list -Hv simplejson-26 > pkg://omnios/library/python-2/simplejson-26 at 3.6.5-0.151014:20150402T184431Z > i-- > > Yes, yes it is, and it *was* built using pkgsend -T '*.py'. However, > the commit in omnios-build introducing that was authored on Mar 25 2015 > (8cc8f3ef45d9c7d8ccdfda608d00599cd3890597). My theory is that if the > file content does not change, even if pkgsend -T is used to preserve the > timestamps, the file is not touched on update; would that help explain > what you're seeing? > > -- > Lauri Tirkkonen | lotheac @ IRCnet > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From lotheac at iki.fi Thu Oct 8 20:32:50 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Thu, 8 Oct 2015 23:32:50 +0300 Subject: [OmniOS-discuss] pkg verify failing on pyc files In-Reply-To: References: <37E9FE40-E0B7-4E9C-A201-7D18749E5306@omniti.com> <20151008193700.GC26155@gutsman.lotheac.fi> Message-ID: <20151008203250.GA26977@gutsman.lotheac.fi> On Thu, Oct 08 2015 16:23:35 -0400, Eric Sproul wrote: > On Thu, Oct 8, 2015 at 3:37 PM, Lauri Tirkkonen wrote: > > % pkg list -Hv simplejson-26 > > pkg://omnios/library/python-2/simplejson-26 at 3.6.5-0.151014:20150402T184431Z i-- > > > > Yes, yes it is, and it *was* built using pkgsend -T '*.py'. However, > > the commit in omnios-build introducing that was authored on Mar 25 2015 > > (8cc8f3ef45d9c7d8ccdfda608d00599cd3890597). My theory is that if the > > file content does not change, even if pkgsend -T is used to preserve the > > timestamps, the file is not touched on update; would that help explain > > what you're seeing? > > Since .pyc files are evidently locally modified outside of pkg(5), > would it make sense to mark them as such in their manifests, i.e. > setting the "preserve" attribute? This makes pkg verify not report > differences from the installed manifest. > > Perhaps not a total win, though, as it would mean potentially losing > upgrade content, unless preserve was set to "renameold" or some such. One could just not ship them at all; I was just trying to find out why they're being regenerated after the package install. -- Lauri Tirkkonen | lotheac @ IRCnet From rjahnel at ellipseinc.com Thu Oct 8 22:28:55 2015 From: rjahnel at ellipseinc.com (Richard Jahnel) Date: Thu, 8 Oct 2015 22:28:55 +0000 Subject: [OmniOS-discuss] Two panics now while writing eager zeros to zvols Message-ID: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> VMware Esx 5.5 QLogic Fibre Omnios r14 I have 3 zvols on 3 zpools (1ea) I am attempting to get of the garbage in the empty space by creating eager zeroed vmdks in the free space on vmfs5 volumes backed by zfs zvols. The zvols are hosted on zpools with 2 cache ssds 2 log ssd in mirror and lz4 compression turned on. Twice in the past 24 hours the Omnios host has panicked after about 8 hours of writing eager zero to one or more vmdks. Any ideas? Dump available upon request. [Ellipse Communications] Richard Jahnel | Senior Network Engineer Ellipse Communications - Corporate Office 14800 Quorum Dr, Suite 420 Dallas, TX 75254 TF: 888-678-3869 | F: 972-479-9115 Email * Website * Facebook * Twitter ________________________________ The content of this e-mail (including any attachments) is strictly confidential and may be commercially sensitive. If you are not, or believe you may not be, the intended recipient, please advise the sender immediately by return e-mail, delete this e-mail and destroy any copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: From daleg at omniti.com Thu Oct 8 23:09:06 2015 From: daleg at omniti.com (Dale Ghent) Date: Thu, 8 Oct 2015 19:09:06 -0400 Subject: [OmniOS-discuss] Two panics now while writing eager zeros to zvols In-Reply-To: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> Message-ID: <0B6890AE-EC67-43C9-815D-1BAB2924EB53@omniti.com> A stack trace from the panic would be good to know in general, to see if it matches up with an already-known issue. /dale > On Oct 8, 2015, at 6:28 PM, Richard Jahnel wrote: > > VMware Esx 5.5 > > QLogic Fibre > > Omnios r14 > > I have 3 zvols on 3 zpools (1ea) > > I am attempting to get of the garbage in the empty space by creating eager zeroed vmdks in the free space on vmfs5 volumes backed by zfs zvols. > > The zvols are hosted on zpools with 2 cache ssds 2 log ssd in mirror and lz4 compression turned on. > > Twice in the past 24 hours the Omnios host has panicked after about 8 hours of writing eager zero to one or more vmdks. > > Any ideas? Dump available upon request. > > > > Richard Jahnel | Senior Network Engineer > Ellipse Communications ? Corporate Office > 14800 Quorum Dr, Suite 420 Dallas, TX 75254 > TF: 888-678-3869 | F: 972-479-9115 > Email ? Website ? Facebook ? Twitter > > > The content of this e-mail (including any attachments) is strictly confidential and may be commercially sensitive. If you are not, or believe you may not be, the intended recipient, please advise the sender immediately by return e-mail, delete this e-mail and destroy any copies. > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss From martin.truhlar at archcon.cz Fri Oct 9 10:53:38 2015 From: martin.truhlar at archcon.cz (=?utf-8?B?TWFydGluIFRydWhsw6HFmQ==?=) Date: Fri, 9 Oct 2015 12:53:38 +0200 Subject: [OmniOS-discuss] iSCSI poor write performance In-Reply-To: References: <15C9B79E-7BC4-4C01-9660-FFD64353304D@omniti.com> <8D1002D9-69E2-4857-945A-746B821B27A1@omniti.com> Message-ID: So I've moved a bit. I've disabled writing synchronisation and voila! from writing hell straight to heaven! 7MB/s -> 245MB/s. I'm aware this is not much secure solution, but it works. There is some problem with ZILL SSD, right? I've used new mirrored Intel SSD 530, connected directly to HBA. Any advice before I buy new couple of SSD? I'm still little dissapointed with 4K queued writing, that I would expect higher. Now it is 20k IOPS for reading and 15k IOPS for writing, but Intel SSD 530 is capable to 24k IOPS for reading and 80k IOPS for writing. Actually, I don't know what performance to expect... Martin -----Original Message----- From: Martin Truhl?? [mailto:martin.truhlar at archcon.cz] Sent: Wednesday, September 23, 2015 10:51 AM To: Dan McDonald Cc: omnios-discuss at lists.omniti.com Subject: Re: [OmniOS-discuss] iSCSI poor write performance Tests revealed, that problem is somewhere in disk array itself. Write performance of disk connected directly (via iSCSI) to KVM is poor as well, even write performance measured on Omnios is very poor. So loop is tightened, but there still remains lot of possible hacks. I strived to use professional hw (disks included), so I would try to seek the error in a software setup first. Do you have any ideas where to search first (and second, third...)? FYI mirror 5 was added lately to the running pool. pool: dpool state: ONLINE scan: scrub repaired 0 in 5h33m with 0 errors on Sun Sep 20 00:33:15 2015 config: NAME STATE READ WRITE CKSUM CAP Product /napp-it IOstat mess dpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c1t50014EE00400FA16d0 ONLINE 0 0 0 1 TB WDC WD1002F9YZ-0 S:0 H:0 T:0 c1t50014EE2B40F14DBd0 ONLINE 0 0 0 1 TB WDC WD1003FBYX-0 S:0 H:0 T:0 mirror-1 ONLINE 0 0 0 c1t50014EE05950B131d0 ONLINE 0 0 0 1 TB WDC WD1002F9YZ-0 S:0 H:0 T:0 c1t50014EE2B5E5A6B8d0 ONLINE 0 0 0 1 TB WDC WD1003FBYZ-0 S:0 H:0 T:0 mirror-2 ONLINE 0 0 0 c1t50014EE05958C51Bd0 ONLINE 0 0 0 1 TB WDC WD1002F9YZ-0 S:0 H:0 T:0 c1t50014EE0595617ACd0 ONLINE 0 0 0 1 TB WDC WD1002F9YZ-0 S:0 H:0 T:0 mirror-3 ONLINE 0 0 0 c1t50014EE0AEAE7540d0 ONLINE 0 0 0 1 TB WDC WD1002F9YZ-0 S:0 H:0 T:0 c1t50014EE0AEAE9B65d0 ONLINE 0 0 0 1 TB WDC WD1002F9YZ-0 S:0 H:0 T:0 mirror-5 ONLINE 0 0 0 c1t50014EE0AEABB8E7d0 ONLINE 0 0 0 1 TB WDC WD1002F9YZ-0 S:0 H:0 T:0 c1t50014EE0AEB44327d0 ONLINE 0 0 0 1 TB WDC WD1002F9YZ-0 S:0 H:0 T:0 logs mirror-4 ONLINE 0 0 0 c1t55CD2E404B88ABE1d0 ONLINE 0 0 0 120 GB INTEL SSDSC2BW12 S:0 H:0 T:0 c1t55CD2E404B88E4CFd0 ONLINE 0 0 0 120 GB INTEL SSDSC2BW12 S:0 H:0 T:0 cache c1t55CD2E4000339A59d0 ONLINE 0 0 0 180 GB INTEL SSDSC2BW18 S:0 H:0 T:0 spares c2t2d0 AVAIL 1 TB WDC WD10EFRX-68F S:0 H:0 T:0 errors: No known data errors Martin -----Original Message----- From: Dan McDonald [mailto:danmcd at omniti.com] Sent: Wednesday, September 16, 2015 1:51 PM To: Martin Truhl?? Cc: omnios-discuss at lists.omniti.com; Dan McDonald Subject: Re: [OmniOS-discuss] iSCSI poor write performance > On Sep 16, 2015, at 4:04 AM, Martin Truhl?? wrote: > > Yes, I'm aware, that problem can be hidden in many places. > MTU is 1500. All nics and their setup are included at this email. Start by making your 10GigE network use 9000 MTU. You'll need to configure this on both ends (is this directly-attached 10GigE? Or over a switch?). Dan _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss at lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss -------------- next part -------------- A non-text attachment was scrubbed... Name: before.PNG Type: image/png Size: 41935 bytes Desc: before.PNG URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: after.PNG Type: image/png Size: 49541 bytes Desc: after.PNG URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pools.PNG Type: image/png Size: 24757 bytes Desc: pools.PNG URL: From danmcd at omniti.com Fri Oct 9 17:46:43 2015 From: danmcd at omniti.com (Dan McDonald) Date: Fri, 9 Oct 2015 13:46:43 -0400 Subject: [OmniOS-discuss] Two panics now while writing eager zeros to zvols In-Reply-To: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> Message-ID: <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> > On Oct 8, 2015, at 6:28 PM, Richard Jahnel wrote: > > Any ideas? Dump available upon request. > I have to go, but I took a quick look at your one dump: > $c vpanic() hati_pte_map+0x3ab(ffffff16fb3dde70, 6f, ffffff007aef5358, 800000101cecd007, 0, 0) hati_load_common+0x139(ffffff15ae13ca88, 806f000, ffffff007aef5358, 40b, 0, 0) hat_memload+0x75(ffffff15ae13ca88, 806f000, ffffff007aef5358, b, 0) segvn_faultpage+0x730(ffffff15ae13ca88, ffffff169c042ee8, 806f000, d000, 0, ffffff009a23fb50) segvn_fault+0x8e6(ffffff15ae13ca88, ffffff169c042ee8, 806f000, 1000, 1, 2) as_fault+0x31a(ffffff15ae13ca88, ffffff168f631de0, 806ff20, 1, 1, 2) pagefault+0x96(806ff20, 1, 2, 0) trap+0x2c7(ffffff009a23ff10, 806ff20, b) 0xfffffffffb8001d6() > ::status debugging crash dump vmcore.4 (64-bit) from vstore1 operating system: 5.11 omnios-f090f73 (i86pc) image uuid: 37ff548e-1a7e-48b7-cbc1-9577366cda82 panic message: hati_pte_map: flags & HAT_LOAD_REMAP dump content: kernel pages only > This is a panic while servicing a page fault. Usually when I see something like this, I have to ask if your HW is okay. ("fmadm faulty" show anything?) Beyond that, I'll need to dig deeper, but I can't now. Just wanted to let you know the stuff on the surface, at least. Thanks for sharing the dump! Dan From danmcd at omniti.com Fri Oct 9 17:48:36 2015 From: danmcd at omniti.com (Dan McDonald) Date: Fri, 9 Oct 2015 13:48:36 -0400 Subject: [OmniOS-discuss] Two panics now while writing eager zeros to zvols In-Reply-To: <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> Message-ID: Process for this thread was: R 1178 520 1178 1178 0 0x42014000 ffffff15b6cfc080 VS20Vol20_snapc y T 0xffffff15a1ea4780 Dan From gate03 at landcroft.co.uk Fri Oct 9 23:40:39 2015 From: gate03 at landcroft.co.uk (Michael Mounteney) Date: Sat, 10 Oct 2015 09:40:39 +1000 Subject: [OmniOS-discuss] ISC-DHCPD in a zone Message-ID: <20151010094039.4fde856f@pimple.landy.net> I'm sure this has been done before but I can't find anything in my archive of this list. In a zone, isc-dhcpd is failing on start because it can't obtain the details of an interface. Zone# ifconfig -a lo0:2: flags=2001000849 mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 e1000g1:2: flags=1100843 mtu 1500 index 2 inet 192.168.1.2 netmask ffffff00 broadcast 192.168.1.255 lo0:2: flags=2002000849 mtu 8252 index 1 inet6 ::1/128 According to various sources, one can specify interfaces on the command line to restrict dhcpd: /usr/sbin/dhcpd -cf /etc/dhcpd.conf -lf /var/db/dhcpd.leases -p 67 -s 192.168.1.2 e1000g1 binding to user-specified port 67 Internet Systems Consortium DHCP Server 4.3.1 Copyright 2004-2014 Internet Systems Consortium. All rights reserved. For info, please visit https://www.isc.org/software/dhcp/ Config file: /etc/dhcpd.conf Database file: /var/db/dhcpd.leases PID file: /var/run/dhcpd.pid irs_resconf_load failed: 59. Unable to set resolver from resolv.conf; startup continuing but DDNS support may be affected Wrote 0 deleted host decls to leases file. Wrote 0 new dynamic host decls to leases file. Wrote 0 leases to leases file. Error getting interface flags for 'lo0:2'; No such device or address Error getting interface information. If you think you have received this message due to a bug rather than a configuration issue please read the section on submitting bugs on either our web page at www.isc.org or in the README file before submitting a bug. These pages explain the proper process and the information we find helpful for debugging.. exiting. Specifying e1000g1:2 makes no difference. Obviously normally isc-dhcpd is run via a service, but the above command is what is eventually execed and it fails with the above message in the log. Really I don't care about that lo0:2 interface. Is it the unconfigured ipv6 ? If I could get rid of that, it would solve my problem. Any help ? Either restrict isc-dhcpd or eliminate the interface. ______________ Michael Mounteney From rjahnel at ellipseinc.com Sat Oct 10 03:32:08 2015 From: rjahnel at ellipseinc.com (Richard Jahnel) Date: Sat, 10 Oct 2015 03:32:08 +0000 Subject: [OmniOS-discuss] Two panics now while writing eager zeros to zvols In-Reply-To: <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> Message-ID: <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> Faulty output for yesterday: --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Oct 08 16:18:59 37ff548e-1a7e-48b7-cbc1-9577366cda82 SUNOS-8000-KL Major Host : vstore1 Platform : PowerEdge-R510Chassis_id : 8K307S1 Product_sn : Fault class : defect.sunos.kernel.panic Affects : sw:///:path=/var/crash/unknown/.37ff548e-1a7e-48b7-cbc1-9577366cda82 faulted but still in service Problem in : sw:///:path=/var/crash/unknown/.37ff548e-1a7e-48b7-cbc1-9577366cda82 faulted but still in service Description : The system has rebooted after a kernel panic. Refer to http://illumos.org/msg/SUNOS-8000-KL for more information. Response : The failed system image was dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory /var/crash/unknown. Impact : There may be some performance impact while the panic is copied to the savecore directory. Disk space usage by panics can be substantial. Action : If savecore is not enabled then please take steps to preserve the crash image. Use 'fmdump -Vp -u 37ff548e-1a7e-48b7-cbc1-9577366cda82' to view more panic detail. Please refer to the knowledge article for additional information. -----Original Message----- From: Dan McDonald [mailto:danmcd at omniti.com] Sent: Friday, October 09, 2015 12:47 PM To: Richard Jahnel Cc: omnios-discuss at lists.omniti.com; imemo; Dan McDonald Subject: Re: [OmniOS-discuss] Two panics now while writing eager zeros to zvols > On Oct 8, 2015, at 6:28 PM, Richard Jahnel wrote: > > Any ideas? Dump available upon request. > I have to go, but I took a quick look at your one dump: > $c vpanic() hati_pte_map+0x3ab(ffffff16fb3dde70, 6f, ffffff007aef5358, 800000101cecd007, 0, 0) hati_load_common+0x139(ffffff15ae13ca88, 806f000, ffffff007aef5358, 40b, 0, 0) hat_memload+0x75(ffffff15ae13ca88, 806f000, ffffff007aef5358, b, 0) segvn_faultpage+0x730(ffffff15ae13ca88, ffffff169c042ee8, 806f000, d000, 0, ffffff009a23fb50) segvn_fault+0x8e6(ffffff15ae13ca88, ffffff169c042ee8, 806f000, 1000, 1, 2) as_fault+0x31a(ffffff15ae13ca88, ffffff168f631de0, 806ff20, 1, 1, 2) pagefault+0x96(806ff20, 1, 2, 0) trap+0x2c7(ffffff009a23ff10, 806ff20, b) 0xfffffffffb8001d6() > ::status debugging crash dump vmcore.4 (64-bit) from vstore1 operating system: 5.11 omnios-f090f73 (i86pc) image uuid: 37ff548e-1a7e-48b7-cbc1-9577366cda82 panic message: hati_pte_map: flags & HAT_LOAD_REMAP dump content: kernel pages only > This is a panic while servicing a page fault. Usually when I see something like this, I have to ask if your HW is okay. ("fmadm faulty" show anything?) Beyond that, I'll need to dig deeper, but I can't now. Just wanted to let you know the stuff on the surface, at least. Thanks for sharing the dump! Dan ________________________________ The content of this e-mail (including any attachments) is strictly confidential and may be commercially sensitive. If you are not, or believe you may not be, the intended recipient, please advise the sender immediately by return e-mail, delete this e-mail and destroy any copies. From danmcd at omniti.com Sat Oct 10 03:33:39 2015 From: danmcd at omniti.com (Dan McDonald) Date: Fri, 9 Oct 2015 23:33:39 -0400 Subject: [OmniOS-discuss] Two panics now while writing eager zeros to zvols In-Reply-To: <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> Message-ID: That's just the "you had a kernel panic" message. Shoot. I was hoping for hardware problems. What is that process I mentioned -- VS20Vol20_snapcy ? What is it doing? It's driven from cron, but I can't tell much beyond that. (Most kernel dumps don't take in userspace text.) Dan From jimklimov at cos.ru Sat Oct 10 05:33:29 2015 From: jimklimov at cos.ru (Jim Klimov) Date: Sat, 10 Oct 2015 07:33:29 +0200 Subject: [OmniOS-discuss] ISC-DHCPD in a zone In-Reply-To: <20151010094039.4fde856f@pimple.landy.net> References: <20151010094039.4fde856f@pimple.landy.net> Message-ID: 10 ??????? 2015??. 1:40:39 CEST, Michael Mounteney ?????: >I'm sure this has been done before but I can't find anything in my >archive of this list. In a zone, isc-dhcpd is failing on start because >it can't obtain the details of an interface. > >Zone# ifconfig -a >lo0:2: flags=2001000849 mtu >8232 index 1 inet 127.0.0.1 netmask ff000000 >e1000g1:2: flags=1100843 >mtu 1500 index 2 inet 192.168.1.2 netmask ffffff00 broadcast >192.168.1.255 lo0:2: >flags=2002000849 mtu 8252 >index 1 inet6 ::1/128 > >According to various sources, one can specify interfaces on the command >line to restrict dhcpd: > >/usr/sbin/dhcpd -cf /etc/dhcpd.conf -lf /var/db/dhcpd.leases -p 67 -s >192.168.1.2 e1000g1 >binding to user-specified port 67 >Internet Systems Consortium DHCP Server 4.3.1 >Copyright 2004-2014 Internet Systems Consortium. >All rights reserved. >For info, please visit https://www.isc.org/software/dhcp/ >Config file: /etc/dhcpd.conf >Database file: /var/db/dhcpd.leases >PID file: /var/run/dhcpd.pid >irs_resconf_load failed: 59. >Unable to set resolver from resolv.conf; startup continuing but DDNS >support may be affected >Wrote 0 deleted host decls to leases file. >Wrote 0 new dynamic host decls to leases file. >Wrote 0 leases to leases file. >Error getting interface flags for 'lo0:2'; No such device or address >Error getting interface information. > >If you think you have received this message due to a bug rather >than a configuration issue please read the section on submitting >bugs on either our web page at www.isc.org or in the README file >before submitting a bug. These pages explain the proper >process and the information we find helpful for debugging.. > >exiting. > >Specifying e1000g1:2 makes no difference. > >Obviously normally isc-dhcpd is run via a service, but the above >command is what is eventually execed and it fails with the above >message in the log. > >Really I don't care about that lo0:2 interface. Is it the unconfigured >ipv6 ? If I could get rid of that, it would solve my problem. > >Any help ? Either restrict isc-dhcpd or eliminate the interface. > >______________ >Michael Mounteney >_______________________________________________ >OmniOS-discuss mailing list >OmniOS-discuss at lists.omniti.com >http://lists.omniti.com/mailman/listinfo/omnios-discuss With the alias interfaces in play - do you use a shared-ip zone? That may be the limit; try switching to exclusive-ip with dedicated vnic(s). Also see if any zone or process rbac privileges seem suitable additions to the service (especially if it works from shell and fails from SMF even as root): things like promiscuity or not-owned file access are dropped by default. Jim -- Typos courtesy of K-9 Mail on my Samsung Android From moo at wuffers.net Sat Oct 10 07:18:49 2015 From: moo at wuffers.net (wuffers) Date: Sat, 10 Oct 2015 03:18:49 -0400 Subject: [OmniOS-discuss] Two panics now while writing eager zeros to zvols In-Reply-To: References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> Message-ID: Is this the same bug I ran into in March? http://lists.omniti.com/pipermail/omnios-discuss/2015-March/004540.html I'm running a newer stmf_sbd that Dan made which solved my issue. It had something to do with the WRITE_SAME VAAI primitive, but I'm also running with COMSTAR. Dan was pretty busy preparing for R151014 at the time, so he hasn't had a chance to upstream it back. On Fri, Oct 9, 2015 at 11:33 PM, Dan McDonald wrote: > That's just the "you had a kernel panic" message. Shoot. I was hoping > for hardware problems. > > What is that process I mentioned -- VS20Vol20_snapcy ? What is it doing? > It's driven from cron, but I can't tell much beyond that. (Most kernel > dumps don't take in userspace text.) > > Dan > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From johan.kragsterman at capvert.se Sat Oct 10 08:49:35 2015 From: johan.kragsterman at capvert.se (Johan Kragsterman) Date: Sat, 10 Oct 2015 10:49:35 +0200 Subject: [OmniOS-discuss] Ang: Re: Two panics now while writing eager zeros to zvols In-Reply-To: References: , <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> Message-ID: Hi! -----"OmniOS-discuss" skrev: ----- Till: Dan McDonald Fr?n: wuffers S?nt av: "OmniOS-discuss" Datum: 2015-10-10 09:20 Kopia: Richard Jahnel , imemo , "omnios-discuss at lists.omniti.com" ?rende: Re: [OmniOS-discuss] Two panics now while writing eager zeros to zvols Is this the same bug I ran into in March? http://lists.omniti.com/pipermail/omnios-discuss/2015-March/004540.html I'm running a newer?stmf_sbd that Dan made which solved my issue. It had something to do with the WRITE_SAME VAAI primitive, but I'm also running with COMSTAR. Dan was pretty busy preparing for R151014 at the time, so he hasn't had a chance to upstream it back. Dan, is this upstreamed at all...? Rgrds Johan _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss at lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss at lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss From gate03 at landcroft.co.uk Sat Oct 10 08:50:36 2015 From: gate03 at landcroft.co.uk (Michael Mounteney) Date: Sat, 10 Oct 2015 18:50:36 +1000 Subject: [OmniOS-discuss] ISC-DHCPD in a zone In-Reply-To: References: <20151010094039.4fde856f@pimple.landy.net> Message-ID: <20151010185036.3d904430@coomera> On Sat, 10 Oct 2015 07:33:29 +0200 Jim Klimov wrote: > With the alias interfaces in play - do you use a shared-ip zone? That > may be the limit; try switching to exclusive-ip with dedicated > vnic(s). That would explain why my setup notes (this is a fresh installation) have DHCP in its own zone and all other services (IMAP, version control repositories, TFTP, rsync server etc.) in another. It's not the answer for which I was hoping. It would be neater to have all services together in one zone and not have to run a second zone, just for one service. Is there another way? Anything else I can try? > Also see if any zone or process rbac privileges seem suitable > additions to the service (especially if it works from shell and fails > from SMF even as root): things like promiscuity or not-owned file > access are dropped by default. It's the same both from the command line and via a service. Thanks for your reply. ______________ Michael Mounteney From lists at marzocchi.net Sat Oct 10 16:09:05 2015 From: lists at marzocchi.net (Olaf Marzocchi) Date: Sat, 10 Oct 2015 18:09:05 +0200 Subject: [OmniOS-discuss] Maildir: ACLs/Unix perms: unlink(...) failed: Permission denied In-Reply-To: <56086833.7090507@marzocchi.net> References: <55FD6E91.7020505@marzocchi.net> <839515024ef34c25a9bbe682a454855c@valo.at> <56086833.7090507@marzocchi.net> Message-ID: <56193821.7010503@marzocchi.net> I solved the issue I mentioned some days ago. I checked in the logs the datethe issue appeared, and I noticed it did not correspond to a dovecot update, dovecot was not the culprit. The date also did not correspond to a update of OmniOS, and in any case the previous OmniOS update contained only userland updates. Since the issue appeared when I assigned for the first time ACLs to my home folder on the fileserver to make it better compatible with SMB sharing, I decided the easiest way was to start a new ZFS dataset only for mail, splitting home folder and mail. $ zfs create -o compression=on tank/mail $ chgrp mail /tank/mail $ mkdir /tank/mail/olaf $ mv /tank/home/olaf/Maildir /tank/mail/olaf/ $ chown -R olaf:olaf /tank/mail/olaf $ find Maildir -type d -exec chmod 700 {} \; $ find Maildir -type f -exec chmod 600 {} \; $ svcadm enable dovecot This time in the dataset I did not set the options: -o aclinherit=passthrough-x -o aclmode=passthrough because dovecot does not need ACL anyway. I'm not even sure those two options are what I actually need, but the server is running so I won't change them. Anyway, the server is running fine now. I'm not sure why I cannot see in Thunderbird any folder "Trash" but if I try to create one it fails with "Folder already existing", but I will find out. I also wrote a summary of the issue and of the solution here, because other people had the same problem in the past (http://www.dovecot.org/list/dovecot/2013-November/093778.html) and there was no solution posted. http://www.marzocchi.net/Olafsen/Software/InstallationOfOmniOSAndBasicSetup Cheers, Olaf On 28/09/2015 00:05, Olaf Marzocchi wrote: > Hi, > I tried again with some other options. > > After finding > http://www.dovecot.org/list/dovecot/2013-November/093793.html > I deleted every ACL from the directory Maildir and I also assigned the > group "mail" to it, recursively: > > OmniOS-Xeon:/tank/home/olaf/Maildir/.Generiche $ ls -lV > total 903 > drwxrwxrwx 2 olaf mail 2 Sep 27 23:47 cur > owner@:rwxp--aARWcCos:-------:allow > group@:rwxp--a-R-c--s:-------:allow > everyone@:rwxp--a-R-c--s:-------:allow > (and so on) > > I tried also > mail_full_filesystem_access = yes > hoping that it would solve the issue, but nothing. Even with > mail_debug = yes > the log does not give any info besides > dovecot: [ID 583609 mail.error] imap(olaf): Error: > unlink(/tank/home/olaf/Maildir/.Generiche/dovecot-uidlist.tmp) failed: > Permission denied > > (it shows also "rename" instead of "unlink") > > With these additional info, has anyone any idea about the cause of the > problem? > > My doveconf -n: > > # 2.2.18: /etc/dovecot/dovecot.conf > # OS: SunOS 5.11 i86pc zfs > mail_debug = yes > mail_full_filesystem_access = yes > mail_location = maildir:/tank/home/%u/Maildir > mail_privileged_group = mail > namespace inbox { > inbox = yes > location = > mailbox Sent { > special_use = \Sent > } > mailbox "Sent Messages" { > special_use = \Sent > } > mailbox Trash { > special_use = \Trash > } > prefix = > } > passdb { > driver = pam > } > protocols = imap > ssl = required > ssl_cert = ssl_key = userdb { > driver = passwd > } > > > Any help will be appreciated. > > Regards, > Olaf Marzocchi > > > > > On 19/09/2015 19:22, Christian Kivalo wrote: >> Hi, >> >> On 2015-09-19 16:17, Olaf Marzocchi wrote: >>> Dear Dovecot users, hello. >>> I will merge two issues I have into a single email because they may be >>> related. >>> >>> I used dovecot on a OmniOS server since 2014 (currently OmniOS >>> r151014) with the following configuration (it shows 2.2.18 because I >>> recently updated dovecot, skipping only the PostgreSQL plugin): >>> >>> # 2.2.18: /etc/dovecot/dovecot.conf >>> # OS: SunOS 5.11 i86pc zfs >>> mail_location = maildir:/tank/home/%u/Maildir >>> mail_privileged_group = mail >>> namespace inbox { >>> inbox = yes >>> location = >>> mailbox Drafts { >>> special_use = \Drafts >>> } >>> mailbox Junk { >>> special_use = \Junk >>> } >>> mailbox Sent { >>> special_use = \Sent >>> } >>> mailbox "Sent Messages" { >>> special_use = \Sent >>> } >>> mailbox Trash { >>> special_use = \Trash >>> } >>> prefix = >>> } >>> passdb { >>> driver = pam >>> } >>> protocols = imap >>> ssl = required >>> ssl_cert = >> ssl_key = >> userdb { >>> driver = passwd >>> } >>> >>> You can see that I set the Maildir folder inside the shared home >>> folders of my server (it is only one user, anyway). >>> It always worked perfectly, but one-two months ago I changed the >>> permissions of my whole home folder, recursively, to add proper ACLs. >>> I needed them because the clients started using illumos kernel SMB >>> (relying on ACLs) instead of Netatalk/AFP (relying on Unix perms >>> only). >>> I didn't realise I applied the ACLs also to the Maildir folder. >>> >>> Dovecot worked for several weeks fine, I noticed the issue only >>> yesterday when a mailbox (see below) appeared in Thunderbird >>> completely empty even if the "cur" subfolder on the server still >>> contains all the mails. >>> >>> Dovecot was throwing some errors like: >>> >>> dovecot: [ID 583609 mail.error] imap(olaf): Error: >>> rename(/tank/home/olaf/Maildir/.&A6k- Mailing >>> Lists.Log/dovecot.index.cache) failed: Permission denied >>> (euid=501(olaf) egid=501(olaf) UNIX perms appear ok (ACL/MAC wrong?)) >>> dovecot: [ID 583609 mail.error] imap(olaf): Error: >>> rename(/tank/home/olaf/Maildir/.&A6k- Mailing >>> Lists.Log/dovecot.index.tmp, /tank/home/olaf/Maildir/.&A6k- Mailing >>> Lists.Log/dovecot.index) failed: Permission denied >>> dovecot: [ID 583609 mail.error] imap(olaf): Error: >>> unlink(/tank/home/olaf/Maildir/subscriptions.lock) failed: Permission >>> denied >>> dovecot: [ID 583609 mail.error] imap(olaf): Error: >>> rename(/tank/home/olaf/Maildir/subscriptions.lock, >>> /tank/home/olaf/Maildir/subscriptions) failed: Permission denied >>> >>> I will post here the current permissions of the folder containing >>> Maildir, of the Maildir itself, of its contents, and of the folder >>> that appears empty when browsed with a client (Thunderbird). >>> >>> /tank/home/olaf $ ls -lV .. >>> drwx------+ 16 olaf olaf 17 Sep 19 01:52 olaf >>> user:olaf:rwxpdDaARWcCos:fd-----:allow >>> group:2147483648:rwxpdDaARWcCos:fd-----:allow >>> everyone@:rwxpdDaARWcCos:fd-----:deny >>> >>> /tank/home/olaf $ ls -lV >>> drwxrwx--- 348 olaf olaf 359 Sep 19 01:51 Maildir >>> owner@:rwxp--aARWcCos:-------:allow >>> group@:rwxp--a-R-c--s:-------:allow >>> everyone@:------a-R-c--s:-------:allow >>> >>> /tank/home/olaf $ ls -lV Maildir/ >>> drwxrwx--- 2 olaf olaf 2 Jan 30 2014 cur >>> owner@:rwxp--aARWcCos:-------:allow >>> group@:rwxp--a-R-c--s:-------:allow >>> everyone@:------a-R-c--s:-------:allow >>> -rwxrwx--- 1 olaf olaf 21 Jan 30 2014 dovecot-keywords >>> owner@:rwxp--aARWcCos:-------:allow >>> group@:rwxp--a-R-c--s:-------:allow >>> everyone@:------a-R-c--s:-------:allow >>> (ALL THE SAME PERMISSIONS FOR THE OTHER FILES EXCEPT...) >>> -rwxrwx--- 1 olaf olaf 13735 Jan 24 2015 subscriptions >>> owner@:rwxp--aARWcCos:-------:allow >>> group@:rwxp--a-R-c--s:-------:allow >>> everyone@:------a-R-c--s:-------:allow >>> -rw-rw---- 1 olaf olaf 13709 Sep 19 01:51 subscriptions.lock >>> owner@:rw-p--aARWcCos:-------:allow >>> group@:rw-p--a-R-c--s:-------:allow >>> everyone@:------a-R-c--s:-------:allow >>> >>> The folder that appears empty: >>> >>> /tank/home/olaf $ ls -lV Maildir/.Generiche/ >>> total 513 >>> drwxrwx--- 2 olaf olaf 949 Sep 18 01:42 cur >>> owner@:rwxp--aARWcCos:-------:allow >>> group@:rwxp--a-R-c--s:-------:allow >>> everyone@:------a-R-c--s:-------:allow >>> -rwxrwx--- 1 olaf olaf 46 May 18 2014 dovecot-keywords >>> owner@:rwxp--aARWcCos:-------:allow >>> group@:rwxp--a-R-c--s:-------:allow >>> everyone@:------a-R-c--s:-------:allow >>> (ALL THE SAME PERMISSIONS FOR THE OTHER FILES) >>> >>> >>> I really hope you will have the time to help me because I already >>> applied the permissions recursively and I removed the ACLs, almost as >>> it was before my mistake. >>> I specified "almost" because originally (I checked the backups) the >>> Maildir folder had an ACL that gave access permissions also to the >>> group "mail": >>> >>> drwxrwx---+349 olaf olaf 359 Feb 16 2014 Maildir >>> group:mail:rwxpdDaARWcCos:fd-----:allow >>> owner@:rwxpdDaARWcCos:fd----I:allow >>> group@:rwxpdDaARWcCos:fd----I:allow >>> everyone@:rwxpdDaARWcCos:fd----I:deny >>> >>> Yesterday I haven't replicated it because from the documentation I >>> understood it was not necessary. >> >> From my view the permissions seem to be set correctly, i have to admin, >> its been a while since i moved to virtual users so i may be wrong here... >> >> The log output also seems to support that permissions are correct. >> >> Have you tried adding the group:mail:.... ACLs back? >> >> Have you set mail_debug=yes or other more verbose logging settings? >> http://wiki2.dovecot.org/Logging From hannohirschberger at googlemail.com Sat Oct 10 17:23:15 2015 From: hannohirschberger at googlemail.com (Hanno Hirschberger) Date: Sat, 10 Oct 2015 19:23:15 +0200 Subject: [OmniOS-discuss] Maildir: ACLs/Unix perms: unlink(...) failed: Permission denied In-Reply-To: <56193821.7010503@marzocchi.net> References: <55FD6E91.7020505@marzocchi.net> <839515024ef34c25a9bbe682a454855c@valo.at> <56086833.7090507@marzocchi.net> <56193821.7010503@marzocchi.net> Message-ID: <56194983.20401@googlemail.com> On 10.10.2015 18:09, Olaf Marzocchi wrote: > I'm not sure why I cannot see in Thunderbird any folder "Trash" but if I > try to create one it fails with "Folder already existing", but I will > find out Had the same problem before so let's give it a try! Are all the folders subscribed in Thunderbird? Right click on the mail account name in the mailbox list and go to "Subscribe...". See if the checkbox on the "Trash" entry is activated. Regards, Hanno -------------- next part -------------- A non-text attachment was scrubbed... Name: subscribe.jpg Type: image/jpeg Size: 21212 bytes Desc: not available URL: From lists at marzocchi.net Sat Oct 10 17:45:52 2015 From: lists at marzocchi.net (Olaf Marzocchi) Date: Sat, 10 Oct 2015 19:45:52 +0200 Subject: [OmniOS-discuss] Maildir: ACLs/Unix perms: unlink(...) failed: Permission denied In-Reply-To: <56194983.20401@googlemail.com> References: <55FD6E91.7020505@marzocchi.net> <839515024ef34c25a9bbe682a454855c@valo.at> <56086833.7090507@marzocchi.net> <56193821.7010503@marzocchi.net> <56194983.20401@googlemail.com> Message-ID: <56194ED0.8000208@marzocchi.net> Since I was not able to see the "Trash" even in that dialog, I created one called "__Cestino" (that's trash in Italian, plus the underscores to be sure it appeared on top) and now, after some hours I left it alone, it looks like it is recognised as "official" trash, with the special icon and the... renaming disabled. Anyway, I don't know what actually solved the problem, but thanks. Olaf On 10/10/2015 19:23, Hanno Hirschberger wrote: > On 10.10.2015 18:09, Olaf Marzocchi wrote: >> I'm not sure why I cannot see in Thunderbird any folder "Trash" but if I >> try to create one it fails with "Folder already existing", but I will >> find out > > Had the same problem before so let's give it a try! Are all the folders > subscribed in Thunderbird? Right click on the mail account name in the > mailbox list and go to "Subscribe...". See if the checkbox on the > "Trash" entry is activated. > > Regards, > > Hanno > > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > From jimklimov at cos.ru Sat Oct 10 16:55:43 2015 From: jimklimov at cos.ru (Jim Klimov) Date: Sat, 10 Oct 2015 18:55:43 +0200 Subject: [OmniOS-discuss] ISC-DHCPD in a zone In-Reply-To: <20151010185036.3d904430@coomera> References: <20151010094039.4fde856f@pimple.landy.net> <20151010185036.3d904430@coomera> Message-ID: <10558480-332F-49A3-993D-DF040DCD49C2@cos.ru> 10 ??????? 2015??. 10:50:36 CEST, Michael Mounteney ?????: >On Sat, 10 Oct 2015 07:33:29 +0200 >Jim Klimov wrote: > >> With the alias interfaces in play - do you use a shared-ip zone? That >> may be the limit; try switching to exclusive-ip with dedicated >> vnic(s). > >That would explain why my setup notes (this is a fresh installation) >have DHCP in its own zone and all other services (IMAP, version control >repositories, TFTP, rsync server etc.) in another. > >It's not the answer for which I was hoping. It would be neater to have >all services together in one zone and not have to run a second zone, >just for one service. Is there another way? Anything else I can try? > >> Also see if any zone or process rbac privileges seem suitable >> additions to the service (especially if it works from shell and fails >> from SMF even as root): things like promiscuity or not-owned file >> access are dropped by default. > >It's the same both from the command line and via a service. > >Thanks for your reply. > >______________ >Michael Mounteney You can try creating a vnic and delegating it to a zone (via device match rules). Hopefully then you'd get an owned device in the zone, but still not an owned stack where you can go promiscuous, change routes, etc. It may still be the limit... Maybe you can't even set an ip address on the delegated vnic from inside the zone. Hopefully someone better experienced with isc dhcpd canoffer better ideas. Jim -- Typos courtesy of K-9 Mail on my Samsung Android From heinz at licenser.net Mon Oct 12 13:38:32 2015 From: heinz at licenser.net (Heinz Nikolaus Gies) Date: Mon, 12 Oct 2015 15:38:32 +0200 Subject: [OmniOS-discuss] Project-FiFo 0.7.0 release Message-ID: <8A3C13DD-6CC6-4090-926C-190554FFFB1B@licenser.net> Good news everyone! FiFo 0.7.0 is released today! There is a blog post [1] explaining the details so I just want to go over the biggest news and keep this short. * A shiny new UI. * A complete overhaul of our documentation. * Historic metrics for the whole cloud with Tachyon and DalmatinerDB. * Accounting/usage information for VM?s. * Full support for OAuth2. * Experimental support for OmniOS. If you want to update we?ve a detailed update section in the docs [2]. [1] https://blog.project-fifo.net/the-biggest-news-yet-0-7-0-and-support [2] http://docs-new.project-fifo.net/docs/upgrading-fifo -------------- next part -------------- An HTML attachment was scrubbed... URL: From rjahnel at ellipseinc.com Mon Oct 12 14:35:36 2015 From: rjahnel at ellipseinc.com (Richard Jahnel) Date: Mon, 12 Oct 2015 14:35:36 +0000 Subject: [OmniOS-discuss] Two panics now while writing eager zeros to zvols In-Reply-To: References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> Message-ID: <65DC5816D4BEE043885A89FD54E273FC6CF60B69@MAIL101.Ellipseinc.com> That would be the hourly snapshot destroy and create for VS20. -----Original Message----- From: Dan McDonald [mailto:danmcd at omniti.com] Sent: Friday, October 09, 2015 10:34 PM To: Richard Jahnel Cc: omnios-discuss at lists.omniti.com; imemo Subject: Re: [OmniOS-discuss] Two panics now while writing eager zeros to zvols That's just the "you had a kernel panic" message. Shoot. I was hoping for hardware problems. What is that process I mentioned -- VS20Vol20_snapcy ? What is it doing? It's driven from cron, but I can't tell much beyond that. (Most kernel dumps don't take in userspace text.) Dan ________________________________ The content of this e-mail (including any attachments) is strictly confidential and may be commercially sensitive. If you are not, or believe you may not be, the intended recipient, please advise the sender immediately by return e-mail, delete this e-mail and destroy any copies. From rjahnel at ellipseinc.com Mon Oct 12 14:46:03 2015 From: rjahnel at ellipseinc.com (Richard Jahnel) Date: Mon, 12 Oct 2015 14:46:03 +0000 Subject: [OmniOS-discuss] Two panics now while writing eager zeros to zvols In-Reply-To: References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> Message-ID: <65DC5816D4BEE043885A89FD54E273FC6CF60C61@MAIL101.Ellipseinc.com> Hmmm seems possible. Both panics included attempts to make eager zeroed volumes larger than 2 TB. From: wuffers [mailto:moo at wuffers.net] Sent: Saturday, October 10, 2015 2:19 AM To: Dan McDonald Cc: Richard Jahnel; imemo; omnios-discuss at lists.omniti.com Subject: Re: [OmniOS-discuss] Two panics now while writing eager zeros to zvols Is this the same bug I ran into in March? http://lists.omniti.com/pipermail/omnios-discuss/2015-March/004540.html I'm running a newer stmf_sbd that Dan made which solved my issue. It had something to do with the WRITE_SAME VAAI primitive, but I'm also running with COMSTAR. Dan was pretty busy preparing for R151014 at the time, so he hasn't had a chance to upstream it back. On Fri, Oct 9, 2015 at 11:33 PM, Dan McDonald > wrote: That's just the "you had a kernel panic" message. Shoot. I was hoping for hardware problems. What is that process I mentioned -- VS20Vol20_snapcy ? What is it doing? It's driven from cron, but I can't tell much beyond that. (Most kernel dumps don't take in userspace text.) Dan _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss at lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss ________________________________ The content of this e-mail (including any attachments) is strictly confidential and may be commercially sensitive. If you are not, or believe you may not be, the intended recipient, please advise the sender immediately by return e-mail, delete this e-mail and destroy any copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rjahnel at ellipseinc.com Mon Oct 12 15:29:17 2015 From: rjahnel at ellipseinc.com (Richard Jahnel) Date: Mon, 12 Oct 2015 15:29:17 +0000 Subject: [OmniOS-discuss] Two panics now while writing eager zeros to zvols In-Reply-To: <65DC5816D4BEE043885A89FD54E273FC6CF60C61@MAIL101.Ellipseinc.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> <65DC5816D4BEE043885A89FD54E273FC6CF60C61@MAIL101.Ellipseinc.com> Message-ID: <65DC5816D4BEE043885A89FD54E273FC6CF60CA4@MAIL101.Ellipseinc.com> Scratch that. Just panicked again on 250gb disks. From: Richard Jahnel Sent: Monday, October 12, 2015 9:46 AM To: wuffers; Dan McDonald Cc: imemo; omnios-discuss at lists.omniti.com Subject: RE: [OmniOS-discuss] Two panics now while writing eager zeros to zvols Hmmm seems possible. Both panics included attempts to make eager zeroed volumes larger than 2 TB. From: wuffers [mailto:moo at wuffers.net] Sent: Saturday, October 10, 2015 2:19 AM To: Dan McDonald Cc: Richard Jahnel; imemo; omnios-discuss at lists.omniti.com Subject: Re: [OmniOS-discuss] Two panics now while writing eager zeros to zvols Is this the same bug I ran into in March? http://lists.omniti.com/pipermail/omnios-discuss/2015-March/004540.html I'm running a newer stmf_sbd that Dan made which solved my issue. It had something to do with the WRITE_SAME VAAI primitive, but I'm also running with COMSTAR. Dan was pretty busy preparing for R151014 at the time, so he hasn't had a chance to upstream it back. On Fri, Oct 9, 2015 at 11:33 PM, Dan McDonald > wrote: That's just the "you had a kernel panic" message. Shoot. I was hoping for hardware problems. What is that process I mentioned -- VS20Vol20_snapcy ? What is it doing? It's driven from cron, but I can't tell much beyond that. (Most kernel dumps don't take in userspace text.) Dan _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss at lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss ________________________________ The content of this e-mail (including any attachments) is strictly confidential and may be commercially sensitive. If you are not, or believe you may not be, the intended recipient, please advise the sender immediately by return e-mail, delete this e-mail and destroy any copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard at netbsd.org Tue Oct 13 03:24:15 2015 From: richard at netbsd.org (Richard PALO) Date: Tue, 13 Oct 2015 05:24:15 +0200 Subject: [OmniOS-discuss] usb printer debugging Message-ID: Hi, me again. Trying to see if why my multifunction doesn't show up in correctly on omnios. a similar box works okay on OI.. I have a single purpose DYMO label printer that is configured okay, but not the epson. > richard at omnis:/home/richard/src$ cfgadm -lv usb5/1 > Ap_Id Receptacle Occupant Condition Information > When Type Busy Phys_Id > usb5/1 connected unconfigured ok Mfg: DYMO Product: DYMO LabelWriter 450 NConfigs: 1 Config: 0 > unavailable usb-printer n /devices/pci at 0,0/pci15d9,a711 at 13:1 > richard at omnis:/home/richard/src$ cfgadm -lv usb4/1 > Ap_Id Receptacle Occupant Condition Information > When Type Busy Phys_Id > usb4/1 connected configured ok Mfg: EPSON Product: EPSON WP-4595 Series NConfigs: 1 Config: 0 : USB2.0 MFP(Hi-Speed) > unavailable usb-device n /devices/pci at 0,0/pci15d9,a711 at 12,2:1 > richard at omnis:/home/richard# echo ::prtusb |mdb -k > INDEX DRIVER INST NODE VID.PID PRODUCT > 1 ehci 0 pci15d9,a711 0000.0000 No Product String > 2 ehci 1 pci1002,4396 0000.0000 No Product String > 3 ohci 0 pci15d9,a711 0000.0000 No Product String > 4 ohci 1 pci15d9,a711 0000.0000 No Product String > 5 ohci 2 pci15d9,a711 0000.0000 No Product String > 6 ohci 3 pci15d9,a711 0000.0000 No Product String > 7 ohci 4 pci1002,4396 0000.0000 No Product String > 8 usb_mid 1 device 0557.2221 Hermon USB hidmouse Device > 9 usb_mid 4 device 046d.c52b USB Receiver > a usbprn 0 printer 0922.0020 DYMO LabelWriter 450 > b usb_mid 6 device 04b8.087e EPSON WP-4595 Series > richard at omnis:/home/richard# echo ::prtusb -v -ia |mdb -k > INDEX DRIVER INST NODE VID.PID PRODUCT > a usbprn 0 printer 0922.0020 DYMO LabelWriter 450 > > Device Descriptor > { > bLength = 0x12 > bDescriptorType = 0x1 > bcdUSB = 0x200 > bDeviceClass = 0 > bDeviceSubClass = 0 > bDeviceProtocol = 0 > bMaxPacketSize0 = 0x40 > idVendor = 0x922 > idProduct = 0x20 > bcdDevice = 0x112 > iManufacturer = 0x1 > iProduct = 0x2 > iSerialNumber = 0x3 > bNumConfigurations = 0x1 > } > -- Active Config Index 0 > Configuration Descriptor > { > bLength = 0x9 > bDescriptorType = 0x2 > wTotalLength = 0x20 > bNumInterfaces = 0x1 > bConfigurationValue = 0x1 > iConfiguration = 0x0 > bmAttributes = 0xc0 > bMaxPower = 0x1 > } > Interface Descriptor > { > bLength = 0x9 > bDescriptorType = 0x4 > bInterfaceNumber = 0x0 > bAlternateSetting = 0x0 > bNumEndpoints = 0x2 > bInterfaceClass = 0x7 > bInterfaceSubClass = 0x1 > bInterfaceProtocol = 0x2 > iInterface = 0x0 > } > Endpoint Descriptor > { > bLength = 0x7 > bDescriptorType = 0x5 > bEndpointAddress = 0x82 > bmAttributes = 0x2 > wMaxPacketSize = 0x40 > bInterval = 0x0 > } > Endpoint Descriptor > { > bLength = 0x7 > bDescriptorType = 0x5 > bEndpointAddress = 0x2 > bmAttributes = 0x2 > wMaxPacketSize = 0x40 > bInterval = 0x0 > } > > richard at omnis:/home/richard# echo ::prtusb -v -ib |mdb -k > INDEX DRIVER INST NODE VID.PID PRODUCT > b usb_mid 6 device 04b8.087e EPSON WP-4595 Series > > Device Descriptor > { > bLength = 0x12 > bDescriptorType = 0x1 > bcdUSB = 0x200 > bDeviceClass = 0 > bDeviceSubClass = 0 > bDeviceProtocol = 0 > bMaxPacketSize0 = 0x40 > idVendor = 0x4b8 > idProduct = 0x87e > bcdDevice = 0x100 > iManufacturer = 0x1 > iProduct = 0x2 > iSerialNumber = 0x3 > bNumConfigurations = 0x1 > } > -- Active Config Index 0 > Configuration Descriptor > { > bLength = 0x9 > bDescriptorType = 0x2 > wTotalLength = 0x4e > bNumInterfaces = 0x3 > bConfigurationValue = 0x1 > iConfiguration = 0x4 > bmAttributes = 0xc0 > bMaxPower = 0x1 > } > Interface Descriptor > { > bLength = 0x9 > bDescriptorType = 0x4 > bInterfaceNumber = 0x0 > bAlternateSetting = 0x0 > bNumEndpoints = 0x2 > bInterfaceClass = 0xff > bInterfaceSubClass = 0xff > bInterfaceProtocol = 0xff > iInterface = 0x5 > } > Endpoint Descriptor > { > bLength = 0x7 > bDescriptorType = 0x5 > bEndpointAddress = 0x1 > bmAttributes = 0x2 > wMaxPacketSize = 0x200 > bInterval = 0x0 > } > Endpoint Descriptor > { > bLength = 0x7 > bDescriptorType = 0x5 > bEndpointAddress = 0x82 > bmAttributes = 0x2 > wMaxPacketSize = 0x200 > bInterval = 0x0 > } > Interface Descriptor > { > bLength = 0x9 > bDescriptorType = 0x4 > bInterfaceNumber = 0x1 > bAlternateSetting = 0x0 > bNumEndpoints = 0x2 > bInterfaceClass = 0x7 > bInterfaceSubClass = 0x1 > bInterfaceProtocol = 0x2 > iInterface = 0x6 > } > Endpoint Descriptor > { > bLength = 0x7 > bDescriptorType = 0x5 > bEndpointAddress = 0x4 > bmAttributes = 0x2 > wMaxPacketSize = 0x200 > bInterval = 0x0 > } > Endpoint Descriptor > { > bLength = 0x7 > bDescriptorType = 0x5 > bEndpointAddress = 0x85 > bmAttributes = 0x2 > wMaxPacketSize = 0x200 > bInterval = 0x0 > } > Interface Descriptor > { > bLength = 0x9 > bDescriptorType = 0x4 > bInterfaceNumber = 0x2 > bAlternateSetting = 0x0 > bNumEndpoints = 0x2 > bInterfaceClass = 0x8 > bInterfaceSubClass = 0x6 > bInterfaceProtocol = 0x50 > iInterface = 0x7 > } > Endpoint Descriptor > { > bLength = 0x7 > bDescriptorType = 0x5 > bEndpointAddress = 0x7 > bmAttributes = 0x2 > wMaxPacketSize = 0x200 > bInterval = 0x0 > } > Endpoint Descriptor > { > bLength = 0x7 > bDescriptorType = 0x5 > bEndpointAddress = 0x88 > bmAttributes = 0x2 > wMaxPacketSize = 0x200 > bInterval = 0x0 > } > The multifonction only gets one configuration made, config 0 although the other two are certainly listed. On OI, I get automagically a printer and a fax device. Any hints? -- Richard PALO From danmcd at omniti.com Tue Oct 13 11:39:42 2015 From: danmcd at omniti.com (Dan McDonald) Date: Tue, 13 Oct 2015 07:39:42 -0400 Subject: [OmniOS-discuss] Ang: Re: Two panics now while writing eager zeros to zvols In-Reply-To: References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> Message-ID: <1677B2AD-3051-4359-8381-C624157913DB@omniti.com> > On Oct 10, 2015, at 4:49 AM, Johan Kragsterman wrote: > > Dan, is this upstreamed at all...? No it's not. The following not-yet-upstreamed diff is a subset of a larger fix from illumos-nexenta: diff --git a/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c b/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c index cb6e115..7242d15 100644 --- a/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c +++ b/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c @@ -2347,6 +2347,7 @@ write_same_xfer_done: if (scmd->flags & SBD_SCSI_CMD_XFER_FAIL) { stmf_scsilib_send_status(task, STATUS_CHECK, STMF_SAA_WRITE_ERROR); + ret = (int)SBD_FAILURE; } else { ret = sbd_write_same_data(task, scmd); if (ret != SBD_SUCCESS) { @@ -2355,15 +2356,24 @@ write_same_xfer_done: } else { stmf_scsilib_send_status(task, STATUS_GOOD, 0); } + if ((scmd->flags & SBD_SCSI_CMD_TRANS_DATA) && + scmd->trans_data != NULL) { + kmem_free(scmd->trans_data, scmd->trans_data_len); + scmd->trans_data = NULL; + scmd->trans_data_len = 0; + scmd->flags &= ~SBD_SCSI_CMD_TRANS_DATA; + } } /* - * Only way we should get here is via handle_write_same(), - * and that should make the following assertion always pass. + * Do the send_status afterwards, because of a potential + * double-free problem. */ - ASSERT((scmd->flags & SBD_SCSI_CMD_TRANS_DATA) && - scmd->trans_data != NULL); - kmem_free(scmd->trans_data, scmd->trans_data_len); - scmd->flags &= ~SBD_SCSI_CMD_TRANS_DATA; + if (ret != SBD_SUCCESS) { + stmf_scsilib_send_status(task, STATUS_CHECK, + STMF_SAA_WRITE_ERROR); + } else { + stmf_scsilib_send_status(task, STATUS_GOOD, 0); + } return; } sbd_do_write_same_xfer(task, scmd, dbuf, dbuf_reusable); It fixes a double-free in this path, which the eager-zeroes seems to tickle. I'm attaching an stmf_sbd binary as well. People affected by this can try this binary by: 1.) beadm create test-be 2.) beadm mount test-be /mnt 3.) cp stmf_sbd /mnt/kernel/drv/amd64/stmf_sbd 4.) bootadm update-archive -R /mnt 5.) Reboot, watch for grub 6.) Select "test-be" from the grub menu. This will be a boot-once-into-the-new-BE test. 7.) Try your eager-zero test. I look forward to seeing any experimental results. Thanks, Dan -------------- next part -------------- A non-text attachment was scrubbed... Name: stmf_sbd Type: application/octet-stream Size: 164248 bytes Desc: not available URL: From danmcd at omniti.com Tue Oct 13 11:43:17 2015 From: danmcd at omniti.com (Dan McDonald) Date: Tue, 13 Oct 2015 07:43:17 -0400 Subject: [OmniOS-discuss] ISC-DHCPD in a zone In-Reply-To: <10558480-332F-49A3-993D-DF040DCD49C2@cos.ru> References: <20151010094039.4fde856f@pimple.landy.net> <20151010185036.3d904430@coomera> <10558480-332F-49A3-993D-DF040DCD49C2@cos.ru> Message-ID: <9965A834-711E-4CD4-B8D3-11C57B239CC7@omniti.com> > On Oct 10, 2015, at 12:55 PM, Jim Klimov wrote: > > You can try creating a vnic and delegating it to a zone (via device match rules). Hopefully then you'd get an owned device in the zone, but still not an owned stack where you can go promiscuous, change routes, etc. It may still be the limit... Maybe you can't even set an ip address on the delegated vnic from inside the zone. > > Hopefully someone better experienced with isc dhcpd canoffer better ideas. Oh my... > Zone# ifconfig -a > lo0:2: flags=2001000849 mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 > e1000g1:2: flags=1100843 mtu 1500 index 2 inet 192.168.1.2 netmask ffffff00 broadcast > 192.168.1.255 lo0:2: flags=2002000849 mtu 8252 index 1 inet6 ::1/128 You're using a shared-stack zone. I didn't know people still did that... ISC DHCP needs full DLPI-ish access to the NIC in question. I run ISC DHCP in a zone, but it's an exclusive-stack zone. I'd take Jim's advice first if you're really intent on using a shared-stack zone. I can't guarantee it'll work, but you certainly cannot run ISC DHCP without having a full NIC available. Sorry, Dan From danmcd at omniti.com Tue Oct 13 12:16:19 2015 From: danmcd at omniti.com (Dan McDonald) Date: Tue, 13 Oct 2015 08:16:19 -0400 Subject: [OmniOS-discuss] Two panics now while writing eager zeros to zvols In-Reply-To: <65DC5816D4BEE043885A89FD54E273FC6CF60CA4@MAIL101.Ellipseinc.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> <65DC5816D4BEE043885A89FD54E273FC6CF60C61@MAIL101.Ellipseinc.com> <65DC5816D4BEE043885A89FD54E273FC6CF60CA4@MAIL101.Ellipseinc.com> Message-ID: See my other note on this subject. This may be a bug which is fixed in illumos-nexenta, but not upstreamed. Dan From danmcd at omniti.com Tue Oct 13 12:21:52 2015 From: danmcd at omniti.com (Dan McDonald) Date: Tue, 13 Oct 2015 08:21:52 -0400 Subject: [OmniOS-discuss] usb printer debugging In-Reply-To: References: Message-ID: <7477FABC-5ACE-47F6-A80B-102ECB716CF7@omniti.com> > On Oct 12, 2015, at 11:24 PM, Richard PALO wrote: > > On OI, I get automagically a printer and a fax device. > > Any hints? Sure - we don't support CUPS in OmniOS. That requires apache stuff, which was expunged to enforce the keep-your-stuff-to-yourself policies of OmniOS. (Apache is only available on the "omniti-ms/ms.omniti.com" publisher.) Sorry, Dan From rjahnel at ellipseinc.com Tue Oct 13 14:30:07 2015 From: rjahnel at ellipseinc.com (Richard Jahnel) Date: Tue, 13 Oct 2015 14:30:07 +0000 Subject: [OmniOS-discuss] Ang: Re: Two panics now while writing eager zeros to zvols In-Reply-To: <1677B2AD-3051-4359-8381-C624157913DB@omniti.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> <1677B2AD-3051-4359-8381-C624157913DB@omniti.com> Message-ID: <65DC5816D4BEE043885A89FD54E273FC6CF60E45@MAIL101.Ellipseinc.com> Will experiment with it this afternoon. If it hasn't panicked by tomorrow evening odds are this will have identified and fixed this issue. -----Original Message----- From: Dan McDonald [mailto:danmcd at omniti.com] Sent: Tuesday, October 13, 2015 6:40 AM To: Johan Kragsterman; Dan McDonald Cc: wuffers; Richard Jahnel; imemo; omnios-discuss at lists.omniti.com Subject: Re: Ang: Re: [OmniOS-discuss] Two panics now while writing eager zeros to zvols > On Oct 10, 2015, at 4:49 AM, Johan Kragsterman wrote: > > Dan, is this upstreamed at all...? No it's not. The following not-yet-upstreamed diff is a subset of a larger fix from illumos-nexenta: diff --git a/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c b/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c index cb6e115..7242d15 100644 --- a/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c +++ b/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c @@ -2347,6 +2347,7 @@ write_same_xfer_done: if (scmd->flags & SBD_SCSI_CMD_XFER_FAIL) { stmf_scsilib_send_status(task, STATUS_CHECK, STMF_SAA_WRITE_ERROR); + ret = (int)SBD_FAILURE; } else { ret = sbd_write_same_data(task, scmd); if (ret != SBD_SUCCESS) { @@ -2355,15 +2356,24 @@ write_same_xfer_done: } else { stmf_scsilib_send_status(task, STATUS_GOOD, 0); } + if ((scmd->flags & SBD_SCSI_CMD_TRANS_DATA) && + scmd->trans_data != NULL) { + kmem_free(scmd->trans_data, scmd->trans_data_len); + scmd->trans_data = NULL; + scmd->trans_data_len = 0; + scmd->flags &= ~SBD_SCSI_CMD_TRANS_DATA; + } } /* - * Only way we should get here is via handle_write_same(), - * and that should make the following assertion always pass. + * Do the send_status afterwards, because of a potential + * double-free problem. */ - ASSERT((scmd->flags & SBD_SCSI_CMD_TRANS_DATA) && - scmd->trans_data != NULL); - kmem_free(scmd->trans_data, scmd->trans_data_len); - scmd->flags &= ~SBD_SCSI_CMD_TRANS_DATA; + if (ret != SBD_SUCCESS) { + stmf_scsilib_send_status(task, STATUS_CHECK, + STMF_SAA_WRITE_ERROR); + } else { + stmf_scsilib_send_status(task, STATUS_GOOD, 0); + } return; } sbd_do_write_same_xfer(task, scmd, dbuf, dbuf_reusable); It fixes a double-free in this path, which the eager-zeroes seems to tickle. I'm attaching an stmf_sbd binary as well. People affected by this can try this binary by: 1.) beadm create test-be 2.) beadm mount test-be /mnt 3.) cp stmf_sbd /mnt/kernel/drv/amd64/stmf_sbd 4.) bootadm update-archive -R /mnt 5.) Reboot, watch for grub 6.) Select "test-be" from the grub menu. This will be a boot-once-into-the-new-BE test. 7.) Try your eager-zero test. I look forward to seeing any experimental results. Thanks, Dan ________________________________ The content of this e-mail (including any attachments) is strictly confidential and may be commercially sensitive. If you are not, or believe you may not be, the intended recipient, please advise the sender immediately by return e-mail, delete this e-mail and destroy any copies. From rjahnel at ellipseinc.com Tue Oct 13 17:34:19 2015 From: rjahnel at ellipseinc.com (Richard Jahnel) Date: Tue, 13 Oct 2015 17:34:19 +0000 Subject: [OmniOS-discuss] Ang: Re: Two panics now while writing eager zeros to zvols In-Reply-To: <1677B2AD-3051-4359-8381-C624157913DB@omniti.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> <1677B2AD-3051-4359-8381-C624157913DB@omniti.com> Message-ID: <65DC5816D4BEE043885A89FD54E273FC6CF60EC9@MAIL101.Ellipseinc.com> I'm probably doing it wrong, but I have failed to get this to work. When attempting to boot into the test environment I got something along the lines of stmf_sbd: undefined symbol '__stack_chk_fail' stmf_sbd: undefined symbol '__stack_chk_guard' unable to load module stmf_sbd My version information below. OmniOS 5.11 omnios-f090f73 September 2015 # cat /etc/release OmniOS v11 r151014 Copyright 2015 OmniTI Computer Consulting, Inc. All rights reserved. Use is subject to license terms. -----Original Message----- From: Dan McDonald [mailto:danmcd at omniti.com] Sent: Tuesday, October 13, 2015 6:40 AM To: Johan Kragsterman; Dan McDonald Cc: wuffers; Richard Jahnel; imemo; omnios-discuss at lists.omniti.com Subject: Re: Ang: Re: [OmniOS-discuss] Two panics now while writing eager zeros to zvols > On Oct 10, 2015, at 4:49 AM, Johan Kragsterman wrote: > > Dan, is this upstreamed at all...? No it's not. The following not-yet-upstreamed diff is a subset of a larger fix from illumos-nexenta: diff --git a/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c b/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c index cb6e115..7242d15 100644 --- a/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c +++ b/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c @@ -2347,6 +2347,7 @@ write_same_xfer_done: if (scmd->flags & SBD_SCSI_CMD_XFER_FAIL) { stmf_scsilib_send_status(task, STATUS_CHECK, STMF_SAA_WRITE_ERROR); + ret = (int)SBD_FAILURE; } else { ret = sbd_write_same_data(task, scmd); if (ret != SBD_SUCCESS) { @@ -2355,15 +2356,24 @@ write_same_xfer_done: } else { stmf_scsilib_send_status(task, STATUS_GOOD, 0); } + if ((scmd->flags & SBD_SCSI_CMD_TRANS_DATA) && + scmd->trans_data != NULL) { + kmem_free(scmd->trans_data, scmd->trans_data_len); + scmd->trans_data = NULL; + scmd->trans_data_len = 0; + scmd->flags &= ~SBD_SCSI_CMD_TRANS_DATA; + } } /* - * Only way we should get here is via handle_write_same(), - * and that should make the following assertion always pass. + * Do the send_status afterwards, because of a potential + * double-free problem. */ - ASSERT((scmd->flags & SBD_SCSI_CMD_TRANS_DATA) && - scmd->trans_data != NULL); - kmem_free(scmd->trans_data, scmd->trans_data_len); - scmd->flags &= ~SBD_SCSI_CMD_TRANS_DATA; + if (ret != SBD_SUCCESS) { + stmf_scsilib_send_status(task, STATUS_CHECK, + STMF_SAA_WRITE_ERROR); + } else { + stmf_scsilib_send_status(task, STATUS_GOOD, 0); + } return; } sbd_do_write_same_xfer(task, scmd, dbuf, dbuf_reusable); It fixes a double-free in this path, which the eager-zeroes seems to tickle. I'm attaching an stmf_sbd binary as well. People affected by this can try this binary by: 1.) beadm create test-be 2.) beadm mount test-be /mnt 3.) cp stmf_sbd /mnt/kernel/drv/amd64/stmf_sbd 4.) bootadm update-archive -R /mnt 5.) Reboot, watch for grub 6.) Select "test-be" from the grub menu. This will be a boot-once-into-the-new-BE test. 7.) Try your eager-zero test. I look forward to seeing any experimental results. Thanks, Dan ________________________________ The content of this e-mail (including any attachments) is strictly confidential and may be commercially sensitive. If you are not, or believe you may not be, the intended recipient, please advise the sender immediately by return e-mail, delete this e-mail and destroy any copies. From danmcd at omniti.com Tue Oct 13 17:52:37 2015 From: danmcd at omniti.com (Dan McDonald) Date: Tue, 13 Oct 2015 13:52:37 -0400 Subject: [OmniOS-discuss] Ang: Re: Two panics now while writing eager zeros to zvols In-Reply-To: <65DC5816D4BEE043885A89FD54E273FC6CF60EC9@MAIL101.Ellipseinc.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> <1677B2AD-3051-4359-8381-C624157913DB@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF60EC9@MAIL101.Ellipseinc.com> Message-ID: <807F8D8C-BF81-4368-BCFA-2CCA0C2DE338@omniti.com> My bad. I built this with bloody and forgot the flag day for modules. I'll need to build it for 014 specifically. Dan Sent from my iPhone (typos, autocorrect, and all) > On Oct 13, 2015, at 1:34 PM, Richard Jahnel wrote: > > I'm probably doing it wrong, but I have failed to get this to work. > > When attempting to boot into the test environment I got something along the lines of > > stmf_sbd: undefined symbol '__stack_chk_fail' > stmf_sbd: undefined symbol '__stack_chk_guard' > > unable to load module stmf_sbd > > My version information below. > > OmniOS 5.11 omnios-f090f73 September 2015 > > # cat /etc/release > OmniOS v11 r151014 > Copyright 2015 OmniTI Computer Consulting, Inc. All rights reserved. > Use is subject to license terms. > > -----Original Message----- > From: Dan McDonald [mailto:danmcd at omniti.com] > Sent: Tuesday, October 13, 2015 6:40 AM > To: Johan Kragsterman; Dan McDonald > Cc: wuffers; Richard Jahnel; imemo; omnios-discuss at lists.omniti.com > Subject: Re: Ang: Re: [OmniOS-discuss] Two panics now while writing eager zeros to zvols > > >> On Oct 10, 2015, at 4:49 AM, Johan Kragsterman wrote: >> >> Dan, is this upstreamed at all...? > > No it's not. The following not-yet-upstreamed diff is a subset of a larger fix from illumos-nexenta: > > diff --git a/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c b/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c > index cb6e115..7242d15 100644 > --- a/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c > +++ b/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd_scsi.c > @@ -2347,6 +2347,7 @@ write_same_xfer_done: > if (scmd->flags & SBD_SCSI_CMD_XFER_FAIL) { > stmf_scsilib_send_status(task, STATUS_CHECK, > STMF_SAA_WRITE_ERROR); > + ret = (int)SBD_FAILURE; > } else { > ret = sbd_write_same_data(task, scmd); > if (ret != SBD_SUCCESS) { @@ -2355,15 +2356,24 @@ write_same_xfer_done: > } else { > stmf_scsilib_send_status(task, STATUS_GOOD, 0); > } > + if ((scmd->flags & SBD_SCSI_CMD_TRANS_DATA) && > + scmd->trans_data != NULL) { > + kmem_free(scmd->trans_data, scmd->trans_data_len); > + scmd->trans_data = NULL; > + scmd->trans_data_len = 0; > + scmd->flags &= ~SBD_SCSI_CMD_TRANS_DATA; > + } > } > /* > - * Only way we should get here is via handle_write_same(), > - * and that should make the following assertion always pass. > + * Do the send_status afterwards, because of a potential > + * double-free problem. > */ > - ASSERT((scmd->flags & SBD_SCSI_CMD_TRANS_DATA) && > - scmd->trans_data != NULL); > - kmem_free(scmd->trans_data, scmd->trans_data_len); > - scmd->flags &= ~SBD_SCSI_CMD_TRANS_DATA; > + if (ret != SBD_SUCCESS) { > + stmf_scsilib_send_status(task, STATUS_CHECK, > + STMF_SAA_WRITE_ERROR); > + } else { > + stmf_scsilib_send_status(task, STATUS_GOOD, 0); > + } > return; > } > sbd_do_write_same_xfer(task, scmd, dbuf, dbuf_reusable); > > It fixes a double-free in this path, which the eager-zeroes seems to tickle. > > I'm attaching an stmf_sbd binary as well. People affected by this can try this binary by: > > 1.) beadm create test-be > > 2.) beadm mount test-be /mnt > > 3.) cp stmf_sbd /mnt/kernel/drv/amd64/stmf_sbd > > 4.) bootadm update-archive -R /mnt > > 5.) Reboot, watch for grub > > 6.) Select "test-be" from the grub menu. This will be a boot-once-into-the-new-BE test. > > 7.) Try your eager-zero test. > > I look forward to seeing any experimental results. > > Thanks, > Dan > > ________________________________ > > The content of this e-mail (including any attachments) is strictly confidential and may be commercially sensitive. If you are not, or believe you may not be, the intended recipient, please advise the sender immediately by return e-mail, delete this e-mail and destroy any copies. From danmcd at omniti.com Tue Oct 13 18:13:44 2015 From: danmcd at omniti.com (Dan McDonald) Date: Tue, 13 Oct 2015 14:13:44 -0400 Subject: [OmniOS-discuss] Ang: Re: Two panics now while writing eager zeros to zvols In-Reply-To: <807F8D8C-BF81-4368-BCFA-2CCA0C2DE338@omniti.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> <1677B2AD-3051-4359-8381-C624157913DB@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF60EC9@MAIL101.Ellipseinc.com> <807F8D8C-BF81-4368-BCFA-2CCA0C2DE338@omniti.com> Message-ID: <4AC0EC81-FEC5-4A5C-BA79-D93C0162597D@omniti.com> > On Oct 13, 2015, at 1:52 PM, Dan McDonald wrote: > > My bad. I built this with bloody and forgot the flag day for modules. I'll need to build it for 014 specifically. Try this one. Dan -------------- next part -------------- A non-text attachment was scrubbed... Name: stmf_sbd Type: application/octet-stream Size: 164072 bytes Desc: not available URL: From danmcd at omniti.com Tue Oct 13 18:15:32 2015 From: danmcd at omniti.com (Dan McDonald) Date: Tue, 13 Oct 2015 14:15:32 -0400 Subject: [OmniOS-discuss] Ang: Re: Two panics now while writing eager zeros to zvols In-Reply-To: <4AC0EC81-FEC5-4A5C-BA79-D93C0162597D@omniti.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> <1677B2AD-3051-4359-8381-C624157913DB@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF60EC9@MAIL101.Ellipseinc.com> <807F8D8C-BF81-4368-BCFA-2CCA0C2DE338@omniti.com> <4AC0EC81-FEC5-4A5C-BA79-D93C0162597D@omniti.com> Message-ID: <9A17713D-913B-4A87-B8CE-14315CFA82CA@omniti.com> > On Oct 13, 2015, at 2:13 PM, Dan McDonald wrote: > > Try this one. NO DON'T! Sorry. This one. Dan -------------- next part -------------- A non-text attachment was scrubbed... Name: stmf_sbd Type: application/octet-stream Size: 163240 bytes Desc: not available URL: From cks at cs.toronto.edu Tue Oct 13 18:35:08 2015 From: cks at cs.toronto.edu (Chris Siebenmann) Date: Tue, 13 Oct 2015 14:35:08 -0400 Subject: [OmniOS-discuss] Installing non-current kernels on OmniOS r151014? Message-ID: <20151013183508.6AEF07A0408@apps0.cs.toronto.edu> We have a situation where we would like to be able to install new r151014 machines with something other than the current r151014 kernel. (In the extreme case we'd like to be able to specify the exact package version for all packages, but kernels are the most important for us.) I *think* that the required older versions (both of the kernel and of drivers) are still available in the OmniOS repository. However, I can't seem to coax 'pkg' to show them to me (perhaps because they differ only in the timestamp, not in the version number that pkg stuff normally shows) and so I'm not sure I can get pkg to install them. In related news, is there an easy way to fish the full specific versions of installed packages out of a non-current boot environment? (Or for that matter from the current boot environment.) Is the OmniOS repo for r151014 going to keep copies of all old packages for the lifetime of r151014, or should we also be looking into creating our own copy of the r151014 repo so we can be sure the copies we need are preserved? Thanks in advance. (For the curious: we've been doing various testing of r151014 before upgrading production machines to it. The August 18th and September 14th updates were stable, but after the September 29th one our test machine has started experiencing kernel problems. It's possible that we're putting somewhat different test load on it, but we don't think we've particularly changed anything. While we're going to try to get crash dumps and so on, our first priority is stabilizing some version of r151014 for a production upgrade, which requires being able to specifically install *it* (at least as far as the kernel/NFS/etc goes), not 'the current r151014'.) - cks From richard at netbsd.org Tue Oct 13 18:57:02 2015 From: richard at netbsd.org (Richard PALO) Date: Tue, 13 Oct 2015 20:57:02 +0200 Subject: [OmniOS-discuss] usb printer debugging In-Reply-To: <7477FABC-5ACE-47F6-A80B-102ECB716CF7@omniti.com> References: <7477FABC-5ACE-47F6-A80B-102ECB716CF7@omniti.com> Message-ID: <561D53FE.20105@netbsd.org> Le 13/10/15 14:21, Dan McDonald a ?crit : > >> On Oct 12, 2015, at 11:24 PM, Richard PALO wrote: >> >> On OI, I get automagically a printer and a fax device. >> >> Any hints? > > Sure - we don't support CUPS in OmniOS. That requires apache stuff, which was expunged to enforce the keep-your-stuff-to-yourself policies of OmniOS. (Apache is only available on the "omniti-ms/ms.omniti.com" publisher.) > > Sorry, > Dan > Well, I'm building a gate with CUPS and APACHE just to see. But I'm still just a bit dubious, otherwise why would usbprn still be provided? omission? still smells anomalie somewhere... -- Richard PALO From lotheac at iki.fi Tue Oct 13 19:10:20 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Tue, 13 Oct 2015 22:10:20 +0300 Subject: [OmniOS-discuss] Installing non-current kernels on OmniOS r151014? In-Reply-To: <20151013183508.6AEF07A0408@apps0.cs.toronto.edu> References: <20151013183508.6AEF07A0408@apps0.cs.toronto.edu> Message-ID: <20151013191020.GD26977@gutsman.lotheac.fi> On Tue, Oct 13 2015 14:35:08 -0400, Chris Siebenmann wrote: > I *think* that the required older versions (both of the kernel and > of drivers) are still available in the OmniOS repository. However, > I can't seem to coax 'pkg' to show them to me (perhaps because they > differ only in the timestamp, not in the version number that pkg stuff > normally shows) and so I'm not sure I can get pkg to install them. They are: % pkg list -vfa kernel FMRI IFO pkg://omnios/system/kernel at 0.5.11-0.151014:20150929T225337Z i-- pkg://omnios/system/kernel at 0.5.11-0.151014:20150914T195008Z --- pkg://omnios/system/kernel at 0.5.11-0.151014:20150913T201559Z --- pkg://omnios/system/kernel at 0.5.11-0.151014:20150818T161044Z --- pkg://omnios/system/kernel at 0.5.11-0.151014:20150727T054700Z --- pkg://omnios/system/kernel at 0.5.11-0.151014:20150417T182434Z --- pkg://omnios/system/kernel at 0.5.11-0.151014:20150402T175237Z --- Downgrading a kernel package might not be so trivial, though. pkg will generally refuse to downgrade packages unless you give version numbers in 'update', and the dependencies generally involve lots more packages than just system/kernel. > In related news, is there an easy way to fish the full specific versions > of installed packages out of a non-current boot environment? (Or for > that matter from the current boot environment.) Mount the BE (beadm mount) and run 'pkg -R list -v'. For the current BE just 'pkg list -v'. -- Lauri Tirkkonen | lotheac @ IRCnet From rjahnel at ellipseinc.com Tue Oct 13 19:42:52 2015 From: rjahnel at ellipseinc.com (Richard Jahnel) Date: Tue, 13 Oct 2015 19:42:52 +0000 Subject: [OmniOS-discuss] Ang: Re: Two panics now while writing eager zeros to zvols In-Reply-To: <9A17713D-913B-4A87-B8CE-14315CFA82CA@omniti.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> <1677B2AD-3051-4359-8381-C624157913DB@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF60EC9@MAIL101.Ellipseinc.com> <807F8D8C-BF81-4368-BCFA-2CCA0C2DE338@omniti.com> <4AC0EC81-FEC5-4A5C-BA79-D93C0162597D@omniti.com> <9A17713D-913B-4A87-B8CE-14315CFA82CA@omniti.com> Message-ID: <65DC5816D4BEE043885A89FD54E273FC6CF60FA7@MAIL101.Ellipseinc.com> While the system did not panic, VMware lost all communication with all zvols shortly after I attempted to add a new vmdk to one of them. -----Original Message----- From: Dan McDonald [mailto:danmcd at omniti.com] Sent: Tuesday, October 13, 2015 1:16 PM To: Richard Jahnel Cc: Johan Kragsterman; wuffers; imemo; omnios-discuss at lists.omniti.com Subject: Re: Ang: Re: [OmniOS-discuss] Two panics now while writing eager zeros to zvols > On Oct 13, 2015, at 2:13 PM, Dan McDonald wrote: > > Try this one. NO DON'T! Sorry. This one. Dan ________________________________ The content of this e-mail (including any attachments) is strictly confidential and may be commercially sensitive. If you are not, or believe you may not be, the intended recipient, please advise the sender immediately by return e-mail, delete this e-mail and destroy any copies. From danmcd at omniti.com Tue Oct 13 19:47:16 2015 From: danmcd at omniti.com (Dan McDonald) Date: Tue, 13 Oct 2015 15:47:16 -0400 Subject: [OmniOS-discuss] Ang: Re: Two panics now while writing eager zeros to zvols In-Reply-To: <65DC5816D4BEE043885A89FD54E273FC6CF60FA7@MAIL101.Ellipseinc.com> References: <65DC5816D4BEE043885A89FD54E273FC6CF5FEBB@MAIL101.Ellipseinc.com> <2D9BCBB7-C736-400F-A149-9BD4CB44E5FD@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF608D8@MAIL101.Ellipseinc.com> <1677B2AD-3051-4359-8381-C624157913DB@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF60EC9@MAIL101.Ellipseinc.com> <807F8D8C-BF81-4368-BCFA-2CCA0C2DE338@omniti.com> <4AC0EC81-FEC5-4A5C-BA79-D93C0162597D@omniti.com> <9A17713D-913B-4A87-B8CE-14315CFA82CA@omniti.com> <65DC5816D4BEE043885A89FD54E273FC6CF60FA7@MAIL101.Ellipseinc.com> Message-ID: <4ECAE70E-6C02-499E-9568-08D4C83B2C3F@omniti.com> > On Oct 13, 2015, at 3:42 PM, Richard Jahnel wrote: > > While the system did not panic, VMware lost all communication with all zvols shortly after I attempted to add a new vmdk to one of them. Shoot. There's a LOT of COMSTAR goodies in NexentaStor they haven't yet upstreamed. VAAI-related ones are a big part of it. Dan From keith at paskett.org Wed Oct 14 04:07:04 2015 From: keith at paskett.org (Keith Paskett) Date: Tue, 13 Oct 2015 22:07:04 -0600 Subject: [OmniOS-discuss] subversion intentionally compiled without http(s) support? Message-ID: After installing the following subversion package, I get an error accessing any subversion repository via http(s) protocols: PACKAGE PUBLISHER pkg:/omniti/developer/versioning/subversion at 1.9.2-0.151014 ms.omniti.com The error I get is svn: E170000: Unrecognized URL scheme for ?https://?' Online search suggests that it was compiled without the ?with-ssl and/or ?with-neon switches. Subversion worked fine on a r151014 system I set up a couple of months ago. Keith Paskett KLP-Systems From keith at paskett.org Wed Oct 14 04:32:14 2015 From: keith at paskett.org (Keith Paskett) Date: Tue, 13 Oct 2015 22:32:14 -0600 Subject: [OmniOS-discuss] subversion intentionally compiled without http(s) support? In-Reply-To: References: Message-ID: <3B1BD7DE-7D7B-471F-A5DE-B402E145B65D@paskett.org> subversion at 1.8.10-0.151014 is fine. > On Oct 13, 2015, at 10:07 PM, Keith Paskett wrote: > > After installing the following subversion package, I get an error accessing any subversion repository via http(s) protocols: > > PACKAGE PUBLISHER > pkg:/omniti/developer/versioning/subversion at 1.9.2-0.151014 ms.omniti.com > > The error I get is svn: E170000: Unrecognized URL scheme for ?https://?' > > Online search suggests that it was compiled without the ?with-ssl and/or ?with-neon switches. > > Subversion worked fine on a r151014 system I set up a couple of months ago. > > Keith Paskett > KLP-Systems > > > > > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss From rt at steait.net Wed Oct 14 05:45:37 2015 From: rt at steait.net (Rune Tipsmark) Date: Wed, 14 Oct 2015 05:45:37 +0000 Subject: [OmniOS-discuss] ZIL TXG commits happen very frequently - why? Message-ID: Hi all. Wondering if anyone could shed some light on why my ZFS pool would perform TXG commits up to 5 times per second. It's set to the default 5 second interval and occasionally it does wait 5 seconds between commits, but only when nearly idle. I'm not sure if this impacts my performance but I would suspect it doesn't improve it. I force sync on all data. I got 11 mirrors (7200rpm sas disks) two SLOG devices and two L2 ARC devices and a pair of spare disks. Each log device can hold 150GB of data so plenty for 2 TXG commits. The system has 384GB memory. Below is a bit of output from zilstat during a near idle time this morning so you wont see 4-5 commits per second, but during load later today it will happen.. root at zfs10:/tmp# ./zilstat.ksh -M -t -p pool01 txg waiting for txg commit... TIME txg N-MB N-MB/s N-Max-Rate B-MB B-MB/s B-Max-Rate ops <=4kB 4-32kB >=32kB 2015 Oct 14 06:21:19 10872771 3 3 0 21 21 2 234 14 19 201 2015 Oct 14 06:21:22 10872772 10 3 3 70 23 24 806 0 84 725 2015 Oct 14 06:21:24 10872773 12 6 5 56 28 26 682 17 107 558 2015 Oct 14 06:21:25 10872774 13 13 2 75 75 14 651 0 10 641 2015 Oct 14 06:21:25 10872775 0 0 0 0 0 0 1 0 0 1 2015 Oct 14 06:21:26 10872776 11 11 6 53 53 29 645 2 136 507 2015 Oct 14 06:21:30 10872777 11 2 4 81 20 32 873 11 60 804 2015 Oct 14 06:21:30 10872778 0 0 0 0 0 0 1 0 1 0 2015 Oct 14 06:21:31 10872779 12 12 11 56 56 52 631 0 8 623 2015 Oct 14 06:21:33 10872780 11 5 4 74 37 27 858 0 44 814 2015 Oct 14 06:21:36 10872781 14 4 6 79 26 30 977 12 82 883 2015 Oct 14 06:21:39 10872782 11 3 4 78 26 25 957 18 55 884 2015 Oct 14 06:21:43 10872783 13 3 4 80 20 24 930 0 135 795 2015 Oct 14 06:21:46 10872784 13 4 4 81 27 29 965 13 95 857 2015 Oct 14 06:21:49 10872785 11 3 6 80 26 41 1077 12 215 850 2015 Oct 14 06:21:53 10872786 9 3 2 67 22 18 870 1 74 796 2015 Oct 14 06:21:56 10872787 12 3 5 72 18 26 909 17 163 729 2015 Oct 14 06:21:58 10872788 12 6 3 53 26 21 530 0 33 497 2015 Oct 14 06:21:59 10872789 26 26 24 72 72 62 882 12 60 810 2015 Oct 14 06:22:02 10872790 9 3 5 57 19 28 777 0 70 708 2015 Oct 14 06:22:07 10872791 11 2 3 96 24 22 1044 12 46 986 2015 Oct 14 06:22:10 10872792 13 3 4 78 19 22 911 12 38 862 2015 Oct 14 06:22:14 10872793 11 2 4 79 19 26 930 10 94 826 2015 Oct 14 06:22:17 10872794 11 3 5 73 24 26 1054 17 151 886 2015 Oct 14 06:22:17 10872795 0 0 0 0 0 0 2 0 0 2 2015 Oct 14 06:22:18 10872796 40 40 38 78 78 60 707 0 28 680 2015 Oct 14 06:22:22 10872797 10 3 3 66 22 21 937 14 164 759 2015 Oct 14 06:22:25 10872798 9 2 2 66 16 21 821 11 92 718 2015 Oct 14 06:22:28 10872799 24 12 14 80 40 43 750 0 23 727 2015 Oct 14 06:22:28 10872800 0 0 0 0 0 0 2 0 0 2 2015 Oct 14 06:22:29 10872801 15 7 9 49 24 24 526 11 25 490 2015 Oct 14 06:22:33 10872802 10 2 3 79 19 24 939 0 63 876 2015 Oct 14 06:22:36 10872803 10 5 3 59 29 18 756 11 65 682 2015 Oct 14 06:22:36 10872804 0 0 0 0 0 0 0 0 0 0 2015 Oct 14 06:22:36 10872805 13 13 2 58 58 9 500 0 29 471 -- root at zfs10:/tmp# zpool status pool01 pool: pool01 state: ONLINE scan: scrub repaired 0 in 7h53m with 0 errors on Sat Oct 3 06:53:43 2015 config: NAME STATE READ WRITE CKSUM pool01 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c4t5000C50055FC9533d0 ONLINE 0 0 0 c4t5000C50055FE6A63d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c4t5000C5005708296Fd0 ONLINE 0 0 0 c4t5000C5005708351Bd0 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 c4t5000C500570858EFd0 ONLINE 0 0 0 c4t5000C50057085A6Bd0 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 c4t5000C50057086307d0 ONLINE 0 0 0 c4t5000C50057086B67d0 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 c4t5000C500570870D3d0 ONLINE 0 0 0 c4t5000C50057089753d0 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 c4t5000C500625B7EA7d0 ONLINE 0 0 0 c4t5000C500625B8137d0 ONLINE 0 0 0 mirror-6 ONLINE 0 0 0 c4t5000C500625B8427d0 ONLINE 0 0 0 c4t5000C500625B86E3d0 ONLINE 0 0 0 mirror-7 ONLINE 0 0 0 c4t5000C500625B886Fd0 ONLINE 0 0 0 c4t5000C500625BB773d0 ONLINE 0 0 0 mirror-8 ONLINE 0 0 0 c4t5000C500625BC2C3d0 ONLINE 0 0 0 c4t5000C500625BD3EBd0 ONLINE 0 0 0 mirror-9 ONLINE 0 0 0 c4t5000C50062878C0Bd0 ONLINE 0 0 0 c4t5000C50062878C43d0 ONLINE 0 0 0 mirror-10 ONLINE 0 0 0 c4t5000C50062879687d0 ONLINE 0 0 0 c4t5000C50062879707d0 ONLINE 0 0 0 logs c11d0 ONLINE 0 0 0 c10d0 ONLINE 0 0 0 cache c14d0 ONLINE 0 0 0 c15d0 ONLINE 0 0 0 spares c4t5000C50062879723d0 AVAIL c4t5000C50062879787d0 AVAIL errors: No known data errors Br, Rune -------------- next part -------------- An HTML attachment was scrubbed... URL: From chip at innovates.com Wed Oct 14 12:44:50 2015 From: chip at innovates.com (Schweiss, Chip) Date: Wed, 14 Oct 2015 07:44:50 -0500 Subject: [OmniOS-discuss] ZIL TXG commits happen very frequently - why? In-Reply-To: References: Message-ID: It all has to do with the write throttle and buffers filling. Here's a great blog post on how it works and how it's tuned: http://dtrace.org/blogs/ahl/2014/02/10/the-openzfs-write-throttle/ http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/ -Chip On Wed, Oct 14, 2015 at 12:45 AM, Rune Tipsmark wrote: > Hi all. > > > > Wondering if anyone could shed some light on why my ZFS pool would perform > TXG commits up to 5 times per second. It?s set to the default 5 second > interval and occasionally it does wait 5 seconds between commits, but only > when nearly idle. > > > > I?m not sure if this impacts my performance but I would suspect it doesn?t > improve it. I force sync on all data. > > > > I got 11 mirrors (7200rpm sas disks) two SLOG devices and two L2 ARC > devices and a pair of spare disks. > > > > Each log device can hold 150GB of data so plenty for 2 TXG commits. The > system has 384GB memory. > > > > > Below is a bit of output from zilstat during a near idle time this morning > so you wont see 4-5 commits per second, but during load later today it will > happen.. > > > > root at zfs10:/tmp# ./zilstat.ksh -M -t -p pool01 txg > > waiting for txg commit... > > TIME txg N-MB N-MB/s N-Max-Rate > B-MB B-MB/s B-Max-Rate ops <=4kB 4-32kB >=32kB > > 2015 Oct 14 06:21:19 10872771 3 3 0 > 21 21 2 234 14 19 201 > > 2015 Oct 14 06:21:22 10872772 10 3 3 > 70 23 24 806 0 84 725 > > 2015 Oct 14 06:21:24 10872773 12 6 5 > 56 28 26 682 17 107 558 > > 2015 Oct 14 06:21:25 10872774 13 13 2 > 75 75 14 651 0 10 641 > > 2015 Oct 14 06:21:25 10872775 0 0 0 > 0 0 0 1 0 0 1 > > 2015 Oct 14 06:21:26 10872776 11 11 6 > 53 53 29 645 2 136 507 > > 2015 Oct 14 06:21:30 10872777 11 2 4 > 81 20 32 873 11 60 804 > > 2015 Oct 14 06:21:30 10872778 0 0 0 > 0 0 0 1 0 1 0 > > 2015 Oct 14 06:21:31 10872779 12 12 11 > 56 56 52 631 0 8 623 > > 2015 Oct 14 06:21:33 10872780 11 5 4 > 74 37 27 858 0 44 814 > > 2015 Oct 14 06:21:36 10872781 14 4 6 > 79 26 30 977 12 82 883 > > 2015 Oct 14 06:21:39 10872782 11 3 4 > 78 26 25 957 18 55 884 > > 2015 Oct 14 06:21:43 10872783 13 3 4 > 80 20 24 930 0 135 795 > > 2015 Oct 14 06:21:46 10872784 13 4 4 > 81 27 29 965 13 95 857 > > 2015 Oct 14 06:21:49 10872785 11 3 6 > 80 26 41 1077 12 215 850 > > 2015 Oct 14 06:21:53 10872786 9 3 2 > 67 22 18 870 1 74 796 > > 2015 Oct 14 06:21:56 10872787 12 3 5 > 72 18 26 909 17 163 729 > > 2015 Oct 14 06:21:58 10872788 12 6 3 > 53 26 21 530 0 33 497 > > 2015 Oct 14 06:21:59 10872789 26 26 24 > 72 72 62 882 12 60 810 > > 2015 Oct 14 06:22:02 10872790 9 3 5 > 57 19 28 777 0 70 708 > > 2015 Oct 14 06:22:07 10872791 11 2 3 > 96 24 22 1044 12 46 986 > > 2015 Oct 14 06:22:10 10872792 13 3 4 > 78 19 22 911 12 38 862 > > 2015 Oct 14 06:22:14 10872793 11 2 4 > 79 19 26 930 10 94 826 > > 2015 Oct 14 06:22:17 10872794 11 3 5 > 73 24 26 1054 17 151 886 > > 2015 Oct 14 06:22:17 10872795 0 0 0 > 0 0 0 2 0 0 2 > > 2015 Oct 14 06:22:18 10872796 40 40 38 > 78 78 60 707 0 28 680 > > 2015 Oct 14 06:22:22 10872797 10 3 3 > 66 22 21 937 14 164 759 > > 2015 Oct 14 06:22:25 10872798 9 2 2 > 66 16 21 821 11 92 718 > > 2015 Oct 14 06:22:28 10872799 24 12 14 > 80 40 43 750 0 23 727 > > 2015 Oct 14 06:22:28 10872800 0 0 0 > 0 0 0 2 0 0 2 > > 2015 Oct 14 06:22:29 10872801 15 7 9 > 49 24 24 526 11 25 490 > > 2015 Oct 14 06:22:33 10872802 10 2 3 > 79 19 24 939 0 63 876 > > 2015 Oct 14 06:22:36 10872803 10 5 3 > 59 29 18 756 11 65 682 > > 2015 Oct 14 06:22:36 10872804 0 0 0 > 0 0 0 0 0 0 0 > > 2015 Oct 14 06:22:36 10872805 13 13 2 > 58 58 9 500 0 29 471 > > > > -- > > > > root at zfs10:/tmp# zpool status pool01 > > pool: pool01 > > state: ONLINE > > scan: scrub repaired 0 in 7h53m with 0 errors on Sat Oct 3 06:53:43 2015 > > config: > > > > NAME STATE READ WRITE CKSUM > > pool01 ONLINE 0 0 0 > > mirror-0 ONLINE 0 0 0 > > c4t5000C50055FC9533d0 ONLINE 0 0 0 > > c4t5000C50055FE6A63d0 ONLINE 0 0 0 > > mirror-1 ONLINE 0 0 0 > > c4t5000C5005708296Fd0 ONLINE 0 0 0 > > c4t5000C5005708351Bd0 ONLINE 0 0 0 > > mirror-2 ONLINE 0 0 0 > > c4t5000C500570858EFd0 ONLINE 0 0 0 > > c4t5000C50057085A6Bd0 ONLINE 0 0 0 > > mirror-3 ONLINE 0 0 0 > > c4t5000C50057086307d0 ONLINE 0 0 0 > > c4t5000C50057086B67d0 ONLINE 0 0 0 > > mirror-4 ONLINE 0 0 0 > > c4t5000C500570870D3d0 ONLINE 0 0 0 > > c4t5000C50057089753d0 ONLINE 0 0 0 > > mirror-5 ONLINE 0 0 0 > > c4t5000C500625B7EA7d0 ONLINE 0 0 0 > > c4t5000C500625B8137d0 ONLINE 0 0 0 > > mirror-6 ONLINE 0 0 0 > > c4t5000C500625B8427d0 ONLINE 0 0 0 > > c4t5000C500625B86E3d0 ONLINE 0 0 0 > > mirror-7 ONLINE 0 0 0 > > c4t5000C500625B886Fd0 ONLINE 0 0 0 > > c4t5000C500625BB773d0 ONLINE 0 0 0 > > mirror-8 ONLINE 0 0 0 > > c4t5000C500625BC2C3d0 ONLINE 0 0 0 > > c4t5000C500625BD3EBd0 ONLINE 0 0 0 > > mirror-9 ONLINE 0 0 0 > > c4t5000C50062878C0Bd0 ONLINE 0 0 0 > > c4t5000C50062878C43d0 ONLINE 0 0 0 > > mirror-10 ONLINE 0 0 0 > > c4t5000C50062879687d0 ONLINE 0 0 0 > > c4t5000C50062879707d0 ONLINE 0 0 0 > > logs > > c11d0 ONLINE 0 0 0 > > c10d0 ONLINE 0 0 0 > > cache > > c14d0 ONLINE 0 0 0 > > c15d0 ONLINE 0 0 0 > > spares > > c4t5000C50062879723d0 AVAIL > > c4t5000C50062879787d0 AVAIL > > > > errors: No known data errors > > > > > > Br, > > Rune > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From danmcd at omniti.com Wed Oct 14 14:50:10 2015 From: danmcd at omniti.com (Dan McDonald) Date: Wed, 14 Oct 2015 10:50:10 -0400 Subject: [OmniOS-discuss] NEW BLOODY - last or second-to-last before r151016 Message-ID: BIG update this time for OmniOS bloody as we head to r151016. If there is another bloody before r151016, it'll be bugfixes or upstream-illumos that I think require some bloody soak time. New with this update out of omnios-build (now at master revision 76d2785 in the install media, and one additional bugfix advancing to e75489a in the repo server): - gcc51 now built with parallel make (shrinking build times noticeably) - So many updates I had to automate the extraction of them: Update gnu-make to 4.1 Update xz to 5.2.2 Update wget to 1.16.3 Update unixodbc to 2.3.2 Update tmux to 2.0 (NOTE --> additional bugfix available only from "pkg update", thanks to Lauri "lotheac" Tirkkonen for the fix) Update tcsh to 6.19.0 Update sqlite-3 to 3.8.11.1 Update sigcpp to 2.6.1 Update screen to 4.3.1 Update simplejson-26 to 3.8.0 Update pylint to 1.4.4 Update ply to 3.8 Update numpy-26 to 1.10.0 Update lxml-26 to 3.4.4 Update coverage-26 to 4.0 Update pv (pipe-viewer) to 1.6.0 Update pcre to 8.37 Update pciutils to 3.4.0 Update gnu-patch to 2.7.5 Update netperf to 2.7.0 Update Mercurial to 3.5.2 Update libxml2 to 2.9.2 Update libtool & libltdl to 2.4.6 Update libpcap to 1.7.4 Update libidn to 1.32 Update iso-codes to 3.57 Update ISC DHCP to 4.3.3 Update intltool to 0.51.0 Update groff to 1.22.3 Update git to 2.6.1 Update gawk to 4.1.3 Update Amazon EC2 API to 1.7.5.1 Update curl to 7.44.0 Update bind to 9.10.3 Update bash to 4.3p42 Update automake to 1.15 Update XML::Parser to 2.44 Update coreutils to 8.24 Update Mercurial to 3.5.1 And highlights of illumos-omnios progress (now at master revision 85fef88, meaning uname -v == omnios-85fef88) include: - Resumable ZFS send/recv. Flag day here: http://www.listbox.com/member/archive/182191/2015/10/sort/time_rev/page/1/entry/2:18/20151012235207:C12B2C18-715D-11E5-A848-EAF6A2A023E1/ - New ZFS hash algorithms - Other ZFS bugfixes - Fix in link aggregations (aggrs) to be more reliable in the face of downed links (illumos 6274) - strerror_l() for localized strerror(). (Translations welcome.) - Updated hardware data (prelude to r151016). - Assorted SMB/CIFS fixes from Nexenta. - Slight increase in the number of concurrent BEs GRUB can cope with (up from 40 to ~55, but officially we still suggest you keep it at 40 or less). - useradd/del/mod is now ZFS aware, fix from OpenIndiana. - EOL of cachefs (I hope nobody is still using it...). - New uuidgen(1) command. - prtdiag(1M) improvements dealing with hardware in slots. - SMBIOS 3.0 support, from Joyent. - NVMe support (pardon the delay here, wanted it in '014 first). Please give this one a spin folks -- it's essentially an r151016 preview. Dan From danmcd at omniti.com Wed Oct 14 14:51:50 2015 From: danmcd at omniti.com (Dan McDonald) Date: Wed, 14 Oct 2015 10:51:50 -0400 Subject: [OmniOS-discuss] EOSL for r151012 COMING VERY SOON Message-ID: <36615D8E-D774-4A0C-A1DB-7B0590BB5616@omniti.com> The day r151016 is released, support for r151012 (old stable) will dry up - i.e. its End Of Service Life (EOSL). You already should have updated r151012 to r151014, which is not only the current stable but is also the current LTS release. If you have any problems migrating off r151012, please mention it on the list, or via support if you're a paying support customer. Thanks, Dan From colin at omniti.com Wed Oct 14 23:24:31 2015 From: colin at omniti.com (Colin Roche-Dutch) Date: Wed, 14 Oct 2015 19:24:31 -0400 Subject: [OmniOS-discuss] subversion intentionally compiled without http(s) support? In-Reply-To: <3B1BD7DE-7D7B-471F-A5DE-B402E145B65D@paskett.org> References: <3B1BD7DE-7D7B-471F-A5DE-B402E145B65D@paskett.org> Message-ID: Hi Keith, A bad package slipped into the ms.omniti.com repo. I should have an updated one early tomorrow with libserf enabled to fix the http/https version. -Thanks, Colin On Wed, Oct 14, 2015 at 12:32 AM, Keith Paskett wrote: > subversion at 1.8.10-0.151014 is fine. > > > On Oct 13, 2015, at 10:07 PM, Keith Paskett wrote: > > > > After installing the following subversion package, I get an error > accessing any subversion repository via http(s) protocols: > > > > PACKAGE PUBLISHER > > pkg:/omniti/developer/versioning/subversion at 1.9.2-0.151014 ms.omniti.com > > > > The error I get is svn: E170000: Unrecognized URL scheme for ?https:// > ?' > > > > Online search suggests that it was compiled without the ?with-ssl and/or > ?with-neon switches. > > > > Subversion worked fine on a r151014 system I set up a couple of months > ago. > > > > Keith Paskett > > KLP-Systems > > > > > > > > > > > > _______________________________________________ > > OmniOS-discuss mailing list > > OmniOS-discuss at lists.omniti.com > > http://lists.omniti.com/mailman/listinfo/omnios-discuss > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rt at steait.net Wed Oct 14 23:41:13 2015 From: rt at steait.net (Rune Tipsmark) Date: Wed, 14 Oct 2015 23:41:13 +0000 Subject: [OmniOS-discuss] ZIL TXG commits happen very frequently - why? In-Reply-To: References: , Message-ID: <1444866067065.866@steait.net> Thanks, that was helpful reading although I'm not 100% sure where to start. I did a bit of testing with the scripts and running IOmeter on a VM residing on a vSphere host connected to my ZFS box with 8Gbit Fibre channel. I noticed that txg commits rarely took over 1 second. root at zfs10:/tmp# dtrace -s duration.d pool01 dtrace: script 'duration.d' matched 2 probes CPU ID FUNCTION:NAME 7 17407 txg_sync_thread:txg-synced sync took 0.68 seconds 8 17407 txg_sync_thread:txg-synced sync took 0.71 seconds 12 17407 txg_sync_thread:txg-synced sync took 0.52 seconds 7 17407 txg_sync_thread:txg-synced sync took 0.29 seconds 22 17407 txg_sync_thread:txg-synced sync took 0.64 seconds 1 17407 txg_sync_thread:txg-synced sync took 0.34 seconds 5 17407 txg_sync_thread:txg-synced sync took 0.93 seconds 0 17407 txg_sync_thread:txg-synced sync took 0.46 seconds 9 17407 txg_sync_thread:txg-synced sync took 2.59 seconds 0 17407 txg_sync_thread:txg-synced sync took 0.29 seconds 7 17407 txg_sync_thread:txg-synced sync took 1.31 seconds 8 17407 txg_sync_thread:txg-synced sync took 0.71 seconds 10 17407 txg_sync_thread:txg-synced sync took 0.67 seconds 8 17407 txg_sync_thread:txg-synced sync took 0.29 seconds 12 17407 txg_sync_thread:txg-synced sync took 0.58 seconds 1 17407 txg_sync_thread:txg-synced sync took 0.46 seconds also I noticed that the default allocation of 4GB on the slog device was never used, the peak I saw was just over 1GB, most of the time half that. 0 17408 txg_sync_thread:txg-syncing 1179MB of 4096MB used 9 17408 txg_sync_thread:txg-syncing 482MB of 4096MB used 20 17408 txg_sync_thread:txg-syncing 686MB of 4096MB used 0 17408 txg_sync_thread:txg-syncing 429MB of 4096MB used 14 17408 txg_sync_thread:txg-syncing 328MB of 4096MB used 10 17408 txg_sync_thread:txg-syncing 374MB of 4096MB used 8 17408 txg_sync_thread:txg-syncing 510MB of 4096MB used 12 17408 txg_sync_thread:txg-syncing 210MB of 4096MB used 1 17408 txg_sync_thread:txg-syncing 268MB of 4096MB used 0 17408 txg_sync_thread:txg-syncing 432MB of 4096MB used 16 17408 txg_sync_thread:txg-syncing 236MB of 4096MB used 18 17408 txg_sync_thread:txg-syncing 341MB of 4096MB used 9 17408 txg_sync_thread:txg-syncing 361MB of 4096MB used 14 17408 txg_sync_thread:txg-syncing 597MB of 4096MB used 10 17408 txg_sync_thread:txg-syncing 357MB of 4096MB used 21 17408 txg_sync_thread:txg-syncing 437MB of 4096MB used 18 17408 txg_sync_thread:txg-syncing 637MB of 4096MB used I did not see any significant write latency, but I did see a high read latency which is odd since it doesn't reflect what I experience on the VM or the vSphere host. root at zfs10:/tmp# dtrace -s rw.d -c 'sleep 60' write value ------------- Distribution ------------- count 8 | 0 16 | 27 32 |@ 11222 64 |@@@@@@@ 106215 128 |@@@@@@@@@@@@@@@@@@@@@@ 327807 256 |@@@@@@ 94605 512 |@ 20467 1024 | 6067 2048 |@ 7968 4096 |@ 10076 8192 | 3380 16384 | 249 32768 | 214 65536 | 219 131072 | 77 262144 | 1 524288 | 0 read value ------------- Distribution ------------- count 4 | 0 8 | 18 16 | 58 32 | 174 64 |@@@@@@ 4322 128 |@@@@@@@@@ 6278 256 |@@@ 2545 512 |@ 892 1024 |@ 1074 2048 |@@ 1171 4096 |@@@ 2222 8192 |@@@@@@ 4103 16384 |@@@ 2400 32768 |@@ 1401 65536 |@@ 1504 131072 |@ 897 262144 |@ 427 524288 | 39 1048576 | 1 2097152 | 0 avg latency stddev iops throughput write 496us 3136us 9809/s 450773k/s read 22633us 59917us 492/s 17405k/s I also happen to monitor how busy each disk is and I don't see any significant load there either... here is an example [cid:391d20a6-a7e7-4ec1-850c-2153d4eb4f64] so I'm a bit lost as what to do next, I don't see any stress on the system in terms of writes but I still cannot max out the 8gbit FC... reads however are doing fairly good, getting just over 700MB/sec which is acceptable over 8Gbit FC. Writes tend to be between 350 and 450 MB/sec... they should get up to 700MB/sec as well. Any ideas where to start? br, Rune ________________________________ From: Schweiss, Chip Sent: Wednesday, October 14, 2015 2:44 PM To: Rune Tipsmark Cc: omnios-discuss at lists.omniti.com Subject: Re: [OmniOS-discuss] ZIL TXG commits happen very frequently - why? It all has to do with the write throttle and buffers filling. Here's a great blog post on how it works and how it's tuned: http://dtrace.org/blogs/ahl/2014/02/10/the-openzfs-write-throttle/ http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/ -Chip On Wed, Oct 14, 2015 at 12:45 AM, Rune Tipsmark > wrote: Hi all. Wondering if anyone could shed some light on why my ZFS pool would perform TXG commits up to 5 times per second. It?s set to the default 5 second interval and occasionally it does wait 5 seconds between commits, but only when nearly idle. I?m not sure if this impacts my performance but I would suspect it doesn?t improve it. I force sync on all data. I got 11 mirrors (7200rpm sas disks) two SLOG devices and two L2 ARC devices and a pair of spare disks. Each log device can hold 150GB of data so plenty for 2 TXG commits. The system has 384GB memory. Below is a bit of output from zilstat during a near idle time this morning so you wont see 4-5 commits per second, but during load later today it will happen.. root at zfs10:/tmp# ./zilstat.ksh -M -t -p pool01 txg waiting for txg commit... TIME txg N-MB N-MB/s N-Max-Rate B-MB B-MB/s B-Max-Rate ops <=4kB 4-32kB >=32kB 2015 Oct 14 06:21:19 10872771 3 3 0 21 21 2 234 14 19 201 2015 Oct 14 06:21:22 10872772 10 3 3 70 23 24 806 0 84 725 2015 Oct 14 06:21:24 10872773 12 6 5 56 28 26 682 17 107 558 2015 Oct 14 06:21:25 10872774 13 13 2 75 75 14 651 0 10 641 2015 Oct 14 06:21:25 10872775 0 0 0 0 0 0 1 0 0 1 2015 Oct 14 06:21:26 10872776 11 11 6 53 53 29 645 2 136 507 2015 Oct 14 06:21:30 10872777 11 2 4 81 20 32 873 11 60 804 2015 Oct 14 06:21:30 10872778 0 0 0 0 0 0 1 0 1 0 2015 Oct 14 06:21:31 10872779 12 12 11 56 56 52 631 0 8 623 2015 Oct 14 06:21:33 10872780 11 5 4 74 37 27 858 0 44 814 2015 Oct 14 06:21:36 10872781 14 4 6 79 26 30 977 12 82 883 2015 Oct 14 06:21:39 10872782 11 3 4 78 26 25 957 18 55 884 2015 Oct 14 06:21:43 10872783 13 3 4 80 20 24 930 0 135 795 2015 Oct 14 06:21:46 10872784 13 4 4 81 27 29 965 13 95 857 2015 Oct 14 06:21:49 10872785 11 3 6 80 26 41 1077 12 215 850 2015 Oct 14 06:21:53 10872786 9 3 2 67 22 18 870 1 74 796 2015 Oct 14 06:21:56 10872787 12 3 5 72 18 26 909 17 163 729 2015 Oct 14 06:21:58 10872788 12 6 3 53 26 21 530 0 33 497 2015 Oct 14 06:21:59 10872789 26 26 24 72 72 62 882 12 60 810 2015 Oct 14 06:22:02 10872790 9 3 5 57 19 28 777 0 70 708 2015 Oct 14 06:22:07 10872791 11 2 3 96 24 22 1044 12 46 986 2015 Oct 14 06:22:10 10872792 13 3 4 78 19 22 911 12 38 862 2015 Oct 14 06:22:14 10872793 11 2 4 79 19 26 930 10 94 826 2015 Oct 14 06:22:17 10872794 11 3 5 73 24 26 1054 17 151 886 2015 Oct 14 06:22:17 10872795 0 0 0 0 0 0 2 0 0 2 2015 Oct 14 06:22:18 10872796 40 40 38 78 78 60 707 0 28 680 2015 Oct 14 06:22:22 10872797 10 3 3 66 22 21 937 14 164 759 2015 Oct 14 06:22:25 10872798 9 2 2 66 16 21 821 11 92 718 2015 Oct 14 06:22:28 10872799 24 12 14 80 40 43 750 0 23 727 2015 Oct 14 06:22:28 10872800 0 0 0 0 0 0 2 0 0 2 2015 Oct 14 06:22:29 10872801 15 7 9 49 24 24 526 11 25 490 2015 Oct 14 06:22:33 10872802 10 2 3 79 19 24 939 0 63 876 2015 Oct 14 06:22:36 10872803 10 5 3 59 29 18 756 11 65 682 2015 Oct 14 06:22:36 10872804 0 0 0 0 0 0 0 0 0 0 2015 Oct 14 06:22:36 10872805 13 13 2 58 58 9 500 0 29 471 -- root at zfs10:/tmp# zpool status pool01 pool: pool01 state: ONLINE scan: scrub repaired 0 in 7h53m with 0 errors on Sat Oct 3 06:53:43 2015 config: NAME STATE READ WRITE CKSUM pool01 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c4t5000C50055FC9533d0 ONLINE 0 0 0 c4t5000C50055FE6A63d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c4t5000C5005708296Fd0 ONLINE 0 0 0 c4t5000C5005708351Bd0 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 c4t5000C500570858EFd0 ONLINE 0 0 0 c4t5000C50057085A6Bd0 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 c4t5000C50057086307d0 ONLINE 0 0 0 c4t5000C50057086B67d0 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 c4t5000C500570870D3d0 ONLINE 0 0 0 c4t5000C50057089753d0 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 c4t5000C500625B7EA7d0 ONLINE 0 0 0 c4t5000C500625B8137d0 ONLINE 0 0 0 mirror-6 ONLINE 0 0 0 c4t5000C500625B8427d0 ONLINE 0 0 0 c4t5000C500625B86E3d0 ONLINE 0 0 0 mirror-7 ONLINE 0 0 0 c4t5000C500625B886Fd0 ONLINE 0 0 0 c4t5000C500625BB773d0 ONLINE 0 0 0 mirror-8 ONLINE 0 0 0 c4t5000C500625BC2C3d0 ONLINE 0 0 0 c4t5000C500625BD3EBd0 ONLINE 0 0 0 mirror-9 ONLINE 0 0 0 c4t5000C50062878C0Bd0 ONLINE 0 0 0 c4t5000C50062878C43d0 ONLINE 0 0 0 mirror-10 ONLINE 0 0 0 c4t5000C50062879687d0 ONLINE 0 0 0 c4t5000C50062879707d0 ONLINE 0 0 0 logs c11d0 ONLINE 0 0 0 c10d0 ONLINE 0 0 0 cache c14d0 ONLINE 0 0 0 c15d0 ONLINE 0 0 0 spares c4t5000C50062879723d0 AVAIL c4t5000C50062879787d0 AVAIL errors: No known data errors Br, Rune _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss at lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pastedImage.png Type: image/png Size: 51990 bytes Desc: pastedImage.png URL: From danmcd at omniti.com Wed Oct 14 23:45:19 2015 From: danmcd at omniti.com (Dan McDonald) Date: Wed, 14 Oct 2015 19:45:19 -0400 Subject: [OmniOS-discuss] New metapackage --> illumos-tools Message-ID: <7D40F53B-0F1C-4147-9E64-0C29E649EDCF@omniti.com> IF you want to build illumos-omnos or illumos-gate on an OmniOS r151014 or later installation, you can do so now simply by uttering: pkg install developer/illumos-tools illumos-tools is a metapackage that brings in all of the required packages one needs to build illumos-gate or illumos-omnios. You can then just download the closed-binaries, git clone your favorite illumos-gate child, construct a .env file and get going. I wanted to have this be a feature of r151016, but it was easy enough where I backported it. Sorry I didn't have something like this sooner. I've also updated the How to Build illumos page on the illumos wiki. Happy installing! Dan From ryan at zinascii.com Thu Oct 15 00:59:58 2015 From: ryan at zinascii.com (Ryan Zezeski) Date: Wed, 14 Oct 2015 20:59:58 -0400 Subject: [OmniOS-discuss] New metapackage --> illumos-tools In-Reply-To: <7D40F53B-0F1C-4147-9E64-0C29E649EDCF@omniti.com> References: <7D40F53B-0F1C-4147-9E64-0C29E649EDCF@omniti.com> Message-ID: Dan McDonald writes: > IF you want to build illumos-omnos or illumos-gate on an OmniOS r151014 or later installation, you can do so now simply by uttering: > > pkg install developer/illumos-tools > > illumos-tools is a metapackage that brings in all of the required packages one needs to build illumos-gate or illumos-omnios. You can then just download the closed-binaries, git clone your favorite illumos-gate child, construct a .env file and get going. > > I wanted to have this be a feature of r151016, but it was easy enough where I backported it. > > Sorry I didn't have something like this sooner. I've also updated the How to Build illumos page on the illumos wiki. > > Happy installing! > Dan Thank you Dan! Changes like this may seem small but they make all the difference to a beginner. -Z From danmcd at omniti.com Thu Oct 15 01:23:03 2015 From: danmcd at omniti.com (Dan McDonald) Date: Wed, 14 Oct 2015 21:23:03 -0400 Subject: [OmniOS-discuss] New metapackage --> illumos-tools In-Reply-To: References: <7D40F53B-0F1C-4147-9E64-0C29E649EDCF@omniti.com> Message-ID: > On Oct 14, 2015, at 8:59 PM, Ryan Zezeski wrote: > > > Thank you Dan! Changes like this may seem small but they make all the > difference to a beginner. I wanted to have this done before FOSDEM, on the off chance I get to go. There's one other thing in illumos-gate itself I think I can do to help the newbie (merge bldenv and ws, probably implemented by adding goodies to bldenv, and making ws a wrapper to said goodies), but any little bit will help. Dan From paladinemishakal at gmail.com Thu Oct 15 07:49:48 2015 From: paladinemishakal at gmail.com (Lawrence Giam) Date: Thu, 15 Oct 2015 15:49:48 +0800 Subject: [OmniOS-discuss] HP Proliant Gen9 server Message-ID: Hi All, I am looking at whether to get HP Proliant Gen9 server and running OmniOS on it. Does anyone have any experience with this generation of server? Is the RAID controller (either the B140i or the H240) supported by illumos? I did a search and cannot find any result in the hardware compatability list. Thanks & Regards. -------------- next part -------------- An HTML attachment was scrubbed... URL: From danmcd at omniti.com Thu Oct 15 11:25:18 2015 From: danmcd at omniti.com (Dan McDonald) Date: Thu, 15 Oct 2015 07:25:18 -0400 Subject: [OmniOS-discuss] HP Proliant Gen9 server In-Reply-To: References: Message-ID: > On Oct 15, 2015, at 3:49 AM, Lawrence Giam wrote: > > Hi All, > > I am looking at whether to get HP Proliant Gen9 server and running OmniOS on it. Does anyone have any experience with this generation of server? Is the RAID controller (either the B140i or the H240) supported by illumos? I did a search and cannot find any result in the hardware compatability list. There have been updates to cpqary3 for more modern HP Proliant HW. If you know the PCI IDs, that'd be MOST helpful. I suspect the answer is "yes", but I don't have the requisite experience. Dan From keith at paskett.org Thu Oct 15 19:29:23 2015 From: keith at paskett.org (Keith Paskett) Date: Thu, 15 Oct 2015 13:29:23 -0600 Subject: [OmniOS-discuss] HP Proliant Gen9 server In-Reply-To: References: Message-ID: <647A48B5-2A1E-40AD-9B22-0BDD179EC098@paskett.org> Proceed with caution. A couple of months ago we tried with r151014 and never could get it so OmniOS would recognize the drives. We tried a couple of different HBAs/Array controllers. The Gen8 systems have been great, but they are getting harder to find new. Keith > On Oct 15, 2015, at 5:25 AM, Dan McDonald wrote: > > >> On Oct 15, 2015, at 3:49 AM, Lawrence Giam wrote: >> >> Hi All, >> >> I am looking at whether to get HP Proliant Gen9 server and running OmniOS on it. Does anyone have any experience with this generation of server? Is the RAID controller (either the B140i or the H240) supported by illumos? I did a search and cannot find any result in the hardware compatability list. > > There have been updates to cpqary3 for more modern HP Proliant HW. If you know the PCI IDs, that'd be MOST helpful. I suspect the answer is "yes", but I don't have the requisite experience. > > Dan > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss From danmcd at omniti.com Thu Oct 15 19:37:05 2015 From: danmcd at omniti.com (Dan McDonald) Date: Thu, 15 Oct 2015 15:37:05 -0400 Subject: [OmniOS-discuss] HP Proliant Gen9 server In-Reply-To: <647A48B5-2A1E-40AD-9B22-0BDD179EC098@paskett.org> References: <647A48B5-2A1E-40AD-9B22-0BDD179EC098@paskett.org> Message-ID: > On Oct 15, 2015, at 3:29 PM, Keith Paskett wrote: > > Proceed with caution. > A couple of months ago we tried with r151014 and never could get it so OmniOS would recognize the drives. We tried a couple of different HBAs/Array controllers. > The Gen8 systems have been great, but they are getting harder to find new. Did you try after this commit got backported into r151014? commit d08e0e5199f47566c90482b1ef4f31ec3798228b Author: Robert Mustacchi AuthorDate: Tue Aug 11 14:53:49 2015 -0700 Commit: Dan McDonald CommitDate: Tue Aug 18 11:39:33 2015 -0400 6113 cpqary3: add support for hp gen9 smart array controllers Reviewed by: Garrett D'Amore Reviewed by: Igor Kozhukhov Reviewed by: Richard Lowe Approved by: Dan McDonald Note the commit date of 18 August vs. when you tried. I can't recall if you tried before or after. The most recent r151014 ISO should have that commit on there. Dan From keith at paskett.org Thu Oct 15 20:18:08 2015 From: keith at paskett.org (Keith Paskett) Date: Thu, 15 Oct 2015 14:18:08 -0600 Subject: [OmniOS-discuss] HP Proliant Gen9 server In-Reply-To: References: <647A48B5-2A1E-40AD-9B22-0BDD179EC098@paskett.org> Message-ID: <2AB16D7D-E4EC-4E4C-A4DD-8B23320ECEBA@paskett.org> > On Oct 15, 2015, at 1:37 PM, Dan McDonald wrote: > > >> On Oct 15, 2015, at 3:29 PM, Keith Paskett wrote: >> >> Proceed with caution. >> A couple of months ago we tried with r151014 and never could get it so OmniOS would recognize the drives. We tried a couple of different HBAs/Array controllers. >> The Gen8 systems have been great, but they are getting harder to find new. > > Did you try after this commit got backported into r151014? > > commit d08e0e5199f47566c90482b1ef4f31ec3798228b > Author: Robert Mustacchi > AuthorDate: Tue Aug 11 14:53:49 2015 -0700 > Commit: Dan McDonald > CommitDate: Tue Aug 18 11:39:33 2015 -0400 > > 6113 cpqary3: add support for hp gen9 smart array controllers > Reviewed by: Garrett D'Amore > Reviewed by: Igor Kozhukhov > Reviewed by: Richard Lowe > Approved by: Dan McDonald > > > Note the commit date of 18 August vs. when you tried. I can't recall if you tried before or after. The most recent r151014 ISO should have that commit on there. > > Dan > We had already returned our gen9 servers by 18 August, so we never tried with that patch. ? How quickly knowledge becomes obsolete. At least in this case, it?s bad news that is no longer true. Keith From henson at acm.org Fri Oct 16 01:59:00 2015 From: henson at acm.org (Paul B. Henson) Date: Thu, 15 Oct 2015 18:59:00 -0700 Subject: [OmniOS-discuss] HP Proliant Gen9 server In-Reply-To: References: Message-ID: <20151016015900.GQ3405@bender.unx.cpp.edu> On Thu, Oct 15, 2015 at 07:25:18AM -0400, Dan McDonald wrote: > There have been updates to cpqary3 for more modern HP Proliant HW. If > you know the PCI IDs, that'd be MOST helpful. I suspect the answer is > "yes", but I don't have the requisite experience. We've got some gen9 gear running linux, the H240 card shows up as: 0a:00.0 RAID bus controller [0104]: Hewlett-Packard Company Smart Array Gen9 Controllers [103c:3239] (rev 01) I believe 103c:3239 is the PCI ID for the card. I can provide any other hardware details on request (we've got a handful of DL160 units and a DL360 unit), but unfortunately at this time don't have a box I could actually boot omnios on. We're actually planning on buying some HP gear for basic zfs cifs/nfs storage, but ideally will get it with just a SAS HBA rather than the HP raid card. I know HP partners with Nexenta, so if you can find the right sales rep they should be able to spec out illumos friendly gear. From trey at mailchimp.com Fri Oct 16 04:38:49 2015 From: trey at mailchimp.com (Trey Palmer) Date: Fri, 16 Oct 2015 00:38:49 -0400 Subject: [OmniOS-discuss] HP Proliant Gen9 server In-Reply-To: References: Message-ID: B140i is standard AHCI with a fakeraid mode per HP's docs, so it probably works okay. B120i works fine, and it looks B140i is the same thing but supports more disks. We also run OmniOS on DL380Gen8's, but with LSI2x08 cards and Intel X520's (ixgbe) vice the onboard controllers. That setup works well. I have a Gen9 with an LSI card racked in to test but haven't gotten to it yet. One really nice thing about the DL380Gen9 is that you can get a 24xSFF version with no SAS expanders. Somewhat relevant: I tested a 60-disk 4.3U SL4540 Gen8 earlier this year, using the mezzanine H220 (which is an LSI mpt_sas chipset, the PCIe version looks like a bog standard 920x-8i). The box takes two mezzanine cards which reminded me a little of Sparc 20 SBus cards. There's a PCIe riser available for the upper socket, but you can't use it for an HBA because the only connection to the disks is through the mezzanine sockets. The mezzanine H220 was the most pathological mpt_sas card I've ever encountered. The disks would just disconnect completely. It could be the weird connection or the SAS expanders or anything else (disks were He8 SAS). It just felt risky and we punted on the non-standard hardware after trying several firmware versions even though we loved the form factor. The "onboard 10GbE" turns out to be Mellanox ConnectX3, with one QSFP and one SFP. It has two SFF AHCI SATA system drives (B120i) per server module. You can get it with one or several nodes in the chassis, 1x60, 2x25 or 3x15. It seems purpose-built for the HDFS/Ceph/GlusterFS communities. -- Trey On Thu, Oct 15, 2015 at 3:49 AM, Lawrence Giam wrote: > Hi All, > > I am looking at whether to get HP Proliant Gen9 server and running OmniOS > on it. Does anyone have any experience with this generation of server? Is > the RAID controller (either the B140i or the H240) supported by illumos? I > did a search and cannot find any result in the hardware compatability list. > > Thanks & Regards. > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From johan.kragsterman at capvert.se Fri Oct 16 12:48:06 2015 From: johan.kragsterman at capvert.se (Johan Kragsterman) Date: Fri, 16 Oct 2015 14:48:06 +0200 Subject: [OmniOS-discuss] was: HP Proliant Gen9 server; now: Mezzanine H220 problems SL4540 In-Reply-To: References: , Message-ID: Hi! -----"OmniOS-discuss" skrev: ----- Till: Lawrence Giam Fr?n: Trey Palmer S?nt av: "OmniOS-discuss" Datum: 2015-10-16 06:47 Kopia: omnios-discuss ?rende: Re: [OmniOS-discuss] HP Proliant Gen9 server Somewhat relevant: ? I tested a 60-disk 4.3U SL4540 Gen8 earlier this year, using the mezzanine H220 (which is an LSI mpt_sas chipset, the PCIe version looks like a bog standard 920x-8i). ? The box takes two mezzanine cards which reminded me a little of Sparc 20 SBus cards. There's a PCIe riser available for the upper socket, but you can't use it for an HBA because the only connection to the disks is through the mezzanine sockets. The mezzanine H220 was the most pathological mpt_sas card I've ever encountered. ? The disks would just disconnect completely.? It could be the weird connection or the SAS expanders or anything else (disks were He8 SAS). ? It just felt risky and we punted on the non-standard hardware after trying several firmware versions even though we loved the form factor. The "onboard 10GbE" turns out to be Mellanox ConnectX3, with one QSFP and one SFP. ? It has two SFF AHCI SATA system drives (B120i) per server module. ? You can get it with one or several nodes in the chassis, 1x60, 2x25 or 3x15. ? It seems purpose-built for the HDFS/Ceph/GlusterFS communities. ? ?-- Trey Just want to comment on the Mezzanine H220 problems on SL4540: Could be a PCI bridge problem. Perhaps there is a bridge between the PCIe bus and the mezzanine card slot, and that one could definitly mess it up... Rgrds Johan _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss at lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss at lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss From colin at omniti.com Fri Oct 16 15:06:42 2015 From: colin at omniti.com (Colin Roche-Dutch) Date: Fri, 16 Oct 2015 11:06:42 -0400 Subject: [OmniOS-discuss] subversion intentionally compiled without http(s) support? In-Reply-To: References: <3B1BD7DE-7D7B-471F-A5DE-B402E145B65D@paskett.org> Message-ID: Keith, The updated pkg for subversion 1.9.2 is out with libserf for http/https support. -Thanks, Colin On Wed, Oct 14, 2015 at 7:24 PM, Colin Roche-Dutch wrote: > Hi Keith, > > A bad package slipped into the ms.omniti.com repo. I should have an > updated one early tomorrow with libserf enabled to fix the http/https > version. > > -Thanks, > Colin > > On Wed, Oct 14, 2015 at 12:32 AM, Keith Paskett wrote: > >> subversion at 1.8.10-0.151014 is fine. >> >> > On Oct 13, 2015, at 10:07 PM, Keith Paskett wrote: >> > >> > After installing the following subversion package, I get an error >> accessing any subversion repository via http(s) protocols: >> > >> > PACKAGE PUBLISHER >> > pkg:/omniti/developer/versioning/subversion at 1.9.2-0.151014 >> ms.omniti.com >> > >> > The error I get is svn: E170000: Unrecognized URL scheme for ?https:// >> ?' >> > >> > Online search suggests that it was compiled without the ?with-ssl >> and/or ?with-neon switches. >> > >> > Subversion worked fine on a r151014 system I set up a couple of months >> ago. >> > >> > Keith Paskett >> > KLP-Systems >> > >> > >> > >> > >> > >> > _______________________________________________ >> > OmniOS-discuss mailing list >> > OmniOS-discuss at lists.omniti.com >> > http://lists.omniti.com/mailman/listinfo/omnios-discuss >> >> _______________________________________________ >> OmniOS-discuss mailing list >> OmniOS-discuss at lists.omniti.com >> http://lists.omniti.com/mailman/listinfo/omnios-discuss >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kai at meder.info Sat Oct 17 14:51:02 2015 From: kai at meder.info (Kai) Date: Sat, 17 Oct 2015 14:51:02 +0000 (UTC) Subject: [OmniOS-discuss] Upgrade from r151006 to r151014 Message-ID: Hello, I'm currently on OmniOS v11 r151006 and want to upgrade to the lastest LTS and stay there for a while. I already did $ pkg refresh --full $ pkg unfreeze \ entire \ consolidation/osnet/osnet-incorporation \ incorporation/jeos/illumos-gate \ incorporation/jeos/omnios-userland (although can't remember to having freezed anything in the past) $ pkg update -nv : No updates available for this image. doing an explicit $ pkg update -nv --be-name=omnios-r151014 entire at 11,5.11-0.151014 : pkg update: 'entire at 11,5.11-0.151008' matches no installed packages What can I do now? Thank you very much, Kai From jdg117 at elvis.arl.psu.edu Sat Oct 17 15:21:20 2015 From: jdg117 at elvis.arl.psu.edu (John D Groenveld) Date: Sat, 17 Oct 2015 11:21:20 -0400 Subject: [OmniOS-discuss] Upgrade from r151006 to r151014 In-Reply-To: Your message of "Sat, 17 Oct 2015 14:51:02 -0000." References: Message-ID: <201510171521.t9HFLKm2006012@elvis.arl.psu.edu> In message , Kai writes: >I'm currently on OmniOS v11 r151006 and want to upgrade to the lastest LTS >and stay there for a while. Did you reset you publisher? John groenveld at acm.org From danmcd at omniti.com Sat Oct 17 16:11:34 2015 From: danmcd at omniti.com (Dan McDonald) Date: Sat, 17 Oct 2015 12:11:34 -0400 Subject: [OmniOS-discuss] Upgrade from r151006 to r151014 In-Reply-To: References: Message-ID: <15DBC5B2-D7E2-461A-800F-EE097646A03B@omniti.com> We documented this pretty well, I think. http://omnios.omniti.com/wiki.php/Upgrade_to_r151014 Dan Sent from my iPhone (typos, autocorrect, and all) > On Oct 17, 2015, at 10:51 AM, Kai wrote: > > Hello, > > I'm currently on OmniOS v11 r151006 and want to upgrade to the lastest LTS > and stay there for a while. > > I already did > $ pkg refresh --full > $ pkg unfreeze \ > entire \ > consolidation/osnet/osnet-incorporation \ > incorporation/jeos/illumos-gate \ > incorporation/jeos/omnios-userland > (although can't remember to having freezed anything in the past) > > $ pkg update -nv > : No updates available for this image. > > doing an explicit > $ pkg update -nv --be-name=omnios-r151014 entire at 11,5.11-0.151014 > : pkg update: 'entire at 11,5.11-0.151008' matches no installed packages > > What can I do now? > > Thank you very much, > Kai > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss From richard at netbsd.org Tue Oct 20 10:42:50 2015 From: richard at netbsd.org (Richard PALO) Date: Tue, 20 Oct 2015 12:42:50 +0200 Subject: [OmniOS-discuss] omnios-build perl Message-ID: thought I'd try to rebuild perl (for grins) on bloody > richard at omnis:/home/richard/src/omnios-build/build/perl$ tail -30 build.log > Build Perl for SOCKS? [n] > Try to use long doubles if available? [n] > Checking for optional libraries... > What libraries to use? [-lsocket -lnsl -lgdbm -ldl -lm -lpthread -lc] > What optimizer/debugger flag should be used? [-O3] > Any additional cc flags? > [-D_REENTRANT -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_TS_ERRNO -DPTR_IS_LONG -fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include] > Let me guess what the preprocessor flags are... > Any additional ld flags (NOT including libraries)? > [ -fstack-protector -L/opt/local/lib -L/usr/gnu/lib] > Checking your choice of C compiler and flags for coherency... > Configure: line 5358: 415512 Killed $sh -c "$run ./try " >> try.msg 2>&1 > I've tried to compile and run the following simple program: > > #include > int main() { printf("Ok\n"); return(0); } > > I used the command: > > gcc -o try -O3 -D_REENTRANT -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_TS_ERRNO -DPTR_IS_LONG -fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include -fstack-protector -L/opt/local/lib -L/usr/gnu/lib try.c -lsocket -lnsl -lgdbm -ldl -lm -lpthread -lc > ./try > > and I got the following output: > > ld.so.1: try: fatal: libgdbm.so.4: open failed: No such file or directory > The program compiled OK, but exited with status 137. > You have a problem. Shall I abort Configure [y] > Ok. Stopping Configure. > --- Configure failed > ===== Build aborted ===== qu?saco? I don't believe the gate nor omnios provides gdbm... -- Richard PALO From sequoiamobil at gmx.net Tue Oct 20 11:47:47 2015 From: sequoiamobil at gmx.net (Sebastian Gabler) Date: Tue, 20 Oct 2015 13:47:47 +0200 Subject: [OmniOS-discuss] HP Proliant Gen9 server In-Reply-To: References: Message-ID: <562629E3.9060004@gmx.net> Am 20.10.2015 um 12:43 schrieb omnios-discuss-request at lists.omniti.com: > Message: 1 > Date: Fri, 16 Oct 2015 00:38:49 -0400 > From: Trey Palmer > To: Lawrence Giam > Cc: omnios-discuss > Subject: Re: [OmniOS-discuss] HP Proliant Gen9 server > Message-ID: > > Content-Type: text/plain; charset="utf-8" > One really nice thing about the DL380Gen9 is that you can get a 24xSFF > version with no SAS expanders. I would be interested how that would work, using a single HP branded HBA or RAID controller. It may work using the H240ar and H240 (PCIe) each (two port controllers) , but I am not sure if that is a desirable configuration. I'd rather go for the expander card option, or for external JBODs, entirely. My expectation would be that the expander card would not have problems with HP branded drives, at least. Aftermarket drives are problematic anyhow in context with the HP boxes. The B120i is no longer available with the G9 servers, BTW. sebastian From jdg117 at elvis.arl.psu.edu Tue Oct 20 12:32:23 2015 From: jdg117 at elvis.arl.psu.edu (John D Groenveld) Date: Tue, 20 Oct 2015 08:32:23 -0400 Subject: [OmniOS-discuss] omnios-build perl In-Reply-To: Your message of "Tue, 20 Oct 2015 12:42:50 +0200." References: Message-ID: <201510201232.t9KCWNcY007546@elvis.arl.psu.edu> In message , Richard PALO writes: >thought I'd try to rebuild perl (for grins) on bloody > >> richard at omnis:/home/richard/src/omnios-build/build/perl$ tail -30 build.log >> Build Perl for SOCKS? [n] >> Try to use long doubles if available? [n] >> Checking for optional libraries... >> What libraries to use? [-lsocket -lnsl -lgdbm -ldl -lm -lpthread -lc] Where's that -lgdbm come from? Perl's not dependent on libgdbm. And which gcc are you using? AFAICT bloody allows you to choose between developer/gcc48 and developer/gcc51. John groenveld at acm.org From lotheac at iki.fi Tue Oct 20 12:52:34 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Tue, 20 Oct 2015 15:52:34 +0300 Subject: [OmniOS-discuss] omnios-build perl In-Reply-To: References: Message-ID: <20151020125234.GA29305@gutsman.lotheac.fi> On Tue, Oct 20 2015 12:42:50 +0200, Richard PALO wrote: > thought I'd try to rebuild perl (for grins) on bloody > > > richard at omnis:/home/richard/src/omnios-build/build/perl$ tail -30 build.log > > Build Perl for SOCKS? [n] > > Try to use long doubles if available? [n] > > Checking for optional libraries... > > What libraries to use? [-lsocket -lnsl -lgdbm -ldl -lm -lpthread -lc] Doesn't happen on my bloody box. Perhaps you have gdbm.h available somewhere and perl picks it up? My perl build.log says: ~/omnios-build/build/perl % egrep '(gdbm\.h|What libraries)' build.log What libraries to use? [-lsocket -lnsl -ldl -lm -lpthread -lc] NOT found. What libraries to use? [-lsocket -lnsl -ldl -lm -lpthread -lc] NOT found. -- Lauri Tirkkonen | lotheac @ IRCnet From richard at netbsd.org Tue Oct 20 13:01:21 2015 From: richard at netbsd.org (Richard PALO) Date: Tue, 20 Oct 2015 15:01:21 +0200 Subject: [OmniOS-discuss] omnios-build perl In-Reply-To: <201510201232.t9KCWNcY007546@elvis.arl.psu.edu> References: <201510201232.t9KCWNcY007546@elvis.arl.psu.edu> Message-ID: <56263B21.6030707@netbsd.org> Le 20/10/15 14:32, John D Groenveld a ?crit : > In message , Richard PALO writes: >> thought I'd try to rebuild perl (for grins) on bloody >> >>> richard at omnis:/home/richard/src/omnios-build/build/perl$ tail -30 build.log >>> Build Perl for SOCKS? [n] >>> Try to use long doubles if available? [n] >>> Checking for optional libraries... >>> What libraries to use? [-lsocket -lnsl -lgdbm -ldl -lm -lpthread -lc] > > Where's that -lgdbm come from? > Perl's not dependent on libgdbm. >From INSTALL, perl seems to try autodetecting, which I presume is buggy here. Perhaps something needs to be done in Configure or Makefile.SH? > richard at omnis:/home/richard$ gcc -v > Using built-in specs. > COLLECT_GCC=gcc > COLLECT_LTO_WRAPPER=/opt/gcc-5.1.0/libexec/gcc/i386-pc-solaris2.11/5.1.0/lto-wrapper > Target: i386-pc-solaris2.11 > Configured with: ./configure --prefix=/opt/gcc-5.1.0 --host i386-pc-solaris2.11 --build i386-pc-solaris2.11 --target i386-pc-solaris2.11 --with-boot-ldflags=-R/opt/gcc-5.1.0/lib --with-gmp=/opt/gcc-5.1.0 --with-mpfr=/opt/gcc-5.1.0 --with-mpc=/opt/gcc-5.1.0 --enable-languages=c,c++,fortran,lto --without-gnu-ld --with-ld=/bin/ld --with-as=/usr/bin/gas --with-gnu-as --with-build-time-tools=/usr/gnu/i386-pc-solaris2.11/bin > Thread model: posix > gcc version 5.1.0 (GCC) -- Richard PALO From richard at netbsd.org Tue Oct 20 13:19:10 2015 From: richard at netbsd.org (Richard PALO) Date: Tue, 20 Oct 2015 15:19:10 +0200 Subject: [OmniOS-discuss] omnios-build perl In-Reply-To: <20151020125234.GA29305@gutsman.lotheac.fi> References: <20151020125234.GA29305@gutsman.lotheac.fi> Message-ID: <56263F4E.9010709@netbsd.org> Le 20/10/15 14:52, Lauri Tirkkonen a ?crit : > egrep '(gdbm\.h|What libraries)' build.log bloody hell, perl is picking up my pkgsrc installation in /opt/local I already checked my $PATH prior to launching build.sh, but I notice from INSTALL: > 022 Again, this should all happen automatically. This should also work if > 1023 you have gdbm installed in any of (/usr/local, /opt/local, /usr/gnu, > 1024 /opt/gnu, /usr/GNU, or /opt/GNU). I'll patch these out and see. -- Richard PALO From trey at mailchimp.com Tue Oct 20 13:22:24 2015 From: trey at mailchimp.com (Trey Palmer) Date: Tue, 20 Oct 2015 09:22:24 -0400 Subject: [OmniOS-discuss] HP Proliant Gen9 server In-Reply-To: <562629E3.9060004@gmx.net> References: <562629E3.9060004@gmx.net> Message-ID: The expectation by HP is to use a SAS expander card if you're using the HP RAID hardware. I was thinking of the specific case of running ZFS on SATA SSD's hooked up to mpt_sas HBA's. For SATA drives on Illumos, SAS expanders should be avoided. Not a bad idea on any platform, but imperative on Illumos. -- Trey On Tuesday, October 20, 2015, Sebastian Gabler wrote: > Am 20.10.2015 um 12:43 schrieb omnios-discuss-request at lists.omniti.com: > >> Message: 1 >> Date: Fri, 16 Oct 2015 00:38:49 -0400 >> From: Trey Palmer >> To: Lawrence Giam >> Cc: omnios-discuss >> Subject: Re: [OmniOS-discuss] HP Proliant Gen9 server >> Message-ID: >> < >> CADRROpUpP+E1XCAcayDWmOT4n9RJqiTabnnwz6G0rOEBNSDtCQ at mail.gmail.com> >> Content-Type: text/plain; charset="utf-8" >> > > One really nice thing about the DL380Gen9 is that you can get a 24xSFF >> version with no SAS expanders. >> > > I would be interested how that would work, using a single HP branded HBA > or RAID controller. It may work using the H240ar and H240 (PCIe) each (two > port controllers) , but I am not sure if that is a desirable configuration. > I'd rather go for the expander card option, or for external JBODs, entirely. > My expectation would be that the expander card would not have problems > with HP branded drives, at least. Aftermarket drives are problematic anyhow > in context with the HP boxes. > The B120i is no longer available with the G9 servers, BTW. > > sebastian > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard at netbsd.org Tue Oct 20 14:13:58 2015 From: richard at netbsd.org (Richard PALO) Date: Tue, 20 Oct 2015 16:13:58 +0200 Subject: [OmniOS-discuss] omnios-build perl In-Reply-To: <20151020125234.GA29305@gutsman.lotheac.fi> References: <20151020125234.GA29305@gutsman.lotheac.fi> Message-ID: <56264C26.3090403@netbsd.org> Le 20/10/15 14:52, Lauri Tirkkonen a ?crit : > On Tue, Oct 20 2015 12:42:50 +0200, Richard PALO wrote: >> thought I'd try to rebuild perl (for grins) on bloody >> >>> richard at omnis:/home/richard/src/omnios-build/build/perl$ tail -30 build.log >>> Build Perl for SOCKS? [n] >>> Try to use long doubles if available? [n] >>> Checking for optional libraries... >>> What libraries to use? [-lsocket -lnsl -lgdbm -ldl -lm -lpthread -lc] > > Doesn't happen on my bloody box. Perhaps you have gdbm.h available > somewhere and perl picks it up? My perl build.log says: > > ~/omnios-build/build/perl % egrep '(gdbm\.h|What libraries)' build.log > What libraries to use? [-lsocket -lnsl -ldl -lm -lpthread -lc] > NOT found. > What libraries to use? [-lsocket -lnsl -ldl -lm -lpthread -lc] > NOT found. > I was able to get it building okay updating gcc-sunld.patch with the following: > @@ -60,7 +60,7 @@ esac > # libmalloc.a may allocate memory that is only 4 byte aligned, but > # GNU CC on the Sparc assumes that doubles are 8 byte aligned. > # Thanks to Hallvard B. Furuseth > -set `echo " $libswanted " | sed -e 's@ ld @ @' -e 's@ malloc @ @' -e 's@ ucb @ @' -e 's@ sec @ @' -e 's@ crypt @ @'` > +set `echo " $libswanted " | sed -e 's@ ld @ @' -e 's@ malloc @ @' -e 's@ ucb @ @' -e 's@ sec @ @' -e 's@ crypt @ @' -e 's@ gdbm @ @'` > libswanted="$*" > > # Look for architecture name. We want to suggest a useful default. ---------------------------------------------------------------------------------------------------------------------------^^^^^^^^^^^ btw, needed to add it to patches/series with the argument -bz.orig to be useful this is probably not necessary if always building in a virgin dev zone. -- Richard PALO From al.slater at scluk.com Wed Oct 21 10:08:55 2015 From: al.slater at scluk.com (Al Slater) Date: Wed, 21 Oct 2015 11:08:55 +0100 Subject: [OmniOS-discuss] ILB memory leak? Message-ID: <56276437.2020109@scluk.com> Hi, I am running omnios r151014 on a couple of machines with a couple of zones each. 1 zone runs apache as an SSL reverse proxy, the other runs ILB for load balancing web to app tier connections. I noticed that in the ILB zone, the ilbd process memory grows to about 2Gb. Restarting ILB releases the memory, and then the memory usage gradually increases again, with each memory increase approximately 2 * the size of the previous one. I run a cronjob twice a day ( 8am and 8pm) which restarts the ilb service and releases the memory. A graph of memory usage is available at https://www.dropbox.com/s/zaz51apxslnivlq/ILB_Memory_2_days.png?dl=0 There are currently 62 rules in the load balancer, with a total of 664 server/port pairs. Is there anything I can provide that would help track this down? -- Al Slater From danmcd at omniti.com Wed Oct 21 16:35:33 2015 From: danmcd at omniti.com (Dan McDonald) Date: Wed, 21 Oct 2015 12:35:33 -0400 Subject: [OmniOS-discuss] ILB memory leak? In-Reply-To: <56276437.2020109@scluk.com> References: <56276437.2020109@scluk.com> Message-ID: <00AE5FA3-E699-4C4F-8A94-AEEDAAED0856@omniti.com> > On Oct 21, 2015, at 6:08 AM, Al Slater wrote: > > Hi, > > I am running omnios r151014 on a couple of machines with a couple of zones each. 1 zone runs apache as an SSL reverse proxy, the other runs ILB for load balancing web to app tier connections. > > I noticed that in the ILB zone, the ilbd process memory grows to about 2Gb. Restarting ILB releases the memory, and then the memory usage gradually increases again, with each memory increase approximately 2 * the size of the previous one. I run a cronjob twice a day ( 8am and 8pm) which restarts the ilb service and releases the memory. > > A graph of memory usage is available at https://www.dropbox.com/s/zaz51apxslnivlq/ILB_Memory_2_days.png?dl=0 > > There are currently 62 rules in the load balancer, with a total of 664 server/port pairs. > > Is there anything I can provide that would help track this down? You can use svccfg(1M) to enable user-level memory debugging on ilb. It may cause the ilb daemon to dump core. (And you're just noticing this in the process, not kernel memory consumption, correct?) As root: svcadm disable -t ilb svccfg -s ilb setenv LD_PRELOAD libumem.so svccfg -s ilb setenv UMEM_DEBUG default svccfg -s ilb refresh svcadm enable ilb That should enable user-level memory debugging. If you get a coredump, save it and share it. If you don't and the ilb daemon keeps running, eventually please: gcore `pgrep ilbd` and share THAT corefile. You can also do this by youself: mdb > ::findleaks and share ::findleaks. Once you're done generating corefiles, repeat the steps above, but use "unsetenv LD_PRELOAD" and "unsetenv UMEM_DEBUG" instead of the setenv lines. Hope this helps, Dan From bfriesen at simple.dallas.tx.us Wed Oct 21 19:07:08 2015 From: bfriesen at simple.dallas.tx.us (Bob Friesenhahn) Date: Wed, 21 Oct 2015 14:07:08 -0500 (CDT) Subject: [OmniOS-discuss] ILB memory leak? In-Reply-To: <00AE5FA3-E699-4C4F-8A94-AEEDAAED0856@omniti.com> References: <56276437.2020109@scluk.com> <00AE5FA3-E699-4C4F-8A94-AEEDAAED0856@omniti.com> Message-ID: On Wed, 21 Oct 2015, Dan McDonald wrote: > > You can use svccfg(1M) to enable user-level memory debugging on ilb. It may cause the ilb daemon to dump core. (And you're just noticing this in the process, not kernel memory consumption, correct?) > > As root: > > svcadm disable -t ilb > svccfg -s ilb setenv LD_PRELOAD libumem.so > svccfg -s ilb setenv UMEM_DEBUG default > svccfg -s ilb refresh > svcadm enable ilb Is there a way to use ulimit to limit the data segment size (ulimit -d)? If this is possible, then a dumped core (due to hitting the limit) may point directly to the guilty party. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ From al.slater at scluk.com Thu Oct 22 08:43:04 2015 From: al.slater at scluk.com (Al Slater) Date: Thu, 22 Oct 2015 09:43:04 +0100 Subject: [OmniOS-discuss] ILB memory leak? In-Reply-To: <00AE5FA3-E699-4C4F-8A94-AEEDAAED0856@omniti.com> References: <56276437.2020109@scluk.com> <00AE5FA3-E699-4C4F-8A94-AEEDAAED0856@omniti.com> Message-ID: <5628A198.5040808@scluk.com> On 21/10/2015 17:35, Dan McDonald wrote: > >> On Oct 21, 2015, at 6:08 AM, Al Slater >> wrote: >> >> Hi, >> >> I am running omnios r151014 on a couple of machines with a couple >> of zones each. 1 zone runs apache as an SSL reverse proxy, the >> other runs ILB for load balancing web to app tier connections. >> >> I noticed that in the ILB zone, the ilbd process memory grows to >> about 2Gb. Restarting ILB releases the memory, and then the >> memory usage gradually increases again, with each memory increase >> approximately 2 * the size of the previous one. I run a cronjob >> twice a day ( 8am and 8pm) which restarts the ilb service and >> releases the memory. >> >> A graph of memory usage is available at >> https://www.dropbox.com/s/zaz51apxslnivlq/ILB_Memory_2_days.png?dl=0 >> >> There are currently 62 rules in the load balancer, with a >> total >> of 664 server/port pairs. >> >> Is there anything I can provide that would help track this down? > > You can use svccfg(1M) to enable user-level memory debugging on ilb. > It may cause the ilb daemon to dump core. (And you're just noticing > this in the process, not kernel memory consumption, correct?) I am seeing kernel memory consumption increasing as well, but that may be a different issue. The ilbd process memory is definitely growing. > As root: > > svcadm disable -t ilb svccfg -s ilb setenv LD_PRELOAD libumem.so > svccfg -s ilb setenv UMEM_DEBUG default svccfg -s ilb refresh svcadm > enable ilb > > That should enable user-level memory debugging. If you get a > coredump, save it and share it. If you don't and the ilb daemon > keeps running, eventually please: > > gcore `pgrep ilbd` > > and share THAT corefile. You can also do this by youself: > > mdb > ::findleaks > > and share ::findleaks. > > Once you're done generating corefiles, repeat the steps above, but > use "unsetenv LD_PRELOAD" and "unsetenv UMEM_DEBUG" instead of the > setenv lines. Thanks Dan. As we are talking about production boxes here, I will have to try and reproduce on another box and then I will give the process above a go and see what we come up with. -- Al Slater Technical Director SCL Phone : +44 (0)1273 666607 Fax : +44 (0)1273 666601 email : al.slater at scluk.com Stanton Consultancy Ltd Park Gate, 161 Preston Road, Brighton, East Sussex, BN1 6AU Registered in England Company number: 1957652 VAT number: GB 760 2433 55 From ryan at zinascii.com Thu Oct 22 15:13:05 2015 From: ryan at zinascii.com (Ryan Zezeski) Date: Thu, 22 Oct 2015 11:13:05 -0400 Subject: [OmniOS-discuss] ILB memory leak? In-Reply-To: <5628A198.5040808@scluk.com> References: <56276437.2020109@scluk.com> <00AE5FA3-E699-4C4F-8A94-AEEDAAED0856@omniti.com> <5628A198.5040808@scluk.com> Message-ID: Al Slater writes: > On 21/10/2015 17:35, Dan McDonald wrote: >> >> That should enable user-level memory debugging. If you get a >> coredump, save it and share it. If you don't and the ilb daemon >> keeps running, eventually please: >> >> gcore `pgrep ilbd` >> >> and share THAT corefile. You can also do this by youself: >> >> mdb > ::findleaks >> >> and share ::findleaks. >> >> Once you're done generating corefiles, repeat the steps above, but >> use "unsetenv LD_PRELOAD" and "unsetenv UMEM_DEBUG" instead of the >> setenv lines. > > Thanks Dan. As we are talking about production boxes here, I will have > to try and reproduce on another box and then I will give the process > above a go and see what we come up with. You can also use the DTrace pid provider to grab the user stack on every malloc(3C) call, and the syscall provider to track mmap(2) calls. That poses no harm to production and might make the cause of memory usage obvious. Something like: dtrace -qn 'pid$target::malloc:entry { @[ustack()] = count(); } syscall::mmap*:entry /pid == $target/ { @[ustack()] = count(); }' -p Let that run for a while as the memory grows, then Ctrl-C. -Z From jim at cos.ru Thu Oct 22 16:59:15 2015 From: jim at cos.ru (Jim Klimov) Date: Thu, 22 Oct 2015 19:59:15 +0300 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly Message-ID: Hello all, I have this HP-Z400 workstation with 16Gb ECC(should be) RAM running OmniOS bloody, which acts as a backup server for our production systems (regularly rsync'ing large files off Linux boxes, and rotating ZFS auto-snapshots to keep its space free). Sometimes it also runs replicas of infrastructure (DHCP, DNS) and was set up as a VirtualBox + phpVirtualBox host to test that out, but no VMs running. So the essential loads are ZFS snapshots and ZFS scrubs :) And it freezes roughly every week. Stops responding to ping, attempts to log in via SSH or physical console - it processes keypresses on the latter, but does not present a login prompt. It used to be stable, and such regular hangs began around summertime. My primary guess would be for flaky disks, maybe timing out under load or going to sleep or whatever... But I have yet to prove it, or any other theory. Maybe just CPU is overheating due to regular near-100% load with disk I/O... At least I want to rule out OS errors and rule out (or point out) operator/box errors as much as possible - which is something I can change to try and fix ;) Before I proceed to TL;DR screenshots, I'd overview what I see: * In the "top" output, processes owned by zfssnap lead most of the time... But even the SSH shell is noticeably slow to respond (1 sec per line when just pressing enter to clear the screen to prepare nice screenshots). * SMART was not enabled on 3TB mirrored "pool" SATA disks (is now, long tests initiated), but was in place on the "rpool" SAS disk where it logged some corrected ECC errors - but none uncorrected. Maybe the cabling should be reseated. * iostat shows disks are generally not busy (they don't audibly rattle nor visibly blink all the time, either) * zpool scrubs return clean * there are partitions of the system rpool disk (10K RPM SAS) used as log and cache devices for the main data pool on 3TB SATA disks. The system disk is fast and underutilized, so what the heck ;) And it was not a problem for the first year of this system's honest and stable workouts. These devices are pretty empty at the moment. I have enabled deadman panics according to Wiki, but none have happened so far: # cat /etc/system | egrep -v '(^\*|^$)' set snooping=1 set pcplusmp:apic_panic_on_nmi=1 set apix:apic_panic_on_nmi = 1 In the "top" output, processes owned by zfssnap lead most of the time: last pid: 22599; load avg: 12.9, 12.2, 11.2; up 0+09:52:11 18:34:41 140 processes: 125 sleeping, 13 running, 2 on cpu CPU states: 0.0% idle, 22.9% user, 77.1% kernel, 0.0% iowait, 0.0% swap Memory: 16G phys mem, 1765M free mem, 2048M total swap, 2048M free swap Seconds to delay: PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 21389 zfssnap 1 43 2 863M 860M run 5:04 35.61% zfs 22360 zfssnap 1 52 2 118M 115M run 0:37 16.50% zfs 21778 zfssnap 1 52 2 563M 560M run 3:15 13.17% zfs 21278 zfssnap 1 52 2 947M 944M run 5:32 6.91% zfs 21881 zfssnap 1 43 2 433M 431M run 2:31 5.41% zfs 21852 zfssnap 1 52 2 459M 456M run 2:39 5.16% zfs 21266 zfssnap 1 43 2 906M 903M run 5:18 3.95% zfs 21757 zfssnap 1 43 2 597M 594M run 3:26 2.91% zfs 21274 zfssnap 1 52 2 930M 927M cpu/0 5:27 2.78% zfs 22588 zfssnap 1 43 2 30M 27M run 0:08 2.48% zfs 22580 zfssnap 1 52 2 49M 46M run 0:14 0.71% zfs 22038 root 1 59 0 5312K 3816K cpu/1 0:01 0.10% top 22014 root 1 59 0 8020K 4988K sleep 0:00 0.02% sshd Average "iostats" are not that busy: # zpool iostat -Td 5 Thu Oct 22 18:24:59 CEST 2015 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- pool 2.52T 207G 802 116 28.3M 840K rpool 33.0G 118G 0 4 4.52K 58.7K ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:04 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 10 0 97.9K ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:09 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:14 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 9 0 93.5K ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:19 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:24 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:29 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:34 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:39 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 16 0 374K ---------- ----- ----- ----- ----- ----- ----- ... Thu Oct 22 18:33:49 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 11 0 94.5K ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:33:54 CEST 2015 pool 2.52T 207G 0 13 819 80.0K rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:33:59 CEST 2015 pool 2.52T 207G 0 129 0 1.06M rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:34:04 CEST 2015 pool 2.52T 207G 0 55 0 503K rpool 33.0G 118G 0 11 0 97.9K ---------- ----- ----- ----- ----- ----- ----- ... just occasional bursts of work. I've now enabled SMART on the disks (2*3Tb mirror "pool" and 1*300Gb "rpool") and ran some short tests and triggered long tests (hopefully they'd succeed by tomorrow); current results are: # for D in /dev/rdsk/c0*s0; do echo "===== $D :"; smartctl -d sat,12 -a $D ; done ; for D in /dev/rdsk/c4*s0 ; do echo "===== $D :"; smartctl -d scsi -a $D ; done ===== /dev/rdsk/c0t3d0s0 : smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: WDC WD3003FZEX-00Z4SA0 Serial Number: WD-WCC5D1KKU0PA LU WWN Device Id: 5 0014ee 2610716b7 Firmware Version: 01.01A01 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Thu Oct 22 18:45:28 2015 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: (32880) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 357) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x7035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 246 154 021 Pre-fail Always - 6691 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 14 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 4869 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14 16 Unknown_Attribute 0x0022 130 070 000 Old_age Always - 2289651870502 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 12 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2 194 Temperature_Celsius 0x0022 117 111 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 4869 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ===== /dev/rdsk/c0t5d0s0 : smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate SV35 Device Model: ST3000VX000-1ES166 Serial Number: Z500S3L8 LU WWN Device Id: 5 000c50 079e3757b Firmware Version: CV26 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Thu Oct 22 18:45:28 2015 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: ( 80) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 325) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x10b9) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail Always - 8600880 3 Spin_Up_Time 0x0003 096 094 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 19 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail Always - 342685681 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 4214 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 19 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 028 028 000 Old_age Always - 72 190 Airflow_Temperature_Cel 0x0022 069 065 045 Old_age Always - 31 (Min/Max 29/32) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 19 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 28 194 Temperature_Celsius 0x0022 031 040 000 Old_age Always - 31 (0 20 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Self-test routine in progress 90% 4214 - # 2 Short offline Completed without error 00% 4214 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ===== /dev/rdsk/c4t5000CCA02A1292DDd0s0 : smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org Vendor: HITACHI Product: HUS156030VLS600 Revision: HPH1 User Capacity: 300,000,000,000 bytes [300 GB] Logical block size: 512 bytes Logical Unit id: 0x5000cca02a1292dc Serial number: LVVA6NHS Device type: disk Transport protocol: SAS Local Time is: Thu Oct 22 18:45:29 2015 CEST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 45 C Drive Trip Temperature: 70 C Manufactured in week 14 of year 2012 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 80 Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 2340336504406016 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 888890 0 888890 0 29326.957 0 write: 0 961315 0 961315 0 6277.560 0 Non-medium error count: 283 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Self test in progress ... - NOW - [- - -] # 2 Background long Aborted (device reset ?) - 14354 - [- - -] # 3 Background short Completed - 14354 - [- - -] # 4 Background long Aborted (device reset ?) - 14354 - [- - -] # 5 Background long Aborted (device reset ?) - 14354 - [- - -] Long (extended) Self Test duration: 2506 seconds [41.8 minutes] The zpool scrub results and general layout: # zpool status -v pool: pool state: ONLINE scan: scrub repaired 0 in 164h13m with 0 errors on Thu Oct 22 18:13:33 2015 config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 logs c4t5000CCA02A1292DDd0p2 ONLINE 0 0 0 cache c4t5000CCA02A1292DDd0p3 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(5) for details. scan: scrub repaired 0 in 3h3m with 0 errors on Thu Oct 8 04:12:35 2015 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c4t5000CCA02A1292DDd0s0 ONLINE 0 0 0 errors: No known data errors # zpool list -v NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT pool 2.72T 2.52T 207G - 68% 92% 1.36x ONLINE / mirror 2.72T 2.52T 207G - 68% 92% c0t3d0 - - - - - - c0t5d0 - - - - - - log - - - - - - c4t5000CCA02A1292DDd0p2 8G 148K 8.00G - 0% 0% cache - - - - - - c4t5000CCA02A1292DDd0p3 120G 1.80G 118G - 0% 1% rpool 151G 33.0G 118G - 76% 21% 1.00x ONLINE - c4t5000CCA02A1292DDd0s0 151G 33.0G 118G - 76% 21% Note the long scrub time may have included the downtime while the system was frozen until it was rebooted. Thanks in advance for the fresh pairs of eyeballs, Jim Klimov -------------- next part -------------- An HTML attachment was scrubbed... URL: From danmcd at omniti.com Thu Oct 22 17:11:30 2015 From: danmcd at omniti.com (Dan McDonald) Date: Thu, 22 Oct 2015 13:11:30 -0400 Subject: [OmniOS-discuss] UPDATE NOW --> ntp to 4.2.8p4 Message-ID: The NTP software was updated to 4.2.8p4 yesterday. I've pushed out updates for r151006, r151014, and bloody. As I mentioned earlier, r151012 users should update to r151014. "pkg update" followed by "svcadm restart ntp" as a safety measure should be sufficient. No rebooting is needed. NTP's update on this patch is here: http://support.ntp.org/bin/view/Main/SecurityNotice#Recent_Vulnerabilities Thanks, Dan From danmcd at omniti.com Thu Oct 22 17:13:12 2015 From: danmcd at omniti.com (Dan McDonald) Date: Thu, 22 Oct 2015 13:13:12 -0400 Subject: [OmniOS-discuss] UPDATE NOW --> ntp to 4.2.8p4 In-Reply-To: References: Message-ID: <36D4EF6A-C596-441B-828B-862D5EB9423E@omniti.com> > On Oct 22, 2015, at 1:11 PM, Dan McDonald wrote: > > "pkg update" followed by "svcadm restart ntp" as a safety measure should be sufficient. No rebooting is needed. I made a small mistake here. The "svcadm restart..." is not necessary, the IPS package does the right thing here. Sorry for the confusion, Dan From yavoritomov at gmail.com Thu Oct 22 17:36:56 2015 From: yavoritomov at gmail.com (Yavor Tomov) Date: Thu, 22 Oct 2015 12:36:56 -0500 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: References: Message-ID: Hi Tovarishch Jim, I had similar issue with my box and it was related to the NFS locks. I assume you are using it due to the Linux backups. The solution was posted by Chip on the mailing list. Copy of his solution below: "I've seen issues like this when you run out of NFS locks. NFSv3 in Illumos is really slow at releasing locks. On all my NFS servers I do: sharectl set -p lockd_listen_backlog=256 nfs sharectl set -p lockd_servers=2048 nfs Everywhere I can, I use NFSv4 instead of v3. It handles lock much better." All the Best Yavor On Thu, Oct 22, 2015 at 11:59 AM, Jim Klimov wrote: > Hello all, > > I have this HP-Z400 workstation with 16Gb ECC(should be) RAM running > OmniOS bloody, which acts as a backup server for our production systems > (regularly rsync'ing large files off Linux boxes, and rotating ZFS > auto-snapshots to keep its space free). Sometimes it also runs replicas of > infrastructure (DHCP, DNS) and was set up as a VirtualBox + phpVirtualBox > host to test that out, but no VMs running. > > So the essential loads are ZFS snapshots and ZFS scrubs :) > > And it freezes roughly every week. Stops responding to ping, attempts to > log in via SSH or physical console - it processes keypresses on the latter, > but does not present a login prompt. It used to be stable, and such regular > hangs began around summertime. > > > > My primary guess would be for flaky disks, maybe timing out under load or > going to sleep or whatever... But I have yet to prove it, or any other > theory. Maybe just CPU is overheating due to regular near-100% load with > disk I/O... At least I want to rule out OS errors and rule out (or point > out) operator/box errors as much as possible - which is something I can > change to try and fix ;) > > Before I proceed to TL;DR screenshots, I'd overview what I see: > > * In the "top" output, processes owned by zfssnap lead most of the time... > But even the SSH shell is noticeably slow to respond (1 sec per line when > just pressing enter to clear the screen to prepare nice screenshots). > > * SMART was not enabled on 3TB mirrored "pool" SATA disks (is now, long > tests initiated), but was in place on the "rpool" SAS disk where it logged > some corrected ECC errors - but none uncorrected. > > Maybe the cabling should be reseated. > > * iostat shows disks are generally not busy (they don't audibly rattle nor > visibly blink all the time, either) > > * zpool scrubs return clean > > * there are partitions of the system rpool disk (10K RPM SAS) used as log > and cache devices for the main data pool on 3TB SATA disks. The system disk > is fast and underutilized, so what the heck ;) And it was not a problem for > the first year of this system's honest and stable workouts. These devices > are pretty empty at the moment. > > > > I have enabled deadman panics according to Wiki, but none have happened so > far: > > # cat /etc/system | egrep -v '(^\*|^$)' > set snooping=1 > set pcplusmp:apic_panic_on_nmi=1 > set apix:apic_panic_on_nmi = 1 > > > > > > In the "top" output, processes owned by zfssnap lead most of the time: > > > > last pid: 22599; load avg: 12.9, 12.2, 11.2; up > 0+09:52:11 > 18:34:41 > 140 processes: 125 sleeping, 13 running, 2 on cpu > CPU states: 0.0% idle, 22.9% user, 77.1% kernel, 0.0% iowait, 0.0% swap > Memory: 16G phys mem, 1765M free mem, 2048M total swap, 2048M free swap > Seconds to delay: > PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND > 21389 zfssnap 1 43 2 863M 860M run 5:04 35.61% zfs > 22360 zfssnap 1 52 2 118M 115M run 0:37 16.50% zfs > 21778 zfssnap 1 52 2 563M 560M run 3:15 13.17% zfs > 21278 zfssnap 1 52 2 947M 944M run 5:32 6.91% zfs > 21881 zfssnap 1 43 2 433M 431M run 2:31 5.41% zfs > 21852 zfssnap 1 52 2 459M 456M run 2:39 5.16% zfs > 21266 zfssnap 1 43 2 906M 903M run 5:18 3.95% zfs > 21757 zfssnap 1 43 2 597M 594M run 3:26 2.91% zfs > 21274 zfssnap 1 52 2 930M 927M cpu/0 5:27 2.78% zfs > 22588 zfssnap 1 43 2 30M 27M run 0:08 2.48% zfs > 22580 zfssnap 1 52 2 49M 46M run 0:14 0.71% zfs > 22038 root 1 59 0 5312K 3816K cpu/1 0:01 0.10% top > 22014 root 1 59 0 8020K 4988K sleep 0:00 0.02% sshd > > > > Average "iostats" are not that busy: > > > > # zpool iostat -Td 5 > Thu Oct 22 18:24:59 CEST 2015 > capacity operations bandwidth > pool alloc free read write read write > ---------- ----- ----- ----- ----- ----- ----- > pool 2.52T 207G 802 116 28.3M 840K > rpool 33.0G 118G 0 4 4.52K 58.7K > ---------- ----- ----- ----- ----- ----- ----- > > Thu Oct 22 18:25:04 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 10 0 97.9K > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:09 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:14 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 9 0 93.5K > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:19 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:24 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:29 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:34 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:39 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 16 0 374K > ---------- ----- ----- ----- ----- ----- ----- > ... > > Thu Oct 22 18:33:49 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 11 0 94.5K > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:33:54 CEST 2015 > pool 2.52T 207G 0 13 819 80.0K > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:33:59 CEST 2015 > pool 2.52T 207G 0 129 0 1.06M > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:34:04 CEST 2015 > pool 2.52T 207G 0 55 0 503K > rpool 33.0G 118G 0 11 0 97.9K > ---------- ----- ----- ----- ----- ----- ----- > ... > > just occasional bursts of work. > > I've now enabled SMART on the disks (2*3Tb mirror "pool" and 1*300Gb > "rpool") and ran some short tests and triggered long tests (hopefully > they'd succeed by tomorrow); current results are: > > > # for D in /dev/rdsk/c0*s0; do echo "===== $D :"; smartctl -d sat,12 -a $D > ; done ; for D in /dev/rdsk/c4*s0 ; do echo "===== $D :"; smartctl -d scsi > -a $D ; done > ===== /dev/rdsk/c0t3d0s0 : > smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) > Copyright (C) 2002-12, Bruce Allen, Christian Franke, > www.smartmontools.org > > === START OF INFORMATION SECTION === > Device Model: WDC WD3003FZEX-00Z4SA0 > Serial Number: WD-WCC5D1KKU0PA > LU WWN Device Id: 5 0014ee 2610716b7 > Firmware Version: 01.01A01 > User Capacity: 3,000,592,982,016 bytes [3.00 TB] > Sector Sizes: 512 bytes logical, 4096 bytes physical > Rotation Rate: 7200 rpm > Device is: Not in smartctl database [for details use: -P showall] > ATA Version is: ACS-2 (minor revision not indicated) > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) > Local Time is: Thu Oct 22 18:45:28 2015 CEST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x82) Offline data collection activity > was completed without error. > Auto Offline Data Collection: > Enabled. > Self-test execution status: ( 249) Self-test routine in progress... > 90% of test remaining. > Total time to complete Offline > data collection: (32880) seconds. > Offline data collection > capabilities: (0x7b) SMART execute Offline immediate. > Auto Offline data collection > on/off support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 357) minutes. > Conveyance self-test routine > recommended polling time: ( 5) minutes. > SCT capabilities: (0x7035) SCT Status supported. > SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED > WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > Always - 0 > 3 Spin_Up_Time 0x0027 246 154 021 Pre-fail > Always - 6691 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > Always - 14 > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age > Always - 0 > 9 Power_On_Hours 0x0032 094 094 000 Old_age > Always - 4869 > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > Always - 0 > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 14 > 16 Unknown_Attribute 0x0022 130 070 000 Old_age > Always - 2289651870502 > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > Always - 12 > 193 Load_Cycle_Count 0x0032 200 200 000 Old_age > Always - 2 > 194 Temperature_Celsius 0x0022 117 111 000 Old_age > Always - 35 > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > Always - 0 > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > Offline - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Short offline Completed without error 00% > 4869 - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute delay. > > ===== /dev/rdsk/c0t5d0s0 : > smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) > Copyright (C) 2002-12, Bruce Allen, Christian Franke, > www.smartmontools.org > > === START OF INFORMATION SECTION === > Model Family: Seagate SV35 > Device Model: ST3000VX000-1ES166 > Serial Number: Z500S3L8 > LU WWN Device Id: 5 000c50 079e3757b > Firmware Version: CV26 > User Capacity: 3,000,592,982,016 bytes [3.00 TB] > Sector Sizes: 512 bytes logical, 4096 bytes physical > Rotation Rate: 7200 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b > SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) > Local Time is: Thu Oct 22 18:45:28 2015 CEST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x00) Offline data collection activity > was never started. > Auto Offline Data Collection: > Disabled. > Self-test execution status: ( 249) Self-test routine in progress... > 90% of test remaining. > Total time to complete Offline > data collection: ( 80) seconds. > Offline data collection > capabilities: (0x73) SMART execute Offline immediate. > Auto Offline data collection > on/off support. > Suspend Offline collection upon new > command. > No Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 1) minutes. > Extended self-test routine > recommended polling time: ( 325) minutes. > Conveyance self-test routine > recommended polling time: ( 2) minutes. > SCT capabilities: (0x10b9) SCT Status supported. > SCT Error Recovery Control > supported. > SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED > WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail > Always - 8600880 > 3 Spin_Up_Time 0x0003 096 094 000 Pre-fail > Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age > Always - 19 > 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail > Always - 342685681 > 9 Power_On_Hours 0x0032 096 096 000 Old_age > Always - 4214 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age > Always - 19 > 184 End-to-End_Error 0x0032 100 100 099 Old_age > Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age > Always - 0 > 188 Command_Timeout 0x0032 100 100 000 Old_age > Always - 0 > 189 High_Fly_Writes 0x003a 028 028 000 Old_age > Always - 72 > 190 Airflow_Temperature_Cel 0x0022 069 065 045 Old_age > Always - 31 (Min/Max 29/32) > 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age > Always - 0 > 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age > Always - 19 > 193 Load_Cycle_Count 0x0032 100 100 000 Old_age > Always - 28 > 194 Temperature_Celsius 0x0022 031 040 000 Old_age > Always - 31 (0 20 0 0 0) > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age > Always - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Extended offline Self-test routine in progress 90% > 4214 - > # 2 Short offline Completed without error 00% > 4214 - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute delay. > > ===== /dev/rdsk/c4t5000CCA02A1292DDd0s0 : > smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) > Copyright (C) 2002-12, Bruce Allen, Christian Franke, > www.smartmontools.org > > Vendor: HITACHI > Product: HUS156030VLS600 > Revision: HPH1 > User Capacity: 300,000,000,000 bytes [300 GB] > Logical block size: 512 bytes > Logical Unit id: 0x5000cca02a1292dc > Serial number: LVVA6NHS > Device type: disk > Transport protocol: SAS > Local Time is: Thu Oct 22 18:45:29 2015 CEST > Device supports SMART and is Enabled > Temperature Warning Enabled > SMART Health Status: OK > > Current Drive Temperature: 45 C > Drive Trip Temperature: 70 C > Manufactured in week 14 of year 2012 > Specified cycle count over device lifetime: 50000 > Accumulated start-stop cycles: 80 > Elements in grown defect list: 0 > Vendor (Seagate) cache information > Blocks sent to initiator = 2340336504406016 > > Error counter log: > Errors Corrected by Total Correction > Gigabytes Total > ECC rereads/ errors algorithm > processed uncorrected > fast | delayed rewrites corrected invocations [10^9 > bytes] errors > read: 0 888890 0 888890 0 > 29326.957 0 > write: 0 961315 0 961315 0 > 6277.560 0 > > Non-medium error count: 283 > > SMART Self-test log > Num Test Status segment LifeTime > LBA_first_err [SK ASC ASQ] > Description number (hours) > # 1 Background long Self test in progress ... - > NOW - [- - -] > # 2 Background long Aborted (device reset ?) - > 14354 - [- - -] > # 3 Background short Completed - > 14354 - [- - -] > # 4 Background long Aborted (device reset ?) - > 14354 - [- - -] > # 5 Background long Aborted (device reset ?) - > 14354 - [- - -] > > Long (extended) Self Test duration: 2506 seconds [41.8 minutes] > > > > The zpool scrub results and general layout: > > > > # zpool status -v > pool: pool > state: ONLINE > scan: scrub repaired 0 in 164h13m with 0 errors on Thu Oct 22 18:13:33 > 2015 > config: > > NAME STATE READ WRITE CKSUM > pool ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > c0t3d0 ONLINE 0 0 0 > c0t5d0 ONLINE 0 0 0 > logs > c4t5000CCA02A1292DDd0p2 ONLINE 0 0 0 > cache > c4t5000CCA02A1292DDd0p3 ONLINE 0 0 0 > > errors: No known data errors > > pool: rpool > state: ONLINE > status: Some supported features are not enabled on the pool. The pool can > still be used, but some features are unavailable. > action: Enable all features using 'zpool upgrade'. Once this is done, > the pool may no longer be accessible by software that does not > support > the features. See zpool-features(5) for details. > scan: scrub repaired 0 in 3h3m with 0 errors on Thu Oct 8 04:12:35 2015 > config: > > NAME STATE READ WRITE CKSUM > rpool ONLINE 0 0 0 > c4t5000CCA02A1292DDd0s0 ONLINE 0 0 0 > > errors: No known data errors > > # zpool list -v > NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP > DEDUP HEALTH ALTROOT > pool 2.72T 2.52T 207G - 68% 92% > 1.36x ONLINE / > mirror 2.72T 2.52T 207G - 68% 92% > c0t3d0 - - - - - - > c0t5d0 - - - - - - > log - - - - - - > c4t5000CCA02A1292DDd0p2 8G 148K 8.00G - 0% 0% > cache - - - - - - > c4t5000CCA02A1292DDd0p3 120G 1.80G 118G - 0% 1% > rpool 151G 33.0G 118G - 76% 21% > 1.00x ONLINE - > c4t5000CCA02A1292DDd0s0 151G 33.0G 118G - 76% 21% > > Note the long scrub time may have included the downtime while the system > was frozen until it was rebooted. > > > > Thanks in advance for the fresh pairs of eyeballs, > Jim Klimov > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jimklimov at cos.ru Thu Oct 22 18:51:32 2015 From: jimklimov at cos.ru (Jim Klimov) Date: Thu, 22 Oct 2015 20:51:32 +0200 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: References: Message-ID: 22 ??????? 2015??. 19:36:56 CEST, Yavor Tomov ?????: >Hi Tovarishch Jim, > >I had similar issue with my box and it was related to the NFS locks. I >assume you are using it due to the Linux backups. The solution was >posted >by Chip on the mailing list. Copy of his solution below: > >"I've seen issues like this when you run out of NFS locks. NFSv3 in >Illumos is really slow at releasing locks. > >On all my NFS servers I do: > >sharectl set -p lockd_listen_backlog=256 nfs >sharectl set -p lockd_servers=2048 nfs > >Everywhere I can, I use NFSv4 instead of v3. It handles lock much >better." > >All the Best >Yavor > > > >On Thu, Oct 22, 2015 at 11:59 AM, Jim Klimov wrote: > >> Hello all, >> >> I have this HP-Z400 workstation with 16Gb ECC(should be) RAM running >> OmniOS bloody, which acts as a backup server for our production >systems >> (regularly rsync'ing large files off Linux boxes, and rotating ZFS >> auto-snapshots to keep its space free). Sometimes it also runs >replicas of >> infrastructure (DHCP, DNS) and was set up as a VirtualBox + >phpVirtualBox >> host to test that out, but no VMs running. >> >> So the essential loads are ZFS snapshots and ZFS scrubs :) >> >> And it freezes roughly every week. Stops responding to ping, attempts >to >> log in via SSH or physical console - it processes keypresses on the >latter, >> but does not present a login prompt. It used to be stable, and such >regular >> hangs began around summertime. >> >> >> >> My primary guess would be for flaky disks, maybe timing out under >load or >> going to sleep or whatever... But I have yet to prove it, or any >other >> theory. Maybe just CPU is overheating due to regular near-100% load >with >> disk I/O... At least I want to rule out OS errors and rule out (or >point >> out) operator/box errors as much as possible - which is something I >can >> change to try and fix ;) >> >> Before I proceed to TL;DR screenshots, I'd overview what I see: >> >> * In the "top" output, processes owned by zfssnap lead most of the >time... >> But even the SSH shell is noticeably slow to respond (1 sec per line >when >> just pressing enter to clear the screen to prepare nice screenshots). >> >> * SMART was not enabled on 3TB mirrored "pool" SATA disks (is now, >long >> tests initiated), but was in place on the "rpool" SAS disk where it >logged >> some corrected ECC errors - but none uncorrected. >> >> Maybe the cabling should be reseated. >> >> * iostat shows disks are generally not busy (they don't audibly >rattle nor >> visibly blink all the time, either) >> >> * zpool scrubs return clean >> >> * there are partitions of the system rpool disk (10K RPM SAS) used as >log >> and cache devices for the main data pool on 3TB SATA disks. The >system disk >> is fast and underutilized, so what the heck ;) And it was not a >problem for >> the first year of this system's honest and stable workouts. These >devices >> are pretty empty at the moment. >> >> >> >> I have enabled deadman panics according to Wiki, but none have >happened so >> far: >> >> # cat /etc/system | egrep -v '(^\*|^$)' >> set snooping=1 >> set pcplusmp:apic_panic_on_nmi=1 >> set apix:apic_panic_on_nmi = 1 >> >> >> >> >> >> In the "top" output, processes owned by zfssnap lead most of the >time: >> >> >> >> last pid: 22599; load avg: 12.9, 12.2, 11.2; up >> 0+09:52:11 >> 18:34:41 >> 140 processes: 125 sleeping, 13 running, 2 on cpu >> CPU states: 0.0% idle, 22.9% user, 77.1% kernel, 0.0% iowait, 0.0% >swap >> Memory: 16G phys mem, 1765M free mem, 2048M total swap, 2048M free >swap >> Seconds to delay: >> PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND >> 21389 zfssnap 1 43 2 863M 860M run 5:04 35.61% zfs >> 22360 zfssnap 1 52 2 118M 115M run 0:37 16.50% zfs >> 21778 zfssnap 1 52 2 563M 560M run 3:15 13.17% zfs >> 21278 zfssnap 1 52 2 947M 944M run 5:32 6.91% zfs >> 21881 zfssnap 1 43 2 433M 431M run 2:31 5.41% zfs >> 21852 zfssnap 1 52 2 459M 456M run 2:39 5.16% zfs >> 21266 zfssnap 1 43 2 906M 903M run 5:18 3.95% zfs >> 21757 zfssnap 1 43 2 597M 594M run 3:26 2.91% zfs >> 21274 zfssnap 1 52 2 930M 927M cpu/0 5:27 2.78% zfs >> 22588 zfssnap 1 43 2 30M 27M run 0:08 2.48% zfs >> 22580 zfssnap 1 52 2 49M 46M run 0:14 0.71% zfs >> 22038 root 1 59 0 5312K 3816K cpu/1 0:01 0.10% top >> 22014 root 1 59 0 8020K 4988K sleep 0:00 0.02% sshd >> >> >> >> Average "iostats" are not that busy: >> >> >> >> # zpool iostat -Td 5 >> Thu Oct 22 18:24:59 CEST 2015 >> capacity operations bandwidth >> pool alloc free read write read write >> ---------- ----- ----- ----- ----- ----- ----- >> pool 2.52T 207G 802 116 28.3M 840K >> rpool 33.0G 118G 0 4 4.52K 58.7K >> ---------- ----- ----- ----- ----- ----- ----- >> >> Thu Oct 22 18:25:04 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 10 0 97.9K >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:09 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:14 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 9 0 93.5K >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:19 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:24 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:29 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:34 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:39 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 16 0 374K >> ---------- ----- ----- ----- ----- ----- ----- >> ... >> >> Thu Oct 22 18:33:49 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 11 0 94.5K >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:33:54 CEST 2015 >> pool 2.52T 207G 0 13 819 80.0K >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:33:59 CEST 2015 >> pool 2.52T 207G 0 129 0 1.06M >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:34:04 CEST 2015 >> pool 2.52T 207G 0 55 0 503K >> rpool 33.0G 118G 0 11 0 97.9K >> ---------- ----- ----- ----- ----- ----- ----- >> ... >> >> just occasional bursts of work. >> >> I've now enabled SMART on the disks (2*3Tb mirror "pool" and 1*300Gb >> "rpool") and ran some short tests and triggered long tests (hopefully >> they'd succeed by tomorrow); current results are: >> >> >> # for D in /dev/rdsk/c0*s0; do echo "===== $D :"; smartctl -d sat,12 >-a $D >> ; done ; for D in /dev/rdsk/c4*s0 ; do echo "===== $D :"; smartctl -d >scsi >> -a $D ; done >> ===== /dev/rdsk/c0t3d0s0 : >> smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) >> Copyright (C) 2002-12, Bruce Allen, Christian Franke, >> www.smartmontools.org >> >> === START OF INFORMATION SECTION === >> Device Model: WDC WD3003FZEX-00Z4SA0 >> Serial Number: WD-WCC5D1KKU0PA >> LU WWN Device Id: 5 0014ee 2610716b7 >> Firmware Version: 01.01A01 >> User Capacity: 3,000,592,982,016 bytes [3.00 TB] >> Sector Sizes: 512 bytes logical, 4096 bytes physical >> Rotation Rate: 7200 rpm >> Device is: Not in smartctl database [for details use: -P >showall] >> ATA Version is: ACS-2 (minor revision not indicated) >> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) >> Local Time is: Thu Oct 22 18:45:28 2015 CEST >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> >> General SMART Values: >> Offline data collection status: (0x82) Offline data collection >activity >> was completed without error. >> Auto Offline Data Collection: >> Enabled. >> Self-test execution status: ( 249) Self-test routine in >progress... >> 90% of test remaining. >> Total time to complete Offline >> data collection: (32880) seconds. >> Offline data collection >> capabilities: (0x7b) SMART execute Offline >immediate. >> Auto Offline data collection >> on/off support. >> Suspend Offline collection >upon new >> command. >> Offline surface scan >supported. >> Self-test supported. >> Conveyance Self-test >supported. >> Selective Self-test >supported. >> SMART capabilities: (0x0003) Saves SMART data before >entering >> power-saving mode. >> Supports SMART auto save >timer. >> Error logging capability: (0x01) Error logging supported. >> General Purpose Logging >supported. >> Short self-test routine >> recommended polling time: ( 2) minutes. >> Extended self-test routine >> recommended polling time: ( 357) minutes. >> Conveyance self-test routine >> recommended polling time: ( 5) minutes. >> SCT capabilities: (0x7035) SCT Status supported. >> SCT Feature Control >supported. >> SCT Data Table supported. >> >> SMART Attributes Data Structure revision number: 16 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >UPDATED >> WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail >> Always - 0 >> 3 Spin_Up_Time 0x0027 246 154 021 Pre-fail >> Always - 6691 >> 4 Start_Stop_Count 0x0032 100 100 000 Old_age >> Always - 14 >> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail >> Always - 0 >> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age >> Always - 0 >> 9 Power_On_Hours 0x0032 094 094 000 Old_age >> Always - 4869 >> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age >> Always - 0 >> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age >> Always - 0 >> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age >> Always - 14 >> 16 Unknown_Attribute 0x0022 130 070 000 Old_age >> Always - 2289651870502 >> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age >> Always - 12 >> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age >> Always - 2 >> 194 Temperature_Celsius 0x0022 117 111 000 Old_age >> Always - 35 >> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age >> Always - 0 >> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age >> Always - 0 >> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age >> Offline - 0 >> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age >> Always - 0 >> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age >> Offline - 0 >> >> SMART Error Log Version: 1 >> No Errors Logged >> >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining >> LifeTime(hours) LBA_of_first_error >> # 1 Short offline Completed without error 00% >> 4869 - >> >> SMART Selective self-test log data structure revision number 1 >> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >> 1 0 0 Not_testing >> 2 0 0 Not_testing >> 3 0 0 Not_testing >> 4 0 0 Not_testing >> 5 0 0 Not_testing >> Selective self-test flags (0x0): >> After scanning selected spans, do NOT read-scan remainder of disk. >> If Selective self-test is pending on power-up, resume after 0 minute >delay. >> >> ===== /dev/rdsk/c0t5d0s0 : >> smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) >> Copyright (C) 2002-12, Bruce Allen, Christian Franke, >> www.smartmontools.org >> >> === START OF INFORMATION SECTION === >> Model Family: Seagate SV35 >> Device Model: ST3000VX000-1ES166 >> Serial Number: Z500S3L8 >> LU WWN Device Id: 5 000c50 079e3757b >> Firmware Version: CV26 >> User Capacity: 3,000,592,982,016 bytes [3.00 TB] >> Sector Sizes: 512 bytes logical, 4096 bytes physical >> Rotation Rate: 7200 rpm >> Device is: In smartctl database [for details use: -P show] >> ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b >> SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) >> Local Time is: Thu Oct 22 18:45:28 2015 CEST >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> >> General SMART Values: >> Offline data collection status: (0x00) Offline data collection >activity >> was never started. >> Auto Offline Data Collection: >> Disabled. >> Self-test execution status: ( 249) Self-test routine in >progress... >> 90% of test remaining. >> Total time to complete Offline >> data collection: ( 80) seconds. >> Offline data collection >> capabilities: (0x73) SMART execute Offline >immediate. >> Auto Offline data collection >> on/off support. >> Suspend Offline collection >upon new >> command. >> No Offline surface scan >supported. >> Self-test supported. >> Conveyance Self-test >supported. >> Selective Self-test >supported. >> SMART capabilities: (0x0003) Saves SMART data before >entering >> power-saving mode. >> Supports SMART auto save >timer. >> Error logging capability: (0x01) Error logging supported. >> General Purpose Logging >supported. >> Short self-test routine >> recommended polling time: ( 1) minutes. >> Extended self-test routine >> recommended polling time: ( 325) minutes. >> Conveyance self-test routine >> recommended polling time: ( 2) minutes. >> SCT capabilities: (0x10b9) SCT Status supported. >> SCT Error Recovery Control >> supported. >> SCT Feature Control >supported. >> SCT Data Table supported. >> >> SMART Attributes Data Structure revision number: 10 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >UPDATED >> WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail >> Always - 8600880 >> 3 Spin_Up_Time 0x0003 096 094 000 Pre-fail >> Always - 0 >> 4 Start_Stop_Count 0x0032 100 100 020 Old_age >> Always - 19 >> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail >> Always - 0 >> 7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail >> Always - 342685681 >> 9 Power_On_Hours 0x0032 096 096 000 Old_age >> Always - 4214 >> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail >> Always - 0 >> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age >> Always - 19 >> 184 End-to-End_Error 0x0032 100 100 099 Old_age >> Always - 0 >> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age >> Always - 0 >> 188 Command_Timeout 0x0032 100 100 000 Old_age >> Always - 0 >> 189 High_Fly_Writes 0x003a 028 028 000 Old_age >> Always - 72 >> 190 Airflow_Temperature_Cel 0x0022 069 065 045 Old_age >> Always - 31 (Min/Max 29/32) >> 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age >> Always - 0 >> 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age >> Always - 19 >> 193 Load_Cycle_Count 0x0032 100 100 000 Old_age >> Always - 28 >> 194 Temperature_Celsius 0x0022 031 040 000 Old_age >> Always - 31 (0 20 0 0 0) >> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age >> Always - 0 >> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age >> Offline - 0 >> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age >> Always - 0 >> >> SMART Error Log Version: 1 >> No Errors Logged >> >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining >> LifeTime(hours) LBA_of_first_error >> # 1 Extended offline Self-test routine in progress 90% >> 4214 - >> # 2 Short offline Completed without error 00% >> 4214 - >> >> SMART Selective self-test log data structure revision number 1 >> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >> 1 0 0 Not_testing >> 2 0 0 Not_testing >> 3 0 0 Not_testing >> 4 0 0 Not_testing >> 5 0 0 Not_testing >> Selective self-test flags (0x0): >> After scanning selected spans, do NOT read-scan remainder of disk. >> If Selective self-test is pending on power-up, resume after 0 minute >delay. >> >> ===== /dev/rdsk/c4t5000CCA02A1292DDd0s0 : >> smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) >> Copyright (C) 2002-12, Bruce Allen, Christian Franke, >> www.smartmontools.org >> >> Vendor: HITACHI >> Product: HUS156030VLS600 >> Revision: HPH1 >> User Capacity: 300,000,000,000 bytes [300 GB] >> Logical block size: 512 bytes >> Logical Unit id: 0x5000cca02a1292dc >> Serial number: LVVA6NHS >> Device type: disk >> Transport protocol: SAS >> Local Time is: Thu Oct 22 18:45:29 2015 CEST >> Device supports SMART and is Enabled >> Temperature Warning Enabled >> SMART Health Status: OK >> >> Current Drive Temperature: 45 C >> Drive Trip Temperature: 70 C >> Manufactured in week 14 of year 2012 >> Specified cycle count over device lifetime: 50000 >> Accumulated start-stop cycles: 80 >> Elements in grown defect list: 0 >> Vendor (Seagate) cache information >> Blocks sent to initiator = 2340336504406016 >> >> Error counter log: >> Errors Corrected by Total Correction >> Gigabytes Total >> ECC rereads/ errors algorithm >> processed uncorrected >> fast | delayed rewrites corrected invocations [10^9 >> bytes] errors >> read: 0 888890 0 888890 0 >> 29326.957 0 >> write: 0 961315 0 961315 0 >> 6277.560 0 >> >> Non-medium error count: 283 >> >> SMART Self-test log >> Num Test Status segment LifeTime >> LBA_first_err [SK ASC ASQ] >> Description number (hours) >> # 1 Background long Self test in progress ... - >> NOW - [- - -] >> # 2 Background long Aborted (device reset ?) - >> 14354 - [- - -] >> # 3 Background short Completed - >> 14354 - [- - -] >> # 4 Background long Aborted (device reset ?) - >> 14354 - [- - -] >> # 5 Background long Aborted (device reset ?) - >> 14354 - [- - -] >> >> Long (extended) Self Test duration: 2506 seconds [41.8 minutes] >> >> >> >> The zpool scrub results and general layout: >> >> >> >> # zpool status -v >> pool: pool >> state: ONLINE >> scan: scrub repaired 0 in 164h13m with 0 errors on Thu Oct 22 >18:13:33 >> 2015 >> config: >> >> NAME STATE READ WRITE CKSUM >> pool ONLINE 0 0 0 >> mirror-0 ONLINE 0 0 0 >> c0t3d0 ONLINE 0 0 0 >> c0t5d0 ONLINE 0 0 0 >> logs >> c4t5000CCA02A1292DDd0p2 ONLINE 0 0 0 >> cache >> c4t5000CCA02A1292DDd0p3 ONLINE 0 0 0 >> >> errors: No known data errors >> >> pool: rpool >> state: ONLINE >> status: Some supported features are not enabled on the pool. The pool >can >> still be used, but some features are unavailable. >> action: Enable all features using 'zpool upgrade'. Once this is done, >> the pool may no longer be accessible by software that does >not >> support >> the features. See zpool-features(5) for details. >> scan: scrub repaired 0 in 3h3m with 0 errors on Thu Oct 8 04:12:35 >2015 >> config: >> >> NAME STATE READ WRITE CKSUM >> rpool ONLINE 0 0 0 >> c4t5000CCA02A1292DDd0s0 ONLINE 0 0 0 >> >> errors: No known data errors >> >> # zpool list -v >> NAME SIZE ALLOC FREE EXPANDSZ FRAG >CAP >> DEDUP HEALTH ALTROOT >> pool 2.72T 2.52T 207G - 68% >92% >> 1.36x ONLINE / >> mirror 2.72T 2.52T 207G - 68% >92% >> c0t3d0 - - - - - >- >> c0t5d0 - - - - - >- >> log - - - - - >- >> c4t5000CCA02A1292DDd0p2 8G 148K 8.00G - 0% >0% >> cache - - - - - >- >> c4t5000CCA02A1292DDd0p3 120G 1.80G 118G - 0% >1% >> rpool 151G 33.0G 118G - 76% >21% >> 1.00x ONLINE - >> c4t5000CCA02A1292DDd0s0 151G 33.0G 118G - 76% >21% >> >> Note the long scrub time may have included the downtime while the >system >> was frozen until it was rebooted. >> >> >> >> Thanks in advance for the fresh pairs of eyeballs, >> Jim Klimov >> >> _______________________________________________ >> OmniOS-discuss mailing list >> OmniOS-discuss at lists.omniti.com >> http://lists.omniti.com/mailman/listinfo/omnios-discuss >> >> Thanks for the heads-up. I think all copies are rsync's, but will make sure just in case this bump helps. Did anyone run into issues with many zfs-auto-snapshots (e.g. thousands - many datasets and many snaps until they are killed to keep some 200gb free) on a small nujber of spindles? Jim -- Typos courtesy of K-9 Mail on my Samsung Android From vab at bb-c.de Thu Oct 22 18:57:45 2015 From: vab at bb-c.de (Volker A. Brandt) Date: Thu, 22 Oct 2015 20:57:45 +0200 Subject: [OmniOS-discuss] UPDATE NOW --> ntp to 4.2.8p4 In-Reply-To: <36D4EF6A-C596-441B-828B-862D5EB9423E@omniti.com> References: <36D4EF6A-C596-441B-828B-862D5EB9423E@omniti.com> Message-ID: <22057.12713.886330.955767@glaurung.bb-c.de> Hi Dan! Thanks for all the work you're doing on OmniOS! > > On Oct 22, 2015, at 1:11 PM, Dan McDonald > > wrote: > > > > "pkg update" followed by "svcadm restart ntp" as a safety measure > > should be sufficient. No rebooting is needed. > > I made a small mistake here. The "svcadm restart..." is not > necessary, the IPS package does the right thing here. Well, no, it doesn't. :-) That's due to a design flaw in the interaction between IPS and SMF (IMHO). Even though the manifest object in the package is properly tagged with restart_fmri, the service is never restarted, because the manifest is not touched during the "pkg update", as it has not changed since the last package version. So if you change an SMF method in a package and want an "automatic" restart, you need to also physically modify the SMF manifest. I do that by just incrementing a version counter or a timestamp, and noting the fact in an XML comment in the manifest. Otherwise, you need to manually restart the service. Unrelated, when I updated my local copy of the r151014 repo in preparation of the pkg udpate for ntp, I got this: Processing packages for publisher omnios ... Retrieving and evaluating 6161 package(s)... Download Manifests ( 907/6161) -pkgrecv: http protocol error: code: 404 reason: Not Found URL: 'http://pkg.omniti.com/omnios/r151014/omnios/manifest/0/developer%2Fillumos-tools at 11%2C5.11-0.151014%3A20151016T122410Z' (happened 4 times) Processing packages for publisher omnios ... Retrieving and evaluating 1030 package(s)... PROCESS ITEMS GET (MB) SEND (MB) Completed 1/1 2.4/2.4 4.9/4.9 So the recent addition of the illumos-tools pkg broke something in your repo. I worked around that by specifying the service/network/ntp pkg in the pkgrecv invocation. Regards -- Volker -- ------------------------------------------------------------------------ Volker A. Brandt Consulting and Support for Oracle Solaris Brandt & Brandt Computer GmbH WWW: http://www.bb-c.de/ Am Wiesenpfad 6, 53340 Meckenheim, GERMANY Email: vab at bb-c.de Handelsregister: Amtsgericht Bonn, HRB 10513 Schuhgr??e: 46 Gesch?ftsf?hrer: Rainer J.H. Brandt und Volker A. Brandt "When logic and proportion have fallen sloppy dead" From matej at zunaj.si Thu Oct 22 19:02:49 2015 From: matej at zunaj.si (Matej Zerovnik) Date: Thu, 22 Oct 2015 21:02:49 +0200 Subject: [OmniOS-discuss] Slow performance with ZeusRAM? Message-ID: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> Hello, I'm building a new system and I'm having a bit of a performance problem. Well, its either that or I'm not getting the whole ZIL idea:) My system is following: - IBM xServer 3550 M4 server (dual CPU with 160GB memory) - LSI 9207 HBA (P19 firmware) - Supermicro JBOD with SAS expander - 4TB SAS3 drives - ZeusRAM for ZIL - LTS Omnios (all patches applied) If I benchmark ZeusRAM on its own with random 4k sync writes, I can get 48k IOPS out of it, no problem there. If I create a new raidz2 pool with 10 hard drives, mirrored ZeusRAMs for ZIL and set sync=always, I can only squeeze 14k IOPS out of the system. Is that normal or should I be getting 48k IOPS on the 2nd pool as well, since this is the performance ZeusRAM can deliver? I'm testing with fio: fio --filename=/pool0/test01 --size=5g --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=solarisaio --bs=4k --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest thanks, Matej -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3468 bytes Desc: not available URL: From minkim1 at gmail.com Thu Oct 22 19:15:37 2015 From: minkim1 at gmail.com (Min Kim) Date: Thu, 22 Oct 2015 12:15:37 -0700 Subject: [OmniOS-discuss] Slow performance with ZeusRAM? In-Reply-To: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> References: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> Message-ID: <68BCD34E-B3AE-4A23-A0A9-DD6A450DB892@gmail.com> I believe this is an known issue with SAS expanders. Please see here: http://serverfault.com/questions/242336/sas-expanders-vs-direct-attached-sas When you are stress-testing the Zeusram by itself, all the IOPs and bandwidth of the expander are allocated to that device alone. Once you add all the other drives, you lose some of that as you have to share it with the other disks. Min Kim > On Oct 22, 2015, at 12:02 PM, Matej Zerovnik wrote: > > Hello, > > I'm building a new system and I'm having a bit of a performance problem. Well, its either that or I'm not getting the whole ZIL idea:) > > My system is following: > - IBM xServer 3550 M4 server (dual CPU with 160GB memory) > - LSI 9207 HBA (P19 firmware) > - Supermicro JBOD with SAS expander > - 4TB SAS3 drives > - ZeusRAM for ZIL > - LTS Omnios (all patches applied) > > If I benchmark ZeusRAM on its own with random 4k sync writes, I can get 48k IOPS out of it, no problem there. > > If I create a new raidz2 pool with 10 hard drives, mirrored ZeusRAMs for ZIL and set sync=always, I can only squeeze 14k IOPS out of the system. > Is that normal or should I be getting 48k IOPS on the 2nd pool as well, since this is the performance ZeusRAM can deliver? > > I'm testing with fio: > fio --filename=/pool0/test01 --size=5g --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=solarisaio --bs=4k --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest > > thanks, Matej_______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.sproul at circonus.com Thu Oct 22 19:18:47 2015 From: eric.sproul at circonus.com (Eric Sproul) Date: Thu, 22 Oct 2015 15:18:47 -0400 Subject: [OmniOS-discuss] UPDATE NOW --> ntp to 4.2.8p4 In-Reply-To: <22057.12713.886330.955767@glaurung.bb-c.de> References: <36D4EF6A-C596-441B-828B-862D5EB9423E@omniti.com> <22057.12713.886330.955767@glaurung.bb-c.de> Message-ID: On Thu, Oct 22, 2015 at 2:57 PM, Volker A. Brandt wrote: >> I made a small mistake here. The "svcadm restart..." is not >> necessary, the IPS package does the right thing here. > > Well, no, it doesn't. :-) That's due to a design flaw in the interaction > between IPS and SMF (IMHO). Even though the manifest object in the package > is properly tagged with restart_fmri, the service is never restarted, > because the manifest is not touched during the "pkg update", as it has not > changed since the last package version. The service has restarted correctly for me on both 006 and 014 with this update. I'm not sure why that is though, because you're correct that the ntp.xml file has not changed in all of the '014 versions published. I was under the impression that the restart_fmri actuator would only fire when the associated action was triggered. However, if we really *do* want to restart ntp when the *daemon* updates, then we could add a restart_fmri actuator on the usr/lib/inet/ntpd file. Thus, whenever that file is updated, svc:/network/ntp:default could be restarted. Eric From matej at zunaj.si Thu Oct 22 19:26:12 2015 From: matej at zunaj.si (Matej Zerovnik) Date: Thu, 22 Oct 2015 21:26:12 +0200 Subject: [OmniOS-discuss] Slow performance with ZeusRAM? In-Reply-To: <1D3B7684-CBA0-408D-99E6-9D84639CB217@gmail.com> References: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> <1D3B7684-CBA0-408D-99E6-9D84639CB217@gmail.com> Message-ID: <6B6E0336-CF33-4B2E-BB7A-1B1D6937E4FC@zunaj.si> Interesting? Although, I?m not sure if this is really the problem. For test, I booted up linux and put both ZeusRAM to raid1 software raid and repeated the test. I got full 48kIOPS in the test, meaning there was 96kIOPS sent to JBOD (48k IOPS for each drive). On the OmniOS test bed, there are 28k IOPS sent to ZIL and X amount to spindles when flushing write cache, but no more then 1000 IOPS (100 iops/drive * 10). Comparing that to the case above, IOPS shouldn?t be a limit. Maybe I could try building my pools with hard drives that aren?t near ZIL drive, which is in bay 0. I could take hard drives from bays 4-15, which probably use different SAS lanes. lp, Matej > On 22 Oct 2015, at 21:10, Min Kim wrote: > > I believe this is an known issue with SAS expanders. > > Please see here: > > http://serverfault.com/questions/242336/sas-expanders-vs-direct-attached-sas > > When you are stress-testing the Zeusram by itself, all the IOPs and bandwidth of the expander are allocated to that device alone. Once you add all the other drives, you lose some of that as you have to share it with the other disks. > > Min Kim > > > >> On Oct 22, 2015, at 12:02 PM, Matej Zerovnik > wrote: >> >> Hello, >> >> I'm building a new system and I'm having a bit of a performance problem. Well, its either that or I'm not getting the whole ZIL idea:) >> >> My system is following: >> - IBM xServer 3550 M4 server (dual CPU with 160GB memory) >> - LSI 9207 HBA (P19 firmware) >> - Supermicro JBOD with SAS expander >> - 4TB SAS3 drives >> - ZeusRAM for ZIL >> - LTS Omnios (all patches applied) >> >> If I benchmark ZeusRAM on its own with random 4k sync writes, I can get 48k IOPS out of it, no problem there. >> >> If I create a new raidz2 pool with 10 hard drives, mirrored ZeusRAMs for ZIL and set sync=always, I can only squeeze 14k IOPS out of the system. >> Is that normal or should I be getting 48k IOPS on the 2nd pool as well, since this is the performance ZeusRAM can deliver? >> >> I'm testing with fio: >> fio --filename=/pool0/test01 --size=5g --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=solarisaio --bs=4k --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest >> >> thanks, Matej_______________________________________________ >> OmniOS-discuss mailing list >> OmniOS-discuss at lists.omniti.com >> http://lists.omniti.com/mailman/listinfo/omnios-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3468 bytes Desc: not available URL: From lotheac at iki.fi Thu Oct 22 19:28:33 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Thu, 22 Oct 2015 22:28:33 +0300 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: References: Message-ID: <20151022192833.GB77@gutsman.lotheac.fi> On Thu, Oct 22 2015 20:51:32 +0200, Jim Klimov wrote: > Did anyone run into issues with many zfs-auto-snapshots (e.g. > thousands - many datasets and many snaps until they are killed to keep > some 200gb free) on a small nujber of spindles? Not with that number of snapshots, but we had several thousand filesystems with dozens (I think about 70) of snapshots per fs, and it turned out to be a really bad idea due to the memory requirements: things slowed down *a lot*. We didn't see any hangs though. I have a pool with around four thousand snapshots total at home and that box is performing just fine, and it's just two spinning disks + two SSDs for cache/slog. -- Lauri Tirkkonen | lotheac @ IRCnet From minkim1 at gmail.com Thu Oct 22 19:36:50 2015 From: minkim1 at gmail.com (Min Kim) Date: Thu, 22 Oct 2015 12:36:50 -0700 Subject: [OmniOS-discuss] Slow performance with ZeusRAM? In-Reply-To: <6B6E0336-CF33-4B2E-BB7A-1B1D6937E4FC@zunaj.si> References: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> <1D3B7684-CBA0-408D-99E6-9D84639CB217@gmail.com> <6B6E0336-CF33-4B2E-BB7A-1B1D6937E4FC@zunaj.si> Message-ID: <9D6C17D8-26E4-4F6B-837F-2A3FC0C6E882@gmail.com> Are you using the same record size of 4K on your zfs pool as you used with your linux test system? If the record size for the zpool and slog is set at the default value of 128K, it will greatly reduce the measured IOPS relative to that measured with a recordsize of 4K. Min Kim > On Oct 22, 2015, at 12:26 PM, Matej Zerovnik wrote: > > Interesting? > > Although, I?m not sure if this is really the problem. > > For test, I booted up linux and put both ZeusRAM to raid1 software raid and repeated the test. I got full 48kIOPS in the test, meaning there was 96kIOPS sent to JBOD (48k IOPS for each drive). > > On the OmniOS test bed, there are 28k IOPS sent to ZIL and X amount to spindles when flushing write cache, but no more then 1000 IOPS (100 iops/drive * 10). Comparing that to the case above, IOPS shouldn?t be a limit. > > Maybe I could try building my pools with hard drives that aren?t near ZIL drive, which is in bay 0. I could take hard drives from bays 4-15, which probably use different SAS lanes. > > lp, Matej > > >> On 22 Oct 2015, at 21:10, Min Kim > wrote: >> >> I believe this is an known issue with SAS expanders. >> >> Please see here: >> >> http://serverfault.com/questions/242336/sas-expanders-vs-direct-attached-sas >> >> When you are stress-testing the Zeusram by itself, all the IOPs and bandwidth of the expander are allocated to that device alone. Once you add all the other drives, you lose some of that as you have to share it with the other disks. >> >> Min Kim >> >> >> >>> On Oct 22, 2015, at 12:02 PM, Matej Zerovnik > wrote: >>> >>> Hello, >>> >>> I'm building a new system and I'm having a bit of a performance problem. Well, its either that or I'm not getting the whole ZIL idea:) >>> >>> My system is following: >>> - IBM xServer 3550 M4 server (dual CPU with 160GB memory) >>> - LSI 9207 HBA (P19 firmware) >>> - Supermicro JBOD with SAS expander >>> - 4TB SAS3 drives >>> - ZeusRAM for ZIL >>> - LTS Omnios (all patches applied) >>> >>> If I benchmark ZeusRAM on its own with random 4k sync writes, I can get 48k IOPS out of it, no problem there. >>> >>> If I create a new raidz2 pool with 10 hard drives, mirrored ZeusRAMs for ZIL and set sync=always, I can only squeeze 14k IOPS out of the system. >>> Is that normal or should I be getting 48k IOPS on the 2nd pool as well, since this is the performance ZeusRAM can deliver? >>> >>> I'm testing with fio: >>> fio --filename=/pool0/test01 --size=5g --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=solarisaio --bs=4k --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest >>> >>> thanks, Matej_______________________________________________ >>> OmniOS-discuss mailing list >>> OmniOS-discuss at lists.omniti.com >>> http://lists.omniti.com/mailman/listinfo/omnios-discuss >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lotheac at iki.fi Thu Oct 22 19:41:34 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Thu, 22 Oct 2015 22:41:34 +0300 Subject: [OmniOS-discuss] UPDATE NOW --> ntp to 4.2.8p4 In-Reply-To: <22057.12713.886330.955767@glaurung.bb-c.de> References: <36D4EF6A-C596-441B-828B-862D5EB9423E@omniti.com> <22057.12713.886330.955767@glaurung.bb-c.de> Message-ID: <20151022194134.GC77@gutsman.lotheac.fi> On Thu, Oct 22 2015 20:57:45 +0200, Volker A. Brandt wrote: > Thanks for all the work you're doing on OmniOS! > > > > On Oct 22, 2015, at 1:11 PM, Dan McDonald > > > wrote: > > > > > > "pkg update" followed by "svcadm restart ntp" as a safety measure > > > should be sufficient. No rebooting is needed. > > > > I made a small mistake here. The "svcadm restart..." is not > > necessary, the IPS package does the right thing here. > > Well, no, it doesn't. :-) That's due to a design flaw in the interaction > between IPS and SMF (IMHO). Even though the manifest object in the package > is properly tagged with restart_fmri, the service is never restarted, > because the manifest is not touched during the "pkg update", as it has not > changed since the last package version. > > So if you change an SMF method in a package and want an "automatic" > restart, you need to also physically modify the SMF manifest. I do that > by just incrementing a version counter or a timestamp, and noting the > fact in an XML comment in the manifest. Otherwise, you need to manually > restart the service. Well, that's not a design flaw. Actuators are executed only when the action (eg. file) specifying them changes -- in other words, the packager should include restart_fmri actuators in file actions that are relevant for the service in question (eg. the ntpd binary at minimum). This ntp package does not contain *any* restart_fmri actuators for the ntp service: % pkg contents -mr pkg://omnios/service/network/ntp at 4.2.8.4-0.151014:20151022T170026Z|grep restart_fmri file cb84fc718d7aa637c12641aed4405107b5659ab8 chash=8758e80d1b9738c35f2d29cdebceee2930cdfa3b group=bin mode=0444 owner=root path=lib/svc/manifest/network/ntp.xml pkg.csize=1649 pkg.size=4681 restart_fmri=svc:/system/manifest-import:default (sidebar: the manifest-import service is what imports service manifests into the SMF repository, which you need to do when you install a new service) If you wanted to restart ntp when any files in the ntp package change on update, you would need an actuator like 'restart_fmri=svc:/network/ntp:default' on *all* file actions delivered by the package. -- Lauri Tirkkonen | lotheac @ IRCnet From chip at innovates.com Thu Oct 22 19:47:53 2015 From: chip at innovates.com (Schweiss, Chip) Date: Thu, 22 Oct 2015 14:47:53 -0500 Subject: [OmniOS-discuss] Slow performance with ZeusRAM? In-Reply-To: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> References: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> Message-ID: The ZIL on log devices suffer a bit from not filling queues well. In order to get the queues to fill more, try running your test to several zfs folders on the pool simultaneously and measure your total I/O. As I understand it, ff you're writing to only one zfs folder, your queue depth will stay at 1 on the log device and you be come latency bound. -Chip On Thu, Oct 22, 2015 at 2:02 PM, Matej Zerovnik wrote: > Hello, > > I'm building a new system and I'm having a bit of a performance problem. > Well, its either that or I'm not getting the whole ZIL idea:) > > My system is following: > - IBM xServer 3550 M4 server (dual CPU with 160GB memory) > - LSI 9207 HBA (P19 firmware) > - Supermicro JBOD with SAS expander > - 4TB SAS3 drives > - ZeusRAM for ZIL > - LTS Omnios (all patches applied) > > If I benchmark ZeusRAM on its own with random 4k sync writes, I can get > 48k IOPS out of it, no problem there. > > If I create a new raidz2 pool with 10 hard drives, mirrored ZeusRAMs for > ZIL and set sync=always, I can only squeeze 14k IOPS out of the system. > Is that normal or should I be getting 48k IOPS on the 2nd pool as well, > since this is the performance ZeusRAM can deliver? > > I'm testing with fio: > fio --filename=/pool0/test01 --size=5g --rw=randwrite --refill_buffers > --norandommap --randrepeat=0 --ioengine=solarisaio --bs=4k --iodepth=16 > --numjobs=16 --runtime=60 --group_reporting --name=4ktest > > thanks, Matej > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bfriesen at simple.dallas.tx.us Thu Oct 22 19:58:45 2015 From: bfriesen at simple.dallas.tx.us (Bob Friesenhahn) Date: Thu, 22 Oct 2015 14:58:45 -0500 (CDT) Subject: [OmniOS-discuss] Slow performance with ZeusRAM? In-Reply-To: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> References: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> Message-ID: On Thu, 22 Oct 2015, Matej Zerovnik wrote: > > If I create a new raidz2 pool with 10 hard drives, mirrored ZeusRAMs for ZIL and set sync=always, I can only squeeze 14k IOPS out of the system. > Is that normal or should I be getting 48k IOPS on the 2nd pool as well, since this is the performance ZeusRAM can deliver? Is your zfs filesystem using 4k blocks? Random writes may also require random reads due to COW. If the data is not perfectly aligned and perfectly fill the underlying zfs block, and the existing data is not already cached in ARC, then it needs to be read from underlying store so existing data can be modified during the write. I do see that you are using asynchronous I/O, which may add more factors. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ From matej at zunaj.si Thu Oct 22 20:47:51 2015 From: matej at zunaj.si (Matej Zerovnik) Date: Thu, 22 Oct 2015 22:47:51 +0200 Subject: [OmniOS-discuss] Slow performance with ZeusRAM? In-Reply-To: <9D6C17D8-26E4-4F6B-837F-2A3FC0C6E882@gmail.com> References: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> <1D3B7684-CBA0-408D-99E6-9D84639CB217@gmail.com> <6B6E0336-CF33-4B2E-BB7A-1B1D6937E4FC@zunaj.si> <9D6C17D8-26E4-4F6B-837F-2A3FC0C6E882@gmail.com> Message-ID: I?m using the default value of 128K on linux and OmniOS. I tried with recordsize=4k, but there is no different in iops? Matej > On 22 Oct 2015, at 21:36, Min Kim wrote: > > Are you using the same record size of 4K on your zfs pool as you used with your linux test system? > > If the record size for the zpool and slog is set at the default value of 128K, it will greatly reduce the measured IOPS relative to that measured with a recordsize of 4K. > > Min Kim > > > > >> On Oct 22, 2015, at 12:26 PM, Matej Zerovnik > wrote: >> >> Interesting? >> >> Although, I?m not sure if this is really the problem. >> >> For test, I booted up linux and put both ZeusRAM to raid1 software raid and repeated the test. I got full 48kIOPS in the test, meaning there was 96kIOPS sent to JBOD (48k IOPS for each drive). >> >> On the OmniOS test bed, there are 28k IOPS sent to ZIL and X amount to spindles when flushing write cache, but no more then 1000 IOPS (100 iops/drive * 10). Comparing that to the case above, IOPS shouldn?t be a limit. >> >> Maybe I could try building my pools with hard drives that aren?t near ZIL drive, which is in bay 0. I could take hard drives from bays 4-15, which probably use different SAS lanes. >> >> lp, Matej >> >> >>> On 22 Oct 2015, at 21:10, Min Kim > wrote: >>> >>> I believe this is an known issue with SAS expanders. >>> >>> Please see here: >>> >>> http://serverfault.com/questions/242336/sas-expanders-vs-direct-attached-sas >>> >>> When you are stress-testing the Zeusram by itself, all the IOPs and bandwidth of the expander are allocated to that device alone. Once you add all the other drives, you lose some of that as you have to share it with the other disks. >>> >>> Min Kim >>> >>> >>> >>>> On Oct 22, 2015, at 12:02 PM, Matej Zerovnik > wrote: >>>> >>>> Hello, >>>> >>>> I'm building a new system and I'm having a bit of a performance problem. Well, its either that or I'm not getting the whole ZIL idea:) >>>> >>>> My system is following: >>>> - IBM xServer 3550 M4 server (dual CPU with 160GB memory) >>>> - LSI 9207 HBA (P19 firmware) >>>> - Supermicro JBOD with SAS expander >>>> - 4TB SAS3 drives >>>> - ZeusRAM for ZIL >>>> - LTS Omnios (all patches applied) >>>> >>>> If I benchmark ZeusRAM on its own with random 4k sync writes, I can get 48k IOPS out of it, no problem there. >>>> >>>> If I create a new raidz2 pool with 10 hard drives, mirrored ZeusRAMs for ZIL and set sync=always, I can only squeeze 14k IOPS out of the system. >>>> Is that normal or should I be getting 48k IOPS on the 2nd pool as well, since this is the performance ZeusRAM can deliver? >>>> >>>> I'm testing with fio: >>>> fio --filename=/pool0/test01 --size=5g --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=solarisaio --bs=4k --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest >>>> >>>> thanks, Matej_______________________________________________ >>>> OmniOS-discuss mailing list >>>> OmniOS-discuss at lists.omniti.com >>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss >>> >> > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3468 bytes Desc: not available URL: From bfriesen at simple.dallas.tx.us Thu Oct 22 21:11:48 2015 From: bfriesen at simple.dallas.tx.us (Bob Friesenhahn) Date: Thu, 22 Oct 2015 16:11:48 -0500 (CDT) Subject: [OmniOS-discuss] Slow performance with ZeusRAM? In-Reply-To: References: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> <1D3B7684-CBA0-408D-99E6-9D84639CB217@gmail.com> <6B6E0336-CF33-4B2E-BB7A-1B1D6937E4FC@zunaj.si> <9D6C17D8-26E4-4F6B-837F-2A3FC0C6E882@gmail.com> Message-ID: On Thu, 22 Oct 2015, Matej Zerovnik wrote: > I?m using the default value of 128K on linux and OmniOS. I tried with recordsize=4k, but there is no different in iops? There should be a large difference unless the file data is already cached in the ARC. Even with caching, a block size of 128k means that 128k is written to underlying store, although a useful purpose of your ZIL device is that an aggregation of multiple writes during the TXG interval to the same 128k block may be written as one write at the next TXG sync interval (rather than immediately). Try umounting and re-mounting your zfs filesystem (or 'zfs destroy' followed by 'zfs create') to see how performance differs on a freshly mounted filesystem. The zfs ARC caching will be purged when the filesystem is unmounted. Do you have compression enabled for this filesystem? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ From matej at zunaj.si Thu Oct 22 21:28:49 2015 From: matej at zunaj.si (Matej Zerovnik) Date: Thu, 22 Oct 2015 23:28:49 +0200 Subject: [OmniOS-discuss] Slow performance with ZeusRAM? In-Reply-To: References: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> Message-ID: <95ADB53F-BBDD-4CCD-959F-0A174E7DA8F2@zunaj.si> Chip: I tried running fio on multiple folders and it?s a little better When I run 7x fio (iodepth=4 threads=4), I get 28k IOPS on average When I run 7x fio (iodepth=4 threads=16), I get 35k IOPS on average. iostat shows that there is transfer rate of 140-220MB/s with average request size of 35kB When I run 7x fio (iodepth=1 threads=1), I get 24k IOPS on average There are still at least 10k IOPS left to use I guess:) Bob: Yes, my ZFS is ashift=12, since all my drives report 4k blocks (is that what you ment?). The pool is completly empty, so there is enough place for writes, so write speed should not be limited because of COW. Looking at iostat, there are no reads on the drives at all. I?m not sure where fio gets its data, probably from /dev/zero or somewhere? I will try sync engine instead of solarisaio to see if there is any difference. I don?t have compression enabled, since I want to test raw performance. I also disabled ARC (primarycache=metadata), just so my read tests are also as real as possible (so I don?t need to run tests with 1TB test file). > Try umounting and re-mounting your zfs filesystem (or 'zfs destroy' followed by 'zfs create') to see how performance differs on a freshly mounted filesystem. The zfs ARC caching will be purged when the filesystem is unmounted. If I understand you correctly, you are saying I should destroy my folders, set recordsize=4k on my pool and then create zfs folders? thanks, Matej > On 22 Oct 2015, at 21:47, Schweiss, Chip wrote: > > The ZIL on log devices suffer a bit from not filling queues well. In order to get the queues to fill more, try running your test to several zfs folders on the pool simultaneously and measure your total I/O. > > As I understand it, ff you're writing to only one zfs folder, your queue depth will stay at 1 on the log device and you be come latency bound. > > -Chip > > On Thu, Oct 22, 2015 at 2:02 PM, Matej Zerovnik > wrote: > Hello, > > I'm building a new system and I'm having a bit of a performance problem. Well, its either that or I'm not getting the whole ZIL idea:) > > My system is following: > - IBM xServer 3550 M4 server (dual CPU with 160GB memory) > - LSI 9207 HBA (P19 firmware) > - Supermicro JBOD with SAS expander > - 4TB SAS3 drives > - ZeusRAM for ZIL > - LTS Omnios (all patches applied) > > If I benchmark ZeusRAM on its own with random 4k sync writes, I can get 48k IOPS out of it, no problem there. > > If I create a new raidz2 pool with 10 hard drives, mirrored ZeusRAMs for ZIL and set sync=always, I can only squeeze 14k IOPS out of the system. > Is that normal or should I be getting 48k IOPS on the 2nd pool as well, since this is the performance ZeusRAM can deliver? > > I'm testing with fio: > fio --filename=/pool0/test01 --size=5g --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=solarisaio --bs=4k --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest > > thanks, Matej > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3468 bytes Desc: not available URL: From bfriesen at simple.dallas.tx.us Thu Oct 22 21:51:04 2015 From: bfriesen at simple.dallas.tx.us (Bob Friesenhahn) Date: Thu, 22 Oct 2015 16:51:04 -0500 (CDT) Subject: [OmniOS-discuss] Slow performance with ZeusRAM? In-Reply-To: <95ADB53F-BBDD-4CCD-959F-0A174E7DA8F2@zunaj.si> References: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> <95ADB53F-BBDD-4CCD-959F-0A174E7DA8F2@zunaj.si> Message-ID: On Thu, 22 Oct 2015, Matej Zerovnik wrote: > Bob: > Yes, my ZFS is ashift=12, since all my drives report 4k blocks (is that what you ment?). The pool is completly empty, so there is enough place for > writes, so write speed should not be limited because of COW. Looking at iostat, there are no reads on the drives at all. > I?m not sure where fio gets its data, probably from /dev/zero or somewhere? To be clear, zfs does not overwrite blocks. Instead zfs modifies (in memory) any prior data from a block, and then it writes the block data to a new location. This is called "copy on write". If the prior data would not be entirely overwritten and is not already cached in memory, then it needs to be read from underlying disk. It is interesting that you say there are no reads on the drives at all. > Try umounting and re-mounting your zfs filesystem (or 'zfs destroy' followed by 'zfs create') to see how performance differs on a freshly > mounted filesystem. ?The zfs ARC caching will be purged when the filesystem is unmounted. > > If I understand you correctly, you are saying I should destroy my folders, set recordsize=4k on my pool and then create zfs folders? It should suffice to delete the file and set recordsize=4k on the filesystem, and then use fio to create a new file. The file retains its original recordsize after it has been created so you would need to create a new file. There is maximum performance for random write if the write blocksize matches the filesystem blocksize. There is still a catch though if the random write is truely random because the writes may still not match up perfectly to underlying blocks. For example (top is logical write data and bottom is file block data): aligned "random write" data: XXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXX vs unaligned "random write" data: XXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXX The writes aligned to the start of the underlying block will be much faster. A major benefit of your ZIL device is to help turn random writes into fewer random writes or even sequential writes when the TXG is written to your data drives. It is very difficult to test raw hardware performance through zfs filesytem access. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ From vab at bb-c.de Thu Oct 22 23:13:59 2015 From: vab at bb-c.de (Volker A. Brandt) Date: Fri, 23 Oct 2015 01:13:59 +0200 Subject: [OmniOS-discuss] UPDATE NOW --> ntp to 4.2.8p4 In-Reply-To: <20151022194134.GC77@gutsman.lotheac.fi> References: <36D4EF6A-C596-441B-828B-862D5EB9423E@omniti.com> <22057.12713.886330.955767@glaurung.bb-c.de> <20151022194134.GC77@gutsman.lotheac.fi> Message-ID: <22057.28087.51441.685509@glaurung.bb-c.de> Lauri Tirkkonen writes: > > Well, no, it doesn't. :-) That's due to a design flaw in the > > interaction between IPS and SMF (IMHO). [...] > Well, that's not a design flaw. Actuators are executed only when the > action (eg. file) specifying them changes Yes, this is how IPS does it. IPS does not really know that the manifest-import service is special. There should have been an explicit "re-import this manifest now" actuator, much like users or groups are created. [...] > If you wanted to restart ntp when any files in the ntp package > change on update, you would need an actuator like > 'restart_fmri=svc:/network/ntp:default' on *all* file actions > delivered by the package. I know what you mean. That might work, but that is normally not what you do when you deliver an SMF manifest in your package. You just drop it and restart manifest-import, and hope that manifest-import will see your new manifest. This is quite a different thing. Also, what you wrote is not quite true. What you wanted to write was "you would need an actuator on *at least one* file action *that has a different file hash*". If nothing is different, the action is not executed, and the attached actuator does not fire. And it gets worse when you remove a package that contains a manifest for a running SMF service, because it is impossible to call the stop method of the service before removing the package. Lots of fun. :-) Viele Gr??e -- Volker A. Brandt -- ------------------------------------------------------------------------ Volker A. Brandt Consulting and Support for Oracle Solaris Brandt & Brandt Computer GmbH WWW: http://www.bb-c.de/ Am Wiesenpfad 6, 53340 Meckenheim, GERMANY Email: vab at bb-c.de Handelsregister: Amtsgericht Bonn, HRB 10513 Schuhgr??e: 46 Gesch?ftsf?hrer: Rainer J.H. Brandt und Volker A. Brandt "When logic and proportion have fallen sloppy dead" From vab at bb-c.de Thu Oct 22 23:20:16 2015 From: vab at bb-c.de (Volker A. Brandt) Date: Fri, 23 Oct 2015 01:20:16 +0200 Subject: [OmniOS-discuss] UPDATE NOW --> ntp to 4.2.8p4 In-Reply-To: References: <36D4EF6A-C596-441B-828B-862D5EB9423E@omniti.com> <22057.12713.886330.955767@glaurung.bb-c.de> Message-ID: <22057.28464.529688.787911@glaurung.bb-c.de> Eric Sproul writes: > The service has restarted correctly for me on both 006 and 014 with > this update. I'm not sure why that is though, because you're > correct that the ntp.xml file has not changed in all of the '014 > versions published. I was under the impression that the > restart_fmri actuator would only fire when the associated action was > triggered. Yes, that's why it did not restart on my 014 box, hence I noticed. Maybe you were on an earlier rev where the manifest really did change? I guess it also restarted when Dan tested, or else he would not have mentioned that pkg DTRT. > However, if we really *do* want to restart ntp when the *daemon* > updates, then we could add a restart_fmri actuator on the > usr/lib/inet/ntpd file. Thus, whenever that file is updated, > svc:/network/ntp:default could be restarted. It's been a while since I last tried, but I think this will not work, at least not in some corner cases, e.g. when the pkg is not installed at all, and the svc:/network/ntp:default service does not exist when the pkg is installed. ISTR that pkg install will error out. Hmmm, need to test that again sometimes. :-) Regards -- Volker -- ------------------------------------------------------------------------ Volker A. Brandt Consulting and Support for Oracle Solaris Brandt & Brandt Computer GmbH WWW: http://www.bb-c.de/ Am Wiesenpfad 6, 53340 Meckenheim, GERMANY Email: vab at bb-c.de Handelsregister: Amtsgericht Bonn, HRB 10513 Schuhgr??e: 46 Gesch?ftsf?hrer: Rainer J.H. Brandt und Volker A. Brandt "When logic and proportion have fallen sloppy dead" From bhildebrandt at exegy.com Fri Oct 23 04:42:02 2015 From: bhildebrandt at exegy.com (Hildebrandt, Bill) Date: Fri, 23 Oct 2015 04:42:02 +0000 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: References: Message-ID: Chip was responding to an issue that we have been having (since 9/18) with our online systems becoming unresponsive to NFS and CIFS. During these hangs we also see that doing an ?ls? works, but an ?ls ?l? hangs. NFS cannot be restarted, and we are forced to reboot. I was hopeful that the NFS settings would work, but on Tuesday of this week, our offline replication target experienced the same issue. We were about to perform a scrub, when we noticed in our Napp-it interface that ?ZFS Filesystems? was not displaying. After SSHing to the system, we saw that ?ls ?l? did not respond. This was a pure replication target with no NFS access, so I don?t see how the NFS lock tuning could be the solution. A ?reboot ?d? was performed, and we have a 3.9GB dump. If you have a preferred location to receive such dumps, I would be more than happy to share. I should note that we just started using OmniOS this summer, so we have always been at the r151014 version. These were newly created units that have performed perfectly for 2.5 months, and now we are having hangs every 1-2 weeks. Here is a timeline that I shared with Guenther Alka: (I know it?s not best practice, but we had to use ?export? as the name of our data pool) 9/14 ? notified of the L2Arc issue ? removed the L2Arc device from the export pool and stopped nightly replication jobs. Ran a scrub on the replicated NAS appliance, and found no issues 9/15 ? ran ?zdb ?bbccsv export? on the replicated unit and it came up clean 9/16 ? updated OmniOS and Napp-it on the replicated unit; re-added the L2Arc device; re-enabled the replication jobs 9/18 ? in the morning, we were notified that the main NAS unit had stopped responding to CIFS and ?ls ?l? was not working ? NFS continued to function. So this was prior to any rebuild of the system disk, upgrade of OmniOS, or re-enabling the L2Arc device. That night the system disk was rebuilt from scratch from the original CD, with a pkg update performed prior to importing the data pool. 9/19-9/21 ? everything appeared to be working well. The replication jobs that run at 1, 2, and 3am this morning completed just fine. Around 5:40am this morning is when we were notified that NFS had stopped serving up files. After logging in, we found that the ?ls ?l? issue had returned, and CIFS was non-functional as well. Also, the ?Pools? tab in Napp-it worked, but the ?ZFS Filesystems? tab did not. I find it interesting that an ?ls ?l? failed while I was in /root ? the rpool is brand new, with updated OS, and no L2Arc device. I have performed scrubs and ?zdb ?bbccvs? on both units shortly after this, and no errors were found. For the system that hung on Tuesday, a new scrub was clean, but the zdb showed leaks, and over 33k checksum errors (full disclosure ? several replication jobs were ran during the zdb). A subsequent scrub and ?zpool status? showed no problem. I performed another round of replications and all completed without errors. I have now exported that pool, and am performing another zdb, which is still running but showing clean so far. Just in case I have done something insanely stupid with regard to the system configuration, here is the config (both units are identical except for the OS ? if the config seems strange, these were conversions of two old CORAID chassis): OS: omnios-f090f73 (the online unit is at omnios-cffff65) Mobo: Supermicro H8DG6-H8DGi RAM: 128GB ECC (Hynix, which is really Samsung I think) Controllers: 2x LSI 9201-16i with SAS-SATA break-out cables going to a SAS846TQ backplane System disk: mirrored Innodisk 128GB SATADOMs (SATADOM-MV 3ME) off mobo ZIL: mirrored Samsung SSD 845D (400GB) off mobo L2ARC: SAMSUNG MZHPU512 (512GB PCIe) NICs: Intel 10G (X520) Data disks: 24x WD 1TB Enterprise SATA (WDC WD1002FBYS) Any help/insight is greatly appreciated. Thanks, Bill From: OmniOS-discuss [mailto:omnios-discuss-bounces at lists.omniti.com] On Behalf Of Yavor Tomov Sent: Thursday, October 22, 2015 12:37 PM To: jimklimov at cos.ru Cc: OmniOS-discuss Subject: Re: [OmniOS-discuss] OmniOS backup box hanging regularly Hi Tovarishch Jim, I had similar issue with my box and it was related to the NFS locks. I assume you are using it due to the Linux backups. The solution was posted by Chip on the mailing list. Copy of his solution below: "I've seen issues like this when you run out of NFS locks. NFSv3 in Illumos is really slow at releasing locks. On all my NFS servers I do: sharectl set -p lockd_listen_backlog=256 nfs sharectl set -p lockd_servers=2048 nfs Everywhere I can, I use NFSv4 instead of v3. It handles lock much better." All the Best Yavor On Thu, Oct 22, 2015 at 11:59 AM, Jim Klimov > wrote: Hello all, I have this HP-Z400 workstation with 16Gb ECC(should be) RAM running OmniOS bloody, which acts as a backup server for our production systems (regularly rsync'ing large files off Linux boxes, and rotating ZFS auto-snapshots to keep its space free). Sometimes it also runs replicas of infrastructure (DHCP, DNS) and was set up as a VirtualBox + phpVirtualBox host to test that out, but no VMs running. So the essential loads are ZFS snapshots and ZFS scrubs :) And it freezes roughly every week. Stops responding to ping, attempts to log in via SSH or physical console - it processes keypresses on the latter, but does not present a login prompt. It used to be stable, and such regular hangs began around summertime. My primary guess would be for flaky disks, maybe timing out under load or going to sleep or whatever... But I have yet to prove it, or any other theory. Maybe just CPU is overheating due to regular near-100% load with disk I/O... At least I want to rule out OS errors and rule out (or point out) operator/box errors as much as possible - which is something I can change to try and fix ;) Before I proceed to TL;DR screenshots, I'd overview what I see: * In the "top" output, processes owned by zfssnap lead most of the time... But even the SSH shell is noticeably slow to respond (1 sec per line when just pressing enter to clear the screen to prepare nice screenshots). * SMART was not enabled on 3TB mirrored "pool" SATA disks (is now, long tests initiated), but was in place on the "rpool" SAS disk where it logged some corrected ECC errors - but none uncorrected. Maybe the cabling should be reseated. * iostat shows disks are generally not busy (they don't audibly rattle nor visibly blink all the time, either) * zpool scrubs return clean * there are partitions of the system rpool disk (10K RPM SAS) used as log and cache devices for the main data pool on 3TB SATA disks. The system disk is fast and underutilized, so what the heck ;) And it was not a problem for the first year of this system's honest and stable workouts. These devices are pretty empty at the moment. I have enabled deadman panics according to Wiki, but none have happened so far: # cat /etc/system | egrep -v '(^\*|^$)' set snooping=1 set pcplusmp:apic_panic_on_nmi=1 set apix:apic_panic_on_nmi = 1 In the "top" output, processes owned by zfssnap lead most of the time: last pid: 22599; load avg: 12.9, 12.2, 11.2; up 0+09:52:11 18:34:41 140 processes: 125 sleeping, 13 running, 2 on cpu CPU states: 0.0% idle, 22.9% user, 77.1% kernel, 0.0% iowait, 0.0% swap Memory: 16G phys mem, 1765M free mem, 2048M total swap, 2048M free swap Seconds to delay: PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 21389 zfssnap 1 43 2 863M 860M run 5:04 35.61% zfs 22360 zfssnap 1 52 2 118M 115M run 0:37 16.50% zfs 21778 zfssnap 1 52 2 563M 560M run 3:15 13.17% zfs 21278 zfssnap 1 52 2 947M 944M run 5:32 6.91% zfs 21881 zfssnap 1 43 2 433M 431M run 2:31 5.41% zfs 21852 zfssnap 1 52 2 459M 456M run 2:39 5.16% zfs 21266 zfssnap 1 43 2 906M 903M run 5:18 3.95% zfs 21757 zfssnap 1 43 2 597M 594M run 3:26 2.91% zfs 21274 zfssnap 1 52 2 930M 927M cpu/0 5:27 2.78% zfs 22588 zfssnap 1 43 2 30M 27M run 0:08 2.48% zfs 22580 zfssnap 1 52 2 49M 46M run 0:14 0.71% zfs 22038 root 1 59 0 5312K 3816K cpu/1 0:01 0.10% top 22014 root 1 59 0 8020K 4988K sleep 0:00 0.02% sshd Average "iostats" are not that busy: # zpool iostat -Td 5 Thu Oct 22 18:24:59 CEST 2015 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- pool 2.52T 207G 802 116 28.3M 840K rpool 33.0G 118G 0 4 4.52K 58.7K ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:04 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 10 0 97.9K ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:09 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:14 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 9 0 93.5K ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:19 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:24 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:29 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:34 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:25:39 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 16 0 374K ---------- ----- ----- ----- ----- ----- ----- ... Thu Oct 22 18:33:49 CEST 2015 pool 2.52T 207G 0 0 0 0 rpool 33.0G 118G 0 11 0 94.5K ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:33:54 CEST 2015 pool 2.52T 207G 0 13 819 80.0K rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:33:59 CEST 2015 pool 2.52T 207G 0 129 0 1.06M rpool 33.0G 118G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Thu Oct 22 18:34:04 CEST 2015 pool 2.52T 207G 0 55 0 503K rpool 33.0G 118G 0 11 0 97.9K ---------- ----- ----- ----- ----- ----- ----- ... just occasional bursts of work. I've now enabled SMART on the disks (2*3Tb mirror "pool" and 1*300Gb "rpool") and ran some short tests and triggered long tests (hopefully they'd succeed by tomorrow); current results are: # for D in /dev/rdsk/c0*s0; do echo "===== $D :"; smartctl -d sat,12 -a $D ; done ; for D in /dev/rdsk/c4*s0 ; do echo "===== $D :"; smartctl -d scsi -a $D ; done ===== /dev/rdsk/c0t3d0s0 : smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: WDC WD3003FZEX-00Z4SA0 Serial Number: WD-WCC5D1KKU0PA LU WWN Device Id: 5 0014ee 2610716b7 Firmware Version: 01.01A01 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Thu Oct 22 18:45:28 2015 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: (32880) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 357) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x7035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 246 154 021 Pre-fail Always - 6691 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 14 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 4869 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14 16 Unknown_Attribute 0x0022 130 070 000 Old_age Always - 2289651870502 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 12 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2 194 Temperature_Celsius 0x0022 117 111 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 4869 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ===== /dev/rdsk/c0t5d0s0 : smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate SV35 Device Model: ST3000VX000-1ES166 Serial Number: Z500S3L8 LU WWN Device Id: 5 000c50 079e3757b Firmware Version: CV26 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Thu Oct 22 18:45:28 2015 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: ( 80) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 325) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x10b9) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail Always - 8600880 3 Spin_Up_Time 0x0003 096 094 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 19 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail Always - 342685681 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 4214 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 19 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 028 028 000 Old_age Always - 72 190 Airflow_Temperature_Cel 0x0022 069 065 045 Old_age Always - 31 (Min/Max 29/32) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 19 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 28 194 Temperature_Celsius 0x0022 031 040 000 Old_age Always - 31 (0 20 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Self-test routine in progress 90% 4214 - # 2 Short offline Completed without error 00% 4214 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ===== /dev/rdsk/c4t5000CCA02A1292DDd0s0 : smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org Vendor: HITACHI Product: HUS156030VLS600 Revision: HPH1 User Capacity: 300,000,000,000 bytes [300 GB] Logical block size: 512 bytes Logical Unit id: 0x5000cca02a1292dc Serial number: LVVA6NHS Device type: disk Transport protocol: SAS Local Time is: Thu Oct 22 18:45:29 2015 CEST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 45 C Drive Trip Temperature: 70 C Manufactured in week 14 of year 2012 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 80 Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 2340336504406016 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 888890 0 888890 0 29326.957 0 write: 0 961315 0 961315 0 6277.560 0 Non-medium error count: 283 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Self test in progress ... - NOW - [- - -] # 2 Background long Aborted (device reset ?) - 14354 - [- - -] # 3 Background short Completed - 14354 - [- - -] # 4 Background long Aborted (device reset ?) - 14354 - [- - -] # 5 Background long Aborted (device reset ?) - 14354 - [- - -] Long (extended) Self Test duration: 2506 seconds [41.8 minutes] The zpool scrub results and general layout: # zpool status -v pool: pool state: ONLINE scan: scrub repaired 0 in 164h13m with 0 errors on Thu Oct 22 18:13:33 2015 config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 logs c4t5000CCA02A1292DDd0p2 ONLINE 0 0 0 cache c4t5000CCA02A1292DDd0p3 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(5) for details. scan: scrub repaired 0 in 3h3m with 0 errors on Thu Oct 8 04:12:35 2015 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c4t5000CCA02A1292DDd0s0 ONLINE 0 0 0 errors: No known data errors # zpool list -v NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT pool 2.72T 2.52T 207G - 68% 92% 1.36x ONLINE / mirror 2.72T 2.52T 207G - 68% 92% c0t3d0 - - - - - - c0t5d0 - - - - - - log - - - - - - c4t5000CCA02A1292DDd0p2 8G 148K 8.00G - 0% 0% cache - - - - - - c4t5000CCA02A1292DDd0p3 120G 1.80G 118G - 0% 1% rpool 151G 33.0G 118G - 76% 21% 1.00x ONLINE - c4t5000CCA02A1292DDd0s0 151G 33.0G 118G - 76% 21% Note the long scrub time may have included the downtime while the system was frozen until it was rebooted. Thanks in advance for the fresh pairs of eyeballs, Jim Klimov _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss at lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss ________________________________ This e-mail and any documents accompanying it may contain legally privileged and/or confidential information belonging to Exegy, Inc. Such information may be protected from disclosure by law. The information is intended for use by only the addressee. If you are not the intended recipient, you are hereby notified that any disclosure or use of the information is strictly prohibited. If you have received this e-mail in error, please immediately contact the sender by e-mail or phone regarding instructions for return or destruction and do not use or disclose the content to others. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lotheac at iki.fi Fri Oct 23 06:32:03 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Fri, 23 Oct 2015 09:32:03 +0300 Subject: [OmniOS-discuss] UPDATE NOW --> ntp to 4.2.8p4 In-Reply-To: <22057.28087.51441.685509@glaurung.bb-c.de> References: <36D4EF6A-C596-441B-828B-862D5EB9423E@omniti.com> <22057.12713.886330.955767@glaurung.bb-c.de> <20151022194134.GC77@gutsman.lotheac.fi> <22057.28087.51441.685509@glaurung.bb-c.de> Message-ID: <20151023063203.GD77@gutsman.lotheac.fi> We may be getting a bit off topic here :) On Fri, Oct 23 2015 01:13:59 +0200, Volker A. Brandt wrote: > Lauri Tirkkonen writes: > > Well, that's not a design flaw. Actuators are executed only when the > > action (eg. file) specifying them changes > > Yes, this is how IPS does it. IPS does not really know that the > manifest-import service is special. There should have been an explicit > "re-import this manifest now" actuator, much like users or groups are > created. But it isn't really that special. If you ship a different service manifest, it gets reimported with 'svccfg import' -- but this does *not* affect whether the service is running or restart it (IME). Reimporting a manifest does, however, cause a service refresh, but the refresh action doesn't necessarily restart any processes: mail ~ # svcs spamd STATE STIME FMRI online Sep_30 svc:/mail/spamassassin/spamd:default mail ~ # svccfg import /lib/svc/manifest/mail/spamassassin-spamd.xml mail ~ # svcs -vp spamd STATE NSTATE STIME CTID FMRI online - 6:15:41 341 svc:/mail/spamassassin/spamd:default Sep_30 3619 spamd Oct_20 19590 spamd Oct_16 29484 spamd mail ~ # tail -2 $(svcs -L spamd) [ Oct 23 06:15:41 Rereading configuration. ] [ Oct 23 06:15:41 No 'refresh' method defined. Treating as :true. ] I don't know what you mean with the bit about users and groups; AFAICT the logic is similar to other actions. > [...] > > If you wanted to restart ntp when any files in the ntp package > > change on update, you would need an actuator like > > 'restart_fmri=svc:/network/ntp:default' on *all* file actions > > delivered by the package. > > I know what you mean. That might work, but that is normally not what > you do when you deliver an SMF manifest in your package. You just drop > it and restart manifest-import, and hope that manifest-import will see > your new manifest. This is quite a different thing. It is actually what I normally do, because no other way works :) See for example the following mog file for ISC bind: https://github.com/niksula/omnios-build-scripts/blob/master/bind/local.mog#L6 Even the manifest-import restart is an actuator added by omnios-build (from global.mog), nothing happens automatically. > Also, what you wrote is not quite true. What you wanted to write was > "you would need an actuator on *at least one* file action *that has a > different file hash*". If nothing is different, the action is not > executed, and the attached actuator does not fire. We're both correct. Note that I said "when _any_ files in the ntp package change on update" (emphasis added). > And it gets worse when you remove a package that contains a manifest for > a running SMF service, because it is impossible to call the stop method > of the service before removing the package. Lots of fun. :-) I don't know why it would be impossible before or even after removing the package. After the manifest has been imported into the SMF repository, the service will not go away even if you remove the xml manifest it came from. If you meant that pkg can't automatically stop your service when removing a package, maybe disable_fmri is something you want? From pkg(5): disable_fmri causes the given FMRI to be disabled prior to action removal, per the disable subcommand to svcadm(1M). -- Lauri Tirkkonen | lotheac @ IRCnet From lotheac at iki.fi Fri Oct 23 06:44:58 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Fri, 23 Oct 2015 09:44:58 +0300 Subject: [OmniOS-discuss] UPDATE NOW --> ntp to 4.2.8p4 In-Reply-To: <22057.28464.529688.787911@glaurung.bb-c.de> References: <36D4EF6A-C596-441B-828B-862D5EB9423E@omniti.com> <22057.12713.886330.955767@glaurung.bb-c.de> <22057.28464.529688.787911@glaurung.bb-c.de> Message-ID: <20151023064458.GE77@gutsman.lotheac.fi> On Fri, Oct 23 2015 01:20:16 +0200, Volker A. Brandt wrote: > Eric Sproul writes: > > The service has restarted correctly for me on both 006 and 014 with > > this update. I'm not sure why that is though, because you're > > correct that the ntp.xml file has not changed in all of the '014 > > versions published. I was under the impression that the > > restart_fmri actuator would only fire when the associated action was > > triggered. > > Yes, that's why it did not restart on my 014 box, hence I noticed. > Maybe you were on an earlier rev where the manifest really did change? > I guess it also restarted when Dan tested, or else he would not have > mentioned that pkg DTRT. I don't know why the service would restart even if the manifest did change, because reimporting the manifest only triggers a refresh (see my other mail). I find it strange that Eric mentions it did. > > However, if we really *do* want to restart ntp when the *daemon* > > updates, then we could add a restart_fmri actuator on the > > usr/lib/inet/ntpd file. Thus, whenever that file is updated, > > svc:/network/ntp:default could be restarted. > > It's been a while since I last tried, but I think this will not work, > at least not in some corner cases, e.g. when the pkg is not installed > at all, and the svc:/network/ntp:default service does not exist when > the pkg is installed. ISTR that pkg install will error out. Hmmm, > need to test that again sometimes. :-) I don't know about when pkg is not installed (how would you even update or install the package then?), but on a package containing both the manifest-import restart actuator on the manifest file as well as a restart_fmri for the new service on the other files, 'pkg install -v' does mention that both services will be restarted, even though the one of them doesn't exist at install time. It has never caused any errors for me though, and we've been shipping several different packages containing services for quite a while now. -- Lauri Tirkkonen | lotheac @ IRCnet From lotheac at iki.fi Fri Oct 23 07:00:38 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Fri, 23 Oct 2015 10:00:38 +0300 Subject: [OmniOS-discuss] UPDATE NOW --> ntp to 4.2.8p4 In-Reply-To: <20151023064458.GE77@gutsman.lotheac.fi> References: <36D4EF6A-C596-441B-828B-862D5EB9423E@omniti.com> <22057.12713.886330.955767@glaurung.bb-c.de> <22057.28464.529688.787911@glaurung.bb-c.de> <20151023064458.GE77@gutsman.lotheac.fi> Message-ID: <20151023070038.GF77@gutsman.lotheac.fi> On Fri, Oct 23 2015 09:44:58 +0300, Lauri Tirkkonen wrote: > On Fri, Oct 23 2015 01:20:16 +0200, Volker A. Brandt wrote: > > It's been a while since I last tried, but I think this will not work, > > at least not in some corner cases, e.g. when the pkg is not installed > > at all > > I don't know about when pkg is not installed (how would you even update > or install the package then?) Sorry, looks like I failed at reading comprehension (I missed the "the" in "the pkg"). -- Lauri Tirkkonen | lotheac @ IRCnet From richard at netbsd.org Fri Oct 23 07:25:14 2015 From: richard at netbsd.org (Richard PALO) Date: Fri, 23 Oct 2015 09:25:14 +0200 Subject: [OmniOS-discuss] ILB memory leak? In-Reply-To: <5628A198.5040808@scluk.com> References: <56276437.2020109@scluk.com> <00AE5FA3-E699-4C4F-8A94-AEEDAAED0856@omniti.com> <5628A198.5040808@scluk.com> Message-ID: <5629E0DA.5070402@netbsd.org> Le 22/10/15 10:43, Al Slater a ?crit : > I am seeing kernel memory consumption increasing as well, but that may > be a different issue. The ilbd process memory is definitely growing. > this is indeed probably a different issue, but it would be useful to create a thread on illumos discuss as I'm seeing it as well (not using ILB).. for example, running a number of rather intensive builds I see kernel steadily going up to ~40%!!: > richard at omnis:/home/richard$ swap -hs ; echo ::memstat |pfexec mdb -k > total: 1,8G allocated + 311M reserved = 2,1G used, 40G available > Page Summary Pages MB %Tot > ------------ ---------------- ---------------- ---- > Kernel 3231113 12621 39% > ZFS File Data 2944763 11502 35% > Anon 452803 1768 5% > Exec and libs 5088 19 0% > Page cache 65892 257 1% > Free (cachelist) 70820 276 1% > Free (freelist) 1614595 6307 19% > > Total 8385074 32754 > Physical 8385072 32754 -- Richard PALO From matej at zunaj.si Fri Oct 23 11:55:03 2015 From: matej at zunaj.si (Matej Zerovnik) Date: Fri, 23 Oct 2015 13:55:03 +0200 Subject: [OmniOS-discuss] Slow performance with ZeusRAM? In-Reply-To: References: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> <95ADB53F-BBDD-4CCD-959F-0A174E7DA8F2@zunaj.si> Message-ID: <792FFED6-445A-41E6-9FB3-ADD6E8F7310E@zunaj.si> > On 22 Oct 2015, at 23:51, Bob Friesenhahn wrote: > > On Thu, 22 Oct 2015, Matej Zerovnik wrote: >> Bob: >> Yes, my ZFS is ashift=12, since all my drives report 4k blocks (is that what you ment?). The pool is completly empty, so there is enough place for >> writes, so write speed should not be limited because of COW. Looking at iostat, there are no reads on the drives at all. >> I?m not sure where fio gets its data, probably from /dev/zero or somewhere? > > To be clear, zfs does not overwrite blocks. Instead zfs modifies (in memory) any prior data from a block, and then it writes the block data to a new location. This is called "copy on write". If the prior data would not be entirely overwritten and is not already cached in memory, then it needs to be read from underlying disk. > > It is interesting that you say there are no reads on the drives at all. I think I got it. I think in my case, there are no reads because the whole pool is empty and fresh data are written to pool. So there is no rewrite, just pure writes... I did some more testing with recordsize=4k (need to repeat them with 128k as well) and it looks like I can get up to 48k IOPS when doing sequential 4k writing running 6x dd (dd if=/dev/zero of=/pool/folder/file1 bs=4k count=100000). When I switch to random writing (this time I tried iozone instead of fio), I can only get up to 58mb/s (which translates to around 14.500 IOPS (although iostat is showing higher values). iostat during random write: r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 29805.4 0.0 119209.6 0.0 3.0 0.0 0.1 3 83 c9t5000A72A300B3D9Dd0 0.0 29807.4 0.0 119217.6 0.0 4.7 0.0 0.2 4 83 c10t5000A72A300B3D7Ed0 0.0 589.7 0.0 62794.9 0.0 0.7 0.0 1.2 1 62 c10t5000C500837549F9d0 0.0 622.6 0.0 66173.8 0.0 0.6 0.0 1.0 1 54 c10t5000C50083759089d0 0.0 609.6 0.0 66173.9 0.0 0.6 0.0 1.0 1 54 c10t5000C500837557EDd0 How come I can see 30k IOPS flowing to ZeusRAM, but I only see 58MB/s being written to hard drive? I tried running scripts from http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning: txd flush takes around 0,3s and average txd usage is: 3 12645 txg_sync_thread:txg-syncing 742MB of 4096MB used 15 12645 txg_sync_thread:txg-syncing 1848MB of 4096MB used 21 12645 txg_sync_thread:txg-syncing 1467MB of 4096MB used 20 12645 txg_sync_thread:txg-syncing 2231MB of 4096MB used 16 12645 txg_sync_thread:txg-syncing 1237MB of 4096MB used 14 12645 txg_sync_thread:txg-syncing 1624MB of 4096MB used 1 12645 txg_sync_thread:txg-syncing 1130MB of 4096MB used 9 12645 txg_sync_thread:txg-syncing 1750MB of 4096MB used 9 12645 txg_sync_thread:txg-syncing 1300MB of 4096MB used 18 12645 txg_sync_thread:txg-syncing 2396MB of 4096MB used If I understand that correctly, spindles have no problem writing data from write cache to disks. I have a feling I have a real problem with understanding how things work in ZFS:) I always used a simple explanation: as fast as system can write to ZIL, that is the speed that the program can write to fs. I guess not? Matej -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3468 bytes Desc: not available URL: From danmcd at omniti.com Fri Oct 23 15:24:23 2015 From: danmcd at omniti.com (Dan McDonald) Date: Fri, 23 Oct 2015 11:24:23 -0400 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: References: Message-ID: <74C4EFF5-B1D7-49DE-81D5-54CE8787C258@omniti.com> > On Oct 23, 2015, at 12:42 AM, Hildebrandt, Bill wrote: > > Controllers: 2x LSI 9201-16i with SAS-SATA break-out cables going to a SAS846TQ backplane > System disk: mirrored Innodisk 128GB SATADOMs (SATADOM-MV 3ME) off mobo > ZIL: mirrored Samsung SSD 845D (400GB) off mobo > L2ARC: SAMSUNG MZHPU512 (512GB PCIe) > NICs: Intel 10G (X520) > Data disks: 24x WD 1TB Enterprise SATA (WDC WD1002FBYS) SATA disks attached to a SAS expander (this is the SAS846TQ, an expander, right?) are known to be a dangerous deployment. SATA doesn't have the reporting capabilities SAS does, and many SAS expanders don't relay SATA well enough. We do not support paying customers who deploy like this. FYI, Dan From richard.elling at richardelling.com Fri Oct 23 16:42:10 2015 From: richard.elling at richardelling.com (Richard Elling) Date: Fri, 23 Oct 2015 09:42:10 -0700 Subject: [OmniOS-discuss] Slow performance with ZeusRAM? In-Reply-To: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> References: <10FE1CC1-F9F5-433A-9A2D-6570C4EE6CCF@zunaj.si> Message-ID: additional insight below... > On Oct 22, 2015, at 12:02 PM, Matej Zerovnik wrote: > > Hello, > > I'm building a new system and I'm having a bit of a performance problem. Well, its either that or I'm not getting the whole ZIL idea:) > > My system is following: > - IBM xServer 3550 M4 server (dual CPU with 160GB memory) > - LSI 9207 HBA (P19 firmware) > - Supermicro JBOD with SAS expander > - 4TB SAS3 drives > - ZeusRAM for ZIL > - LTS Omnios (all patches applied) > > If I benchmark ZeusRAM on its own with random 4k sync writes, I can get 48k IOPS out of it, no problem there. Do not assume writes to the slog for 4k random write workload are only 4k in size. You'll want to measure to be sure, but the worst case here is 8k written to slog: 4k data + 4k chain pointer = 8k physical write There are cases where multiple 4k data gets coalesced, so the above is worst case. Measure to be sure. A quick back-of-the-napkin measurement can be done from iostat -x output. More detailed measurements can be done wil zilstat or other specific dtracing. -- richard > > If I create a new raidz2 pool with 10 hard drives, mirrored ZeusRAMs for ZIL and set sync=always, I can only squeeze 14k IOPS out of the system. > Is that normal or should I be getting 48k IOPS on the 2nd pool as well, since this is the performance ZeusRAM can deliver? > > I'm testing with fio: > fio --filename=/pool0/test01 --size=5g --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=solarisaio --bs=4k --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest > > thanks, Matej_______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss From jimklimov at cos.ru Fri Oct 23 16:54:27 2015 From: jimklimov at cos.ru (Jim Klimov) Date: Fri, 23 Oct 2015 18:54:27 +0200 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: References: Message-ID: <02DF3A33-F955-4F86-A478-0D639CB500F1@cos.ru> 23 ??????? 2015??. 11:23:28 CEST, Jim Klimov ?????: >A new bit of info came in. I left the box running along with an SSH >session running various tracers overnight, and it seems that the system >dumbly ran out of memory. > >However it never logged any forking errors, etc. which were typical for >similar cases before, and did not recover by time or "magic" (e.g. >processes dying on ENOMEM and so freeing it up). There is some swap >free, too (which wouldn't help if it ran out of kernel memory). The >64Mb free RAM (or 32Mb in some of my older experiences) is the empiric >minimum under which illumos is good as dead ;) > >The box is pingable after all, but no SSH nor local usability. The >"top" listings froze at 4:25am, that's some 7 hours ago. > >last pid: 26331; load avg: 6.65, 5.61, 6.88; up 0+19:42:47 >04:25:17 >208 processes: 178 sleeping, 28 running, 1 zombie, 1 on cpu >CPU states: 21.5% idle, 3.5% user, 75.0% kernel, 0.0% iowait, 0.0% >swap >Memory: 16G phys mem, 64M free mem, 2048M total swap, 1420M free swap > PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND > 24910 zfssnap 1 60 2 243M 238M run 1:24 1.61% zfs > 25620 zfssnap 1 53 2 220M 215M run 1:18 1.17% zfs > 25619 zfssnap 1 53 2 220M 215M run 1:18 1.15% zfs > 24753 zfssnap 1 53 2 243M 238M run 1:26 1.12% zfs > 25864 zfssnap 1 53 2 220M 215M run 1:19 0.93% zfs > 25861 zfssnap 1 53 2 13M 10M sleep 0:03 0.87% zfs > 22380 zfssnap 1 60 2 764M 721M run 4:34 0.83% zfs > 22546 zfssnap 1 53 2 698M 672M sleep 4:08 0.79% zfs > 25857 zfssnap 1 53 2 220M 215M run 1:19 0.78% zfs > 24224 zfssnap 1 60 2 548M 536M run 3:13 0.76% zfs > 22901 zfssnap 1 60 2 698M 672M run 4:09 0.75% zfs > 22551 zfssnap 1 60 2 698M 672M sleep 4:08 0.73% zfs > 22373 zfssnap 1 60 2 767M 729M run 4:33 0.73% zfs > 22212 zfssnap 1 60 2 767M 730M sleep 4:36 0.71% zfs > 24215 zfssnap 1 60 2 549M 537M sleep 3:13 0.69% zfs > > >Heh, by sheer coincidence, it froze 10 hours 1 second after I logged in >(which in my profile also prints a few lines from top): > >last pid: 22036; load avg: 10.7, 10.3, 10.4; up 0+09:42:46 >18:25:16 >126 processes: 115 sleeping, 9 running, 2 on cpu >CPU states: 0.0% idle, 21.4% user, 78.6% kernel, 0.0% iowait, 0.0% >swap >Memory: 16G phys mem, 1713M free mem, 2048M total swap, 2048M free swap > PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND > 21266 zfssnap 1 46 2 660M 657M run 3:50 28.20% zfs > 21274 zfssnap 1 45 2 672M 669M run 3:55 16.70% zfs > 21757 zfssnap 1 50 2 314M 311M run 1:48 8.95% zfs > 21389 zfssnap 1 52 2 586M 584M run 3:24 8.82% zfs > 21173 zfssnap 1 53 2 762M 759M cpu/1 4:28 8.76% zfs > > >So at the moment it seems there is some issue with zfs-auto-snapshots >on OmniOS that I haven't seen in SXCE, OI nor Hipster. Possibly I had >different implementations of the service in these different OSes (shell >vs python, and all that at different versions). > >I'll see if toning down the frequency of autosnaps (e.g. disable >"frequent" or "hourly" schedules) would help improve stability. If it >does - I'd still call it a bug. System should not die like that. And >the actual load (as in I/O ops) is seemingly not that gigantic. > > >Jim > >----- ???????? ????????? ----- >??: Jim Klimov >????: Thursday, October 22, 2015 20:02 >????: [OmniOS-discuss] OmniOS backup box hanging regularly >???? (To): OmniOS-discuss > > >> Hello all, >> I have this HP-Z400 workstation with 16Gb ECC(should be) RAM running >OmniOS bloody, which acts as a backup server for our production systems >(regularly rsync'ing large files off Linux boxes, and rotating ZFS >auto-snapshots to keep its space free). Sometimes it also runs replicas >of infrastructure (DHCP, DNS) and was set up as a VirtualBox + >phpVirtualBox host to test that out, but no VMs running. >> So the essential loads are ZFS snapshots and ZFS scrubs :) >> And it freezes roughly every week. Stops responding to ping, attempts >to log in via SSH or physical console - it processes keypresses on the >latter, but does not present a login prompt. It used to be stable, and >such regular hangs began around summertime. >> >> My primary guess would be for flaky disks, maybe timing out under >load or going to sleep or whatever... But I have yet to prove it, or >any other theory. Maybe just CPU is overheating due to regular >near-100% load with disk I/O... At least I want to rule out OS errors >and rule out (or point out) operator/box errors as much as possible - >which is something I can change to try and fix ;) >> Before I proceed to TL;DR screenshots, I'd overview what I see: >> * In the "top" output, processes owned by zfssnap lead most of the >time... But even the SSH shell is noticeably slow to respond (1 sec per >line when just pressing enter to clear the screen to prepare nice >screenshots). >> * SMART was not enabled on 3TB mirrored "pool" SATA disks (is now, >long tests initiated), but was in place on the "rpool" SAS disk where >it logged some corrected ECC errors - but none uncorrected. >> Maybe the cabling should be reseated. >> * iostat shows disks are generally not busy (they don't audibly >rattle nor visibly blink all the time, either) >> * zpool scrubs return clean >> * there are partitions of the system rpool disk (10K RPM SAS) used as >log and cache devices for the main data pool on 3TB SATA disks. The >system disk is fast and underutilized, so what the heck ;) And it was >not a problem for the first year of this system's honest and stable >workouts. These devices are pretty empty at the moment. >> >> I have enabled deadman panics according to Wiki, but none have >happened so far: >> # cat /etc/system | egrep -v '(^\*|^$)' >> set snooping=1 >> set pcplusmp:apic_panic_on_nmi=1 >> set apix:apic_panic_on_nmi = 1 > >> >> >> In the "top" output, processes owned by zfssnap lead most of the >time: >> >> last pid: 22599; load avg: 12.9, 12.2, 11.2; up 0+09:52:11 > 18:34:41 >> 140 processes: 125 sleeping, 13 running, 2 on cpu >> CPU states: 0.0% idle, 22.9% user, 77.1% kernel, 0.0% iowait, 0.0% >swap >> Memory: 16G phys mem, 1765M free mem, 2048M total swap, 2048M free >swap >> Seconds to delay: >> PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND >> 21389 zfssnap 1 43 2 863M 860M run 5:04 35.61% zfs >> 22360 zfssnap 1 52 2 118M 115M run 0:37 16.50% zfs >> 21778 zfssnap 1 52 2 563M 560M run 3:15 13.17% zfs >> 21278 zfssnap 1 52 2 947M 944M run 5:32 6.91% zfs >> 21881 zfssnap 1 43 2 433M 431M run 2:31 5.41% zfs >> 21852 zfssnap 1 52 2 459M 456M run 2:39 5.16% zfs >> 21266 zfssnap 1 43 2 906M 903M run 5:18 3.95% zfs >> 21757 zfssnap 1 43 2 597M 594M run 3:26 2.91% zfs >> 21274 zfssnap 1 52 2 930M 927M cpu/0 5:27 2.78% zfs >> 22588 zfssnap 1 43 2 30M 27M run 0:08 2.48% zfs >> 22580 zfssnap 1 52 2 49M 46M run 0:14 0.71% zfs >> 22038 root 1 59 0 5312K 3816K cpu/1 0:01 0.10% top >> 22014 root 1 59 0 8020K 4988K sleep 0:00 0.02% sshd > >> >> Average "iostats" are not that busy: >> >> # zpool iostat -Td 5 >> Thu Oct 22 18:24:59 CEST 2015 >> capacity operations bandwidth >> pool alloc free read write read write >> ---------- ----- ----- ----- ----- ----- ----- >> pool 2.52T 207G 802 116 28.3M 840K >> rpool 33.0G 118G 0 4 4.52K 58.7K >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:04 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 10 0 97.9K >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:09 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:14 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 9 0 93.5K >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:19 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:24 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:29 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:34 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:25:39 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 16 0 374K >> ---------- ----- ----- ----- ----- ----- ----- >> ... >> Thu Oct 22 18:33:49 CEST 2015 >> pool 2.52T 207G 0 0 0 0 >> rpool 33.0G 118G 0 11 0 94.5K >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:33:54 CEST 2015 >> pool 2.52T 207G 0 13 819 80.0K >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:33:59 CEST 2015 >> pool 2.52T 207G 0 129 0 1.06M >> rpool 33.0G 118G 0 0 0 0 >> ---------- ----- ----- ----- ----- ----- ----- >> Thu Oct 22 18:34:04 CEST 2015 >> pool 2.52T 207G 0 55 0 503K >> rpool 33.0G 118G 0 11 0 97.9K >> ---------- ----- ----- ----- ----- ----- ----- >> ... >> just occasional bursts of work. >> I've now enabled SMART on the disks (2*3Tb mirror "pool" and 1*300Gb >"rpool") and ran some short tests and triggered long tests (hopefully >they'd succeed by tomorrow); current results are: >> >> >> # for D in /dev/rdsk/c0*s0; do echo "===== $D :"; smartctl -d sat,12 >-a $D ; done ; for D in /dev/rdsk/c4*s0 ; do echo "===== $D :"; >smartctl -d scsi -a $D ; done >> ===== /dev/rdsk/c0t3d0s0 : >> smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) >> Copyright (C) 2002-12, Bruce Allen, Christian Franke, >www.smartmontools.org >> === START OF INFORMATION SECTION === >> Device Model: WDC WD3003FZEX-00Z4SA0 >> Serial Number: WD-WCC5D1KKU0PA >> LU WWN Device Id: 5 0014ee 2610716b7 >> Firmware Version: 01.01A01 >> User Capacity: 3,000,592,982,016 bytes [3.00 TB] >> Sector Sizes: 512 bytes logical, 4096 bytes physical >> Rotation Rate: 7200 rpm >> Device is: Not in smartctl database [for details use: -P >showall] >> ATA Version is: ACS-2 (minor revision not indicated) >> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) >> Local Time is: Thu Oct 22 18:45:28 2015 CEST >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> General SMART Values: >> Offline data collection status: (0x82) Offline data collection >activity >> was completed without error. >> Auto Offline Data Collection: >Enabled. >> Self-test execution status: ( 249) Self-test routine in >progress... >> 90% of test remaining. >> Total time to complete Offline >> data collection: (32880) seconds. >> Offline data collection >> capabilities: (0x7b) SMART execute Offline >immediate. >> Auto Offline data collection >on/off support. >> Suspend Offline collection >upon new >> command. >> Offline surface scan >supported. >> Self-test supported. >> Conveyance Self-test >supported. >> Selective Self-test >supported. >> SMART capabilities: (0x0003) Saves SMART data before >entering >> power-saving mode. >> Supports SMART auto save >timer. >> Error logging capability: (0x01) Error logging supported. >> General Purpose Logging >supported. >> Short self-test routine >> recommended polling time: ( 2) minutes. >> Extended self-test routine >> recommended polling time: ( 357) minutes. >> Conveyance self-test routine >> recommended polling time: ( 5) minutes. >> SCT capabilities: (0x7035) SCT Status supported. >> SCT Feature Control >supported. >> SCT Data Table supported. >> SMART Attributes Data Structure revision number: 16 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >UPDATED WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail >Always - 0 >> 3 Spin_Up_Time 0x0027 246 154 021 Pre-fail >Always - 6691 >> 4 Start_Stop_Count 0x0032 100 100 000 Old_age >Always - 14 >> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail >Always - 0 >> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age >Always - 0 >> 9 Power_On_Hours 0x0032 094 094 000 Old_age >Always - 4869 >> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age >Always - 0 >> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age >Always - 0 >> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age >Always - 14 >> 16 Unknown_Attribute 0x0022 130 070 000 Old_age >Always - 2289651870502 >> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age >Always - 12 >> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age >Always - 2 >> 194 Temperature_Celsius 0x0022 117 111 000 Old_age >Always - 35 >> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age >Always - 0 >> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age >Always - 0 >> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age >Offline - 0 >> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age >Always - 0 >> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age >Offline - 0 >> SMART Error Log Version: 1 >> No Errors Logged >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining >LifeTime(hours) LBA_of_first_error >> # 1 Short offline Completed without error 00% 4869 > - >> SMART Selective self-test log data structure revision number 1 >> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >> 1 0 0 Not_testing >> 2 0 0 Not_testing >> 3 0 0 Not_testing >> 4 0 0 Not_testing >> 5 0 0 Not_testing >> Selective self-test flags (0x0): >> After scanning selected spans, do NOT read-scan remainder of disk. >> If Selective self-test is pending on power-up, resume after 0 minute >delay. >> ===== /dev/rdsk/c0t5d0s0 : >> smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) >> Copyright (C) 2002-12, Bruce Allen, Christian Franke, >www.smartmontools.org >> === START OF INFORMATION SECTION === >> Model Family: Seagate SV35 >> Device Model: ST3000VX000-1ES166 >> Serial Number: Z500S3L8 >> LU WWN Device Id: 5 000c50 079e3757b >> Firmware Version: CV26 >> User Capacity: 3,000,592,982,016 bytes [3.00 TB] >> Sector Sizes: 512 bytes logical, 4096 bytes physical >> Rotation Rate: 7200 rpm >> Device is: In smartctl database [for details use: -P show] >> ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b >> SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) >> Local Time is: Thu Oct 22 18:45:28 2015 CEST >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> General SMART Values: >> Offline data collection status: (0x00) Offline data collection >activity >> was never started. >> Auto Offline Data Collection: >Disabled. >> Self-test execution status: ( 249) Self-test routine in >progress... >> 90% of test remaining. >> Total time to complete Offline >> data collection: ( 80) seconds. >> Offline data collection >> capabilities: (0x73) SMART execute Offline >immediate. >> Auto Offline data collection >on/off support. >> Suspend Offline collection >upon new >> command. >> No Offline surface scan >supported. >> Self-test supported. >> Conveyance Self-test >supported. >> Selective Self-test >supported. >> SMART capabilities: (0x0003) Saves SMART data before >entering >> power-saving mode. >> Supports SMART auto save >timer. >> Error logging capability: (0x01) Error logging supported. >> General Purpose Logging >supported. >> Short self-test routine >> recommended polling time: ( 1) minutes. >> Extended self-test routine >> recommended polling time: ( 325) minutes. >> Conveyance self-test routine >> recommended polling time: ( 2) minutes. >> SCT capabilities: (0x10b9) SCT Status supported. >> SCT Error Recovery Control >supported. >> SCT Feature Control >supported. >> SCT Data Table supported. >> SMART Attributes Data Structure revision number: 10 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >UPDATED WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail >Always - 8600880 >> 3 Spin_Up_Time 0x0003 096 094 000 Pre-fail >Always - 0 >> 4 Start_Stop_Count 0x0032 100 100 020 Old_age >Always - 19 >> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail >Always - 0 >> 7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail >Always - 342685681 >> 9 Power_On_Hours 0x0032 096 096 000 Old_age >Always - 4214 >> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail >Always - 0 >> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age >Always - 19 >> 184 End-to-End_Error 0x0032 100 100 099 Old_age >Always - 0 >> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age >Always - 0 >> 188 Command_Timeout 0x0032 100 100 000 Old_age >Always - 0 >> 189 High_Fly_Writes 0x003a 028 028 000 Old_age >Always - 72 >> 190 Airflow_Temperature_Cel 0x0022 069 065 045 Old_age >Always - 31 (Min/Max 29/32) >> 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age >Always - 0 >> 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age >Always - 19 >> 193 Load_Cycle_Count 0x0032 100 100 000 Old_age >Always - 28 >> 194 Temperature_Celsius 0x0022 031 040 000 Old_age >Always - 31 (0 20 0 0 0) >> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age >Always - 0 >> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age >Offline - 0 >> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age >Always - 0 >> SMART Error Log Version: 1 >> No Errors Logged >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining >LifeTime(hours) LBA_of_first_error >> # 1 Extended offline Self-test routine in progress 90% 4214 > - >> # 2 Short offline Completed without error 00% 4214 > - >> SMART Selective self-test log data structure revision number 1 >> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >> 1 0 0 Not_testing >> 2 0 0 Not_testing >> 3 0 0 Not_testing >> 4 0 0 Not_testing >> 5 0 0 Not_testing >> Selective self-test flags (0x0): >> After scanning selected spans, do NOT read-scan remainder of disk. >> If Selective self-test is pending on power-up, resume after 0 minute >delay. >> ===== /dev/rdsk/c4t5000CCA02A1292DDd0s0 : >> smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) >> Copyright (C) 2002-12, Bruce Allen, Christian Franke, >www.smartmontools.org >> Vendor: HITACHI >> Product: HUS156030VLS600 >> Revision: HPH1 >> User Capacity: 300,000,000,000 bytes [300 GB] >> Logical block size: 512 bytes >> Logical Unit id: 0x5000cca02a1292dc >> Serial number: LVVA6NHS >> Device type: disk >> Transport protocol: SAS >> Local Time is: Thu Oct 22 18:45:29 2015 CEST >> Device supports SMART and is Enabled >> Temperature Warning Enabled >> SMART Health Status: OK >> Current Drive Temperature: 45 C >> Drive Trip Temperature: 70 C >> Manufactured in week 14 of year 2012 >> Specified cycle count over device lifetime: 50000 >> Accumulated start-stop cycles: 80 >> Elements in grown defect list: 0 >> Vendor (Seagate) cache information >> Blocks sent to initiator = 2340336504406016 >> Error counter log: >> Errors Corrected by Total Correction >Gigabytes Total >> ECC rereads/ errors algorithm >processed uncorrected >> fast | delayed rewrites corrected invocations [10^9 >bytes] errors >> read: 0 888890 0 888890 0 >29326.957 0 >> write: 0 961315 0 961315 0 >6277.560 0 >> Non-medium error count: 283 >> SMART Self-test log >> Num Test Status segment LifeTime >LBA_first_err [SK ASC ASQ] >> Description number (hours) >> # 1 Background long Self test in progress ... - NOW > - [- - -] >> # 2 Background long Aborted (device reset ?) - 14354 > - [- - -] >> # 3 Background short Completed - 14354 > - [- - -] >> # 4 Background long Aborted (device reset ?) - 14354 > - [- - -] >> # 5 Background long Aborted (device reset ?) - 14354 > - [- - -] >> Long (extended) Self Test duration: 2506 seconds [41.8 minutes] > >> >> The zpool scrub results and general layout: >> >> # zpool status -v >> pool: pool >> state: ONLINE >> scan: scrub repaired 0 in 164h13m with 0 errors on Thu Oct 22 >18:13:33 2015 >> config: >> NAME STATE READ WRITE CKSUM >> pool ONLINE 0 0 0 >> mirror-0 ONLINE 0 0 0 >> c0t3d0 ONLINE 0 0 0 >> c0t5d0 ONLINE 0 0 0 >> logs >> c4t5000CCA02A1292DDd0p2 ONLINE 0 0 0 >> cache >> c4t5000CCA02A1292DDd0p3 ONLINE 0 0 0 >> errors: No known data errors >> pool: rpool >> state: ONLINE >> status: Some supported features are not enabled on the pool. The pool >can >> still be used, but some features are unavailable. >> action: Enable all features using 'zpool upgrade'. Once this is done, >> the pool may no longer be accessible by software that does >not support >> the features. See zpool-features(5) for details. >> scan: scrub repaired 0 in 3h3m with 0 errors on Thu Oct 8 04:12:35 >2015 >> config: >> NAME STATE READ WRITE CKSUM >> rpool ONLINE 0 0 0 >> c4t5000CCA02A1292DDd0s0 ONLINE 0 0 0 >> errors: No known data errors > >> # zpool list -v >> NAME SIZE ALLOC FREE EXPANDSZ FRAG >CAP DEDUP HEALTH ALTROOT >> pool 2.72T 2.52T 207G - 68% >92% 1.36x ONLINE / >> mirror 2.72T 2.52T 207G - 68% >92% >> c0t3d0 - - - - - >- >> c0t5d0 - - - - - >- >> log - - - - - >- >> c4t5000CCA02A1292DDd0p2 8G 148K 8.00G - 0% >0% >> cache - - - - - >- >> c4t5000CCA02A1292DDd0p3 120G 1.80G 118G - 0% >1% >> rpool 151G 33.0G 118G - 76% >21% 1.00x ONLINE - >> c4t5000CCA02A1292DDd0s0 151G 33.0G 118G - 76% >21% > >> Note the long scrub time may have included the downtime while the >system was frozen until it was rebooted. >> >> Thanks in advance for the fresh pairs of eyeballs, >> Jim Klimov >> _______________________________________________ >> OmniOS-discuss mailing list >> OmniOS-discuss at lists.omniti.com >> http://lists.omniti.com/mailman/listinfo/omnios-discuss Mail apparently got bounced, reposting... -- Typos courtesy of K-9 Mail on my Samsung Android From bhildebrandt at exegy.com Fri Oct 23 16:57:55 2015 From: bhildebrandt at exegy.com (Hildebrandt, Bill) Date: Fri, 23 Oct 2015 16:57:55 +0000 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: <74C4EFF5-B1D7-49DE-81D5-54CE8787C258@omniti.com> References: <74C4EFF5-B1D7-49DE-81D5-54CE8787C258@omniti.com> Message-ID: It's a 4U direct-attached backplane . . . not an expander. Each drive requires its own connection. The zdb -bbccsv of my exported pool just finished and without error this time. Dump is still available if you would like to review it. -Bill -----Original Message----- From: Dan McDonald [mailto:danmcd at omniti.com] Sent: Friday, October 23, 2015 10:24 AM To: Hildebrandt, Bill Cc: Yavor Tomov; jimklimov at cos.ru; OmniOS-discuss; Dan McDonald Subject: Re: [OmniOS-discuss] OmniOS backup box hanging regularly > On Oct 23, 2015, at 12:42 AM, Hildebrandt, Bill wrote: > > Controllers: 2x LSI 9201-16i with SAS-SATA break-out cables going to a SAS846TQ backplane > System disk: mirrored Innodisk 128GB SATADOMs (SATADOM-MV 3ME) off mobo > ZIL: mirrored Samsung SSD 845D (400GB) off mobo > L2ARC: SAMSUNG MZHPU512 (512GB PCIe) > NICs: Intel 10G (X520) > Data disks: 24x WD 1TB Enterprise SATA (WDC WD1002FBYS) SATA disks attached to a SAS expander (this is the SAS846TQ, an expander, right?) are known to be a dangerous deployment. SATA doesn't have the reporting capabilities SAS does, and many SAS expanders don't relay SATA well enough. We do not support paying customers who deploy like this. FYI, Dan ________________________________ This e-mail and any documents accompanying it may contain legally privileged and/or confidential information belonging to Exegy, Inc. Such information may be protected from disclosure by law. The information is intended for use by only the addressee. If you are not the intended recipient, you are hereby notified that any disclosure or use of the information is strictly prohibited. If you have received this e-mail in error, please immediately contact the sender by e-mail or phone regarding instructions for return or destruction and do not use or disclose the content to others. From lotheac at iki.fi Fri Oct 23 17:10:47 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Fri, 23 Oct 2015 20:10:47 +0300 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: <02DF3A33-F955-4F86-A478-0D639CB500F1@cos.ru> References: <02DF3A33-F955-4F86-A478-0D639CB500F1@cos.ru> Message-ID: <20151023171047.GA25370@gutsman.lotheac.fi> On Fri, Oct 23 2015 18:54:27 +0200, Jim Klimov wrote: > 23 ??????? 2015??. 11:23:28 CEST, Jim Klimov ?????: > >So at the moment it seems there is some issue with zfs-auto-snapshots > >on OmniOS that I haven't seen in SXCE, OI nor Hipster. Possibly I had > >different implementations of the service in these different OSes (shell > >vs python, and all that at different versions). I highly recommend znapzend for automatic snapshotting. We used to use an implementation called zfs-auto-snapshot (I wonder how many there are :) in the past but it was doing dumb things like listing all existing snapshots quite frequently, and that ate up memory. http://www.znapzend.org/ -- Lauri Tirkkonen | lotheac @ IRCnet From carlb at flamewarestudios.com Fri Oct 23 18:01:04 2015 From: carlb at flamewarestudios.com (Carl Brunning) Date: Fri, 23 Oct 2015 18:01:04 +0000 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: <74C4EFF5-B1D7-49DE-81D5-54CE8787C258@omniti.com> References: <74C4EFF5-B1D7-49DE-81D5-54CE8787C258@omniti.com> Message-ID: Hay just so you know this backplane is not a expander but direct connection So sata should be fine -----Original Message----- From: OmniOS-discuss [mailto:omnios-discuss-bounces at lists.omniti.com] On Behalf Of Dan McDonald Sent: 23 October 2015 16:24 To: Hildebrandt, Bill Cc: OmniOS-discuss Subject: Re: [OmniOS-discuss] OmniOS backup box hanging regularly > On Oct 23, 2015, at 12:42 AM, Hildebrandt, Bill wrote: > > Controllers: 2x LSI 9201-16i with SAS-SATA break-out cables going to a SAS846TQ backplane > System disk: mirrored Innodisk 128GB SATADOMs (SATADOM-MV 3ME) off mobo > ZIL: mirrored Samsung SSD 845D (400GB) off mobo > L2ARC: SAMSUNG MZHPU512 (512GB PCIe) > NICs: Intel 10G (X520) > Data disks: 24x WD 1TB Enterprise SATA (WDC WD1002FBYS) SATA disks attached to a SAS expander (this is the SAS846TQ, an expander, right?) are known to be a dangerous deployment. SATA doesn't have the reporting capabilities SAS does, and many SAS expanders don't relay SATA well enough. We do not support paying customers who deploy like this. FYI, Dan _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss at lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss From nagele at wildbit.com Fri Oct 23 20:03:59 2015 From: nagele at wildbit.com (Chris Nagele) Date: Fri, 23 Oct 2015 16:03:59 -0400 Subject: [OmniOS-discuss] Pool degraded after inserting disk Message-ID: Hi all. I had an issue come up today when we added some new disks to a server. After physically inserting the disk, the pool immediately degraded and any zfs commands would hang. We rebooted and the pool was completely fine. Is this a common issue? I've never run into it before. To clarify, these servers are running on the X9DRD-7LN4F-JBOD board with SATA SSDs attached directly to the onboard controller. Chris From danmcd at omniti.com Mon Oct 26 15:17:46 2015 From: danmcd at omniti.com (Dan McDonald) Date: Mon, 26 Oct 2015 11:17:46 -0400 Subject: [OmniOS-discuss] HEADS UP: r151006 EOSL on March 31st, 2016 Message-ID: <0D2ECE15-BC7F-4246-B41E-C595712058F5@omniti.com> With r151014, the new LTS, in the field for over six months now, and r151016, the next stable, arriving within the next two weeks, it is time to start the countdown timer on support for the old LTS, r151006. March 31st, 2016 is the earliest date the following stable, r151018, will be released. Coincident with that is the end of service life (EOSL) for r151006, the old LTS. If you are still on r151006, PLEASE start your migration to r151014 NOW. As before, r151006 will receive security updates, but after March 31st, it will not. The release cadence listed here: http://omnios.omniti.com/wiki.php/ReleaseCycle documents this, and has for some time. Thank you, Dan McDonald -- OmniOS Engineering From skeltonr at btconnect.com Mon Oct 26 20:46:34 2015 From: skeltonr at btconnect.com (Richard Skelton) Date: Mon, 26 Oct 2015 20:46:34 +0000 Subject: [OmniOS-discuss] pkgrecv -s http://pkg.omniti.com/omnios/r151014/ -d /tank/repo '*' seems to be broken Message-ID: <562E912A.80808@btconnect.com> Hi, I am trying to make a local copy of the stable repo :- pkgrecv -s http://pkg.omniti.com/omnios/r151014/ -d /tank/repo '*' seems to be broken but it fails :-( root at hp:/root/fio-2.1.10# pkgrecv -s http://pkg.omniti.com/omnios/r151014/ -d /tank/repo '*' Processing packages for publisher omnios ... Retrieving and evaluating 6161 package(s)... Download Manifests ( 907/6161) -pkgrecv: http protocol error: code: 404 reason: Not Found URL: 'http://pkg.omniti.com/omnios/r151014/omnios/manifest/0/developer%2Fillumos-tools at 11%2C5.11-0.151014%3A20151016T122410Z' (happened 4 times) root at hp:/root/fio-2.1.10# Cheers Richard From danmcd at omniti.com Mon Oct 26 21:14:23 2015 From: danmcd at omniti.com (Dan McDonald) Date: Mon, 26 Oct 2015 17:14:23 -0400 Subject: [OmniOS-discuss] pkgrecv -s http://pkg.omniti.com/omnios/r151014/ -d /tank/repo '*' seems to be broken In-Reply-To: <562E912A.80808@btconnect.com> References: <562E912A.80808@btconnect.com> Message-ID: <43AF2ECC-D806-47BD-B313-23DAFE7F4563@omniti.com> > On Oct 26, 2015, at 4:46 PM, Richard Skelton wrote: > > Hi, > I am trying to make a local copy of the stable repo :- > pkgrecv -s http://pkg.omniti.com/omnios/r151014/ -d /tank/repo '*' seems > to be broken > > but it fails :-( Two things. 1.) Assuming you're ON r151014 already, use "-m latest" for less transfers, unless you REALLY WANT all of the historical r151014 packages. pkgrecv -s http://pkg.omniti.com/omnios/r151014/ -d /tank/repo -m latest '*' 2.) I just rebuilt r151014's repo index. Please try again (with the -m latest to prevent extra transfers if needed). Dan From lotheac at iki.fi Tue Oct 27 09:56:42 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Tue, 27 Oct 2015 11:56:42 +0200 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: <54A6F787-8791-4C94-85AB-BB3615077387@cos.ru> References: <02DF3A33-F955-4F86-A478-0D639CB500F1@cos.ru> <20151023171047.GA25370@gutsman.lotheac.fi> <54A6F787-8791-4C94-85AB-BB3615077387@cos.ru> Message-ID: <20151027095642.GA13407@gutsman.lotheac.fi> On Tue, Oct 27 2015 09:49:40 +0100, Jim Klimov wrote: > So far I use a mix of 'standard' time-slider and additionally my script that kills oldest snapshot groups (chosen by pattern of automatic snaps) to keep a specified watermark of free space. Yeah, we were previously using zfs-auto-snap from OpenSolaris before it became time-slider (with one or two local patches). > Something in this simple activity is enough to bring the box down into swapping until the deadman knocks to interrupt the infinite loop looking for a free page, and I've got a screenshot to prove this theory ;) In your previous mail you have a 'top' listing with way too many 'zfs' processes owned by zfssnap, and all are hundreds of megabytes in RSS. That sounds like a problem. IIRC, one problematic configuration that caused issues like this was a single filesystem setting a zfs-auto-snapshot property locally in a large tree where it also inherited it from the parent. My memory on this is a bit hazy though. > I wonder why doesn't the offending process die on some failed malloc... Good question. -- Lauri Tirkkonen | lotheac @ IRCnet From jimklimov at cos.ru Tue Oct 27 11:05:31 2015 From: jimklimov at cos.ru (Jim Klimov) Date: Tue, 27 Oct 2015 12:05:31 +0100 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: <54A6F787-8791-4C94-85AB-BB3615077387@cos.ru> References: <02DF3A33-F955-4F86-A478-0D639CB500F1@cos.ru> <20151023171047.GA25370@gutsman.lotheac.fi> <54A6F787-8791-4C94-85AB-BB3615077387@cos.ru> Message-ID: <31B5A10C-1A68-4FC4-82EF-A887518349B8@cos.ru> 27 ??????? 2015??. 9:49:40 CET, Jim Klimov ?????: >23 ??????? 2015??. 19:10:47 CEST, Lauri Tirkkonen >?????: >>On Fri, Oct 23 2015 18:54:27 +0200, Jim Klimov wrote: >>> 23 ??????? 2015??. 11:23:28 CEST, Jim Klimov ?????: >>> >So at the moment it seems there is some issue with >>zfs-auto-snapshots >>> >on OmniOS that I haven't seen in SXCE, OI nor Hipster. Possibly I >>had >>> >different implementations of the service in these different OSes >>(shell >>> >vs python, and all that at different versions). >> >>I highly recommend znapzend for automatic snapshotting. We used to use >>an implementation called zfs-auto-snapshot (I wonder how many there >are >>:) in the past but it was doing dumb things like listing all existing >>snapshots quite frequently, and that ate up memory. >>http://www.znapzend.org/ >> >>-- >>Lauri Tirkkonen | lotheac @ IRCnet >>_______________________________________________ >>OmniOS-discuss mailing list >>OmniOS-discuss at lists.omniti.com >>http://lists.omniti.com/mailman/listinfo/omnios-discuss > >Thanks, i'll try to take a look whenifever I have time ;) > >So far I use a mix of 'standard' time-slider and additionally my script >that kills oldest snapshot groups (chosen by pattern of automatic >snaps) to keep a specified watermark of free space. > >Something in this simple activity is enough to bring the box down into >swapping until the deadman knocks to interrupt the infinite loop >looking for a free page, and I've got a screenshot to prove this theory >;) > >I wonder why doesn't the offending process die on some failed malloc... > >Jim >-- >Typos courtesy of K-9 Mail on my Samsung Android Heh, in fact this OmniOS installation does not offer a time-slider, but rather the ksh93-based scripts for 'zfs/autosnapshot'. Now gotta verify what i run elsewhere;) -- Typos courtesy of K-9 Mail on my Samsung Android From lotheac at iki.fi Tue Oct 27 11:07:56 2015 From: lotheac at iki.fi (Lauri Tirkkonen) Date: Tue, 27 Oct 2015 13:07:56 +0200 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: <31B5A10C-1A68-4FC4-82EF-A887518349B8@cos.ru> References: <02DF3A33-F955-4F86-A478-0D639CB500F1@cos.ru> <20151023171047.GA25370@gutsman.lotheac.fi> <54A6F787-8791-4C94-85AB-BB3615077387@cos.ru> <31B5A10C-1A68-4FC4-82EF-A887518349B8@cos.ru> Message-ID: <20151027110756.GB13407@gutsman.lotheac.fi> On Tue, Oct 27 2015 12:05:31 +0100, Jim Klimov wrote: > Heh, in fact this OmniOS installation does not offer a time-slider, but rather the ksh93-based scripts for 'zfs/autosnapshot'. Now gotta verify what i run elsewhere;) I think OmniOS ships neither time-slider nor zfs-auto-snapshot. When we used it I had to dig it out from somewhere in the interwebs and package it :) -- Lauri Tirkkonen | lotheac @ IRCnet From gmason at msu.edu Tue Oct 27 12:37:25 2015 From: gmason at msu.edu (Greg Mason) Date: Tue, 27 Oct 2015 08:37:25 -0400 Subject: [OmniOS-discuss] OmniOS backup box hanging regularly In-Reply-To: <20151027110756.GB13407@gutsman.lotheac.fi> References: <02DF3A33-F955-4F86-A478-0D639CB500F1@cos.ru> <20151023171047.GA25370@gutsman.lotheac.fi> <54A6F787-8791-4C94-85AB-BB3615077387@cos.ru> <31B5A10C-1A68-4FC4-82EF-A887518349B8@cos.ru> <20151027110756.GB13407@gutsman.lotheac.fi> Message-ID: <778F99AB-C38F-489A-ACFA-2A031EEC80D2@msu.edu> we?ve been using this, fired off by cron: https://github.com/MSU-iCER/puppet-zfs-auto-snapshot/blob/master/files/zfs-auto-snapshot.pl We manage it via puppet. It?s a bit of an older puppet module, but should still work. -Greg > On Oct 27, 2015, at 7:07 AM, Lauri Tirkkonen wrote: > > On Tue, Oct 27 2015 12:05:31 +0100, Jim Klimov wrote: >> Heh, in fact this OmniOS installation does not offer a time-slider, but rather the ksh93-based scripts for 'zfs/autosnapshot'. Now gotta verify what i run elsewhere;) > > I think OmniOS ships neither time-slider nor zfs-auto-snapshot. When we > used it I had to dig it out from somewhere in the interwebs and package > it :) > > -- > Lauri Tirkkonen | lotheac @ IRCnet > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss at lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss From skeltonr at btconnect.com Tue Oct 27 17:03:34 2015 From: skeltonr at btconnect.com (Richard Skelton) Date: Tue, 27 Oct 2015 17:03:34 +0000 Subject: [OmniOS-discuss] pkgrecv -s http://pkg.omniti.com/omnios/r151014/ -d /tank/repo '*' seems to be broken In-Reply-To: <43AF2ECC-D806-47BD-B313-23DAFE7F4563@omniti.com> References: <562E912A.80808@btconnect.com> <43AF2ECC-D806-47BD-B313-23DAFE7F4563@omniti.com> Message-ID: <562FAE66.8080704@btconnect.com> Hi Dan, Now I get :- root at hp:/root/fio-2.1.10# pkgrecv -s http://pkg.omniti.com/omnios/r151014/ -d /tank/repo '*' Processing packages for publisher omnios ... Retrieving and evaluating 6160 package(s)... PROCESS ITEMS GET (MB) SEND (MB) developer/gcc48 0/3464 43/1853 0/5108pkgrecv: 1: Framework error: code: 18 reason: transfer closed with 26579845 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/81dded17f29bf94296d580fe3d197f1c650a7f98' 2: Framework error: code: 18 reason: transfer closed with 24763267 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/399f01f216d9bf29a3549821df79572a9e120401' 3: Framework error: code: 18 reason: transfer closed with 26681841 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/4c40e09a64075f6b145cc747fdd5cbc621984d16' 4: Framework error: code: 18 reason: transfer closed with 22826291 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/d371f83d2055ab737eaacb0c70f3b8e958f896ac' 5: Framework error: code: 18 reason: transfer closed with 20496851 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/399f01f216d9bf29a3549821df79572a9e120401' 6: Framework error: code: 18 reason: transfer closed with 20129541 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/81dded17f29bf94296d580fe3d197f1c650a7f98' 7: Framework error: code: 18 reason: transfer closed with 22444385 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/4c40e09a64075f6b145cc747fdd5cbc621984d16' 8: Framework error: code: 18 reason: transfer closed with 21467763 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/d371f83d2055ab737eaacb0c70f3b8e958f896ac' 9: Framework error: code: 18 reason: transfer closed with 21192525 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/81dded17f29bf94296d580fe3d197f1c650a7f98' 10: Framework error: code: 18 reason: transfer closed with 15680195 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/399f01f216d9bf29a3549821df79572a9e120401' 11: Framework error: code: 18 reason: transfer closed with 21010865 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/4c40e09a64075f6b145cc747fdd5cbc621984d16' 12: Framework error: code: 18 reason: transfer closed with 19564821 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/81dded17f29bf94296d580fe3d197f1c650a7f98' 13: Framework error: code: 18 reason: transfer closed with 21120761 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/4c40e09a64075f6b145cc747fdd5cbc621984d16' 14: Framework error: code: 18 reason: transfer closed with 19370755 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/d371f83d2055ab737eaacb0c70f3b8e958f896ac' 15: Framework error: code: 18 reason: transfer closed with 15824995 bytes remaining to read URL: 'http://pkg.omniti.com/omnios/r151014/omnios/file/1/399f01f216d9bf29a3549821df79572a9e120401' pkgrecv: Cached files were preserved in the following directory: /var/tmp/pkgrecv-BpcFZb Use pkgrecv -c to resume the interrupted download. root at hp:/root/fio-2.1.10# Dan McDonald wrote: >> On Oct 26, 2015, at 4:46 PM, Richard Skelton wrote: >> >> Hi, >> I am trying to make a local copy of the stable repo :- >> pkgrecv -s http://pkg.omniti.com/omnios/r151014/ -d /tank/repo '*' seems >> to be broken >> >> but it fails :-( >> > > Two things. > > 1.) Assuming you're ON r151014 already, use "-m latest" for less transfers, unless you REALLY WANT all of the historical r151014 packages. > > pkgrecv -s http://pkg.omniti.com/omnios/r151014/ -d /tank/repo -m latest '*' > > 2.) I just rebuilt r151014's repo index. Please try again (with the -m latest to prevent extra transfers if needed). > > > Dan > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From danmcd at omniti.com Tue Oct 27 17:47:26 2015 From: danmcd at omniti.com (Dan McDonald) Date: Tue, 27 Oct 2015 13:47:26 -0400 Subject: [OmniOS-discuss] pkgrecv -s http://pkg.omniti.com/omnios/r151014/ -d /tank/repo '*' seems to be broken In-Reply-To: <562FAE66.8080704@btconnect.com> References: <562E912A.80808@btconnect.com> <43AF2ECC-D806-47BD-B313-23DAFE7F4563@omniti.com> <562FAE66.8080704@btconnect.com> Message-ID: Try -m latest... it could just be the sheer number of packages you're transferring. We still use the tiny CherryPy webserver at the repo-box end. Dan From danmcd at omniti.com Thu Oct 29 18:03:07 2015 From: danmcd at omniti.com (Dan McDonald) Date: Thu, 29 Oct 2015 14:03:07 -0400 Subject: [OmniOS-discuss] Small update --> UNZIP Message-ID: <3F49B4C5-3871-4496-B9F9-A803C60BA6A7@omniti.com> Testers found UNZIP 6.0 didn't handle certain fuzz situations as well as it should've. To that end, r151006 and r151014 have been updated with new compression/unzip packages. Please "pkg update". Thanks, Dan From danmcd at omniti.com Fri Oct 30 19:26:33 2015 From: danmcd at omniti.com (Dan McDonald) Date: Fri, 30 Oct 2015 15:26:33 -0400 Subject: [OmniOS-discuss] Fwd: [discuss] HEADS UP: Java kerberos GUI (gkadmin) gone References: Message-ID: <5F829969-C9C6-475F-A256-25E213A4B349@omniti.com> THIS DOES NOT AFFECT THE UPCOMING r151016, but it will affect 018 and beyond. Anyone in the audience use the Kerberos Java GUI to administer Kerberos? If so, consider this an EOL warning for spring of 2016 and r151018. Thanks, Dan > Begin forwarded message: > > From: "Garrett D'Amore" > Date: October 30, 2015 at 3:24:32 PM EDT > To: "discuss at lists.illumos.org" > Subject: [discuss] HEADS UP: Java kerberos GUI (gkadmin) gone > > FYI, with the push I just did on behalf of Dan McDonald and Josef Sipek (who actually took most of this from work I did ages ago in illumos-core), the Java based GUI for administering kerberos is gone from illumos-gate. > > Its unclear if anyone was able to use gkadmin successfully recently, or has been using it. > > As distributions pick this up, you may notice its absence. You can still administer Kerberos using the command line interface kadmin. > > The push in question is this: > > commit dd3293375033eaa6f009722670ffa191b992ffd9 > Author: Garrett D'Amore > > Date: Thu Oct 29 12:33:18 2015 -0400 > > 6407 kerberos Java GUI should go away > Portions contributed by: Josef 'Jeff' Sipek > > Reviewed by: Josef 'Jeff' Sipek > > Reviewed by: Toomas Soome > > Reviewed by: Andy Stormont > > Reviewed by: Albert Lee > > Reviewed by: Peter Tribble > > Reviewed by: Richard PALO > > Approved by: Dan McDonald > > > - Garrett > > illumos-discuss | Archives | Modify Your Subscription -------------- next part -------------- An HTML attachment was scrubbed... URL: