[OmniOS-discuss] OmniOS backup box hanging regularly
Yavor Tomov
yavoritomov at gmail.com
Thu Oct 22 17:36:56 UTC 2015
Hi Tovarishch Jim,
I had similar issue with my box and it was related to the NFS locks. I
assume you are using it due to the Linux backups. The solution was posted
by Chip on the mailing list. Copy of his solution below:
"I've seen issues like this when you run out of NFS locks. NFSv3 in
Illumos is really slow at releasing locks.
On all my NFS servers I do:
sharectl set -p lockd_listen_backlog=256 nfs
sharectl set -p lockd_servers=2048 nfs
Everywhere I can, I use NFSv4 instead of v3. It handles lock much better."
All the Best
Yavor
On Thu, Oct 22, 2015 at 11:59 AM, Jim Klimov <jim at cos.ru> wrote:
> Hello all,
>
> I have this HP-Z400 workstation with 16Gb ECC(should be) RAM running
> OmniOS bloody, which acts as a backup server for our production systems
> (regularly rsync'ing large files off Linux boxes, and rotating ZFS
> auto-snapshots to keep its space free). Sometimes it also runs replicas of
> infrastructure (DHCP, DNS) and was set up as a VirtualBox + phpVirtualBox
> host to test that out, but no VMs running.
>
> So the essential loads are ZFS snapshots and ZFS scrubs :)
>
> And it freezes roughly every week. Stops responding to ping, attempts to
> log in via SSH or physical console - it processes keypresses on the latter,
> but does not present a login prompt. It used to be stable, and such regular
> hangs began around summertime.
>
>
>
> My primary guess would be for flaky disks, maybe timing out under load or
> going to sleep or whatever... But I have yet to prove it, or any other
> theory. Maybe just CPU is overheating due to regular near-100% load with
> disk I/O... At least I want to rule out OS errors and rule out (or point
> out) operator/box errors as much as possible - which is something I can
> change to try and fix ;)
>
> Before I proceed to TL;DR screenshots, I'd overview what I see:
>
> * In the "top" output, processes owned by zfssnap lead most of the time...
> But even the SSH shell is noticeably slow to respond (1 sec per line when
> just pressing enter to clear the screen to prepare nice screenshots).
>
> * SMART was not enabled on 3TB mirrored "pool" SATA disks (is now, long
> tests initiated), but was in place on the "rpool" SAS disk where it logged
> some corrected ECC errors - but none uncorrected.
>
> Maybe the cabling should be reseated.
>
> * iostat shows disks are generally not busy (they don't audibly rattle nor
> visibly blink all the time, either)
>
> * zpool scrubs return clean
>
> * there are partitions of the system rpool disk (10K RPM SAS) used as log
> and cache devices for the main data pool on 3TB SATA disks. The system disk
> is fast and underutilized, so what the heck ;) And it was not a problem for
> the first year of this system's honest and stable workouts. These devices
> are pretty empty at the moment.
>
>
>
> I have enabled deadman panics according to Wiki, but none have happened so
> far:
>
> # cat /etc/system | egrep -v '(^\*|^$)'
> set snooping=1
> set pcplusmp:apic_panic_on_nmi=1
> set apix:apic_panic_on_nmi = 1
>
>
>
>
>
> In the "top" output, processes owned by zfssnap lead most of the time:
>
>
>
> last pid: 22599; load avg: 12.9, 12.2, 11.2; up
> 0+09:52:11
> 18:34:41
> 140 processes: 125 sleeping, 13 running, 2 on cpu
> CPU states: 0.0% idle, 22.9% user, 77.1% kernel, 0.0% iowait, 0.0% swap
> Memory: 16G phys mem, 1765M free mem, 2048M total swap, 2048M free swap
> Seconds to delay:
> PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
> 21389 zfssnap 1 43 2 863M 860M run 5:04 35.61% zfs
> 22360 zfssnap 1 52 2 118M 115M run 0:37 16.50% zfs
> 21778 zfssnap 1 52 2 563M 560M run 3:15 13.17% zfs
> 21278 zfssnap 1 52 2 947M 944M run 5:32 6.91% zfs
> 21881 zfssnap 1 43 2 433M 431M run 2:31 5.41% zfs
> 21852 zfssnap 1 52 2 459M 456M run 2:39 5.16% zfs
> 21266 zfssnap 1 43 2 906M 903M run 5:18 3.95% zfs
> 21757 zfssnap 1 43 2 597M 594M run 3:26 2.91% zfs
> 21274 zfssnap 1 52 2 930M 927M cpu/0 5:27 2.78% zfs
> 22588 zfssnap 1 43 2 30M 27M run 0:08 2.48% zfs
> 22580 zfssnap 1 52 2 49M 46M run 0:14 0.71% zfs
> 22038 root 1 59 0 5312K 3816K cpu/1 0:01 0.10% top
> 22014 root 1 59 0 8020K 4988K sleep 0:00 0.02% sshd
>
>
>
> Average "iostats" are not that busy:
>
>
>
> # zpool iostat -Td 5
> Thu Oct 22 18:24:59 CEST 2015
> capacity operations bandwidth
> pool alloc free read write read write
> ---------- ----- ----- ----- ----- ----- -----
> pool 2.52T 207G 802 116 28.3M 840K
> rpool 33.0G 118G 0 4 4.52K 58.7K
> ---------- ----- ----- ----- ----- ----- -----
>
> Thu Oct 22 18:25:04 CEST 2015
> pool 2.52T 207G 0 0 0 0
> rpool 33.0G 118G 0 10 0 97.9K
> ---------- ----- ----- ----- ----- ----- -----
> Thu Oct 22 18:25:09 CEST 2015
> pool 2.52T 207G 0 0 0 0
> rpool 33.0G 118G 0 0 0 0
> ---------- ----- ----- ----- ----- ----- -----
> Thu Oct 22 18:25:14 CEST 2015
> pool 2.52T 207G 0 0 0 0
> rpool 33.0G 118G 0 9 0 93.5K
> ---------- ----- ----- ----- ----- ----- -----
> Thu Oct 22 18:25:19 CEST 2015
> pool 2.52T 207G 0 0 0 0
> rpool 33.0G 118G 0 0 0 0
> ---------- ----- ----- ----- ----- ----- -----
> Thu Oct 22 18:25:24 CEST 2015
> pool 2.52T 207G 0 0 0 0
> rpool 33.0G 118G 0 0 0 0
> ---------- ----- ----- ----- ----- ----- -----
> Thu Oct 22 18:25:29 CEST 2015
> pool 2.52T 207G 0 0 0 0
> rpool 33.0G 118G 0 0 0 0
> ---------- ----- ----- ----- ----- ----- -----
> Thu Oct 22 18:25:34 CEST 2015
> pool 2.52T 207G 0 0 0 0
> rpool 33.0G 118G 0 0 0 0
> ---------- ----- ----- ----- ----- ----- -----
> Thu Oct 22 18:25:39 CEST 2015
> pool 2.52T 207G 0 0 0 0
> rpool 33.0G 118G 0 16 0 374K
> ---------- ----- ----- ----- ----- ----- -----
> ...
>
> Thu Oct 22 18:33:49 CEST 2015
> pool 2.52T 207G 0 0 0 0
> rpool 33.0G 118G 0 11 0 94.5K
> ---------- ----- ----- ----- ----- ----- -----
> Thu Oct 22 18:33:54 CEST 2015
> pool 2.52T 207G 0 13 819 80.0K
> rpool 33.0G 118G 0 0 0 0
> ---------- ----- ----- ----- ----- ----- -----
> Thu Oct 22 18:33:59 CEST 2015
> pool 2.52T 207G 0 129 0 1.06M
> rpool 33.0G 118G 0 0 0 0
> ---------- ----- ----- ----- ----- ----- -----
> Thu Oct 22 18:34:04 CEST 2015
> pool 2.52T 207G 0 55 0 503K
> rpool 33.0G 118G 0 11 0 97.9K
> ---------- ----- ----- ----- ----- ----- -----
> ...
>
> just occasional bursts of work.
>
> I've now enabled SMART on the disks (2*3Tb mirror "pool" and 1*300Gb
> "rpool") and ran some short tests and triggered long tests (hopefully
> they'd succeed by tomorrow); current results are:
>
>
> # for D in /dev/rdsk/c0*s0; do echo "===== $D :"; smartctl -d sat,12 -a $D
> ; done ; for D in /dev/rdsk/c4*s0 ; do echo "===== $D :"; smartctl -d scsi
> -a $D ; done
> ===== /dev/rdsk/c0t3d0s0 :
> smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build)
> Copyright (C) 2002-12, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Device Model: WDC WD3003FZEX-00Z4SA0
> Serial Number: WD-WCC5D1KKU0PA
> LU WWN Device Id: 5 0014ee 2610716b7
> Firmware Version: 01.01A01
> User Capacity: 3,000,592,982,016 bytes [3.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Rotation Rate: 7200 rpm
> Device is: Not in smartctl database [for details use: -P showall]
> ATA Version is: ACS-2 (minor revision not indicated)
> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is: Thu Oct 22 18:45:28 2015 CEST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection:
> Enabled.
> Self-test execution status: ( 249) Self-test routine in progress...
> 90% of test remaining.
> Total time to complete Offline
> data collection: (32880) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection
> on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 357) minutes.
> Conveyance self-test routine
> recommended polling time: ( 5) minutes.
> SCT capabilities: (0x7035) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
> WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
> Always - 0
> 3 Spin_Up_Time 0x0027 246 154 021 Pre-fail
> Always - 6691
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age
> Always - 14
> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
> Always - 0
> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age
> Always - 0
> 9 Power_On_Hours 0x0032 094 094 000 Old_age
> Always - 4869
> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age
> Always - 0
> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
> Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
> Always - 14
> 16 Unknown_Attribute 0x0022 130 070 000 Old_age
> Always - 2289651870502
> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
> Always - 12
> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age
> Always - 2
> 194 Temperature_Celsius 0x0022 117 111 000 Old_age
> Always - 35
> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
> Always - 0
> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age
> Always - 0
> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
> Offline - 0
> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
> Always - 0
> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
> Offline - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Short offline Completed without error 00%
> 4869 -
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> ===== /dev/rdsk/c0t5d0s0 :
> smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build)
> Copyright (C) 2002-12, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family: Seagate SV35
> Device Model: ST3000VX000-1ES166
> Serial Number: Z500S3L8
> LU WWN Device Id: 5 000c50 079e3757b
> Firmware Version: CV26
> User Capacity: 3,000,592,982,016 bytes [3.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Rotation Rate: 7200 rpm
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
> SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is: Thu Oct 22 18:45:28 2015 CEST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x00) Offline data collection activity
> was never started.
> Auto Offline Data Collection:
> Disabled.
> Self-test execution status: ( 249) Self-test routine in progress...
> 90% of test remaining.
> Total time to complete Offline
> data collection: ( 80) seconds.
> Offline data collection
> capabilities: (0x73) SMART execute Offline immediate.
> Auto Offline data collection
> on/off support.
> Suspend Offline collection upon new
> command.
> No Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 1) minutes.
> Extended self-test routine
> recommended polling time: ( 325) minutes.
> Conveyance self-test routine
> recommended polling time: ( 2) minutes.
> SCT capabilities: (0x10b9) SCT Status supported.
> SCT Error Recovery Control
> supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
> WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail
> Always - 8600880
> 3 Spin_Up_Time 0x0003 096 094 000 Pre-fail
> Always - 0
> 4 Start_Stop_Count 0x0032 100 100 020 Old_age
> Always - 19
> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail
> Always - 0
> 7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail
> Always - 342685681
> 9 Power_On_Hours 0x0032 096 096 000 Old_age
> Always - 4214
> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
> Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age
> Always - 19
> 184 End-to-End_Error 0x0032 100 100 099 Old_age
> Always - 0
> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age
> Always - 0
> 188 Command_Timeout 0x0032 100 100 000 Old_age
> Always - 0
> 189 High_Fly_Writes 0x003a 028 028 000 Old_age
> Always - 72
> 190 Airflow_Temperature_Cel 0x0022 069 065 045 Old_age
> Always - 31 (Min/Max 29/32)
> 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age
> Always - 0
> 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age
> Always - 19
> 193 Load_Cycle_Count 0x0032 100 100 000 Old_age
> Always - 28
> 194 Temperature_Celsius 0x0022 031 040 000 Old_age
> Always - 31 (0 20 0 0 0)
> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age
> Always - 0
> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
> Offline - 0
> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
> Always - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Extended offline Self-test routine in progress 90%
> 4214 -
> # 2 Short offline Completed without error 00%
> 4214 -
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> ===== /dev/rdsk/c4t5000CCA02A1292DDd0s0 :
> smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build)
> Copyright (C) 2002-12, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
> Vendor: HITACHI
> Product: HUS156030VLS600
> Revision: HPH1
> User Capacity: 300,000,000,000 bytes [300 GB]
> Logical block size: 512 bytes
> Logical Unit id: 0x5000cca02a1292dc
> Serial number: LVVA6NHS
> Device type: disk
> Transport protocol: SAS
> Local Time is: Thu Oct 22 18:45:29 2015 CEST
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
>
> Current Drive Temperature: 45 C
> Drive Trip Temperature: 70 C
> Manufactured in week 14 of year 2012
> Specified cycle count over device lifetime: 50000
> Accumulated start-stop cycles: 80
> Elements in grown defect list: 0
> Vendor (Seagate) cache information
> Blocks sent to initiator = 2340336504406016
>
> Error counter log:
> Errors Corrected by Total Correction
> Gigabytes Total
> ECC rereads/ errors algorithm
> processed uncorrected
> fast | delayed rewrites corrected invocations [10^9
> bytes] errors
> read: 0 888890 0 888890 0
> 29326.957 0
> write: 0 961315 0 961315 0
> 6277.560 0
>
> Non-medium error count: 283
>
> SMART Self-test log
> Num Test Status segment LifeTime
> LBA_first_err [SK ASC ASQ]
> Description number (hours)
> # 1 Background long Self test in progress ... -
> NOW - [- - -]
> # 2 Background long Aborted (device reset ?) -
> 14354 - [- - -]
> # 3 Background short Completed -
> 14354 - [- - -]
> # 4 Background long Aborted (device reset ?) -
> 14354 - [- - -]
> # 5 Background long Aborted (device reset ?) -
> 14354 - [- - -]
>
> Long (extended) Self Test duration: 2506 seconds [41.8 minutes]
>
>
>
> The zpool scrub results and general layout:
>
>
>
> # zpool status -v
> pool: pool
> state: ONLINE
> scan: scrub repaired 0 in 164h13m with 0 errors on Thu Oct 22 18:13:33
> 2015
> config:
>
> NAME STATE READ WRITE CKSUM
> pool ONLINE 0 0 0
> mirror-0 ONLINE 0 0 0
> c0t3d0 ONLINE 0 0 0
> c0t5d0 ONLINE 0 0 0
> logs
> c4t5000CCA02A1292DDd0p2 ONLINE 0 0 0
> cache
> c4t5000CCA02A1292DDd0p3 ONLINE 0 0 0
>
> errors: No known data errors
>
> pool: rpool
> state: ONLINE
> status: Some supported features are not enabled on the pool. The pool can
> still be used, but some features are unavailable.
> action: Enable all features using 'zpool upgrade'. Once this is done,
> the pool may no longer be accessible by software that does not
> support
> the features. See zpool-features(5) for details.
> scan: scrub repaired 0 in 3h3m with 0 errors on Thu Oct 8 04:12:35 2015
> config:
>
> NAME STATE READ WRITE CKSUM
> rpool ONLINE 0 0 0
> c4t5000CCA02A1292DDd0s0 ONLINE 0 0 0
>
> errors: No known data errors
>
> # zpool list -v
> NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP
> DEDUP HEALTH ALTROOT
> pool 2.72T 2.52T 207G - 68% 92%
> 1.36x ONLINE /
> mirror 2.72T 2.52T 207G - 68% 92%
> c0t3d0 - - - - - -
> c0t5d0 - - - - - -
> log - - - - - -
> c4t5000CCA02A1292DDd0p2 8G 148K 8.00G - 0% 0%
> cache - - - - - -
> c4t5000CCA02A1292DDd0p3 120G 1.80G 118G - 0% 1%
> rpool 151G 33.0G 118G - 76% 21%
> 1.00x ONLINE -
> c4t5000CCA02A1292DDd0s0 151G 33.0G 118G - 76% 21%
>
> Note the long scrub time may have included the downtime while the system
> was frozen until it was rebooted.
>
>
>
> Thanks in advance for the fresh pairs of eyeballs,
> Jim Klimov
>
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20151022/091c9255/attachment-0001.html>
More information about the OmniOS-discuss
mailing list