[OmniOS-discuss] OmniOS backup box hanging regularly

Thu Oct 22 17:36:56 UTC 2015

Hi Tovarishch Jim,

I had similar issue with my box and it was related to the NFS locks. I
assume you are using it due to the Linux backups. The solution was posted
by Chip on the mailing list. Copy of his solution below:

"I've seen issues like this when you run out of NFS locks.   NFSv3 in
Illumos is really slow at releasing locks.

On all my NFS servers I do:

sharectl set -p lockd_listen_backlog=256 nfs
sharectl set -p lockd_servers=2048 nfs

Everywhere I can, I use NFSv4 instead of v3.   It handles lock much better."

All the Best
Yavor

On Thu, Oct 22, 2015 at 11:59 AM, Jim Klimov <jim at cos.ru> wrote:

> Hello all,
>
> I have this HP-Z400 workstation with 16Gb ECC(should be) RAM running
> OmniOS bloody, which acts as a backup server for our production systems
> (regularly rsync'ing large files off Linux boxes, and rotating ZFS
> auto-snapshots to keep its space free). Sometimes it also runs replicas of
> infrastructure (DHCP, DNS) and was set up as a VirtualBox + phpVirtualBox
> host to test that out, but no VMs running.
>
> So the essential loads are ZFS snapshots and ZFS scrubs :)
>
> And it freezes roughly every week. Stops responding to ping, attempts to
> log in via SSH or physical console - it processes keypresses on the latter,
> but does not present a login prompt. It used to be stable, and such regular
> hangs began around summertime.
>
>
>
> My primary guess would be for flaky disks, maybe timing out under load or
> going to sleep or whatever... But I have yet to prove it, or any other
> theory. Maybe just CPU is overheating due to regular near-100% load with
> disk I/O... At least I want to rule out OS errors and rule out (or point
> out) operator/box errors as much as possible - which is something I can
> change to try and fix ;)
>
> Before I proceed to TL;DR screenshots, I'd overview what I see:
>
> * In the "top" output, processes owned by zfssnap lead most of the time...
> But even the SSH shell is noticeably slow to respond (1 sec per line when
> just pressing enter to clear the screen to prepare nice screenshots).
>
> * SMART was not enabled on 3TB mirrored "pool" SATA disks (is now, long
> tests initiated), but was in place on the "rpool" SAS disk where it logged
> some corrected ECC errors - but none uncorrected.
>
> Maybe the cabling should be reseated.
>
> * iostat shows disks are generally not busy (they don't audibly rattle nor
> visibly blink all the time, either)
>
> * zpool scrubs return clean
>
> * there are partitions of the system rpool disk (10K RPM SAS) used as log
> and cache devices for the main data pool on 3TB SATA disks. The system disk
> is fast and underutilized, so what the heck ;) And it was not a problem for
> the first year of this system's honest and stable workouts. These devices
> are pretty empty at the moment.
>
>
>
> I have enabled deadman panics according to Wiki, but none have happened so
> far:
>
> # cat /etc/system  | egrep -v '(^\*|^$)'
> set snooping=1
> set pcplusmp:apic_panic_on_nmi=1
> set apix:apic_panic_on_nmi = 1
>
>
>
>
>
> In the "top" output, processes owned by zfssnap lead most of the time:
>
>
>
> last pid: 22599;  load avg:  12.9,  12.2,  11.2;       up
> 0+09:52:11
> 18:34:41
> 140 processes: 125 sleeping, 13 running, 2 on cpu
> CPU states:  0.0% idle, 22.9% user, 77.1% kernel,  0.0% iowait,  0.0% swap
> Memory: 16G phys mem, 1765M free mem, 2048M total swap, 2048M free swap
> Seconds to delay:
>    PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
>  21389 zfssnap    1  43    2  863M  860M run      5:04 35.61% zfs
>  22360 zfssnap    1  52    2  118M  115M run      0:37 16.50% zfs
>  21778 zfssnap    1  52    2  563M  560M run      3:15 13.17% zfs
>  21278 zfssnap    1  52    2  947M  944M run      5:32  6.91% zfs
>  21881 zfssnap    1  43    2  433M  431M run      2:31  5.41% zfs
>  21852 zfssnap    1  52    2  459M  456M run      2:39  5.16% zfs
>  21266 zfssnap    1  43    2  906M  903M run      5:18  3.95% zfs
>  21757 zfssnap    1  43    2  597M  594M run      3:26  2.91% zfs
>  21274 zfssnap    1  52    2  930M  927M cpu/0    5:27  2.78% zfs
>  22588 zfssnap    1  43    2   30M   27M run      0:08  2.48% zfs
>  22580 zfssnap    1  52    2   49M   46M run      0:14  0.71% zfs
>  22038 root       1  59    0 5312K 3816K cpu/1    0:01  0.10% top
>  22014 root       1  59    0 8020K 4988K sleep    0:00  0.02% sshd
>
>
>
> Average "iostats" are not that busy:
>
>
>
> # zpool iostat -Td 5
> Thu Oct 22 18:24:59 CEST 2015
>                capacity     operations    bandwidth
> pool        alloc   free   read  write   read  write
> ----------  -----  -----  -----  -----  -----  -----
> pool        2.52T   207G    802    116  28.3M   840K
> rpool       33.0G   118G      0      4  4.52K  58.7K
> ----------  -----  -----  -----  -----  -----  -----
>
> Thu Oct 22 18:25:04 CEST 2015
> pool        2.52T   207G      0      0      0      0
> rpool       33.0G   118G      0     10      0  97.9K
> ----------  -----  -----  -----  -----  -----  -----
> Thu Oct 22 18:25:09 CEST 2015
> pool        2.52T   207G      0      0      0      0
> rpool       33.0G   118G      0      0      0      0
> ----------  -----  -----  -----  -----  -----  -----
> Thu Oct 22 18:25:14 CEST 2015
> pool        2.52T   207G      0      0      0      0
> rpool       33.0G   118G      0      9      0  93.5K
> ----------  -----  -----  -----  -----  -----  -----
> Thu Oct 22 18:25:19 CEST 2015
> pool        2.52T   207G      0      0      0      0
> rpool       33.0G   118G      0      0      0      0
> ----------  -----  -----  -----  -----  -----  -----
> Thu Oct 22 18:25:24 CEST 2015
> pool        2.52T   207G      0      0      0      0
> rpool       33.0G   118G      0      0      0      0
> ----------  -----  -----  -----  -----  -----  -----
> Thu Oct 22 18:25:29 CEST 2015
> pool        2.52T   207G      0      0      0      0
> rpool       33.0G   118G      0      0      0      0
> ----------  -----  -----  -----  -----  -----  -----
> Thu Oct 22 18:25:34 CEST 2015
> pool        2.52T   207G      0      0      0      0
> rpool       33.0G   118G      0      0      0      0
> ----------  -----  -----  -----  -----  -----  -----
> Thu Oct 22 18:25:39 CEST 2015
> pool        2.52T   207G      0      0      0      0
> rpool       33.0G   118G      0     16      0   374K
> ----------  -----  -----  -----  -----  -----  -----
> ...
>
> Thu Oct 22 18:33:49 CEST 2015
> pool        2.52T   207G      0      0      0      0
> rpool       33.0G   118G      0     11      0  94.5K
> ----------  -----  -----  -----  -----  -----  -----
> Thu Oct 22 18:33:54 CEST 2015
> pool        2.52T   207G      0     13    819  80.0K
> rpool       33.0G   118G      0      0      0      0
> ----------  -----  -----  -----  -----  -----  -----
> Thu Oct 22 18:33:59 CEST 2015
> pool        2.52T   207G      0    129      0  1.06M
> rpool       33.0G   118G      0      0      0      0
> ----------  -----  -----  -----  -----  -----  -----
> Thu Oct 22 18:34:04 CEST 2015
> pool        2.52T   207G      0     55      0   503K
> rpool       33.0G   118G      0     11      0  97.9K
> ----------  -----  -----  -----  -----  -----  -----
> ...
>
> just occasional bursts of work.
>
> I've now enabled SMART on the disks (2*3Tb mirror "pool" and 1*300Gb
> "rpool") and ran some short tests and triggered long tests (hopefully
> they'd succeed by tomorrow); current results are:
>
>
> # for D in /dev/rdsk/c0*s0; do echo "===== $D :"; smartctl -d sat,12 -a $D
> ; done ; for D in /dev/rdsk/c4*s0 ; do echo "===== $D :"; smartctl -d scsi
> -a $D ; done
> ===== /dev/rdsk/c0t3d0s0 :
> smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build)
> Copyright (C) 2002-12, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Device Model:     WDC WD3003FZEX-00Z4SA0
> Serial Number:    WD-WCC5D1KKU0PA
> LU WWN Device Id: 5 0014ee 2610716b7
> Firmware Version: 01.01A01
> User Capacity:    3,000,592,982,016 bytes [3.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    7200 rpm
> Device is:        Not in smartctl database [for details use: -P showall]
> ATA Version is:   ACS-2 (minor revision not indicated)
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is:    Thu Oct 22 18:45:28 2015 CEST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x82) Offline data collection activity
>                                         was completed without error.
>                                         Auto Offline Data Collection:
> Enabled.
> Self-test execution status:      ( 249) Self-test routine in progress...
>                                         90% of test remaining.
> Total time to complete Offline
> data collection:                (32880) seconds.
> Offline data collection
> capabilities:                    (0x7b) SMART execute Offline immediate.
>                                         Auto Offline data collection
> on/off support.
>                                         Suspend Offline collection upon new
>                                         command.
>                                         Offline surface scan supported.
>                                         Self-test supported.
>                                         Conveyance Self-test supported.
>                                         Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                         power-saving mode.
>                                         Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                         General Purpose Logging supported.
> Short self-test routine
> recommended polling time:        (   2) minutes.
> Extended self-test routine
> recommended polling time:        ( 357) minutes.
> Conveyance self-test routine
> recommended polling time:        (   5) minutes.
> SCT capabilities:              (0x7035) SCT Status supported.
>                                         SCT Feature Control supported.
>                                         SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> Always       -       0
>   3 Spin_Up_Time            0x0027   246   154   021    Pre-fail
> Always       -       6691
>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> Always       -       14
>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> Always       -       0
>   7 Seek_Error_Rate         0x002e   200   200   000    Old_age
> Always       -       0
>   9 Power_On_Hours          0x0032   094   094   000    Old_age
> Always       -       4869
>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> Always       -       0
>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> Always       -       14
>  16 Unknown_Attribute       0x0022   130   070   000    Old_age
> Always       -       2289651870502
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> Always       -       12
> 193 Load_Cycle_Count        0x0032   200   200   000    Old_age
> Always       -       2
> 194 Temperature_Celsius     0x0022   117   111   000    Old_age
> Always       -       35
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> Always       -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> Offline      -       0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Completed without error       00%
> 4869         -
>
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> ===== /dev/rdsk/c0t5d0s0 :
> smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build)
> Copyright (C) 2002-12, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family:     Seagate SV35
> Device Model:     ST3000VX000-1ES166
> Serial Number:    Z500S3L8
> LU WWN Device Id: 5 000c50 079e3757b
> Firmware Version: CV26
> User Capacity:    3,000,592,982,016 bytes [3.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    7200 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
> SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is:    Thu Oct 22 18:45:28 2015 CEST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x00) Offline data collection activity
>                                         was never started.
>                                         Auto Offline Data Collection:
> Disabled.
> Self-test execution status:      ( 249) Self-test routine in progress...
>                                         90% of test remaining.
> Total time to complete Offline
> data collection:                (   80) seconds.
> Offline data collection
> capabilities:                    (0x73) SMART execute Offline immediate.
>                                         Auto Offline data collection
> on/off support.
>                                         Suspend Offline collection upon new
>                                         command.
>                                         No Offline surface scan supported.
>                                         Self-test supported.
>                                         Conveyance Self-test supported.
>                                         Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                         power-saving mode.
>                                         Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                         General Purpose Logging supported.
> Short self-test routine
> recommended polling time:        (   1) minutes.
> Extended self-test routine
> recommended polling time:        ( 325) minutes.
> Conveyance self-test routine
> recommended polling time:        (   2) minutes.
> SCT capabilities:              (0x10b9) SCT Status supported.
>                                         SCT Error Recovery Control
> supported.
>                                         SCT Feature Control supported.
>                                         SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   105   099   006    Pre-fail
> Always       -       8600880
>   3 Spin_Up_Time            0x0003   096   094   000    Pre-fail
> Always       -       0
>   4 Start_Stop_Count        0x0032   100   100   020    Old_age
> Always       -       19
>   5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail
> Always       -       0
>   7 Seek_Error_Rate         0x000f   085   060   030    Pre-fail
> Always       -       342685681
>   9 Power_On_Hours          0x0032   096   096   000    Old_age
> Always       -       4214
>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
> Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   020    Old_age
> Always       -       19
> 184 End-to-End_Error        0x0032   100   100   099    Old_age
> Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age
> Always       -       0
> 188 Command_Timeout         0x0032   100   100   000    Old_age
> Always       -       0
> 189 High_Fly_Writes         0x003a   028   028   000    Old_age
> Always       -       72
> 190 Airflow_Temperature_Cel 0x0022   069   065   045    Old_age
> Always       -       31 (Min/Max 29/32)
> 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age
> Always       -       0
> 192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age
> Always       -       19
> 193 Load_Cycle_Count        0x0032   100   100   000    Old_age
> Always       -       28
> 194 Temperature_Celsius     0x0022   031   040   000    Old_age
> Always       -       31 (0 20 0 0 0)
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
> Always       -       0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Extended offline    Self-test routine in progress 90%
> 4214         -
> # 2  Short offline       Completed without error       00%
> 4214         -
>
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> ===== /dev/rdsk/c4t5000CCA02A1292DDd0s0 :
> smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build)
> Copyright (C) 2002-12, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
> Vendor:               HITACHI
> Product:              HUS156030VLS600
> Revision:             HPH1
> User Capacity:        300,000,000,000 bytes [300 GB]
> Logical block size:   512 bytes
> Logical Unit id:      0x5000cca02a1292dc
> Serial number:                LVVA6NHS
> Device type:          disk
> Transport protocol:   SAS
> Local Time is:        Thu Oct 22 18:45:29 2015 CEST
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
>
> Current Drive Temperature:     45 C
> Drive Trip Temperature:        70 C
> Manufactured in week 14 of year 2012
> Specified cycle count over device lifetime:  50000
> Accumulated start-stop cycles:  80
> Elements in grown defect list: 0
> Vendor (Seagate) cache information
>   Blocks sent to initiator = 2340336504406016
>
> Error counter log:
>            Errors Corrected by           Total   Correction
> Gigabytes    Total
>                ECC          rereads/    errors   algorithm
> processed    uncorrected
>            fast | delayed   rewrites  corrected  invocations   [10^9
> bytes]  errors
> read:          0   888890         0    888890          0
> 29326.957           0
> write:         0   961315         0    961315          0
> 6277.560           0
>
> Non-medium error count:      283
>
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime
> LBA_first_err [SK ASC ASQ]
>      Description                              number   (hours)
> # 1  Background long   Self test in progress ...   -
> NOW                 - [-   -    -]
> # 2  Background long   Aborted (device reset ?)    -
> 14354                 - [-   -    -]
> # 3  Background short  Completed                   -
> 14354                 - [-   -    -]
> # 4  Background long   Aborted (device reset ?)    -
> 14354                 - [-   -    -]
> # 5  Background long   Aborted (device reset ?)    -
> 14354                 - [-   -    -]
>
> Long (extended) Self Test duration: 2506 seconds [41.8 minutes]
>
>
>
> The zpool scrub results and general layout:
>
>
>
> # zpool status -v
>   pool: pool
>  state: ONLINE
>   scan: scrub repaired 0 in 164h13m with 0 errors on Thu Oct 22 18:13:33
> 2015
> config:
>
>         NAME                       STATE     READ WRITE CKSUM
>         pool                       ONLINE       0     0     0
>           mirror-0                 ONLINE       0     0     0
>             c0t3d0                 ONLINE       0     0     0
>             c0t5d0                 ONLINE       0     0     0
>         logs
>           c4t5000CCA02A1292DDd0p2  ONLINE       0     0     0
>         cache
>           c4t5000CCA02A1292DDd0p3  ONLINE       0     0     0
>
> errors: No known data errors
>
>   pool: rpool
>  state: ONLINE
> status: Some supported features are not enabled on the pool. The pool can
>         still be used, but some features are unavailable.
> action: Enable all features using 'zpool upgrade'. Once this is done,
>         the pool may no longer be accessible by software that does not
> support
>         the features. See zpool-features(5) for details.
>   scan: scrub repaired 0 in 3h3m with 0 errors on Thu Oct  8 04:12:35 2015
> config:
>
>         NAME                       STATE     READ WRITE CKSUM
>         rpool                      ONLINE       0     0     0
>           c4t5000CCA02A1292DDd0s0  ONLINE       0     0     0
>
> errors: No known data errors
>
> # zpool list -v
> NAME                        SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP
> DEDUP  HEALTH  ALTROOT
> pool                       2.72T  2.52T   207G         -    68%    92%
> 1.36x  ONLINE  /
>   mirror                   2.72T  2.52T   207G         -    68%    92%
>     c0t3d0                     -      -      -         -      -      -
>     c0t5d0                     -      -      -         -      -      -
> log                            -      -      -         -      -      -
>   c4t5000CCA02A1292DDd0p2     8G   148K  8.00G         -     0%     0%
> cache                          -      -      -         -      -      -
>   c4t5000CCA02A1292DDd0p3   120G  1.80G   118G         -     0%     1%
> rpool                       151G  33.0G   118G         -    76%    21%
> 1.00x  ONLINE  -
>   c4t5000CCA02A1292DDd0s0   151G  33.0G   118G         -    76%    21%
>
> Note the long scrub time may have included the downtime while the system
> was frozen until it was rebooted.
>
>
>
> Thanks in advance for the fresh pairs of eyeballs,
> Jim Klimov
>
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20151022/091c9255/attachment-0001.html>