[OmniOS-discuss] 4kn or 512e with ashift=12

Mon Mar 28 09:06:46 UTC 2016

> -----Original Message-----
> From: Fred Liu
> Sent: 星期一, 三月 28, 2016 16:57
> To: 'Chris Siebenmann'; 'Richard Jahnel'
> Cc: 'omnios-discuss at lists.omniti.com'
> Subject: RE: [OmniOS-discuss] 4kn or 512e with ashift=12
>
>
>
> > -----Original Message-----
> > From: Fred Liu
> > Sent: 星期四, 三月 24, 2016 18:26
> > To: 'Chris Siebenmann'; Richard Jahnel
> > Cc: omnios-discuss at lists.omniti.com
> > Subject: RE: [OmniOS-discuss] 4kn or 512e with ashift=12
> >
> >
> >
> > > -----Original Message-----
> > > From: Chris Siebenmann [mailto:cks at cs.toronto.edu]
> > > Sent: 星期三, 三月 23, 2016 23:33
> > > To: Richard Jahnel
> > > Cc: Chris Siebenmann; Fred Liu; omnios-discuss at lists.omniti.com
> > > Subject: Re: [OmniOS-discuss] 4kn or 512e with ashift=12
> > >
> > > > It should be noted that using a 512e disk as a 512n disk subjects
> > > > you to a significant risk of silent corruption in the event of power loss.
> > > > Because 512e disks does a read>modify>write operation to modify
> > > > 512byte chunk of a 4k sector, zfs won't know about the other
> > > > 7 corrupted 512e sectors in the event of a power loss during a
> > > > write operation. So when discards the incomplete txg on reboot, it
> > > > won't do anything about the other 7 512e sectors it doesn't know were
> affected.
> > >
> > >  This is true; under normal circumstances you do not want to use a
> > > 512e drive in an ashift=9 vdev. However, if you have a dead 512n
> > > drive and you have no remaining 512n spares, your choices are to run
> > > without redundancy, to wedge in a 512e drive and accept the
> > > potential problems on power failure (problems that can likely be
> > > fixed by scrubbing the pool afterwards), or obtain enough additional
> > > drives (and perhaps
> > > server(s)) to entirely rebuild the pool on 512e drives with ashift=12.
> > >
> > >  In this situation, running with a 512e drive and accepting the
> > > performance issues and potential exposure to power failures is
> > > basically the lesser evil. (I wish ZFS was willing to accept this,
> > > but it isn't.)
> > >
> > [Fred Liu]: I have a similar test here:
> >
> > [root at 00-25-90-74-f5-04 ~]# zpool status
> >   pool: tank
> >  state: ONLINE
> >   scan: resilvered 187G in 21h9m with 0 errors on Thu Jan 15 08:05:16
> > 2015
> > config:
> >
> >         NAME                     STATE     READ WRITE CKSUM
> >         tank                     ONLINE       0     0     0
> >           raidz2-0               ONLINE       0     0     0
> >             c2t45d0              ONLINE       0     0     0
> >             c2t46d0              ONLINE       0     0     0
> >             c2t47d0              ONLINE       0     0     0
> >             c2t48d0              ONLINE       0     0     0
> >             c2t49d0              ONLINE       0     0     0
> >             c2t52d0              ONLINE       0     0     0
> >             c2t53d0              ONLINE       0     0     0
> >             c2t44d0              ONLINE       0     0     0
> >         spares
> >           c0t5000CCA6A0C791CBd0  AVAIL
> >
> > errors: No known data errors
> >
> >   pool: zones
> >  state: ONLINE
> >   scan: scrub repaired 0 in 2h45m with 0 errors on Tue Aug 12 20:24:30
> > 2014
> > config:
> >
> >         NAME                       STATE     READ WRITE
> CKSUM
> >         zones                      ONLINE       0     0     0
> >           raidz2-0                 ONLINE       0     0     0
> >             c0t5000C500584AC07Bd0  ONLINE       0     0     0
> >             c0t5000C500584AC557d0  ONLINE       0     0     0
> >             c0t5000C500584ACB1Fd0  ONLINE       0     0     0
> >             c0t5000C500584AD7B3d0  ONLINE       0     0     0
> >             c0t5000C500584C30DBd0  ONLINE       0     0     0
> >             c0t5000C500586E54A3d0  ONLINE       0     0     0
> >             c0t5000C500586EF0CBd0  ONLINE       0     0     0
> >             c0t5000C50058426A0Fd0  ONLINE       0     0     0
> >         logs
> >           c4t0d0                   ONLINE       0     0     0
> >           c4t1d0                   ONLINE       0     0     0
> >         cache
> >           c0t55CD2E404BE9CB7Ed0    ONLINE       0     0     0
> >
> > errors: No known data errors
> >
> > [root at 00-25-90-74-f5-04 ~]# format
> > Searching for disks...done
> >
> >
> > AVAILABLE DISK SELECTIONS:
> >        0. c0t55CD2E404BE9CB7Ed0 <ATA-INTEL
> SSDSC2BW18-DC32-167.68GB>
> >           /scsi_vhci/disk at g55cd2e404be9cb7e
> >        1. c0t5000C500584AC07Bd0
> > <SEAGATE-ST91000640SS-0004-931.51GB>
> >           /scsi_vhci/disk at g5000c500584ac07b
> >        2. c0t5000C500584AC557d0
> > <SEAGATE-ST91000640SS-0004-931.51GB>
> >           /scsi_vhci/disk at g5000c500584ac557
> >        3. c0t5000C500584ACB1Fd0
> > <SEAGATE-ST91000640SS-0004-931.51GB>
> >           /scsi_vhci/disk at g5000c500584acb1f
> >        4. c0t5000C500584AD7B3d0
> > <SEAGATE-ST91000640SS-0004-931.51GB>
> >           /scsi_vhci/disk at g5000c500584ad7b3
> >        5. c0t5000C500584C30DBd0
> > <SEAGATE-ST91000640SS-0004-931.51GB>
> >           /scsi_vhci/disk at g5000c500584c30db
> >        6. c0t5000C500586E54A3d0
> > <SEAGATE-ST91000640SS-0004-931.51GB>
> >           /scsi_vhci/disk at g5000c500586e54a3
> >        7. c0t5000C500586EF0CBd0
> > <SEAGATE-ST91000640SS-0004-931.51GB>
> >           /scsi_vhci/disk at g5000c500586ef0cb
> >        8. c0t5000C50058426A0Fd0
> > <SEAGATE-ST91000640SS-0004-931.51GB>
> >           /scsi_vhci/disk at g5000c50058426a0f
> >        9. c0t5000CCA6A0C791CBd0 <ATA-Hitachi
> HTS54101-A480-931.51GB>
> >           /scsi_vhci/disk at g5000cca6a0c791cb
> >       10. c0t50000F0056425331d0 <ATA-SAMSUNG
> MMCRE28G-AS1Q-119.24GB>
> >           /scsi_vhci/disk at g50000f0056425331
> >       11. c2t44d0 <ATA-Hitachi HTS54101-A480-931.51GB>
> >           /pci at 0,0/pci8086,1c10 at 1c/pci1000,3140 at 0/sd at 2c,0
> >       12. c2t45d0 <ATA-Hitachi HTS54101-A480-931.51GB>
> >           /pci at 0,0/pci8086,1c10 at 1c/pci1000,3140 at 0/sd at 2d,0
> >       13. c2t46d0 <ATA-ST1000LM024 HN-M-0002-931.51GB>
> >           /pci at 0,0/pci8086,1c10 at 1c/pci1000,3140 at 0/sd at 2e,0
> >       14. c2t47d0 <ATA-ST1000LM024 HN-M-0002-931.51GB>
> >           /pci at 0,0/pci8086,1c10 at 1c/pci1000,3140 at 0/sd at 2f,0
> >       15. c2t48d0 <ATA-WDC WD10JPVT-08A-1A01-931.51GB>
> >           /pci at 0,0/pci8086,1c10 at 1c/pci1000,3140 at 0/sd at 30,0
> >       16. c2t49d0 <ATA-WDC WD10JPVT-75A-1A01-931.51GB>
> >           /pci at 0,0/pci8086,1c10 at 1c/pci1000,3140 at 0/sd at 31,0
> >       17. c2t52d0 <ATA-ST1000LM024 HN-M-0001-931.51GB>
> >           /pci at 0,0/pci8086,1c10 at 1c/pci1000,3140 at 0/sd at 34,0
> >       18. c2t53d0 <ATA-ST1000LM024 HN-M-0001-931.51GB>
> >           /pci at 0,0/pci8086,1c10 at 1c/pci1000,3140 at 0/sd at 35,0
> >       19. c4t0d0 <ATA-ANS9010_2NNN2NNN-_200-1.78GB>
> >           /pci at 0,0/pci15d9,624 at 1f,2/disk at 0,0
> >       20. c4t1d0 <ATA-ANS9010_2NNN2NNN-_200-1.78GB>
> >           /pci at 0,0/pci15d9,624 at 1f,2/disk at 1,0
> >
> >
> >
> > [root at 00-25-90-74-f5-04 ~]# zpool replace tank c2t44d0
> > c0t5000CCA6A0C791CBd0 cannot replace c2t44d0 with
> > c0t5000CCA6A0C791CBd0: devices have different sector alignment
> >
> > But in fact "c2t44d0" and "c0t5000CCA6A0C791CBd0" are in the same
> > model -- ATA-Hitachi HTS54101-A480-931.51GB.
> > That is HTS541010A9E680
> > (https://www.hgst.com/sites/default/files/resources/TS5K1000_ds.pdf)
> > which is a 512e HDD.
> > The *only* difference is that "c2t44d0" is attached to a LSI 1068 HBA
> > and "c0t5000CCA6A0C791CBd0" is attached to a LSI 2308 HBA.
> >
> > [root at 00-25-90-74-f5-04 ~]# zdb -l /dev/dsk/c2t44d0s0 | grep ashift
> >         ashift: 9
> >         ashift: 9
> >         ashift: 9
> >         ashift: 9
> >
> > [root at 00-25-90-74-f5-04 ~]# zdb -l /dev/dsk/c0t5000CCA6A0C791CBd0s0 |
> > grep ashift
> >         ashift: 12
> >         ashift: 12
> >         ashift: 12
> >         ashift: 12
> > format> inq
> > Vendor:   ATA
> > Product:  Hitachi HTS54101
> > Revision: A480
> > format> q
> >
> > adding  " "ATA Hitachi HTS54101", "physical-block-size:512"," into
> > sd.conf
> >
> > [root at 00-25-90-74-f5-04 ~]# update_drv -vf sd Cannot unload module: sd
> > Will be unloaded upon reboot.
> > Forcing update of sd.conf.
> > sd.conf updated in the kernel.
> >
> > Reboot the server for "cfgadm -c unconfigure" can't work here.
> >
> > [root at 00-25-90-74-f5-04 ~]# zdb -l /dev/dsk/c0t5000CCA6A0C791CBd0s0 |
> > grep ashift
> > [root at 00-25-90-74-f5-04 ~]#
> >
> > No ashift in output now.
> >
> > [root at 00-25-90-74-f5-04 ~]# zdb -l /dev/dsk/c2t44d0s0 | grep ashift
> >         ashift: 9
> >         ashift: 9
> >         ashift: 9
> >         ashift: 9
> >
> > same like before
> >
> > [root at 00-25-90-74-f5-04 ~]# zpool replace tank c2t44d0
> > c0t5000CCA6A0C791CBd0 cannot replace c2t44d0 with
> > c0t5000CCA6A0C791CBd0: devices have different sector alignment
> >
> > Remove the spare:
> > [root at 00-25-90-74-f5-04 ~]# zpool remove tank c0t5000CCA6A0C791CBd0
> > [root at 00-25-90-74-f5-04 ~]#
> >
> > Add it back:
> > [root at 00-25-90-74-f5-04 ~]# zpool add tank spare c0t5000CCA6A0C791CBd0
> > [root at 00-25-90-74-f5-04 ~]#
> >
> > [root at 00-25-90-74-f5-04 ~]# zpool status tank
> >   pool: tank
> >  state: ONLINE
> >   scan: resilvered 187G in 21h9m with 0 errors on Thu Jan 15 08:05:16
> > 2015
> > config:
> >
> >         NAME                     STATE     READ WRITE CKSUM
> >         tank                     ONLINE       0     0     0
> >           raidz2-0               ONLINE       0     0     0
> >             c2t45d0              ONLINE       0     0     0
> >             c2t46d0              ONLINE       0     0     0
> >             c2t47d0              ONLINE       0     0     0
> >             c2t48d0              ONLINE       0     0     0
> >             c2t49d0              ONLINE       0     0     0
> >             c2t52d0              ONLINE       0     0     0
> >             c2t53d0              ONLINE       0     0     0
> >             c2t44d0              ONLINE       0     0     0
> >         spares
> >           c0t5000CCA6A0C791CBd0  AVAIL
> >
> > errors: No known data errors
> >
> > Still not working;
> >
> > [root at 00-25-90-74-f5-04 ~]# zpool replace tank c2t44d0
> > c0t5000CCA6A0C791CBd0 cannot replace c2t44d0 with
> > c0t5000CCA6A0C791CBd0: devices have different sector alignment
> >
> >
> > maybe the sd.conf update is not correct.
> >
> >
>
> It looks ZoL is a really helpful tool to override ashift when sd.conf doesn't work
> even with following errors:
>
> PANIC: blkptr at ffff8807e4b34000 has invalid CHECKSUM 10 Showing stack for
> process 11419
> Pid: 11419, comm: z_zvol Tainted: P           -- ------------
> 2.6.32-573.3.1.el6.x86_64 #1
> Call Trace:
>  [<ffffffffa0472e9d>] ? spl_dumpstack+0x3d/0x40 [spl]  [<ffffffffa0472f2d>] ?
> vcmn_err+0x8d/0xf0 [spl]  [<ffffffff815391da>] ?
> schedule_timeout+0x19a/0x2e0  [<ffffffff81089c10>] ?
> process_timeout+0x0/0x10  [<ffffffff810a1697>] ? finish_wait+0x67/0x80
> [<ffffffffa046e4bf>] ? spl_kmem_cache_alloc+0x38f/0x8c0 [spl]
> [<ffffffffa0526e62>] ? zfs_panic_recover+0x52/0x60 [zfs]  [<ffffffffa04c7220>] ?
> arc_read_done+0x0/0x320 [zfs]  [<ffffffffa0577283>] ?
> zfs_blkptr_verify+0x83/0x420 [zfs]  [<ffffffff810a14b0>] ?
> autoremove_wake_function+0x0/0x40  [<ffffffffa0578292>] ?
> zio_read+0x42/0x100 [zfs]  [<ffffffff81178cbd>] ? __kmalloc_node+0x4d/0x60
> [<ffffffffa04c7220>] ? arc_read_done+0x0/0x320 [zfs]  [<ffffffffa04c9721>] ?
> arc_read+0x341/0xa70 [zfs]  [<ffffffffa04d1b34>] ?
> dbuf_prefetch+0x1f4/0x2e0 [zfs]  [<ffffffffa04d892a>] ?
> dmu_prefetch+0x1da/0x210 [zfs]  [<ffffffff8127e51d>] ?
> alloc_disk_node+0xad/0x110  [<ffffffffa0584ce7>] ?
> zvol_create_minor_impl+0x607/0x630 [zfs]  [<ffffffffa0585298>] ?
> zvol_create_minors_cb+0x88/0xf0 [zfs]  [<ffffffffa04dac36>] ?
> dmu_objset_find_impl+0x106/0x420 [zfs]  [<ffffffffa0585210>] ?
> zvol_create_minors_cb+0x0/0xf0 [zfs]  [<ffffffffa04dacfa>] ?
> dmu_objset_find_impl+0x1ca/0x420 [zfs]  [<ffffffffa0585210>] ?
> zvol_create_minors_cb+0x0/0xf0 [zfs]  [<ffffffffa04dacfa>] ?
> dmu_objset_find_impl+0x1ca/0x420 [zfs]  [<ffffffffa0585210>] ?
> zvol_create_minors_cb+0x0/0xf0 [zfs]  [<ffffffffa0585210>] ?
> zvol_create_minors_cb+0x0/0xf0 [zfs]  [<ffffffffa04dafa2>] ?
> dmu_objset_find+0x52/0x80 [zfs]  [<ffffffffa046dd26>] ?
> spl_kmem_alloc+0x96/0x1a0 [spl]  [<ffffffffa05850a2>] ?
> zvol_task_cb+0x392/0x3b0 [zfs]  [<ffffffffa0470ebf>] ?
> taskq_thread+0x25f/0x540 [spl]  [<ffffffff810672b0>] ?
> default_wake_function+0x0/0x20  [<ffffffffa0470c60>] ?
> taskq_thread+0x0/0x540 [spl]  [<ffffffff810a101e>] ? kthread+0x9e/0xc0
> [<ffffffff8100c28a>] ? child_rip+0xa/0x20  [<ffffffff810a0f80>] ?
> kthread+0x0/0xc0  [<ffffffff8100c280>] ? child_rip+0x0/0x20
> INFO: task z_zvol:11419 blocked for more than 120 seconds.
>       Tainted: P           -- ------------    2.6.32-573.3.1.el6.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> z_zvol        D 0000000000000003     0 11419      2 0x00000000
>  ffff8807e5daf610 0000000000000046 0000000000000000 0000000000000000
>  0000000000000000 ffff8807e5daf5e8 000000b57f764afb 0000000000020000
>  ffff8807e5daf5b0 0000000100074cde ffff8807e68bbad8 ffff8
>
> The pool can still be imported by ZoL and processed by zpool replace -o
> ashift=9:
>
> [root at livecd ~]# zpool status
>   pool: tank
>  state: ONLINE
> status: One or more devices is currently being resilvered.  The pool will
>       continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>   scan: resilver in progress since Mon Mar 28 08:36:30 2016
>     1.95G scanned out of 1.47T at 1.76M/s, 242h30m to go
>     246M resilvered, 0.13% done
> config:
>
>       NAME                                              STATE
> READ WRITE CKSUM
>       tank                                              ONLINE
> 0     0     0
>         raidz2-0                                        ONLINE
> 0     0     0
>           ata-Hitachi_HTS541010A9E680_J8400076GJU97C    ONLINE
> 0     0     0
>           ata-ST1000LM024_HN-M101MBB_S2R8J9BC502817     ONLINE
> 0     0     0
>           ata-ST1000LM024_HN-M101MBB_S2R8J9KC505621     ONLINE
> 0     0     0
>           ata-WDC_WD10JPVT-08A1YT2_WD-WXD1A4355927      ONLINE
> 0     0     0
>           ata-WDC_WD10JPVT-75A1YT0_WXP1EA2KFK12         ONLINE
> 0     0     0
>           ata-ST1000LM024_HN-M101MBB_S318J9AF191087     ONLINE
> 0     0     0
>           ata-ST1000LM024_HN-M101MBB_S318J9AF191090     ONLINE
> 0     0     0
>           spare-7                                       ONLINE
> 0     0     0
>             ata-Hitachi_HTS541010A9E680_J8400076GJ0KZD  ONLINE
> 0     0     0
>             sdw                                         ONLINE
> 0     0     0  (resilvering)
>       spares
>         sdw                                             INUSE
> currently in use
>
> It can be good remedy for the old storage systems with 512n HDDs and
> without 512n spares.
>
>

Switching back to illumos:
[root at 00-25-90-74-f5-04 ~]# zpool status tank
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Mar 28 20:36:30 2016
    2.29G scanned out of 1.47T at 1.63M/s, 261h27m to go
    288M resilvered, 0.15% done
config:

        NAME                         STATE     READ WRITE CKSUM
        tank                         ONLINE       0     0     0
          raidz2-0                   ONLINE       0     0     0
            c2t45d0                  ONLINE       0     0     0
            c2t46d0                  ONLINE       0     0     0
            c2t47d0                  ONLINE       0     0     0
            c2t48d0                  ONLINE       0     0     0
            c2t49d0                  ONLINE       0     0     0
            c2t52d0                  ONLINE       0     0     0
            c2t53d0                  ONLINE       0     0     0
            spare-7                  ONLINE       0     0     0
              c2t44d0                ONLINE       0     0     0
              c0t5000CCA6A0C791CBd0  ONLINE       0     0     0  (resilvering)
        spares
          c0t5000CCA6A0C791CBd0      INUSE     currently in use

errors: No known data errors

Thanks.

Fred