[OmniOS-discuss] bad smf manifest hosed system :(
Paul B. Henson
henson at acm.org
Wed Jan 21 01:45:51 UTC 2015
So I was working on updating openntpd in pkgsrc to the new portable release
and adding an smf manifest for it. Thanks to Joyent sponsored work, pkgsrc
supports smf and can automatically install a service for a package.
So I went to install the package, and surprisingly the import of the
manifest failed:
svccfg_libscf.c:7750: fmri_to_entity() failed with unexpected error 1007.
Aborting.
/var/opt/pkg/db/pkg/openntpd-5.7p2/+INSTALL: line 1550: 8281 Abort
(core dumped) /usr/sbin/svccfg import
${PKG_PREFIX}/lib/svc/manifest/openntpd.xml
Much MUCH MUCH more surprisingly, it totally hosed my system 8-/. I was
working remotely, and after the above error showed up, I lost connectivity.
Fortunately, I have a remote serial console, and was able to get to the box,
where I found:
SUNW-MSG-ID: SMF-8000-YX, TYPE: defect, VER: 1, SEVERITY: major
EVENT-TIME: Fri Jan 16 17:54:09 PST 2015
PLATFORM: X9SRE-X9SRE-3F-X9SRi-X9SRi-3F, CSN: 0123456789, HOSTNAME: storage
SOURCE: software-diagnosis, REV: 0.1
EVENT-ID: c4c93f8b-8c63-6195-c1de-8b864f1f6433
DESC: A service failed - a start, stop or refresh method failed.
Refer to http://illumos.org/msg/SMF-8000-YX for more information.
AUTO-RESPONSE: The service has been placed into the maintenance state.
IMPACT: svc:/pkgsrc/openntpd:default is unavailable.
REC-ACTION: Run 'svcs -xv svc:/pkgsrc/openntpd:default' to determine the
generic reason wh
y the service failed, the location of any logfiles, and a list of other
services impacted.
And after that, the following repeated over and over:
Assertion failed: r == 0, file libscf.c, line 58
[ network/physical:default starting (physical network interfaces) ]
[ network/ipfilter:simple starting (IP Filter) ]
[ milestone/network:default starting (Network milestone) ]
Assertion failed: r == 0, file libscf.c, line 58
[ network/physical:default starting (physical network interfaces) ]
[ network/ipfilter:simple starting (IP Filter) ]
[ milestone/network:default starting (Network milestone) ]
Assertion failed: r == 0, file libscf.c, line 58
I was able to login to the maintenance console, and evidently my manifest
was poorly formed or something, as the FMRI wasn't quite right:
# svcs -x
svc:/svc:/network/ntp:pkgsrc-openntpd has no "restarter" property group;
ignoring.
Unfortunately, it would not let me delete it:
# svccfg delete svc:/svc:/network/ntp:pkgsrc-openntpd
# svcs -a | grep ntp
disabled Dec_20 svc:/network/ntp:default
- - svc:/svc:/network/ntp:pkgsrc-openntpd
# svccfg delete svc:/svc:/network/ntp:pkgsrc-openntpd
# svccfg delete svc:/network/ntp:pkgsrc-openntpd
# svcs -a | grep ntp
disabled Dec_20 svc:/network/ntp:default
- - svc:/svc:/network/ntp:pkgsrc-openntpd
Finally, I had to resort to running /lib/svc/bin/restore_repository and
restore the backup of the repository from the last boot <sigh>. Fortunately,
after doing that and rebooting the box seems happy again, albeit with an
unexpected production outage in the middle of the day 8-/. I suppose maybe I
should have tested the package on some other system first, but the last
thing I expected was trying to import a manifest to cause a complete network
outage :(.
I no longer have the manifest that caused the problem, as I was careful to
delete all traces of it before rebooting the system to make sure it didn't
somehow get introduced again, but I do have a crap load of cores:
-rw------- 1 root root 6945929 Jan 20 15:58 core.svccfg.8281
-rw------- 1 root root 8804517 Jan 20 15:58 core.svc.startd.10
-rw------- 1 root root 9134877 Jan 20 15:58 core.svc.startd.8286
-rw------- 1 root root 9126589 Jan 20 15:58 core.svc.startd.8342
-rw------- 1 root root 9126589 Jan 20 16:01 core.svc.startd.8890
-rw------- 1 root root 9138973 Jan 20 16:01 core.svc.startd.8946
-rw------- 1 root root 9122493 Jan 20 16:01 core.svc.startd.9002
-rw------- 1 root root 9126589 Jan 20 16:02 core.svc.startd.9064
-rw------- 1 root root 9126589 Jan 20 16:02 core.svc.startd.9120
-rw------- 1 root root 9130717 Jan 20 16:02 core.svc.startd.9176
-rw------- 1 root root 9122525 Jan 20 16:02 core.svc.startd.9234
-rw------- 1 root root 9306813 Jan 20 16:03 core.svc.startd.9296
-rw------- 1 root root 9018857 Jan 20 16:03 core.svc.startd.9307
-rw------- 1 root root 9018857 Jan 20 16:03 core.svc.startd.9309
I could probably re-create the manifest I had that failed to import. This
seems like a pretty nasty bug, if there were any syntax or other issues with
the manifest svccfg should have failed cleanly and not corrupted the
repository :(.
More information about the OmniOS-discuss
mailing list