Sunday 22 May 2016

Unusual disk failures

Drives always seem to be failing or having problems. More often than I gave credit for. I had 3 drives do this at the same time just during a simple archive creation process. I wonder if I had a bad batch, then again the more drives you have the higher chance something will fail.


        NAME                         STATE     READ WRITE CKSUM
        data                         DEGRADED     0     0     0
          raidz2-0                   DEGRADED     0     0     0
            c0t50004CF210AD1C22d0    ONLINE       0     0     0
            spare-1                  DEGRADED     0     0   249
              c0t50004CF210BE51F1d0  DEGRADED     0     0    70
              c4t0d0                 ONLINE       0     0     0
            spare-2                  DEGRADED     1     0     2
              c0t50004CF210BE51F3d0  UNAVAIL      0     0     0
              c4t1d0                 ONLINE       0     0     0
            c0t50004CF210BE5214d0    ONLINE       0     0     0
            c5t3d0                   ONLINE       0     0     0
            c4t3d0                   ONLINE       0     0     0
        spares
          c4t1d0                     INUSE  
          c4t0d0                     INUSE  
     
  NAME                       STATE     READ WRITE CKSUM
        rpool                      DEGRADED     0     0     0
          mirror-0                 DEGRADED     0     0     0
            c0t500A0751F0096E9Ed0  DEGRADED     0     0   196
            c0t500A0751F0097DA7d0  ONLINE       0     0     0

I attempted reading some more and ...

         NAME                       STATE     READ WRITE CKSUM
        rpool                      DEGRADED     0     0     0
          mirror-0                 DEGRADED     0     0     0
            c0t500A0751F0096E9Ed0  DEGRADED     0     0 1.00K
            c0t500A0751F0097DA7d0  ONLINE       0     0     0


so 2 are degraded due to checksum errors attempting to read data back, other drive just seems to not be powered on at all. (L.E.D. on front inactive) why?

I'll re architect the data pool, first I'll test out autoreplace and find out how it works. (I assume simply take disk out, put in new then all done). Will depend on HW support as well so, best test this. Made a comment in the Oracle Community - https://community.oracle.com/message/13836284#13836284

zpool get autoreplace data
NAME  PROPERTY     VALUE  SOURCE
data  autoreplace  on     local

in the end the Mobo was simply faulty, (went on fire) beside some small chips by the LSI SAS controller next to heat-sink.

Wednesday 11 May 2016

OpenStack meeting

Came back from Fujitsu HQ in London, hearing some of the problems faced by the community and made a few contacts.

A couple of presentations were made by various customers/companies and it looks to be used in some various ways. Had an Ubuntu guy showing this ontop of his laptop with KVM + LXD  also known as  "LXC 2.0" with ZFS underneath. I noticed he had 28% fragmentation on his zpool (only 1 vdev) which seems a little odd to me and wondering why that is. Running many of these LXD "containers" he used some lxc command to take snapshot (ZFS underneath) then something like rm -rf / and was after able to recover this from snapshot. https://insights.ubuntu.com/2016/03/22/lxd-2-0-your-first-lxd-container/ 

So: Ubuntu - KVM - Systemd - LXD - ZFS - OpenStack

Unusual viewpoint that seems backwards to me from this Ubuntu guy: "create lxd containers, each with different OpenStack service running per container to then run OpenStack on top of this?" - Reason was if you have more systems running and want to do some upgrade from distribution you can migrate those OpenStack services within those containers to another machine then do upgrade for example and migrate everything back... or if disk fails, memory fails etc...

I get the feeling this is probably being done over complicated for one. Maybe it can be done easier but I haven't really played with how Ubuntu is doing these things.

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators so could be handy to discuss with some of the developers maybe and look into more user cases, many people around the world are working on this and contributing. I reckon it'll be easier to see what they've done and go from there.

I heard problems after mass deployment across 1,000+ linux hosts with KVM attempting upgrades across systems. From what I heard described it could be people rushing to use this and implement without first thinking longer term. What problems could we encounter? How to we manage upgrades? How will this scale?

It appears might be possible to use something other than Neutron for SDN (software defined networking) We were shown some demo but didn't tell much. (more about pretty bubbly GUI) http://www.nuagenetworks.net/  https://www.youtube.com/watch?v=OjXII11hYwc&feature=youtu.be

Apparently it was a big problem and no clean direct upgrade from Nova to Neutron. Required "entire rebuild" Nova is deprecated - http://docs.openstack.org/openstack-ops/content/nova-network-deprecation.html

Also Fujitsu are creating a new type of software based from OpenStack. called "K5" doing "Iaas + PaaS" with 200+ more APIs than base OpenStack. Taking this on as an internal Global product to attempt saving millions in the process. (All done on Redhat & CentOS, not anything else.)

It looks like everyone around is thinking around the same areas in this Cloud and Container space then how to make money out of it across larger scales of various customers. If so much focus is directed to this are we therefore missing on other aspects around that are happening? I will be following up on this more.

Monday 2 May 2016

Openstack beginnings

Lately been looking into using OpenS. From what I have gathered it looks to be better integrated into Solaris than Linux, although more up to date versions are more easily available on Linux to get hold of.

Part of the OpenStack service "Glance" requires .uar (Unified Archives) for host deployments and so it is probably a preferred method to use .uar for installation of the zone/kernel zone across systems as well to keep this the same everywhere.

I'm thinking that it'd be good to practise re-install a few times, reverse engineer what is inside the publicly available .uar from Oracle which we're using as a test bed. It was generated using 11.3 GA and further steps haven't been described in much detail. I want to customise the install to be lightweight and only contain what we need to make deployments faster so it'll be easier to scale at the same time when and if we get further down that road.

I also will have to get a bit more used to the front end interface and think about what kind of "flavours" we could also configure for use. (type of zone + resources). I was surprised to discover a bug present that no-one looks to have found using the Archives for installation, in the manifest file to install you have a section like

<software_data action="install">
<name>{deployable system name}</name>
</software_data>

I was unsure what this <name> tag is for, I figured maybe zone name but turns out this is for the Deployable System name, the bug means this much match the name in the manifest for the archive otherwise it will fail to install with no useful output in the install log for why it failed. "list index out of range" it should work with any name but is recommended to match the same as the .uar file which is simple to check with archiveadm info <name of .uararchive file>. I further found other documents that I've had Oracle correct with minor typos on archives. 

Need to find out what different packages and services are required and how to configure these for the different node types, how to have this prepared out the box. Also want to setup some options on FS like compress and atime off where possible. One simple problem atm is trying to install from an archive I get an "ERROR: Archived zone oscn-uar-kz has no AI media" hmm... "Archive creation failed: Failed to locate AI media, --exclude-media may be used" but I cannot create without -e due to that... tried another zone and same error...