Events

Event Types

The OpenSVC daemon generates two types of events.

Event kind “patch”

Changes in the cluster monitors data, presented as json patches.

Those are generated upon merging cluster nodes data. They are the most frequent kind of payload exchanged through the heartbeats.

Example:

{
  "nodename": "aubergine",                            # cluster node
  "kind": "patch",
  "data": [
    [
      ["services", "status", "tstscaler", "monitor"], # key
      {                                               # new value
        "status": "scaling",
        "status_updated": 1539074932.4393582,
        "global_expect_updated": 1539074931.857869,
        "global_expect": null,
        "placement": "leader"
      }
    ]
  ]
}

Event kind “event”

These events are not cluster-wide. They are generated by the daemon threads on critical state changes and orchestration decisions on its local objects.

These events have a dictionnary in the “data” key, with the following sub-keys:

  • id: The event id

  • reason: An event id can be triggered for different reasons, in which case the reason key might be provided to explain the situation.

  • svcname: if specified, the event concerns a service, in which case a snapshot of the service data is also provided in the “service” and “instance” keys.

  • monitor: a snapshot of the node monitor states

Example:

{
  "nodename": "aubergine",
  "kind": "event",
  "data": {
    "id": "instance_thaw",                       # event id
    "reason": "target",                          # event reason
    "svcname": "ha1",
    "monitor": {                                 # node monitor states
      "status": "idle",
      "status_updated": 1539074255.1265483
    },
    "service": {                                 # service aggregated states
      "avail": "up",
      "frozen": "frozen",
      "overall": "warn",
      "placement": "optimal",
      "provisioned": true
    },
    "instance": {                                # service instance states
      "updated": "2018-10-09T08:59:00.317291Z",
      "mtime": 1539075540.317291,
      "app": "default",
      "env": "DEV",
      "placement": "spread",
      "topology": "flex",
      "provisioned": true,
      "running": [],
      "flex_min_nodes": 1,
      "flex_max_nodes": 2,
      "frozen": true,
      "orchestrate": "ha",
      "status_group": {
        "fs": "n/a",
        "ip": "up",
        "task": "n/a",
        "app": "n/a",
        "sync": "n/a",
        "disk": "n/a",
        "container": "n/a",
        "share": "n/a"
      },
      "overall": "warn",
      "avail": "up",
      "optional": "n/a",
      "csum": "95b8b5a953d16be504999612d0159949",
      "monitor": {                               # service instance monitor states
        "status": "idle",
        "status_updated": 1539074254.7616527,
        "global_expect_updated": 1539075568.6204853,
        "local_expect": "started",
        "global_expect": "thawed",
        "placement": ""
      }
    }
  }
}

Daemon Events

Id arbitrator_up

arbitrator {arbitrator} is now reachable

Id arbitrator_down

arbitrator {arbitrator} is no longer reachable

Id blacklist_add

sender {sender} blacklisted

Id hb_beating

node {nodename} hb status stale => beating

Id hb_stale

node {nodename} hb status beating => stale

Id node_config_change

node config change

Id node_thaw

thaw node

Id max_resource_restart

max restart ({restart}) reached for resource {rid} ({resource.label})

Id max_stdby_resource_restart

max restart ({restart}) reached for standby resource {rid} ({resource.label})

Id monitor_started

monitor started

Id resource_toc

toc for resource {rid} ({resource.label}) {resource.status} {resource.log}

Id resource_degraded

resource {rid} ({resource.label}) degraded to {resource.status} {resource.log}

Id resource_restart

restart resource {rid} ({resource.label}) {resource.status} {resource.log}, try {try}/{restart}

Id stdby_resource_restart

start standby resource {rid} ({resource.label}) {resource.status} {resource.log}, try {try}/{restart}

Id service_config_installed

config fetched from node {from} is now installed

Id scale_up

misses {delta} instance to reach scale target {instance.scale}

Id scale_down

exceeds {delta} instance to reach scale target {instance.scale}

Id crash, Reason split

cluster is split, we don’t have quorum: {node_votes}+{arbitrator_votes}/{voting} votes {pro_voters}

Id forget_peer, Reason no_rx

no rx thread still receive from node {peer} and maintenance grace period expired. flush its data

Id instance_abort, Reason target

abort {instance.topology} {instance.avail} instance {instance.monitor.local_expect} action to satisfy the {instance.monitor.global_expect} target

Id instance_delete, Reason target

delete {instance.topology} {instance.avail} instance to satisfy the {instance.monitor.global_expect} target

Id instance_freeze, Reason install

freeze instance on install

Id instance_freeze, Reason merge_frozen

freeze instance on rejoin because instance on {peer} is frozen

Id instance_freeze, Reason target

freeze instance to satisfy the {instance.monitor.global_expect} target

Id instance_provision, Reason target

provision {instance.topology} {instance.avail} instance to satisfy the {instance.monitor.global_expect} target

Id instance_purge, Reason target

purge {instance.topology} {instance.avail} instance to satisfy the {instance.monitor.global_expect} target

Id instance_start, Reason from_ready

start {instance.topology} {instance.avail} instance ready for {since} seconds

Id instance_start, Reason single_node

start idle single node {instance.avail} instance

Id instance_start, Reason target

start {instance.topology} {instance.avail} instance to satisfy the {instance.monitor.global_expect} target

Id instance_stop, Reason flex_threshold

stop {instance.topology} {instance.avail} instance to meet threshold constraints: {up}/{instance.flex_target}

Id instance_stop, Reason target

stop {instance.topology} {instance.avail} instance to satisfy the {instance.monitor.global_expect} target

Id instance_thaw, Reason target

thaw instance to satisfy the {instance.monitor.global_expect} target

Id instance_unprovision, Reason target

unprovision {instance.topology} {instance.avail} instance to satisfy the {instance.monitor.global_expect} target

Id node_freeze, Reason kern_freeze

freeze node due to kernel cmdline flag.

Id node_freeze, Reason merge_frozen

freeze node, node {peer} was frozen while we were down

Id node_freeze, Reason rejoin_expire

freeze node, the cluster is not complete on rejoin grace period expiration

Id node_freeze, Reason target

freeze node

Id node_freeze, Reason upgrade

freeze node for upgrade until the cluster is complete

Id node_thaw, Reason upgrade

thaw node after upgrade, the cluster is complete

Id reboot, Reason split

cluster is split, we don’t have quorum: {node_votes}+{arbitrator_votes}/{voting} votes {pro_voters}

Id resource_would_toc, Reason no_candidate

would toc for resource {rid} ({resource.label}) {resource.status} {resource.log}, but no node is candidate for takeover.

Hooks

Custom scripts can be executed on events. These hooks are defined in the node configuration file.

Example:

[hook#1]
events = all
command = /root/on_event

Events are specified by id only. The keyword accepts multiple ids formatted as a comma-separated list, or the all special value.

The script referenced by the command keyword can get the whole event data on stdin.

Hooks executions are logged in the node log.

Watching Events

In human readable format:

om node events

In machine readable format:

om node events --format json

Waiting for data change events

# test the filter
$ om daemon status --filter "monitor.nodes.nuc-cva.frozen"
0

# already on target => return immediately
$ om node wait --filter "monitor.nuc-cva.frozen=0" --duration 1s

$ echo $?
0

# not going on target => timeout
$ om node wait --filter "monitor.nuc-cva.frozen" --duration 1s
timeout

$ echo $?
1