HP 3PAR replication¶
Introduction¶
HP 3Par arrays implement block level data replication between 2 or more arrays. This product is called “HP 3Par Remote Copy Software”. This feature allow opensvc to drive services using a remote copy group, monitoring replication state at the array level, handling remote copy group “reverse” action, dealing with incremental updates when needed. This kind of service is often used to build metrocluster or geocluster systems, drastically lowering time needed to relocate production to another location, with application level granularity, and improving overall IT availability. The following documentation presents the configuration of such a service. This setup just require using the OpenSVC agent, which is free of charge.
Configuration¶
Pre-requisites¶
2 Unix servers with OpenSVC agents. This tutorial use 2 linux debian servers. Their names are
node1
andnode2
, respectively located at site1 (production) and site2 (disaster recovery). For complete OpenSVC beginner, please consider reading this tutorial. It’s quite the same, except that nodes are in 2 different sites, and storage is HP 3Par technology, with remote replication enabled.2 HP 3Par arrays, configured with “HP 3Par Remote Copy Software”. Arrays are named
hp3par1.opensvc.com
andhp3par2.opensvc.com
, respectively located at site1 and site2.1 OpenSVC service. This tutorial use service
svc1.opensvc.com
, which is composed of 1 LVM2 volume groupvg_svc1
, 1 LVM2 logical disk/dev/mapper/v_svc1-rootdisk
, 1 ext4 filesystem built on LVM2 logical disk mounted on/opt/svc1.opensvc.com
, 1 linux LXC container located in the previous filesystem.Storage volumes replicated through HP 3Par remote copy group. The most efficient implementation is to create one HP 3Par remote copy group per application or per OpenSVC service. This is a key factor to be able to relocate at the application level. By convention in this document, we will refer to HP 3Par remote copy group by using the acronym RCG:
MYRCG
is the RCG name in the local HP 3Par array
MYRCG.r12345
is the RCG name in the remote HP 3Par array
Server
node1
is zoned to arrayhp3par1.opensvc.com
, and see primary storage volumes. Servernode2
is zoned to arrayhp3par2.opensvc.com
and see secondary, replicated storage volumes.
As we will pilot storage volume replication, we need to connect to the HP 3Par arrays, that’s why you have to choose between connection methods :
ssh : node directly ssh to the array to issue commands
proxy : node submit command through a proxy (See Proxy Section)
The HP 3Par CLI software commands must be installed in the standard location on the nodes running this service resource type.
HP 3Par configuration¶
To make this tutorial easier, we will use the direct ssh connection to the arrays.
We recommend that you create a dedicated user, so as to enable OpenSVC software to trigger HP 3Par commands in the array. This will allow you to have dedicated logs, describing commands passed from OpenSVC agent.
When created in the HP 3Par arrays (createuser
command), the user is expected to have the Super
role, because of the commands used by the OpenSVC agent, like setrcopygroup
for example.
Once created (If needed, use ssh-keygen -b 1024 -t dsa
, and be sure to keep an empty passphrase), public ssh key for opensvc user has to be registered on the HP 3Par Inserv Storage Server, using the setsshkey -add
command.
At this point, you have to be able to connect both arrays, from both nodes, as opensvc dedicated user, and issue any command in the arrays, passwordless.
Agent configuration file¶
First of all, we need to tell OpenSVC agents how to connect HP 3Par array. The node or cluster configuration file can describe the connection method to each array. The following configuration describes our 2 arrays:
Name between square brackets [] is the array alias used by OpenSVC. It have to be unique on a per agent basis. It also the array name, reported in the
Target
column while issuingshowrcopy groups
commandtype
specify OpenSVC that the declared ressource is a HP 3Par arrayusername
is the username used to connect to the HP 3Par array. It must have enough rights in the array to manage the replicated volumesmanager
field contains the ip address or fully qualified domain name of the HP 3Par Inserv Storage Serverkey
contains the absolute path to the private ssh key used to authenticateusername
on the array
#
root@node1:~ # om cluster print config
...
[array#hp3par1.opensvc.com]
type = hp3par
username = opensvc
manager = hp3par1.opensvc.com
key = /home/opensvc/.ssh/id_dsa
[array#hp3par2.opensvc.com]
type = hp3par
username = opensvc
manager = 192.168.100.199
key = /home/opensvc/.ssh/id_dsa
This simple setup will make OpenSVC agent ssh to HP 3Par arrays, with /home/opensvc/.ssh/id_dsa
as private key file, using opensvc
array user.
Service configuration file¶
Now it’s time to explain OpenSVC software that the service is relying on HP 3Par storage volumes.
If you are in a hurry, the config section to append to the service.env
file is:
[sync#1]
type = hp3par
mode = async
array@node1 = hp3par1.opensvc.com
array@node2 = hp3par2.opensvc.com
rcg@node1 = MYRCG
rcg@node2 = MYRCG.r12345
sync_max_delay = 5
Keywords¶
- sync.hp3par
- array
- method
- mode
- rcg
- blocking_post_provision
- blocking_post_start
- blocking_post_startstandby
- blocking_post_stop
- blocking_post_sync_drp
- blocking_post_sync_nodes
- blocking_post_sync_restore
- blocking_post_sync_resync
- blocking_post_sync_update
- blocking_post_unprovision
- blocking_pre_provision
- blocking_pre_start
- blocking_pre_startstandby
- blocking_pre_stop
- blocking_pre_sync_drp
- blocking_pre_sync_nodes
- blocking_pre_sync_restore
- blocking_pre_sync_resync
- blocking_pre_sync_update
- blocking_pre_unprovision
- comment
- disable
- encap
- monitor
- optional
- pg_blkio_weight
- pg_cpu_quota
- pg_cpu_shares
- pg_cpus
- pg_mem_limit
- pg_mem_oom_control
- pg_mem_swappiness
- pg_mems
- pg_vmem_limit
- post_provision
- post_start
- post_startstandby
- post_stop
- post_sync_drp
- post_sync_nodes
- post_sync_restore
- post_sync_resync
- post_sync_update
- post_unprovision
- pre_provision
- pre_start
- pre_startstandby
- pre_stop
- pre_sync_drp
- pre_sync_nodes
- pre_sync_restore
- pre_sync_resync
- pre_sync_update
- pre_unprovision
- provision
- provision_requires
- restart
- restart_delay
- schedule
- shared
- standby
- start_requires
- stop_requires
- subset
- sync_break_requires
- sync_drp_requires
- sync_max_delay
- sync_nodes_requires
- sync_restore_requires
- sync_resync_requires
- sync_update_requires
- tags
- unprovision
- unprovision_requires
OpenSVC Operations¶
Query service status¶
On node1 (production side):
root@node1:~ # svc1.opensvc.com print status
svc1.opensvc.com
overall up
|- avail up
| |- container#0 .... up svc1.opensvc.com
| | '- ip#1 ...E up svc1.opensvc.com@eth0
| |- vg#1pr .... up /dev/sdgq, /dev/sdax, /dev/sden, /dev/sdgi
| |- vg#1 .... up vg_svc1
| '- fs#1 .... up /dev/mapper/v_svc1-rootdisk@/opt/svc1.opensvc.com
|- sync up
| |- sync#i0 .... up rsync svc config to drpnodes, nodes
| '- sync#1 .... up hp3par async MYRCG
'- hb n/a
All ressources are up (except hb, which is not used here, because optional OpenHA sofware is not dealing with service high availability)
On node2 (disaster recovery side):
root@node2:~ # svc1.opensvc.com print status
svc1.opensvc.com
overall down
|- avail down
| |- container#0 .... down svc1.opensvc.com
| | '- ip#1 ...E down svc1.opensvc.com@eth0
| |- vg#1pr .... down /dev/sdfi, /dev/sdej, /dev/sddk, /dev/sdgh
| |- vg#1 .... down vg_svc1
| '- fs#1 .... down /dev/mapper/v_svc1-rootdisk@/opt/svc1.opensvc.com
|- sync up
| |- sync#i0 .... up rsync svc config to drpnodes, nodes
| '- sync#1 .... up hp3par async MYRCG.r12345
'- hb n/a
All ressources are down, except the one dedicated to synchronisation:
sync#i0 = up means that node1 and node2 are in sync from the OpenSVC service point of view
sync#1 = up means that storage volumes members of HP 3Par RCG named MYRCG.r12345 are in expected state (async mode replicating at a 5 minutes period)
Service Relocation¶
High level steps¶
Some events require that you relocate your production from one site to another (server downtime, power supplies downtime, disaster recovery test plan, …). Those events are often a painfull task to plan, and to execute. That’s where OpenSVC software brings lots of facilities, making the operation much easier, and stressless for people involved.
Synthetically, our service is relocated from one datacenter to the other as easilly as running the commands below :
Production Side:
svc1.opensvc.com stop
Disaster Recovery Side:
svc1.opensvc.com start
In case of a real disaster, we agree that we won’t be able to issue the first command, and the second one is enough to start production at disaster site.
Detailed steps¶
This chapter will detail each steps needed, with checks, and status gathering, to fully understand what happens.
Let’s begin our service relocation by first checking that the production is running fine on the production site:
Production Side : node1@site1:
root@node1:~ # svc1.opensvc.com print status
svc1.opensvc.com
overall up
|- avail up
| |- container#0 .... up svc1.opensvc.com
| | '- ip#1 ...E up svc1.opensvc.com@eth0
| |- vg#1pr .... up /dev/sdgq, /dev/sdax, /dev/sden, /dev/sdgi
| |- vg#1 .... up vg_svc1
| '- fs#1 .... up /dev/mapper/vg_svc1-rootdisk@/opt/svc1.opensvc.com
|- sync up
| |- sync#i0 .... up rsync svc config to drpnodes, nodes
| '- sync#1 .... up hp3par async MYRCG
'- hb n/a
As service is running fine (overall status is up), we can proceed and stop the service.
Production Side : node1@site1:
root@node1:~ # svc1.opensvc.com stop
13:29:15 INFO SVC1.OPENSVC.COM logs from svc1.opensvc.com child service:
13:29:15 INFO SVC1.OPENSVC.COM.CONTAINER#0 lxc-stop -n svc1.opensvc.com -o /var/tmp/svc_svc1.opensvc.com_lxc_stop.log
13:29:16 INFO SVC1.OPENSVC.COM.CONTAINER#0 stop done in 0:00:00.686984 - ret 0 - logs in /var/tmp/svc_svc1.opensvc.com_lxc_stop.log
13:29:16 INFO SVC1.OPENSVC.COM.CONTAINER#0 wait for container down status
13:29:16 INFO SVC1.OPENSVC.COM.FS#1 umount /opt/svc1.opensvc.com
13:29:18 INFO SVC1.OPENSVC.COM.VG#1 vgchange --deltag @node1.opensvc.com vg_svc1
13:29:18 INFO SVC1.OPENSVC.COM.VG#1 output:
Volume group "vg_svc1" successfully changed
13:29:19 INFO SVC1.OPENSVC.COM.VG#1 kpartx -d /dev/vg_svc1/rootdisk
13:29:19 INFO SVC1.OPENSVC.COM.VG#1 vgchange -a n vg_svc1
13:29:19 INFO SVC1.OPENSVC.COM.VG#1 output:
0 logical volume(s) in volume group "vg_svc1" now active
13:29:21 INFO SVC1.OPENSVC.COM.VG#1PR sg_persist -n --out --release --param-rk=0x238170552475005 --prout-type=5 /dev/sdgq
13:29:22 INFO SVC1.OPENSVC.COM.VG#1PR sg_persist -n --out --register-ignore --param-rk=0x238170552475005 /dev/sdgq
13:29:22 INFO SVC1.OPENSVC.COM.VG#1PR sg_persist -n --out --register-ignore --param-rk=0x238170552475005 /dev/sdax
13:29:22 INFO SVC1.OPENSVC.COM.VG#1PR sg_persist -n --out --register-ignore --param-rk=0x238170552475005 /dev/sden
OpenSVC stops the service by turning off the LXC container, umounting filesystem, remove lvm tag, delete logical disk partition mappings, disable lvm volume group, remove scsi reservations from HP 3Par array.
We control the service status, every ressource is now down, except replication ones, which is the expected state.
Production Side : node1@site1:
root@node1:~ # svc1.opensvc.com print status
svc1.opensvc.com
overall down
|- avail down
| |- container#0 .... down svc1.opensvc.com
| |- vg#1pr .... down /dev/sdgq, /dev/sdax, /dev/sden, /dev/sdgi
| |- vg#1 .... down vg_svc1
| '- fs#1 .... down /dev/mapper/vg_svc1-rootdisk@/opt/svc1.opensvc.com
|- sync up
| |- sync#i0 .... up rsync svc config to drpnodes, nodes
| '- sync#1 .... up hp3par async MYRCG
'- hb n/a
As replication is asynchronous, we will ensure that same data image is present on both sides (site1 and site2)
Production Side : node1@site1:
root@node1:~ # svc1.opensvc.com sync update
13:30:35 INFO SVC1.OPENSVC.COM.SYNC#I0 won't sync this resource for a service not up
13:30:35 INFO SVC1.OPENSVC.COM.SYNC#1 syncrcopy -w MYRCG
13:30:37 INFO SVC1.OPENSVC.COM.SYNC#1 Completed synchronization for group MYRCG
Note
we are now sure that same datas are physically located in both arrays. We can safelly start the production at site2 with guaranty of no data loss (RPO=0)
Disaster Recovery Side : node2@site2:
root@node2:~ # svc1.opensvc.com start
13:32:10 INFO SVC1.OPENSVC.COM.SYNC#1 we are joined with hp3par1.opensvc.com array
13:32:10 INFO SVC1.OPENSVC.COM.SYNC#1 stoprcopygroup -f MYRCG (on hp3par1.opensvc.com)
13:32:11 INFO SVC1.OPENSVC.COM.SYNC#1 setrcopygroup reverse -f -waittask MYRCG.r12345
13:32:16 INFO SVC1.OPENSVC.COM.SYNC#1 setrcopygroup for reverse MYRCG.r12345
reverse started with tasks: 2576
Waiting for tasks to complete
Task 2576 done
13:32:17 INFO SVC1.OPENSVC.COM.VG#1PR sg_persist -n --out --register-ignore --param-sark=0x238170551488311 /dev/sdfi
13:32:17 INFO SVC1.OPENSVC.COM.VG#1PR sg_persist -n --out --register-ignore --param-sark=0x238170551488311 /dev/sdej
13:32:17 INFO SVC1.OPENSVC.COM.VG#1PR sg_persist -n --out --register-ignore --param-sark=0x238170551488311 /dev/sddk
13:32:17 INFO SVC1.OPENSVC.COM.VG#1PR sg_persist -n --out --register-ignore --param-sark=0x238170551488311 /dev/sdgh
13:32:17 INFO SVC1.OPENSVC.COM.VG#1PR sg_persist -n --out --reserve --param-rk=0x238170551488311 --prout-type=5 /dev/sdfi
13:32:22 INFO SVC1.OPENSVC.COM.VG#1 vgchange --addtag @node2.opensvc.com vg_svc1
13:32:23 INFO SVC1.OPENSVC.COM.VG#1 output:
Volume group "vg_svc1" successfully changed
13:32:23 INFO SVC1.OPENSVC.COM.VG#1 vgchange -a y vg_svc1
13:32:23 INFO SVC1.OPENSVC.COM.VG#1 output:
1 logical volume(s) in volume group "vg_svc1" now active
13:32:24 INFO SVC1.OPENSVC.COM.FS#1 e2fsck -p /dev/mapper/vg_svc1-rootdisk
13:32:24 INFO SVC1.OPENSVC.COM.FS#1 output:
/dev/mapper/vg_svc1-rootdisk: clean, 21958/1310720 files, 2799240/5238784 blocks
13:32:24 INFO SVC1.OPENSVC.COM.FS#1 mount -t ext4 -o defaults,discard /dev/mapper/vg_svc1-rootdisk /opt/svc1.opensvc.com
13:32:24 INFO SVC1.OPENSVC.COM.CONTAINER#0 lxc-start -d -n svc1.opensvc.com -o /var/tmp/svc_svc1.opensvc.com_lxc_start.log -f /var/lib/lxc/svc1.opensvc.com/config
13:32:24 INFO SVC1.OPENSVC.COM.CONTAINER#0 start done in 0:00:00.006283 - ret 0 - logs in /var/tmp/svc_svc1.opensvc.com_lxc_start.log
13:32:24 INFO SVC1.OPENSVC.COM.CONTAINER#0 wait for container up status
13:32:24 INFO SVC1.OPENSVC.COM.CONTAINER#0 wait for container ping
13:32:25 INFO SVC1.OPENSVC.COM.CONTAINER#0 wait for container operational
13:32:30 INFO SVC1.OPENSVC.COM logs from svc1.opensvc.com child service:
Note
first lines of log show up the HP 3Par stuff. OpenSVC agent on node2 confirm the replication relation with array on site1 (hp3par1.opensvc.com). It stops the RCG, and reverse it, so as to promote site2 storage volume as read/write. Once HP 3Par task is done, node2 puts scsi reservation on hp3par2.opensvc.com, add lvm tag on vg_svc1, activate the lvm volume group, mount filesystem, and start LXC container. As you can see in the logs, time needed is no more than 15 seconds.
Disaster Recovery Side : node2@site2:
root@node2:~ # svc1.opensvc.com sync resume
13:33:43 INFO SVC1.OPENSVC.COM.SYNC#1 startrcopygroup MYRCG.r12345
Note
Although service is now running fine on node2@site2, the data replication is not restarted (the HP 3Par RCG is still stopped). That’s why need to restart the RCG. The OpenSVC sync resume action is made for that, and will trigger a startrcopygroup
in the HP 3Par array.
Let’s check the service state after relocation at site2:
Disaster Recovery Side : node2@site2:
root@node2:~ # svc1.opensvc.com print status
svc1.opensvc.com
overall up
|- avail up
| |- container#0 .... up svc1.opensvc.com
| | '- ip#1 ...E up svc1.opensvc.com@eth0
| |- vg#1pr .... up /dev/sdfi, /dev/sdej, /dev/sddk, /dev/sdgh
| |- vg#1 .... up vg_svc1
| '- fs#1 .... up /dev/mapper/vg_svc1-rootdisk@/opt/svc1.opensvc.com
|- sync up
| |- sync#i0 .... up rsync svc config to drpnodes, nodes
| '- sync#1 .... up hp3par async MYRCG.r12345
'- hb n/a
If you need to rollback to site1, just use the same commands. Feel free to contact admin@opensvc.com if you are in trouble implementing this solution.
Note
those actions can be triggerred either with command line, or by using the OpenSVC collector portal. Of course, for massive operations (like tens of services hosted on a single server), you can use “catchall commands” like allupservices/alldownservices/allservices/allprimaryservices/allsecondaryservices
to relocate multiple services at one time.
Proxy configuration¶
Introduction¶
Considering an infrastructure where servers are segregated in 2 zones, internal, and dmz, every host in the internal lan is capable of connecting to the HP 3Par array. Therefore, there is a problem with servers located in the dmz zone. ssh traffic need to be opened from every host in dmz to HP 3Par array, which is located in the internal network. If we add the fact that the default role for opensvc user in the HP 3Par array is very permissive, we can say that this setup is not secured and highly increase risk of data loss if someone manage to get access to the HP 3Par array from inside the dmz.
OpenSVC company decided to develop a software called “HP 3Par Proxy” (Source tracked here), so as to increase level of security, and lower risk of compromission. This software is provided and maintained by OpenSVC. It is written in python, and basically works like that : listen to incoming connections from OpenSVC agents, checks if requests are allowed or not, deny access if request does not match config file entry or forward the command to the HP 3Par array if access is allowed, after that send back array answer to the OpenSVC agent as a json data structure.
Prerequisites¶
dmz/firewalled servers installed with OpenSVC agent, and OpenSVC services relying on HP 3Par storage volumes
firewall rule allowing every dmz server to https to the proxy service ip address on the internal lan
HP 3Par Proxy Software (Provided by OpenSVC ), which is integrated as an OpenSVC service, somewhere on the internal lan
HP 3Par Command Line utilities, installed on the node where the proxy is running
Configuration¶
Below is an example of config.py:
cli = "/opt/3PAR/inform_cli_3.1.2/bin/cli"
ssl_key = "/srv/svcproxy.opensvc.com/ssl/server.key"
ssl_crt = "/srv/svcproxy.opensvc.com/ssl/server.crt"
access_log = "/srv/svcproxy.opensvc.com/log/access.log"
error_log = "/srv/svcproxy.opensvc.com/log/error.log"
pwf = {
"hp3par1.opensvc.com": "/path/to/hp3par1.opensvc.com.credentials",
"hp3par2.opensvc.com": "/path/to/hp3par2.opensvc.com.credentials",
}
creds = {
("dmzsvc1.dmz.opensvc.com", "3b2c325d-4321-6789-b32f-b987654cb092874a", "hp3par1.opensvc.com"): [
"showrcopy groups RCG.SVC1",
"showrcopy links"
],
("svc2.dmz.opensvc.com", "2a3b4e5d-9876-1234-b32r-d12349dca099812b", "hp3par2.opensvc.com"): [
"showrcopy groups RCG.SVC2",
"showrcopy links"
]
}
First keyword cli
is used to tell proxy software the HP 3Par cli command full path.
Parameters ssl_key
and ssl_crt
are used to specify the ssl certificate to present to https client located in OpenSVC agents.
Keywords access_log
and error_log
are used to log access and errors to the HP 3Par proxy.
Section named pwf
list all the HP 3Par arrays known by the proxy software. First parameter is the fully qualified domain name of the HP 3Par Inserv host. Second paramater is the full path to the credential file to use to be able to make a passwordless connection to the array. (You can generate this file by using the command setpassword –saveonly –file /path/to/hp3par1.opensvc.com.credentials user1
assuming you want the proxy software to use the user1
user in the array)
Section named creds
list all authorized commands. The previous example have 2 authorized rules :
* the server identified by OpenSVC uuid 3b2c325d-4321-6789-b32f-b987654cb092874a
is allowed to run showrcopy groups RCG.SVC1
and showrcopy links
on array hp3par1.opensvc.com
for OpenSVC service dmzsvc1.dmz.opensvc.com
* the server identified by OpenSVC uuid 2a3b4e5d-9876-1234-b32r-d12349dca099812b
is allowed to run showrcopy groups RCG.SVC2
and showrcopy links
on array hp3par2.opensvc.com
for OpenSVC service dmzsvc2.dmz.opensvc.com
This file is voluntarily simple and does not make OpenSVC agent work with HP 3Par arrays. Instead, use the template file available in the tar.gz archive.
Example of refused command¶
The proxy directly return the requesting agent that the operation failed. Return code = 1
{"err": "this command is not allowed for this node-service-array id", "ret": 1, "out": ""}
Example of allowed command¶
After authorizing a request from an agent, the proxy run the command on the array, and send back the answer to the OpenSVC agent. Return code = 0
{"err": "", "ret": 0, "out": "RCG.SVC1,hp3par1.opensvc.com,Started,Primary,Periodic,\"Last-Sync 2014-03-25 16:12:47 CET , Period 5m, auto_recover,over_per_alert\"\n ,VV_SVC1_ROOT,31110,VV_SVC1_ROOT,31079,Synced,2014-03-25 16:12:48 CET\n\n"}
Command set¶
- start
Checks if local array is primary or secondary. * If primary, just activate the replication state monitoring. * If secondary, break and reverse the data-replication. Equivalent to
stoprcopygroup -f RCG.local
andsetrcopygroup reverse -f -waittask RCG.remote
. The devices are promoted to read-write access. Replication is not restarted, you need to use the sync resume for that purpose (We want to be able to test data at the secondary site without impacting data on the primary site)- sync update
While in asynchronous replication mode, trigger an immediate incremental data replication to the remote array. Equivalent to
syncrcopy -w RCG
in the array. As an example, it can be useful to ensure data consistency on the remote array, before trigerring snapshots. Useless in synchronous mode.- sync break
This command stop the RCG. Equivalent to
stoprcopygroup -f RCG.local
.- sync resume
This command start the RCG. Equivalent to
startrcopygroup RCG.local
.- sync swap
This command is only allowed on the secondary array. It stops, then reverse, then start the RCG. You are strongly advised to use this command only when application is stopped.
Status¶
- up
The last replication occured less than sync_max_delay minutes ago. The replication is in the expected mode (async or sync).
- warn
The last replication occured more than sync_max_delay minutes ago. The RCG is not in “Started” state The RCG is “async” and not defined as “Periodic” The RCG is “async”, defined as “Periodic”, without any “Period” set in the array The RCG option “auto_recover” is not set One or more volume is not in the “Synced” state
- down
RCG is in an unexpected state or not present in the array.