This document provides an introduction to getting familiar with the Sipxcom high availability (HA) capabilities and how to deploy in a production network. The work was done using the recent 16.12 ISO release available from download.sipxcom.com. Users are cautioned to plan and test thoroughly - you are working with replicated, real-time Mongo databases - in preparing this document, voice servers were rebuilt several times from fresh installs to eliminate configuration and other issues that are encountered in testing this service. If you wish to reset the system without a re-installation, then follow this procedure precisely http://wiki.sipxcom.org/display/sipXcom/Reset+Entire+System. This primer assumes that the administrator is experienced with the standalone version of Sipxcom and knowledgeable with the details of Polycom phone configurations.
The following diagram provides an overview of the test network configuration. In the phase 1 testing, DNS is configured so that all phone registrations land on pbx3. The primary pbx is failed first and then incoming and outgoing internal and external calls are tested along with basic features - note that services such as voicemail can only run on the primary server. The secondary pbx2 server is then failed and the tests are repeated. The pbx1 and pbx2 servers are then brought back online to ascertain the high availability system recovers and all feature capability is returned.
For phase 2, pbx3 with IP address 10.20.2.33 is replaced with a server with IP address 10.10.17.10 on the separate subnetwork. The same test strategy for phase 1 is then used for phase 2. The router is instrumented with a bandwidth measurement tool to estimate the amount of traffic used to replicate state information from the voice servers in the primary 10.20.2.x subnetwork to the secondary 10.10.17.x subnetwork. This insight is important to understand the WAN bandwidth impact if the intent is to geographically distribute some of the secondary voice servers in multiple locations.
By design Sipxcom is designed is always use its built-in DNS server for querying SRV records - the Sipxcom unmanaged DNS server option when enabled allows querying of an unmanaged DNS server for phone registrations. In testing the Sipxcom HA solution in release 16.12, difficulties were encountered using the managed DNS tools of Sipxcom. This document http://wiki.sipxcom.org/display/sipXcom/DNS+Management provides a good overview of the Sipxcom managed DNS capabilities and how they should work. Significant investments were made to make Sipxcom DNS management tools work with the high availability solution - the HA solution performed best in the lab when Sipxcom used a DNS server that was completely separate from Sipxcom, i.e. DNS services on each of the Sipxcom servers was disabled. The first part of this document assumes that Sipxcom uses a separate DNS server for all SRV record processing - to do this, careful attention needs to be placed on how Sipxcom primary and secondary servers are built and to disable the DNS service from being enabled at startup. The second part of this document describes how the HA solution works using the managed DNS tools available in Sipxcom.
Note: eZuce recommends DNS be configured on each server. Separate DNS servers require manual management and can degrade Proxy & Registrar performance. Where phones get DNS from is a wholly different issue than where the servers get DNS from.
These next few subsections assume that a new Sipxcom server is being built from scratch, a separate DNS server is used for all Sipxcom DNS processing, and that DNS is turned off on all Sipxcom servers.
Build a standalone DNS server using this document as a guide http://wiki.sipxcom.org/display/sipXcom/DNS+Concepts+for+sipXcom. To simplify effort, do the following:
Test from another Unix client machine to ascertain Sipxcom SRV records are returned, ping responses to pbx1.lvtest.com, pbx2.test.com, pbx3.lvtest.com, and www.google.com return correct responses in terms of addresses.
Build the primary Sipxcom server using the 10.20.2.35 IP address as the primary DNS address in the Ethernet network settings - otherwise, Sipxcom will override any manual entries in the /etc/resolv.conf whenever the primary Sipxcom server is restarted. When the fresh Sipxcom system is started, check the primary External DNS setting to ascertain that it points to the standalone DNS server built in the previous step; Otherwise, the etc/resolv.conf will need to be updated to point to the 10.20.2.35 standalone DNS server after every reboot. Start all necessary Sipxcom services except Sipxbridge, which is not supported in the HA configuration.
Go to the System - > NAT Traversal - > Settings menu and disable the Enable NAT Traversal and Server Behind NAT functionality. The Sipxcom high availability solution is only certified to use unmanaged gateways such as SBCs and ISDN/SIP gateways - Sipxbridge does not work properly in this environment.
The Sipxcom active phone registrations page does not provide any indication in a high availability configuration which voice server the phones are registered to, and the DNS server in Sipxcom by default spreads the phone registrations across all 3 servers. For phase 2 testing where replication bandwidth to a remote server is measured, the best way to do this is having all phones registered onto pbx3. In the lab setup, the .sip.tcp, .sip._udp, and .sips._tcp DNS SRV records are weighted so that phones register to pbx3 first, then pbx2, and then the pbx primary proxy. Perform DIGs of the SRV record from a client machine pointed to the 10.20.2.35 DNS server to ascertain that pbx3 appears first in the response. Also weight the SRVs for resource records and test.
Validate that the registry rr IN records in the /var/named/dnsfile directory are correctly configured and that the DNS A records have the correct IP address for each pbx. For phase 1 testing, the pbx3 IP address is 10.20.2.33, and the phase 2 pbx3 IP address is 10.10.17.10.
Go into the System - > Servers - > Core menu, and turn off the DNS service on the primary Sipxcom proxy.
SSH to the primary server, and perform the following:
Go into System - > Servers and click on the top right hand side to add a new server. Provision the Hostname, IP Address, and description fields and hit the okay field - the server will show up in the list of servers with Status field initially set to Uninitialized. Use the administration ID assigned by Sipxcom to build your secondary voice servers from the ISO. Per the previous step, ascertain that the nameserver address used in the CENTOS ethernet interface definition menu points to 10.20.2.35 - otherwise the /etc/resolv.conf will need to be updated after every secondary Sipxcom reboot.
After the ISO is installed, sipxecs-setup is automatically invoked. Point the primary server to 10.10.17.10 and provision the administration ID (in this case 6 for pbx3) in the setp script. After the script completes, go back to the System - > Servers menu - the status field should change from uninitialized to configured. Repeat this process for each secondary server or arbiter assigned to the system.
Go into the System - > Servers - > Telephony section and turn on Sip Proxy and SIP Registrar services for each secondary server added to the HA System.
Go to System - > Databases and add secondary servers to the list of global databases - it will take 60-90 seconds for each database to be added and correctly synchronized. If you are seeing multiple errors or having difficulties getting the server added to the list of global databases, try upgrading the computing platforms being used for the servers.
In the lab setup, phones are manually provisioned with IP address, TFTP, SNTP, and DNS service addresses - a lab phone group was defined on the voice server and assigned to the test phones. The settings are as follows:
These next few subsections assumes that a new Sipxcom server is being built from scratch, and the managed DNS tools in Sipxcom and onboard DNS servers are enabled in Sipxcom. The standalone DNS services in the test setup (10.20.2.35 and 10.10.17.35) will still be used as an unmanaged DNS service for site phone registrations.
Build the standalone DNS servers 10.20.3.25 and 10.10.17.10 as per the previous section. Manipulate the weights of the tcp, udp, and rr service records on the 10.20.2.35 DNS server to have the phones on the 10.20.2.x subnet register to the pbx2 server while phones on the 10.10.17.x subnet register to the pbx3 subnet (tcp records illustrated below).
Build primary Sipxcom server with a valid upstream DNS forwarder address (e.g. 188.8.131.52). Once the primary Sipxcom server has been built, turn on all services except for Sipxbridge and DHCP (in the lab phones were statically provisioned). Sipxcom builds the following DNS settings in /etc/named.conf, /etc/resolv.conf, and /var/named/default.view.lvtest.com.zone.
Set NAT traversal settings exactly like NAT traversal settings in previous section with standalone DNS servers.
Add secondary servers and roles exactly like the previous section with standalone DNS servers.
Add secondary servers to Global Databases exactly like the previous section with standalone DNS servers.
Once pbx2 and pbx3 are built and successfully added as secondary servers to the Global Databases, then perform the following:
After the high availability cluster is configured and services defined, the DNS configuration on each server should look as follows:
The System - > Regions and System - > DNS - > Record View features within Sipxcom creates separate DNS zone files for each subnetwork. The following architecture and registration rules will be used to build DNS regions and failover rules within Sipxcom.
The first step is to define two regions within Sipxcom - one is called Main1020 with an IP address range of 10.20.2.x/24 and the other region is called Local1010 with an IP address range of 10.10.17.x/24. The System - > DNS - > Record View menu will map the region to the failover plan.
Go to System - > DNS Fail-over Plans and create two plans:
Now go into the System - > DNS - > Record View and build two plans
What the Sipxcom DNS tools does is build the following DNS configuration in the /etc/named.conf file - DNS queries from the 10.20.2.x subnetwork use the pbx1020 zone file which always returns pbx2 SRV records while DNS queries from the 10.10.17.x subnetwork use the pbx1010 zone file which always returns pbx3 SRV records.
ssh into each primary and secondary server, and double-check the following:
ssh into each primary and secondary server, and double-check the following:
The StarTrinity Sip Tester Tool can be downloaded here https://startrinity.com/VoIP/SipTester/SipTester.aspx - It is a Windows-based application that can perform a variety of tests against a voice server. In the phase 1 and 2 testing, the key Siptester feature used was the Registration (UAC) batch capability, which simulates a flood of registration requests to a SIP proxy from phones upon a catastrophic failure such as proxy rebooting or managed WAN connection restarting where the phones are in one location, but the voice server proxies are in a centralized data center. The assumption is that this burst of registration requests generates peak replication traffic from the primary proxy to the secondary servers.
The Siptester steps used in the phase 2 measurements includes the following:
When the Delete All symbolic link in Siptester is selected, the tool will instruct Sipxcom to un-register each line - in the Sipxcom Active registrations page, there may be still active or expired registrations - ssh to the ASipxcom primary server and use the following procedure to clear out all active registrations.
An initial test of phase 1 test calls was conducted - with all lab phones configured to register to pbx3 running as a secondary server, pbx1 was brought down. Internal calls between phones registered on pbx3 were successfully placed as well as internal and external call forwarding. External calls were also successfully placed. An incoming external call was placed but failed due to SIP 404 not found was returned from the pbx3 proxy. As expected, voicemail, mwi, and autoattendant functionality were not available when the primary pbx proxy was unavailable.
One issue encountered when all 3 HA servers was operational was on an external call - when the Polycom phone ended the phone call, the secondary pbx3 proxy issued a 403 noaccess response to the primary pbx proxy and the pbx never issued a bye to the ITSP to release the call. The bridged line appearance issue issue identified in the standalone version of Sipxcom also appears in the HA configuration - provisioning the SIP server field on the phones gets BLA working again but those phones then do not work when the primary proxy fails. A JIRA has been opened on this issue.
A three server Sipxcom HA solution was configured, with pbx1 and pbx2 on one subnetwork, and pbx3 on a second subnetwork. All three servers were defined as a global databases - when regions were defined and pbx3 was defined as a local database, the registrar process on pbx3 could not be started. The StarTrinity SIP tester tool https://startrinity.com/VoIP/SipTester/SipTester.aspx was used to flood the pbx3 server with Sipxcom User registration requests for 25, 50, 100, and 500 users. The registration requests was placed from a desktop server in two locations:
In scenario 1 when phones are registered to pbx3 from the SIP tester on the separate 10.20.2.x subnetwork, 10-40 kbps of bandwidth is generated in replication traffic and phone registrations every few seconds. When 100 or 500 phones immediately register to pbx3, 1-3 megabits of traffic per second is generated for several seconds comprised in phone registrations and state replication from pbx3 to pbx2 and pbx1. In the attached graph, there are two peaks in each test - the first peak represents the phones registering to pbx3 and the second peak represents the traffic to de-register the phones. The Sipxtester tool has the capability to delete all phone registrations simultaneously.
In scenario 2 when phones are register to pbx3 from the SIP tester on the same 10.10.17.10 subnetwork, 5-10 Kbps of bandwidth is generated in replication traffic to the primary and secondary servers in the 10.20.2.x subnetwork. When 100 or 500 phones immediately register to pbx3, approximately 1 megabit per second of bandwidth (or 100 Kilobytes (KB)) is generated for several seconds that is destined for the primary and secondary servers on the 10.20.2.x. subnetwork - this information is replication traffic only and not user registrations.