This document provides an introduction to getting familiar with the Sipxcom high availability (HA) capabilities and how to deploy in a production network. The work was done using the recent 16.12 ISO release available from download.sipxcom.com. Users are cautioned to plan and test thoroughly - you are working with replicated, real-time Mongo databases - in preparing this document, voice servers were rebuilt several times from fresh installs to eliminate configuration and other issues that are encountered in testing this service. If you wish to reset the system without a re-installation, then follow this procedure precisely http://wiki.sipxcom.org/display/sipXcom/Reset+Entire+System. This primer assumes that the administrator is experienced with the standalone version of Sipxcom and knowledgeable with the details of Polycom phone configurations.
Test Environment and Approach
The following diagram provides an overview of the test network configuration. In the phase 1 testing, DNS is configured so that all phone registrations land on pbx3. The primary pbx is failed first and then incoming and outgoing internal and external calls are tested along with basic features - note that services such as voicemail can only run on the primary server. The secondary pbx2 server is then failed and the tests are repeated. The pbx1 and pbx2 servers are then brought back online to ascertain the high availability system recovers and all feature capability is returned.
For phase 2, pbx3 with IP address 10.20.2.33 is replaced with a server with IP address 10.10.17.10 on separate subnetwork. The same test strategy for phase 1 is then used for phase 2. The router is instrumented with a bandwidth measurement tool to estimate the amount of traffic used to replicate state information from the voice servers in the primary 10.20.2.x subnetwork to the secondary 10.10.17.x subnetwork. This insight is important to understand the WAN bandwidth impact if the intent is to geographically distribute some of the secondary voice servers in multiple locations.
By design Sipxcom is designed is always use its built-in DNS server for querying SRV records - the Sipxcom unmanaged DNS server option when enabled allows querying of an unmanaged DNS server for phone registrations. In testing the Sipxcom HA solution in release 16.12, difficulties were encountered using the managed DNS tools of Sipxcom. This document http://wiki.sipxcom.org/display/sipXcom/DNS+Management provides a good overview of the Sipxcom managed DNS capabilities and how they should work. Despite significant investments to make Sipxcom DNS management tools work with the high availability solution, the HA solution performed best in the lab when Sipxcom used a DNS server that was completely separate from Sipxcom - i.e. DNS services on each of the Sipxcom servers was disabled. The remainder of this document assumes that Sipxcom uses a separate DNS server for all SRV record processing - to do this, careful attention needs to be placed on how Sipxcom primary and secondary servers are built and to disable the DNS service from being enabled at startup.
Configuring Sipxcom for High Availability
These next few subsections assumes that a new Sipxcom server is being built from scratch, a separate DNS server is used for all Sipxcom DNS processing, and that DNS is turned off on all Sipxcom servers.
Build Standalone Unix DNS Server
Build a standalone DNS server using this document as a guide http://wiki.sipxcom.org/display/sipXcom/DNS+Concepts+for+sipXcom. To simplify effort, do the following:
- Install CentOS minimal
- Install bind using the following command yum install bind* -y
- Copy /etc/named.conf and /var/named/default.view.sip.domain.zone file from a working standalone Sipxcom system and modify the zone file to add the pbx2 and pbx3 servers, and adjust priorities so that phones register to pbx3.
- Follow the steps in this document https://www.unixmen.com/dns-server-installation-step-by-step-using-centos-6-3/ steps 4/5 to open up firewall ports to process DNS packets.
Test from another Unix client machine to ascertain Sipxcom SRV records are returned, ping responses to pbx1.lvtest.com, pbx2.test.com, pbx3.lvtest.com, and www.google.com return correct responses in terms of addresses.
Build Sipxcom Primary Server
Build the primary Sipxcom server using the 10.20.2.35 IP address as the primary DNS address in the Ethernet network settings - otherwise Sipxcom will override any manual entries in the /etc/resolv.conf whenever the primary Sipxcom server is restarted. When the fresh Sipxcom system is started, check the primary External DNS setting to ascertain that it points to the standalone DNS server built in the previous step; Otherwise, the etc/resolv.conf will need to be updated to point to the 10.20.2.35 standalone DNS server after every reboot. Start all all necessary Sipxcom services except Sipxbridge, which is not supported in the HA configuration.
NAT Traversal Settings
Go to the System - > NAT Traversal - > Settings menu and disable the Enable NAT Traversal and Server Behind NAT functionality. The Sipxcom high availability solution is only certified to use unmanaged gateways such as SBCs and ISDN/SIP gateways - Sipxbridge does not work properly in this environment.
The Sipxcom active phone registrations page does not provide any indication in a high availability configuration which voice server the phones are registered to, and the DNS server in Sipxcom by default spreads the phone registrations across all 3 servers. For phase 2 testing where replication bandwidth to a remote server is measured, the best way to do this is having all phones registered onto pbx3. In the lab setup, the .sip.tcp, .sip._udp, and .sips._tcp DNS SRV records are weighted so that phones register to pbx3 first, then pbx2, and then the pbx primary proxy. Perform DIGs of the SRV record from a client machine pointed to the 10.20.2.35 DNS server to ascertain that pbx3 appears first in the response. Also weight the SRVs for resource records and test.
Validate that the registry rr IN records in the /var/named/dnsfile directory are correctly configured and that the DNS A records have the correct IP address for each pbx. For phase 1 testing, the pbx3 IP address is 10.20.2.33, and the phase 2 pbx3 IP address is 10.10.17.10.
Go into the System - > Servers - > Core menu, and turn off the DNS service on the primary Sipxcom proxy.
SSH to the primary server, and perform the following:
- Issue the chkconfig --list named - by default, the Sipxcom installation turns on DNS service regardless of core server settings. Issue the chkconfig named off to disable DNS service startup upon a restart of the primary service.
- Ascertain that DNS service is not running - service named status. If the service is running then disable by issuing a service named stop command.
- Check the /etc/resolv.conf file to ascertain that it points to the standalone DNS server defined in the previous step.
- Reboot the primary server and validate that the DNS server is indeed stopped and that Sipxcom has not rewritten the /etc/resolv.conf file to point to a different external DNS server.
- Perform a DIG command on one of the SRV records and validate the response.
Add Server and Role
Go into System - > Servers and click on the top right hand side to add a new server. Provision the Hostname, IP Address, and description fields and hit the okay field - the server will show up in the list of servers with Status field initially set to Uninitialized. Use the administration ID assigned by Sipxcom to build your secondary voice servers from the ISO. Per the previous step, ascertain that the nameserver address used in the CENTOS ethernet interface definition menu points to 10.20.2.35 - otherwise the /etc/resolv.conf will need to be updated after every secondary Sipxcom reboot.
After the ISO is installed, sipxecs-setup is automatically invoked. Point the primary server to 10.10.17.10 and provision the administration ID (in this case 6 for pbx3) in the setp script. After the script completes, go back to the System - > Servers menu - the status field should change from uninitialized to configured. Repeat this process for each secondary server or arbiter assigned to the system.
Go into the System - > Servers - > Telephony section and turn on Sip Proxy and SIP Registrar services for each secondary server added to the HA System.
Add Secondary Servers to Global Databases
Go to System - > Databases and add secondary servers to the list of global databases - it will take 60-90 seconds for each database to be added and correctly synchronized. If you are seeing multiple errors or having difficulties getting the server added to the list of global databases, try upgrading the computing platforms being used for the servers.
In the lab setup, phones are manually provisioned with IP address, TFTP, SNTP, and DNS service addresses - a lab phone group was defined on the voice server and assigned to the test phones. The settings are as follows:
- TFTP and SNTP servers point to the primary 10.20.2.31 proxy
- DNS server for the phones in the phase 1 tests points to unmanaged DNS server at 10.20.2.35. In phase 2 testing where phones are simulated in a separate location, the DNS address is 10.10.17.35 and SRV weights are adjusted so that phones in that location register to the secondary server.
- The Lines - > Registration - > Primary Registration Server - > Expires value is reduced from 3600 seconds to 120 seconds.
- The registrations GUI in Sipxcom does not provide any information on which proxy the phones are registered to. A custom configuration file is configured for the Polycom phones that allows remote PCAPs to be enabled - combined with the SIP expires Setting (previous point), Wireshark is used to validate which proxy phones are registering to. A way to pull this information from Mongo is being explored using Mongo commands from a unix script is being explored.
Double-Check Lab Configuration
ssh into each primary and secondary server, and double-check the following:
- DNS service is turned off
- /etc/resolv.conf file is point to the unmanaged DNS server at 10.20.2.35
- The SRV records on the unmanaged DNS service are pointing to pbx3 first, then pbx2, and then pbx1 - do a dig SRV _sip._tcp.lvtest.com command
- Double-check that all Sipxcom processes are running by doing a service sipxecs status
- Using Wireshark, double-check that phones are registering to pbx3.
- Place internal and external calls on system to validate that everything is working properly.
StarTrinity SIP Tester Tool
The StarTrinity Sip Tester Tool can be downloaded here https://startrinity.com/VoIP/SipTester/SipTester.aspx - It is a Windows-based application that can perform a variety of tests against a voice server. In the phase 1 and 2 testing, the key Siptester feature used was the Registration (UAC) batch capability, which simulates a flood of registration requests to a SIP proxy from phones upon a catastrophic failure such as proxy rebooting or managed WAN connection restarting where the phones are in one location, but the voice server proxies are in a centralized data center. The assumption is that this burst of registration requests generates peak replication traffic from the primary proxy to the secondary servers.
The Siptester steps used in the phase 2 measurements includes the following:
- Use the Excel Import capability of Sipxcom to pre-populate a large number of users with the same SIP password.
- Go into the Registration (UAC) section of Sipxcom and pull down the Add Batch menu
- Provision the first user name, expiry field (I shorten from 3600 to 300 seconds), the number of user registrations to create, SIP password, and IP address of the Registrar that the users should register to. in this case, we are trying to register all phones to the 10.10.17.10 Sipx proxy.
- Hid the Add symbolic link - doublecheck the status field to ascertain the users connected correctly, and click on the trace symbolic link to ascertain that users are registering to the correct proxy. Go to the Sipxcom Diagnostics - > Registrations page to validate that the users have registered correctly to Sipxcom.
When the Delete All symbolic link in Siptester is selected, the tool will instruct Sipxcom to un-register each line - in the Sipxcom Active registrations page, there may be still active or expired registrations - ssh to the ASipxcom primary server and use the following procedure to clear out all active registrations.
Preliminary Phase 1 Test Results
An initial test of phase 1 test calls was conducted - with all lab phones configured to register to pbx3 running as a secondary server, pbx1 was brought down. Internal calls between phones registered on pbx3 were successfully placed as well as internal and external call forwarding. External calls were also successfully placed. An incoming external call was placed but failed due to SIP 404 not found was returned from the pbx3 proxy. As expected, voicemail, mwi, and autoattendant functionality were not available when the primary pbx proxy was unavailable.
One issue encountered when all 3 HA servers was operational was on an external call - when the Polycom phone ended the phone call, the secondary pbx3 proxy issued a 403 noaccess response to the primary pbx proxy and the pbx never issued a bye to the ITSP to release the call. The bridged line appearance issue issue identified in the standalone version of Sipxcom also appears in the HA configuration - provisioning the SIP server field on the phones gets BLA working again but those phones then do not work when the primary proxy fails. A JIRA has been opened on this issue.
Preliminary Phase 2 Test Results
A three server Sipxcom HA solution was configured, with pbx1 and pbx2 on one subnetwork, and pbx3 on a second subnetwork. All three servers were defined as a global databases - when regions were defined and pbx3 was defined as a local database, the registrar process on pbx3 could not be started. The StarTrinity SIP tester tool https://startrinity.com/VoIP/SipTester/SipTester.aspx was used to flood the pbx3 server with Sipxcom User registration requests for 25, 50, 100, and 500 users. The registration requests was placed from a desktop server in two locations:
- Windows desktop on the 10.20.2.x subnetwork where pbx1 and pbx2 are located - the registrations go through to the router on the 10.10.17.x subnetwork where pbx3 is located. The router measures the registration traffic to the phones plus the replication of state information back to pbx1 and pbx2. This measurement is important to obtain when pbx1 and pbx2 goes down and phones on that subnetwork must re-register to pbx3 on the 10.10.17.x subnetwork.
- Windows desktop on the 10.10.17.x subnetwork where pbx3 is located - in this scenario, the router measures the state replication traffic from pbx3 to pbx2 and pbx1.
In scenario 1 when phones are registered to pbx3 from the SIP tester on the separate 10.20.2.x subnetwork, 10-40 kbps of bandwidth is generated in replication traffic and phone registrations every few seconds. When 100 or 500 phones immediately register to pbx3, 1-3 megabits of traffic per second is generated for several seconds comprised in phone registrations and state replication from pbx3 to pbx2 and pbx1. In the attached graph, there are two peaks in each test - the first peak represents the phones registering to pbx3 and the second peak represents the traffic to de-register the phones. The Sipxtester tool has the capability to delete all phone registrations simultaneously.
In scenario 2 when phones are register to pbx3 from the SIP tester on the same 10.10.17.10 subnetwork, 5-10 Kbps of bandwidth is generated in replication traffic to the primary and secondary servers in the 10.20.2.x subnetwork. When 100 or 500 phones immediately register to pbx3, approximately 1 megabit per second of bandwidth (or 100 Kilobytes (KB)) is generated for several seconds that is destined for the primary and secondary servers on the 10.20.2.x. subnetwork - this information is replication traffic only and not user registrations.