Mastering FreeSWITCH
上QQ阅读APP看书,第一时间看更新

HA deployment

"The contemporary form of Murphy's law goes back as far as 1952, as an epigraph to a mountaineering book by John Sack, who described it as an "ancient mountaineering adage": Anything that can possibly go wrong, does."

--Wikipedia

Like mountaineers, we survive because we respect the environment where we live and thrive: we know it is full of perils, and we know that ignoring those dangers can be deadly (at least for our business/career).

People are used to the concept of communication as a utility, you take the phone and you hear the line tone… Only in the case of major disasters will a user experience an interruption in the working of her voice calls. Barring a hurricane and the like, the telecommunication industry has an outstanding history of reliability, one of the very few fields where we can experience those magic figures of 99.999% uptime (the mythical five nines).

So, we have users with very high expectations when it comes to their capability to place a call and be reached from outside, completely the opposite to, for example, what they expect from their office PC operating system and applications, where crashes and malfunctions are known daily nuisances.

How can we cope with such expected reliability? We must plan for failures, not try to avoid them. Failures will come, that's certitude; we can lower their frequency, but we cannot avoid them. We need to ensure that each individual point-of-failure will not result in a system failure, for example we need to have multiple paths for everything: network connections, electrical power, disk storage, FreeSWITCH servers, and more.

As a bonus, when we design HA into our operation, we are almost there for achieving horizontal scalability, for example our operation will not only be unbreakable, but will grow linearly with the number of users, simply adding more of the base elements that compose our solution.

Storage, network, switches, power supply

As a rule of thumb, you must double each individual path, so it cannot become a single point of failure.

So, at wire level, you will have each machine with at least two different physical network cards in different PCI slots (for example, avoid a single card containing multiple adapters). You bond these two cards together as one virtual adapter for the same IP address(es), and you bring both network cables to a different switch. This way you also ensure physical connectivity in case of failure of a network card, switch, or cable (many failures are not caused by failed hardware, but by human errors like cutting a cable, or wrongful disconnections from patch panels).

You want to be connected by at least two routers bonded to at least two network carriers (so you hopefully survive the thrashing of the cable down the road), and then to two or more ITSPs.

Each physical machine will be built with double power supplies (PS are the most easily broken pieces of hardware, because of the fan) and double system (boot) hard disks in RAID 1 (mirroring) (hard disks will routinely break well before their Medium Time Before Failure, MTBF).

Storage (for common files, configurations, recorded prompts, voice mail, and more) will be preferable to a redundant SAN with redundant fiber connections, or alternatively (and way cheaper) to a Cluster Filesystem or an HA NFS server. A minimal but reliable HA NFS server is composed of two machines in an active-passive setup, accessed via a Virtual IP address that will be assigned by HeartBeat to the active machine. Each machine will have some disk space (typically many entire disks in raid 5 or 10) that mirrors the corresponding disk space on the other machine via DRBD. This solution (NFS + HeartBeat + DRBD + double network cards and switches) is well documented and effective. In case of a machine or filesystem failure, the Virtual IP address will be moved by HeartBeat on the other machine, and clients will continue to access the files that were mirrored real-time at block level by DRBD. The devil is in the details, so be sure to follow each step of a tried and true industry solution description (like the Linbit ones), not a page from a casual tech blog.

Virtualization

HA requires multiple machines, and more and more operations are using virtualization because of consolidation and manageability considerations. Let's have a look at FreeSWITCH virtualization best practices.

Real hardware machines (for example, non-virtual) running only FreeSWITCH on top of a clean operating system and a known kernel revision are the best solution for a reliable quality delivery. Voice and video operations are real-time operations. Delays of more than 150 milliseconds are perceived by users and jitter (variations in delay) will add complexity to delay management. Quality is all about constant timing of the transmitted and received packets.

Original source of timing are the IRQs of the physical machine, derived by BIOS from the motherboard's quartz oscillator. Those IRQs are the basis for the operating systems' timers, which allow for nanosecond formal accuracy, and millisecond actual accuracy. The vast majority of packets are 20 milliseconds long (while usage of packet duration - for example ptime – of 10 or 30 ms is much rarer) so accuracy in the millisecond range actually matters. Then the scheduler will decide at kernel level which process will get CPU access, and so the possibility to check the timers.

So far, the critical aspects are the timers and the scheduler in the kernel (so the need for a known kernel, preferably the same one on which FreeSWITCH is developed and tested by the core team). In an actual, physical machine that's complex but predictable and reliable. FreeSWITCH's core team goes to great lengths to ensure constant timing by automatically using the most advanced time source available from the kernel, and by deriving all internal timings from a single source.

Virtual machines were not designed for real time traffic: although it can serve millions of hits or queries, a web or database server is not time sensitive. As long as they deliver throughput, the exact millisecond one item is delivered and in which timing relation is with previous and later items is not important at all (in DB and Web operations there is no previous or later, each item is usually unrelated to other delivered items).

So virtualization builds an emulation of many virtual machines on top of one hardware machine, each virtual machine simulating a hardware running its own kernel, timers, scheduler, IRQs, BIOS, operating system, and more. This allows for the maximum saturation of CPU and I/O usage, maximum exploitation of the hardware, and minimization of power consumption and space filled in the datacenter, by multiplexing in a random order, CPU and hardware access to the various kernels, operating systems, and applications concurrently running on the actual, physical machine. It's enough of a statistically fair distribution. That's not good at all for real time communication quality.

Delivering decent real-time communication quality and traffic throughput from a virtual machine is more a black magic art than a science, and you can find on mailing lists, blog pages, and broscience, a quantity of incantations that can be of help.

You may succeed, and run a carrier class operation, but you need to be really careful and dedicated. The orthodox verb on this matter is: don't. No matter if Xen, KVM, Amazon EC2, or VMWare. Simply, don't do it for production systems with any sizeable load (as opposed to development and prototyping, where quality under load is not a concern, and virtualization is handy).

So, we're stuck with a correspondence of one FreeSWITCH server equal to one hardware machine? Not exactly, not at all.

In growing levels of overhead, you can, run multiple profiles with the same FS processes, run multiple FreeSWITCHes concurrently on the same operating system, and use containers to create various virtual environments in which to run various FreeSWITCHes.

The first two solutions will need to use separate IP ports for each profile or FS instance, and will share the same environment, for example userspace. The third solution, containers, allows for complete separation of execution environments, so each container will have its own IP address with a full range of ports in exclusive usage, and will be completely separated from other containers for security concerns.

From the admin point of view or feel, a container is like a virtual machine. But is actually a changeroot on steroids: there is one only kernel, one only scheduler, only one set of IRQs, one only set of timers, no emulation, no indirection, no contentions. All processes of all containers are just regular processes, and CPU and I/O access is given to them by way of a quota system. If all quotas are set to the same value, resources are distributed like all processes run on one only container, or in the native (host) environment.

So, containers are the most efficient and performing virtualization technology, because they take a completely different approach from traditional virtualization. Containers are a security group of related processes, which appear as a separate machine. This gives to each container direct access to timing, and performances indistinguishable from native, bare metal operating systems.

The drawback of containers versus other virtualization technologies is that you can run only Linux, and one only kernel revision (the one of the host, bare metal operating system). All the rest is possible: your distro of choice, multiple concurrent distros, multiple versions of the same distro, and more. You will choose the userland you prefer (for FS, you typically choose between Debian and CentOS). Then you can start, stop, backup, or migrate your container just as you would do with a traditional virtual machine.

The most mature technology in this field used to be OpenVZ, a series of patches to the regular Linux kernel and a set of tools to manage containers (for example that will appear as VMs to you). For ease of installation OpenVZ is also distributed as packages for RHEL-CentOS and for Debian. You'll add the repositories, and then add the packages. After a reboot you'll be ready to build your new containers. OpenVZ is the basis of Virtuozzo and Parallels Cloud Server, commercial virtualization solutions offered by Parallels, complete with a GUI for management, and ability to manage other kinds of VMs, like KVM.

LXC (Linux Container) is arguably the future of containers. It is actively developed and present in both RHEL-CentOS and in Ubuntu-Debian, but is clearly Canonical-Ubuntu the champion of LXC.

LXC can be leveraged also by installing Proxmox, a container and VMs management system that offers a complete open source solution, very well engineered, giving ease of use to all components of an HA virtualization platform: LXC, KVM, redundant storage, VM migration, and more. Proxmox offers a web based solution to all containers and storage administration, with redundancy and high availability. You can buy a support contract and be entitled to privileged access to the best tested repositories.

Load balancing and integration with Kamailio and OpenSIPS

You attain HA duplicating for (at least) your communication paths and your servers, by building a resilient system without a single point of failure. You want to eliminate the risk that a malfunction of an element can stop or degrade the functions of the whole system.

In the Web world

In standard HA web operations you run multiple web servers behind a load balancer that faces the clients. So, the flow is: client asks for a page to the web server at http://www.example.com/ address. That server is actually not a real web server, but a load balancer, that proxies the request to one of the real web servers, and then proxies the answer from that web server back to the client.

The load balancer (proxy) is a light process, requiring little hardware to just move bits back and forth. The real web servers are beefy machines where actual, heavy lifting is done.

We have reached the goal of avoiding the risk of web service interruption (if one web server goes down, the load balancer will sense it and send the incoming traffic to the other web servers). Also, we achieved horizontal scalability: as traffic increases we just add more web server machines, the load balancer will, ahem, balance the load between them.

But we just shifted the single point of failure from a single web server to a single load balancer. If the load balancer goes down, service is interrupted, clients will have nobody to connect to and to receive answers from. Back to square one? Actually, no.

We've done the right thing shifting the single point of failure from a heavy and complex environment (a web server requiring lots of RAM, CPU power, connections to DBs, application logics, updates, and more) to a very light and lean entity (a load balancer that needs no intelligence, requires almost nothing in resources, uses a static configuration, kind of fire and forget service).

So, how do you duplicate your load balancer? You cannot load balance it (it would only kick the can forward, for example, you end up with a single point of failure located in the balancer of load balancers).

To duplicate the load balancer, you use a different strategy: you have a second machine with the same resources, software, and configuration like the first one (actually you can think of it as a copy of the first machine), running live side by side with the first machine, using a different IP address.

In normal operation, the first machine is on http://www.example.com/ IP address and got all the traffic, in and out, while the second machine does nothing, just sitting there humming idle (the second machine is on a different IP address, so no traffic directed to http://www.example.com/ will reach it). Remember that load balancing is a very light process, so it will be very cheap to have a second machine standing by idle, as a spare.

When the first machine malfunctions (because of a software or hardware failure), we shut it down, then change the IP address and identity of the second machine, and traffic begins to flow in and out of the second machine, that is now answering at the http://www.example.com/ Internet address.

The last touch to our solution is to make the procedure of malfunction sensing, shutdown of failed machine, impersonation by second machine, and resumption of traffic flow, an automated, very fast, repeatable and reliable flow. This is usually done by accessory software called HeartBeat.

We reached our goal: we presented a unique address to our web users so to them a failure in our system will be transparent, only a delay of an instant in the display of the requested web page.

In the FreeSWITCH world

Building an HA FreeSWITCH service requires the same components as the previous web example, with some added complexity due to the dual nature of SIP calls: signaling and media (for example, audio).

SIP load balancer in our case will be provided by Kamailio or OpenSIPS open source software (using its dispatcher module).

In SIP, the establishment, tearing down, and modifications of the call are carried out by the exchanging of specific signaling network packets that describe who the caller is, who they are looking for, how the call has to be established, how to encode audio to be understandable by both sides, and also the actions of calling, ringing, answering, hanging up, and more. Those signaling packets (the proper SIP packets) are sent between known ports at known IP addresses, and define the communication flow in its entirety.

But this is only a flow description, for example packets contain no audio, only indications about how to exchange audio packets.

Audio is exchanged using a completely different protocol, auxiliary to SIP, called RTP. Those audio packets will use completely different and previously unknown IP ports (which ports to use is negotiated between caller and callee via SIP packets).

Of those two packet exchanges only the RTP (audio) stream is time sensitive. SIP signaling packets can be delayed by almost any amount of time without significant communication disruption. Maybe your call will be answered one whole second (1,000 milliseconds!) later, so what? On the contrary, a delay of more than 150 milliseconds is perceived as bothering, and can you imagine talking on the phone with audio that comes and goes with one second gaps now and then? Other than full gaps, other audio annoyances are delay, jitter, and packet loss in RTP, that can dramatically lower the quality in the resulting sound reconstruction.

Those two flows (signaling and media) are completely independent, and this often results in completely different paths taken by SIP packets and by RTP packets.

For example, caller and callee can have their SIP packets routed through a number of SIP proxies in between, while RTP packets go directly from caller to callee and vice-versa (that's very often what happens when calls are between phones connected to the same LAN). Or RTP can be routed by a single proxy between caller and callee, while SIP packets traverse a much more tortuous path (that's what often happens when caller and callee are in different LANs). Or any other mix and match of communication paths.

Let's see how we can build an HA FreeSWITCH service in the simplest (but real and robust) of topologies: all machines (load balancers and FreeSWITCH servers) are sitting on public IP addresses, directly reachable from the Internet. In our example we'll have the active load balancer assigned by HeartBeat with the public IP address corresponding to sip.example.com in addition to its own IP address (for example it ends up with two IP addresses). In case of active lb failure, HeartBeat will reassign sip.example.com IP address to the standby load balancer (again, in addition to its own IP address). For example, the standby machine will become the active one and traffic will begin to flow in and out of it.

SIP (signaling) packet will flow from caller to active load balancer. Load balancer will route the packet to one of the FreeSWITCH servers (chosen randomly, or by some algorithm).

All FreeSWITCH servers will be configured exactly the same way (they will only differ by their IP address), and will have their own guts (the internal data structures representing phones, calls, SIP details, and more) residing in a PostgreSQL database (this FreeSWITCH setup is called PostgreSQL in the core).

This will allow all FreeSWITCHes to be completely interchangeable: they will all know about where each registered phone is to be located (all registrar information will be on the database). Whichever server that phone has originally registered on, the registrar information will be common between all FreeSWITCHes.

So, SIP signaling packets coming from the caller will be routed by the load balancer to one of the multiple FreeSWITCHes. Each one of them will be able to connect the call to one of the registered phones, or to provide special services to the caller (voice mail, conferencing, and more).

FreeSWITCH answers to the incoming SIP signaling packets from the caller will first go to the load balancer (to FS the call is coming from LB). Then the load balancer will route to the caller the SIP signaling packet coming from FS.

SIP signaling path for a voicemail access will be: caller->LB->FS->LB->caller. For an outgoing call (for example PSTN) would be: caller->LB->FS->ITSP->FS->LB->caller. For an in-LAN call would be caller->LB->FS->callee->FS->LB->caller.

Audio (RTP) will instead always flow from the caller (and callee, if any) to FS, and from FS to the caller (and callee, if any). If the call is just to the FS box (as in voicemail access) audio packets will go directly back and forth between caller and FS. If there is actually a callee (for example if the incoming call will generate an outgoing leg toward another phone) audio packets will flow back and forth from caller to FS, and from callee to FS. FS will internally route those audio (RTP) packets between caller and callee, and between callee and caller, joining inside itself the two legs in a complete, end-to-end call.

SIP signaling packets will be routed from load balancer to an FS server, and that FS server will insert its own RTP port and IP address in the answer SIP packet part that defines the audio (RTP) path. That SIP answer packet containing the RTP address will go to the load balancer, and from it to the caller. The caller will then start sending audio (RTP) packets directly to that FS server, and vice versa (SIP signaling packets will continue to pass by the load balancer). In case of a two leg call, the four streams (caller to FS, FS to caller, callee to FS, FS to callee) will be cross routed (for example switched) inside FS itself.

Until now we left the database as our single point of failure. Fortunately, there are proven technologies to achieve database HA, master-slave, clustering, and many more. The most popular and simplest one is the active-passive configuration, similar to the configuration we applied to the load balancers and very similar to the one described earlier for DRBD NFS servers.

One machine is the active one and gets all traffic, while the other one is passive, sitting idle replicating the active machine's data on itself (so to contain at each moment an exact copy of the active machine's database data). The database is accessed to the published IP address, assigned by HeartBeat to the active machine. In case of failure, HeartBeat will reassign that official IP address to the standby machine, thus making it the active one.

This database HA topology has the advantage of being conceptually simple (you must just ensure that the standby machine is able to replicate the active machine's data). The main drawback is that the data replication process can fail or lag behind, and you end up with a standby machine that does not contain an exact copy of the active machine's data or that contains data that is not even consistent. The other big drawback is that for serving a big database, a machine needs to be huge, powerful, full of RAM and with multiple, big and fast disks. So, you will end up with a very costly big bad box just sitting idle almost all the time, passive, waiting to take over the tasks in case of failure of the active machine. A new solution to both those problems is being made available for PostgreSQL: BDR (Bi-Directional Replication). BDR will allow the use of both machines at the same time, each machine to be guaranteed to be consistent within itself at any moment, and eventually consistent with the other machine. BDR also allows for database replication between different datacenters, to achieve geographical distribution and resiliency to datacenter failures.

We just described a very basic and easy to implement HA FreeSWITCH service solution. The main drawback of this FreeSWITCH HA solution is the exposition on the public Internet of the various FreeSWITCH servers' addresses, that will not be shielded, for example, by DDOS attacks and other security dangers.

Kamailio and OpenSIPS (software we use to implement load balancer) are particularly apt and proven in defending VOIP services from attacks, floods, and DDOS.

A different topology, and indeed one that's often used by Telecom Carriers, would be one that only exposes to the Internet the load balancers. LB will act as registrar too, and will use rtpproxy processes to route audio in and out of the system. In this topology, FreeSWITCH servers' addresses will be unreachable from the public internet (an example would be private addresses), and all RTP audio will flow via the rtpproxy processes.

DNS SRV records for geographical distribution and HA

So, we achieved a system without a single point of failure, we attained High Availability for our customer base. No calls will be dropped!

We got customers on both coasts of the USA, in Europe, and in Asia too. They are all accessing our solution hosted in a New York datacenter. Our customers in Philadelphia and London are getting perfect quality, while from Tokyo and San Diego they lament occasional delays and latency. Small problems, nuances of a well engineered, failure resistant service.

Then a flood, a power outage or another disaster strikes the datacenter that hosts our solution. The datacenter is no longer reachable from the Internet. Our entire service is wiped out, our customer base will be completely unable to make or receive calls until we find a different datacenter and we rebuild our solution from the most recent offsite backup media.

Ugh!

SRV records in a Domain Name System are used to describe which IP addresses and ports a service can be accessed from, and in which order those address/port couples will be tried by clients. SRV records are often used to identify which SIP servers a client needs to connect to in order to reach the destination desired.

The interesting property of SRV records is that, just like MX records for mail (SMTP) service, a DNS server can return multiple records to a query. Each of those records will have a weight associated with it, which will suggest to the client in which order those records would have to be tried. If the one with the lowest weight will not work, try the next higher weight, then the next, and so on. Two records can have the same weight; they would be tried in a random order.

We can manipulate SRV records to optimize traffic patterns and delays and for achieving datacenter disaster survival. Let's say we deploy our solution, load balancers, FreeSWITCHes, Databases and NFS servers, all on both coasts, in New York and in the San Francisco datacenter. We'll use a DNS server with a geolocation feature to answer European and East Coast customers with a lower weight at the New York datacenter address, while Asian and West Coast users will receive a set of records with the lowest weight assigned to the San Francisco datacenter address.

This will ensure both our goals: each one will use the closest site, and in case the closest site is unreachable or not answering, they will use the other one. Yay!