Documentation

Documentation.TroubleShooting-FindPerfPb History

Hide minor edits - Show changes to markup

May 09, 2013, at 01:49 PM by 79.118.227.150 -
Changed lines 1-2 from:

Finding performance problems

to:
April 24, 2013, at 06:35 PM by 213.233.101.41 -
Added lines 1-84:

Finding performance problems

(:toc-float Table of Content:)

Background

This document talks about OpenSIPS 1.X only design and behaviour. Please keep in mind that OpenSIPS 2.x has been redesigned to improve the multithreading / multiprocessing situation and allow switching the processing context which a long-running request is handled in the background. That change largely invalidates most of the issues mentioned here.

The way OpenSIPS 1.X message processing works, one SIP messages will block one worker process of OpenSIPS until the resulting action is found. This usually means, until the SIP message is forwarded to some destination, or dropped. Any action blocking the processing of the current SIP message will affect the general performance. This does not imply only IO blocking as in many scripting languages, but any action which needs to be completed before the script can continue.

As an example, if there is a problem with DNS lookup during processing of the packets, the complete DNS timeout takes 1s to be triggered and the OpenSIPS server runs with 8 children, only 8 packets can block the whole server for a second. Assuming more traffic of the same kind gets queued up, most of the processing will be completely locked up on a moderately busy server.

Eliminating the biggest delays will improve the server performance (throughput) significantly.

Common sources of delays

Explicit database operations

Database operations take a (relatively) long time. At least compared to normal computation tasks. This includes the time needed to send the query to the database (if applicable), process the query and to return the results over the network (again, if applicable). If the tables being queried are not properly indexed for example, this may lead to very long query times. In case of MySQL database, such queries can be spotted in the slow query log if it is enabled. Alternatively long running queries might be logged if the correct threshold is set on the exec_query_threshold parameter in the db_mysql module.

Database queries are triggered directly by commands like avp_db_query, lookup, save, etc.

Ideas:

  1. Remove feature called from the script - if the feature is not needed.
  2. Find out what indexes / storage engines may affect the queries and optimize the database tables.
  3. Change the storage mode if possible to either a delayed writeback, or in-memory only operation.
  4. If the feature is not fully needed, it might be possible to replace it with similar memcache operations.

Implicit database operations

Database queries are run from other sources where they are not directly requested. For example accounting and dialog may be triggered on many packets just because of flag being set at some point of the script.

Ideas:

  1. Check the proportions between the performance with different modules enabled? - see if your profile matches the standard ones.
  2. See other solutions in the previous section.

Excessive logging

The message logging in many cases configured by default involves waiting until the syslog daemon writes the debug messages to the disk. If there are many messages to be written, this will definitely slow down processing of the messages.

There are two ways this issue can be approached. First, the amount of messages logged can be limited in general. In general, try not to run production deployments with the debug parameter set higher than 2. If some long term debugging is needed on the system, this can be done either by turning up the debug level using the MI interface, or by changing the debug level locally using setdebug() function (described in the functions list?).

Second way to approach it is to make the message logging asynchronous on the server side. This way, the logging application will not have to wait until the message is flushed. If you are using standard syslog daemon, then putting a "-" before the filename where you log the OpenSIPS messages may improve the overall filesystem performance.

DNS resolution

Resolving a domain name may happen in many unexpected places. Even worse - the place and the resolved name may be controlled by the user. Every time a message is sent out with a domain name included in the destination address, rather than an IP address, that name will be resolved. If the DNS server does not provide a response to such request quickly enough, the overall server throughput figure may be affected. Especially if many domain names resolutions start to timeout instead of giving a response, the situation may be bad.

This will happen even in everyday operation on a server with a typical configuration. UDP packets are only retransmitted by the applications and the DNS query may be simply lost without a retry for a second or more.

Ideas:

  1. Install a local, caching and quick DNS server. Dnsmasq for example will allow querying all servers at the same time, instead of one-by-one if it's run with --all-servers. This will create more network traffic, but limit the number of delays because of lost DNS packets.
  2. Prevent usage of DNS names as much as possible. Drop non-ip contacts, record-routes, etc.
  3. Make sure local resources are available via /etc/hosts and don't need a remote query.

Excessive branches

Retrying a request which is known to fail with a lot of branches might lead to a lot of useless traffic and wasted processing time. For example many phones will register a new contact after a reboot without cleaning a previous one. This may easily lead to >100 contacts per device in some extremely broken cases. OpenSIPS will generate as many branches as allowed within its limit after a lookup() on that user. Sending all those messages, which are likely to fail is going to impact the server. In case the device requested some presence subscriptions, the problem grows even more.

Similar problem affects situations where many gateways are tried as a failover. Some codes might indicate that retrying is not a good idea - for example a PSTN number on the other side is busy. This depends on the actual call scenario and the correct way to handle it may vary from one deployment to the other.

Ideas:

  1. Limit the number of branches and allowed registrations as much as possible.
  2. Watch out for broken UAs and possibly introduce more restrictions based on the user agent identification header.
  3. Consider which situations are worth a retry / failover and cannot be recovered from.

Other operations

Use the benchmark module to track the request processing time or a specific part of the routing script. It may help to figure out where the delays are occurring.

You can plug in the results from the benchmark into the statistics / graphing system like munin. To do this set the granularity parameter to 0 and use the bm_poll_results command to pull the new batch of results. This will give you an idea of the system performance not only when something goes wrong, but also some reference points for comparison in the previous days.

Number of message processing children

In case you have many delayed operations involving for example database access, but the hardware is not busy - you may consider adding more children to the OpenSIPS server. This may improve situation where the server is waiting for response, rather than actually doing the database lookups (also, when the lookups happen on a remote machine).

To do this, adjust the number in the children core parameter. Keep in mind that this changes the memory requirements of the server however. The server will require a shared memory allocation and a private memory allocation for each of the children. Additionally if disable_tcp is not enabled and tcpchildren is not set, a double amount of the children will be forked - one for handling UDP and one for handling TCP messages.

More modules means more work capacity

Including more modules with more functionality means that OpenSIPS needs to do more work on some packets (generalisation, but statistically true). Try striping some functionality off from the configuration to see how the performance changes. Consider which functionality needs to be enabled and when does it need to be called.


Page last modified on May 09, 2013, at 01:49 PM