When using a distributed platform, with multiple nodes, we often need a method to group several OpenSIPS instances that collaborate for the same purpose in different clusters. An example is the scenario we discussed during the last public meeting, where multiple nodes were designated to handle NAT pinging for a set of subscribers, while others for a different set.
When designing such a mechanism, there are a few problems that need to be addressed:
There are several directions that we have to take into account:
18:00 ---| bogdan_vs has changed the topic to: OpenSIPS monthly meeting 18:01 < razvanc>| here's the meeting's page: http://www.opensips.org/Community/IRCmeeting20150624 18:02 < razvanc>| so today we'll be talking about Clustering 18:02 < razvanc>| we'll continue the last meeting's topic :) 18:03 < lirakis>| is the general "thought" at least from the openips projects perspective - to more or less come up with a "cluster" module - that would implement some ... discovery / exchange mechanism? 18:04 <@ bogdan_vs>| first, to be sure everybody is on the same page - let's put in 2 words why we need clustering feature 18:04 <@ bogdan_vs>| in 2.2, we already have 2 features allowing multiple OpenSIPS servers to exchange/share data directly 18:04 <@ bogdan_vs>| and not via DB (SQL or noSQL) as we had so far 18:04 < lirakis>| this is ... binary replicaation and ? 18:05 <@ bogdan_vs>| in 2.2 we can do distributed dialog profiling and call rating based on BIN 18:05 <@ bogdan_vs>| yes binary replication directly between the OpenSIPS instances 18:06 <@ bogdan_vs>| and during last meeting we touched in the similar way the topic of distributed USRLOC 18:06 <@ bogdan_vs>| and distributing via BIN (versus noSQL) may be an option 18:07 < lirakis>| is binary replication TCP ? 18:07 <@ bogdan_vs>| so we already have 3 functionalities which need to know all the OpenSIPS peers involved in the sharing....basically the cluster 18:07 < liviuc>| of course, we are talking about a TCP-based replication mechanism 18:07 < lirakis>| just want to make sure ;) 18:07 <@ bogdan_vs>| lirakis: right now on UDP only, but liviuc promised to add TCP 18:07 < lirakis>| oh - ok 18:08 < lirakis>| yea so in that case binary replication can not really be trusted to be consistent in any but the most local clusters 18:08 < lirakis>| at any rate - yes we have 3 functionalities which need to know all the opensips peers involved 18:09 <@ bogdan_vs>| indeed 18:09 <@ bogdan_vs>| and we are looking at the option to have a "cluster" module 18:09 < Hydrosine>| So what kind of input are you looking for from us? 18:09 <@ bogdan_vs>| to know the peers, abd eventually their state 18:09 < lirakis>| i think you could have a "cluster" module ... as described that overlays these things is a good idea ... in that .. it becomes the discovery layer 18:10 < lirakis>| and then other modules that do some type of distributed state can hook into the cluster module to get the list of proxies 18:10 < lirakis>| so the discovery is decoupled from the state exchange 18:10 <@ bogdan_vs>| Hydrosine: 2 things - 1) if it make sense and 2) if yes, what other things it can be used for 18:10 <@ bogdan_vs>| (aside dialog, ratelimit, usrloc) 18:11 < lirakis>| i think that it COULD be used also as a state exchange piece too ... like it has some generic "state" object that is more or less a hash, which it could propegate to all known nodes 18:11 < Hydrosine>| LoadBalancing, Sharing the load of destinations with other load balancers 18:11 <@ bogdan_vs>| Hydrosine> indeed 18:12 < liviuc>| @lirakis - doesn't that sound like a heartbeat mechanism? or did I misunderstood completely? 18:12 < lirakis>| no not a heart beat at all 18:12 < liviuc>| s/ood/and/ 18:12 < lirakis>| so im basically talking about some thing similar to how cassandra does discovery and "gossip" 18:13 < lirakis>| each "node" has an ip of at least 1 other node in the cluster, and when it comes up it queries that node for the "State" of the cluster 18:13 < carrar>| bogdan_vs: Could this clustering work in a anycast type setup, were all the public SIP IP's are are the same, and then a unique IP on each server for the communications between the servers? 18:13 < lirakis>| this state includes information about other nodes in the ring, as well as what state they have 18:14 < lirakis>| the "other state" they have could be all sorts of things ... number of dialogs, pike events .. who knows 18:14 < lirakis>| its an opaque object 18:14 < Hydrosine>| addon on lirakis, Galera cluster also works with this principe. you have to know only one node in the cluster to get the information about all of them. 18:16 <@ bogdan_vs>| carrar: what you say touches a different aspect - how to access the cluster from outside 18:17 <@ bogdan_vs>| right now, we more look into building the cluster itself...in terms of inter node communication 18:17 < carrar>| bogdan_vs: sip connections to customers would use the anycast IP 18:17 < carrar>| which is the same across all servers 18:17 <@ bogdan_vs>| or some DNS balancing 18:17 < carrar>| they get routed to the closest server 18:17 < lirakis>| carrar: we arent talking about load balancing requests 18:17 < carrar>| nore am I 18:18 < lirakis>| we are talking about node discovery, and data exchange between nodes 18:18 < liviuc>| also possibly smooth and transparent addition of extra nodes 18:19 < lirakis>| right - that would happen with the periodic "gossip" between nodes 18:19 < lirakis>| https://wiki.apache.org/cassandra/ArchitectureGossip 18:20 < Hydrosine>| will every node know about every transaction happening within the cluster? because most modules are built upon transaction information or dialog info (ie Load_balancer). So if all opensipses knew about all dialogs, they also know the load? 18:20 < lirakis>| so .. this is some thing that i think does not belong in the "cluster" module 18:20 < lirakis>| realtime data 18:20 < lirakis>| should be replicated some other way 18:20 < lirakis>| IMO 18:20 <@ bogdan_vs>| <Hydrosine> : sharing dialog info may be too much 18:20 <@ bogdan_vs>| the idea is to replicate only relevant data 18:21 <@ bogdan_vs>| like for shared dialog profiles, we share the value of local profile counters 18:21 < lirakis>| node down ... or too many dialogs ... etc. more "state" information 18:21 <@ bogdan_vs>| no need to share the whole dialog info 18:21 < Hydrosine>| ok 18:21 < jarrod>| hmm 18:23 < lirakis>| in that sense - you can do some thing like indicate which "nat ping flag" a proxy is responsible for, and if that proxy goes down, another node could take over ownership of the flag and start pinging 18:23 < lirakis>| i mean there are complexities about race/glare conditions for changing that ownership state 18:24 < lirakis>| but .. the framework would be there 18:25 < Hydrosine>| i don't think a clustering module is needed. we have all the events to work with the 'relevant data', we can puth those events wherever we want, mysql,couchbase,webapi's. Why develop a cluster module? 18:25 < Hydrosine>| but some modules do need to share some more information, but i think not on the opensips lvl 18:25 < lirakis>| Hydrosine: how do other proxies know about each other? 18:25 < lirakis>| how do they exchange the data full mesh? 18:25 < lirakis>| the idea is to exchange information WITHOUT using a DB backend 18:27 < Hydrosine>| for some modules they need to now eachother yes, the nat pinging. or uac_registrant!! 18:28 < lirakis>| yeah i agree not every module needs it - but just trying to figure out if there is a general framework for "cluster" that would be useful to hook into 18:28 < jarrod>| i guess im not understanding the type of information that would be relevant for the cluster module 18:28 < Hydrosine>| ^ 18:28 < razvanc>| following the distributed usrloc problem last time 18:28 < jarrod>| now THAT i need/want 18:28 < lirakis>| jarrod: not sure if yourecall the nat pinging issue that was discussed. 18:29 < lirakis>| proxy X is responsible for pinging clients with bflag foo set 18:29 < lirakis>| proxy X goes down 18:29 < lirakis>| how does proxy Y know to take over pinging clients with bflag foo set 18:29 < jarrod>| well 18:29 < jarrod>| thats a good example 18:29 < lirakis>| this could all easily be done in a distribtued key store (cassandra, mongo, redis) etc 18:29 < razvanc>| exactly, I think the idea is that some modules need to know the entire topology to be able to take decisions 18:30 < lirakis>| just trying to figure out if there are other use cases too 18:30 < jarrod>| so instead of building it into the individual modules, create a more general module for them to exchange information 18:30 < lirakis>| and .. if there is some generalized mechanism that would be useful 18:30 < lirakis>| jarrod: yea thats kinda of the idea 18:30 < razvanc>| lirakis: I think there is 18:30 < razvanc>| imagine the dialog replication 18:30 < lirakis>| jarrod: node discovery, and basic state exchange 18:30 < Hydrosine>| uac_registrant is the same. share records among a clusters. but if one goes down the registers that node served have to be taken over. 18:31 < razvanc>| let's say you have a platform with 3 nodes, but you only want to replicate the dialogs to a single more instance 18:31 < jarrod>| yea, i use elasticsearch in this way 18:32 < razvanc>| I mean I see this useful for many scenarios where you want to group two or more instances to do the same thing 18:33 < jarrod>| a clustering module that does discovery with state and provides general hooks for exchanging data between X nodes to individual modules 18:33 < jarrod>| that sounds like a great idea 18:33 < Hydrosine>| gtg, i read up tommorow ;) 18:33 < razvanc>| Hydrosine: sure, thanks for attending 18:34 < razvanc>| I'll publish the logs 18:34 < lirakis>| jarrod: yea - that basically sounds like what im thinking of 18:34 < lirakis>| ALMOST like a cachedb backend 18:34 < lirakis>| with some extra sauce 18:35 < lirakis>| hehe 18:35 < jarrod>| sauce is always good 18:35 < liviuc>| also, the module should be seen as a mere performance optimizer, with the "distributed modules" easily being able to use NoSQL backends as well 18:36 < jarrod>| this may be too specific, but i wonder what happens on network partition 18:36 < jarrod>| i guess they store revisions and are brought up to speed by other nodes? 18:36 < lirakis>| so .. i am suggesting a cassandra gossip like exchange 18:37 < lirakis>| which automatically recovers from a network partition 18:37 < lirakis>| based on gossip digests, timestamps and built in hearbeat sequence 18:38 < lirakis>| got to drop for 5 min. back in a few 18:39 < jarrod>| yea, i guess so many database/key stores support great clustering, replicating, now even write anywhere environments, and this is just going to get better and better 18:40 < jarrod>| im always leery of reinventing something 18:40 < razvanc>| indeed, there are some mechanisms that solve this problem 18:40 < liviuc>| ^ 18:40 < razvanc>| I'm going to look into some of them to find the best solution 18:41 < razvanc>| but for start, we were thinking to specify all nodes in a database 18:41 < jarrod>| cassandra, while it does some things pretty well, is just a heavy layer 18:42 < razvanc>| indeed, the next step would be to make the nodes auto-discoverable 18:42 < jarrod>| yea, discovery could be added later... i do like how C* has the concept of datacenters / racks 18:42 < jarrod>| for more geodistributed environments 18:42 < razvanc>| C*? 18:42 < jarrod>| C* == cassandra 18:42 < jarrod>| its just such a bulky project (and java eeek) 18:43 < razvanc>| oh, yeah, I know 18:43 < razvanc>| :) 18:44 < razvanc>| so, getting back, we were thinking we could specify the nodes in DB 18:45 < razvanc>| each instance queires the table 18:45 < razvanc>| and finds out all the other nodes 18:46 < liviuc>| this way, you must do a rolling restart when adding a new node, right? 18:47 < liviuc>| or MI command - never mind :) 18:47 < jarrod>| or mi 18:47 < jarrod>| yea 18:47 < razvanc>| yes, but again, that's the initial solution 18:47 < razvanc>| the next step, the servers could comunicate between them 18:47 < lirakis>| ok back 18:48 < razvanc>| and using a heartbeat mechanism disable the servers that are down 18:48 < razvanc>| because let's be honest, servers do no pop up that often :) 18:48 < razvanc>| when a new server apperas, you can do a rolling mi command 18:48 < razvanc>| :) 18:49 < liviuc>| the initial list _has_ to be file-system persistent - whether if it's a DB (like in MongoDB) or config files (like Percona Cluster) 18:49 < liviuc>| actually Mongo uses files+own collections - my bad 18:49 < lirakis>| so ... the idea is to put all the nodes in a DB? 18:50 < lirakis>| so we are still reliant on some distributed data store 18:50 < jarrod>| well, i think initially it would be each node each in their local db? 18:50 < razvanc>| lirakis: yes, for the begining 18:51 < jarrod>| oh 18:51 < jarrod>| hmm 18:51 < lirakis>| yeah i mean the whole idea of a cluster module (for me at least) would be to not have to rely on, or set up a heavy weight distributed data store 18:51 < lirakis>| ESPECIALLY for some thing that is "simple" like node discovery 18:52 < liviuc>| @lirakis: I like your p2p-oriented thinking :) 18:52 < razvanc>| lirakis: I agree 18:52 < lirakis>| i think for node discovery the gossip style thing on startup is really light weight, and not complicated 18:52 < liviuc>| so, lirakis is suggesting simply having to specify _1_ neighbour node on each OpenSIPS cluster node 18:53 < liviuc>| correct me if I'm wrong 18:53 < lirakis>| that is correct 18:53 < razvanc>| node discovery is just one issue 18:53 < lirakis>| right i understand 18:53 < lirakis>| but im saying .. that the node discovery part is not reinventing the wheel 18:53 < lirakis>| unlike distributing some shared state 18:53 < lirakis>| that IS reinventing the wheel 18:54 < liviuc>| heartbeats and split-brain problems are ... :( 18:54 < razvanc>| and I think we have it covered by 2 means: either DB provisioning (not very nice) and auto-learning (nice but more complex) 18:54 < razvanc>| I am now trying to address other issues 18:54 < lirakis>| ok 18:55 < liviuc>| maybe we should assess the performance gain of localizing the distributed data storage 18:55 < lirakis>| well ... i think if we are going to internalize/localize the distribtued data element within opensips, it HAS to be light weight 18:55 < lirakis>| if its not, then there is no point - just go setup NOSQLDB 18:56 < lirakis>| and i know its a complicated problem 18:56 < liviuc>| lightweight/heavyweight has nothing to do with that 18:56 < lirakis>| so ... is there really a way to do it clean 18:56 < liviuc>| it's just the matter of dealing with all the associated issues - exactly 18:57 < lirakis>| to quote bogdan_vs, i dont want to make opensips a "Frankenstein" of a proxy + some weird distributed data store project 18:57 < liviuc>| ^ 18:57 < jarrod>| ^ 18:58 < lirakis>| so ... im not certain if we can get some type of modular distribtued data store that .... isnt going to make it such a thing 18:58 < lirakis>| jarrod: back to magic database ;) 18:58 < jarrod>| thats my thinking... the databases that exist kinda accomplish this and more and more are going to come out 18:58 < jarrod>| and features are going to be added... the world is going distributed 18:59 < lirakis>| so ... are we just going to say ... its not worth it, and wait for a light weight distributed key store to be developed? 18:59 < jarrod>| i can see the need though, for communicating interal datastructures and information between proxies 18:59 < jarrod>| but similar to the binary replication? 18:59 < liviuc>| @lirakis: all it needs is to support async ops, and we're golden 19:00 < lirakis>| i can too ... im just not sure if the whole CAP theorem thing is not a simple problem to solve 19:00 < lirakis>| arg... magic database you torment me! 19:01 *| lirakis had a dream about writing a light weight distributed key value store thus deemed "magic database" 19:01 < razvanc>| :) 19:01 *| jarrod volunteered to help 19:01 < razvanc>| ok guys\ 19:01 < razvanc>| let's wrap up 19:01 *| jarrod waves his wand 19:01 *| liviuc hops on his unicorn 19:01 < lirakis>| lol 19:02 < lirakis>| so ... conclusions - razvanc ? 19:03 < razvanc>| so I think we still need this module for a couple of reasons 19:03 < jarrod>| was the thought about distributed usrloc postponed hoping to be accomplished by a cluster module? 19:03 < lirakis>| i think they are complimentary 19:03 < razvanc>| jarrod: no, not at all 19:04 < razvanc>| lirakis: not really 19:04 < razvanc>| I mean if everybody pings everybody, then yes, they are complementary 19:04 < razvanc>| but if you want to assing specific clients to specific registrars 19:05 < razvanc>| then you need a mechnism to know the topology 19:05 < lirakis>| if you use an edge ... and the registrars do the pinging then you are good 19:05 < razvanc>| and basically that's what I was trying to find out today 19:05 < razvanc>| even if it uses an edge 19:05 < razvanc>| what happens if one registrar goes down? 19:06 < lirakis>| right - so each registrar uses a different natbflag ... and says in its state what edge its behind, and what blfag its using 19:06 < lirakis>| so if it goes down - and is detected via this gossip mechanism 19:06 < razvanc>| yup, that's the idea 19:06 < razvanc>| I mean even if it is not a gossip mechanism 19:06 < lirakis>| a registrar behind the same edge will take over that natblfag as well as its own 19:06 < liviuc>| @razvanc: can't they make use of the atomicity and availability of the NoSQL cluster to publish / retrieve state information? 19:06 < razvanc>| there's no way to control that now 19:06 < lirakis>| sure this is all "Theoretical" 19:07 < jarrod>| lunch, brb 19:07 < razvanc>| liviuc yes, it could 19:07 < razvanc>| but that's not there now 19:07 < razvanc>| and even if you have to take that info from SQL, NoSQL or gossiping, this mechanism has to be implemented 19:08 *| liviuc writes down a new task for Vlad 19:08 < lirakis>| heh 19:08 < razvanc>| :) 19:10 < razvanc>| ok guys, thank you for attending for this meeting 19:10 < razvanc>| I will write down the conclusions for today 19:10 < razvanc>| and keep you updated 19:11 < lirakis>| great 19:11 < lirakis>| thanks again for having the meeting! 19:11 < liviuc>| thank you for attending and sharing valuable ideas :)