Zabbix configuration syncer processes more than 75% busy

Содержание

Zabbix configuration syncer processes more than 75% busy

I have alerts from all proxy servers Zabbix configuration syncer processes more than 75% busy

Zabbix server, DB mysql (with SSD), WEB — installed on separate serevers
and 7 proxy (active)

vps (server) — 1045.04 (+ many trappers)
vps per proxy — 336.68, 50.35, 180.84, 178.35, 58.88, 224.74, 6.93

Zabbix server conf :
CacheSize=2G
CacheUpdateFrequency=5
HistoryCacheSize=1G
TrendCacheSize=1G
ValueCacheSize=2G
StartDBSyncers=4
HistoryTextCacheSize=1G

ConfigFrequency=5
CacheSize=1G
StartDBSyncers=4
HistoryCacheSize=512M
HistoryTextCacheSize=512M

What can be a problem and how to fix it?

Thanks for help!
Natalia

Zabbix server, DB mysql (with SSD), WEB — installed on separate serevers
and 7 proxy (active)

vps (server) — 1045.04 (+ many trappers)
vps per proxy — 336.68, 50.35, 180.84, 178.35, 58.88, 224.74, 6.93

Zabbix server conf :
CacheSize=2G
CacheUpdateFrequency=5
HistoryCacheSize=1G
TrendCacheSize=1G
ValueCacheSize=2G
StartDBSyncers=4
HistoryTextCacheSize=1G

You have 7 proxies and only 4 DB syncers.
This is the cause.

BTW: you have way to big caches.
I have 2.5 knvps and 250k items and on server even with 64MB for HistoryCacheSize and TrendCacheSize caches almost constantly all of them are 100% free.

With CacheSize=2G means that you are preserving items config cache for something like 2mln items. I have 386MB but I have quit big hosts in/out flow and I must keep 150% more CacheSize than I need for not monitored hosts (not monitored hosts cfg is constantly present in cfg cache for escalations and other operations).

ConfigFrequency=5
CacheSize=1G
StartDBSyncers=4
HistoryCacheSize=512M
HistoryTextCacheSize=512M

The same. Everything can be reduces by factor 10 if not more. On my biggest proxy with 89k items has CacheSize=80M
If you have active agents you may increase StartDBSyncers.

If you are using MySQL = 5.5 you should consider to use innodb_buffer_pool_instances=N where N will not bigger than number of CPU cores*2 and higher than number of DB syncers. It noticeably improves latency of concurrent selects and a bit writes operations (which re the biggest problem in case of DB workload generated by zabbix).

Comment

Many thanks for the reply and very useful information/suggestion!

where should I reduces on proxy or server or on both ?

Should I reduce all Cache* ?
CacheSize=2G
HistoryCacheSize=1G
TrendCacheSize=1G
ValueCacheSize=2G
HistoryTextCacheSize=1G

I have the following details in dashboard:

Number of hosts: 4639
Number of items :285906
Number of triggers: 259122
Number of users (online) 126 8
Required server performance, new values per second 1057.44

4600 active agents, each proxy has 300 — 2500 hosts
where should I increase StartDBSyncers on proxy or server or on both ?
The proxy is active, what is the purpose of StartDBSyncers on servers side ?

I have MySQL 5.6 with partitions to all *history and *trend tables
should I increase innodb_buffer_pool_size as well ? how much ?

Thanks a lot for the help!

Comment

Many thanks for the reply and very useful information/suggestion!

where should I reduces on proxy or server or on both ?

Should I reduce all Cache* ?

Sometimes more means to much.
Few GB RAM allocated and not used means that this memory effectively is wasted.
Probably on proxies on server you have some external scripts. If yes this memory used even by page cache may be working more effectively than not used in zabbix caches.

In theory of optimization and testing sometimes is used phrase/term death by thousands cuts. If you will cut skin of elephant one time it is even possible that so big animal will not even notice this. However thousands such cuts may kill even elephant.
This is why so important is keeping some complicated systems as close as it is possible around possible sweet spot.
If you will be not caring abut details here or there at some moment system will start behaving randomly and it will be not possible to make single or simple modification to improve/fix this. Why? Because this system will be suffering by only one issue: lack of care for many details

I have the following details in dashboard:

Number of hosts: 4639
Number of items :285906
Number of triggers: 259122
Number of users (online) 126 8
Required server performance, new values per second 1057.44

So your zabbix is almost the same as mine. I have less hosts but for zabbix more important is number of monitored items.

4600 active agents, each proxy has 300 — 2500 hosts
where should I increase StartDBSyncers on proxy or server or on both ?
The proxy is active, what is the purpose of StartDBSyncers on servers side ?

On both. I’m assuming that you are not monitoring hosts over server and only (active) proxies are connecting to srv. In such case number of syncers on server should be not lower than number of active proxies. Why? Because it is some probability that with active proxies all will be pushing own data to the server in exactly the same period of time. Again: this is true in case case of active proxies. In case passive ones server reads data sequentially from all proxies one by one.

That depends what OS is working on DB backend.
I have Solaris and empirically I found that ZFS caching is better than mysql innodb cache., so I have only 16GB for innidb pool and 32GB for ZFS ARC.
In such scenario is it is some funny consequence of such architecture: restart of DB backend is quite lightweight and mysqld has only memory to hold only indexes. Data are served by read data from ZFS ARC.
With More memory used by ARC and using combination of MRU/MFU algorithms ZFS arc hits/misses ratio like 4-19k/0-20 per second (avg in whole day it is 5.5k/5.5 per second).
In attachment on the bottom you can peak on my daily IO graph on physical disk layer (below ZFS pool).

In case of Linux (as long as you are not using zfs) in may be different. I never had time to compare what is more effective: page cache or innodb pool. Probably inndb pool will be more effective but with more memory dedicated for this warming up DB caches may take longer so it may be kind of double edge sword

You have almost the same number of items as I have so daily volume of data written to DB should be close to size of memory used to by different caches used by and underneath DB engine. Why? Because most of the people are looking on the graphs in one day scale or less. With all last 24h data cached it is very possible that data which needs to be delivered to someone graph(s) will be in memory instead on storage.
As long as you have partitioned history* tables it is very easy calculate how much memory needs to be spend on caching. Something like avg size of daily partitions should enough

PS. If anyone is in London area and will be interested details of my zabbix server set up I’ll have presentation on Solaris SIG meeting in September in Oracle office. Those meetings are every month second Wed usually starts at 20:00 (SIG meetings are free .. you need only register ).

Comment

Could you post your server and proxy conf ?

Why to increase dbsyncers on proxy ? 4 it’s not enough
Regarding server, I understand 🙂
What is the limit on proxy ? (How much items or hosts)

One more question, I have partition only for history and trends tables, so I still need housekeeping to run . How often do you run it ?
Should it be run only on server side or on proxy as well ?
Why I need it on proxy ?

Comment

In case of active agents you in worse case scenario you may have all agents connecting to the proxy pushing own data to proxy DB.
Active agents configuration scales better than passive variant but you must have enough channels of connections from proxy to DB backend to push data over more than single connection.

Have no idea where are real limits
So far 100k items with about 1Knvps is not big deal in case using MySQL as DB backend. Sqllite DB backend works enough well in case up to few tenths of monitored hosts. Problem with sqlite is that every insert or update is causing rewrite whole file so at some point this is cause of some IO bottlenecks.
IMO even for small proxies MySQL DB backend is absolute minimum. 1GB innode pool is enough to have good speed with 100k items and 1knvps.

Remember that proxy DB is used not in the same way as by server. Proxy generally is only storing data to the DB and few tenths MB cache is enough to hold all data before they will be send to server to not read those data from DB. DB content is used on second scenario when proxy is loosing sync with server and is pushing all data to server when connectivity will is restored.
Maybe when whole proxy buffered data queue will be enough big it wold be better to drop oldest data by dropping for example oldest every hour created partition. However so far I’ve not been able to reach enough big flow data over the proxy to have good justification to invest time to make some experiments.

In my setup all proxies holds only last 4h of data. It is enough in case of typical disconnections which we have time to time in our env or panned server downtime to do for example to make major upgrade (in my case minor zabbix upgrade is so rock solid that such upgrade is performed under BAU change).

In case of HK on server story is different.
Few weeks ago seems like we hit some server HK limits.

Problem generally is that even if you have disabled HK on trends and history tables delete host is causing write housekeeper tasks to housekeeper table. In my env we have relatively big from of host going in and out (caused by auto scaling in AWS).
After few months working with disabled HK on trends and history we had more than 50mln housekeeper entries (at the moment almost 60mln).

First problem which I’ve hit trying to flush HK records on trends table was with HK process which after finishing cycle been deleting all done HK entries by single delete query (with few millions rows it is not gonna happen ).
Additionally fail on executing such query is causing that HK is trying to repeat this delete query in infinite loop. So this is second bug.
3rd biggest bug in HK is when process is trying prepare list of data before starts doing delete queries on trends and and history tables.
Preparing those data is executed very long running select doing full scan of history or trends table.
Trends table is usually way smaller than history and in my case query like select itemid,min(clock) from trends group by itemid takes about 3h. In case of doing the same on history situation is way worse.
After running 22h query select itemid,min(clock) from history group by itemid on almost 2bln items I’ve stopped HK.

Housekeeper in current implementations have serious scalability limitations or issues. No matter how many HK records needs to be flushed away time of HK cycle scales not with size of the HK queue but with size of history and trends tables.

Problem with long running selects like above in env with housekeeping done by create in advance few daily partitions and dropping oldest one data is that with more than 20h running such query you have very high possibility collision with creating new partitions and dropping oldest data.
Those selects are holding backlog of not fully committed transactions to DB, and as long as those transactions are not fully committed you cannot make ALTER queries.
This is causing another bad consequence: ALTER query is waiting in infinite loop waiting on obtain lock on table. This is blocking all read and write queries.
I’m not sure but ALTER queries should end after some timeout if they cannot suicide so it is even possible that we are talking here about some MySQL bug as well (I’m using MySQL 5.5 and probably I’ll try to discuss this with Oracle support as well).

To unlock this must be killed such ALTER query or stop/killed this long running select.
Cascade of those bad/wrong steps happens in case of using MySQL, but I’m pretty sure that something very similar happens in case of using PostgreSQL or other SQL engines.

However I have good news. Zabbix support already identified all those issues.
Anyone who is interested solution of HK issues should be monitoring publicly available case https://support.zabbix.com/browse/ZBXNEXT-2860

General advise to everyone who is using partitioned trends/history tables: disable HK on tables on which content is maintained by dropping partitions because it will cause freeze whole DB engine when partition maintenance will overlap with HK preparation select query or at last do not try to enable HK.

IMO immediate solution which should be applied as fix in delete item code should be stop writing HK records to housekeeper table.
However zabbix maintainers may have different opinion how to handle this

PS. At the and I must say Big Than You to whole Zabbix Support Team. These guys IMO are doing ReallyGoodJob(tm) .
Money spend on paid zabbix support are IMO worth of what is provided under such support


Источник: www.zabbix.com