Roee, Have you found a resolution ? If not please bring it up on the dev list.

From: Roee Shenberg <[EMAIL PROTECTED]>
Date: Wednesday, August 30, 2017 at 5:43 AM
Subject: Re: Performance issues on idle trident topology

Hi Roshan,

The only unusual setting I'm using is topology.disruptor.batch.timeout.millis set to 20 ms instead of the default 1ms, following my previous performance issues.

On Wed, Aug 30, 2017 at 3:50 AM, Roshan Naik <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Is it running with default settings ? If not, may want to share any non-default settings being used. Somebody (more familiar with this than myself) might be able to  chime in.

From: Roee Shenberg <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Date: Tuesday, August 29, 2017 at 9:03 AM
Subject: Re: Performance issues on idle trident topology

I'm running Storm 1.0.2 + Trident and I have an issue with idle topologies consuming many CPU cycles (around four entire cores), after having processed some messages.

I'm not quite sure what the issue is, because it requires some activity on the topology to trigger said behavior.

My research on what's causing the issue has reached a dead end since it's not easy to replicate the matter (it happens reliably after a few days of idle time).

I managed to make the issue happen on a 1-machine cluster with only one topology running on it, and saw the following:

sudo strace -f -c -p 24996
strace: Process 24996 attached with 242 threads
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 77.96  228.512766         260    878277           epoll_wait
 17.78   52.106948       76628       680       180 futex
  3.73   10.924000     1365500         8         8 restart_syscall
  0.20    0.584934        2600       225           epoll_ctl
  0.20    0.574205        1704       337           read
  0.14    0.416201        1850       225           write
  0.00    0.000000           0         1           stat
  0.00    0.000000           0         1           mmap
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         1           rt_sigreturn
  0.00    0.000000           0         1           madvise
------ ----------- ----------- --------- --------- ----------------
100.00  293.119054                879757       188 total

That's an unreasonable amount of epoll_wait calls, and I noticed a large amount of ZooKeeper connections: nc -n | grep 2181 | wc -l returned 25 connections

The topology statistics were ~2400 tuples emitted and ~3000 tuples transferred in the 10 minute window. Basically idle.

Next, I produced a flamegraph including kernel times while tracing the process, which I couldn't attach as a fully-functional, clickable SVG so I took a screenshot of it:
[line image 1]

The flamegraph shows the entire time is spent in<>.apache.zookeeper.ClientCnxn$

Split 50/50 between sys_epoll_wait in the kernel and java code, mostly the java side of select, with a non negligible amount of time spent in org/apache/zookeeper/ClientCnxn$SendThread:::clientTunneledAuthenticationInProgress

This seems like the zookeeper element is malfunctioning somehow, but I'm not sure how.

Do you have any advice regarding knobs to tweak in storm or ZK config or additional diagnostics to run? (the relevant ZK config AFAIK is the tick interval, which is at the default 2 seconds)