Concurrently query datagrepper during start-up #32

nateyazdani · 2015-08-25T03:11:29Z

Employ the Python 2 backport of concurrent.futures to simulate non-blocking network I/O on top of the requests library in statscache.utils.datagrep(). Prior testing of statscache showed that it was able to achieve a throughput of 125-150 messages per second during the datagrepper backprocessing phase. Since a real deployment will need to sift through the entire 32 million+ message history (which would take about 3 days at that rate), I've been looking pretty intensely at ways to improve this performance, primarily through multiplexing network I/O. Moksha already brings in Twisted as a dependency, but the StatsConsumer must be fully initialized before the reactor runs, so that was out. (Even if this weren't the case, using Twisted would mean interlacing processing of old and new messages, adding another layer of complexity.) After some experimentation, I found that concurrent.futures increases backprocessing throughput to 250-350 messages per second, at which rate the entire datagrepper stream could be processed in around 24 hours. I also tried out gevent, but it, surprisingly, decreased throughput at every thread pool size that I used. Performance considerations aside, concurrent.futures is more suitable as it has minimal interference with Twisted when the reactor is run later.

Use the Python 2 backport of concurrent.futures to issue many requests to datagrepper concurrently. This method was specifically chosen to minimize potential interference with the Twisted reactor system, although it should be completely cleaned up by the time the reactor runs.

Use a one-line lambda instead of a nested function definition and use a constant for the number of worker threads in the concurrent.futures.ThreadPoolExecutor.

Include code to profile the statscache.utils.datagrep() generator for later reuse but commented-out to avoid flooding the log with otherwise useless information.

rtnpro · 2015-08-25T05:56:19Z

👍

ralphbean · 2015-08-25T11:11:19Z

Looks good!

Also good news -- python-futures is packaged for Fedora and EPEL, so no extra work for us there.

Draw the number of worker threads used in the statscache.utils.datagrep() generator from the statscache hub configuration.

Enable or disable profiling measurement of the statscache.utils.datagrep() generator based on a statscache hub configuration value.

nateyazdani · 2015-08-25T21:08:17Z

OK for now. It'd be great if we could move MAX_WORKERS to config/settings.

Agreed, I'll push that change here shortly and wait for re-review by somebody.

Also good news -- python-futures is packaged for Fedora and EPEL, so no extra work for us there.

Right on, I love it when things work out like that!

Thanks for the reviews, guys. I'm also going to push a change to activate the profiling code based on a configuration value, instead of leaving it commented out (that just felt sloppy...).

ralphbean · 2015-08-27T18:20:45Z

👍 much cleaner!

Concurrently query datagrepper during start-up

Nathaniel Yazdani added 3 commits August 24, 2015 15:07

Simplify datagrepper generator code

2874e1f

Use a one-line lambda instead of a nested function definition and use a constant for the number of worker threads in the concurrent.futures.ThreadPoolExecutor.

Add commented-out datagrepper profiling code

a3053c6

Include code to profile the statscache.utils.datagrep() generator for later reuse but commented-out to avoid flooding the log with otherwise useless information.

Nathaniel Yazdani added 2 commits August 25, 2015 13:37

Add configuration option for datagrepper workers

765456b

Draw the number of worker threads used in the statscache.utils.datagrep() generator from the statscache hub configuration.

Enable datagrepper profiling by configuration

32578f9

Enable or disable profiling measurement of the statscache.utils.datagrep() generator based on a statscache hub configuration value.

Remove dead variable

e309865

nateyazdani added a commit that referenced this pull request Aug 27, 2015

Merge pull request #32 from yazman/concurrent_datagrepper

4869d8b

Concurrently query datagrepper during start-up

nateyazdani merged commit 4869d8b into fedora-infra:develop Aug 27, 2015

nateyazdani deleted the concurrent_datagrepper branch August 27, 2015 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrently query datagrepper during start-up #32

Concurrently query datagrepper during start-up #32

nateyazdani commented Aug 25, 2015

rtnpro commented Aug 25, 2015

ralphbean commented Aug 25, 2015

nateyazdani commented Aug 25, 2015

ralphbean commented Aug 27, 2015

Concurrently query datagrepper during start-up #32

Concurrently query datagrepper during start-up #32

Conversation

nateyazdani commented Aug 25, 2015

rtnpro commented Aug 25, 2015

ralphbean commented Aug 25, 2015

nateyazdani commented Aug 25, 2015

ralphbean commented Aug 27, 2015