Skip to content
This repository has been archived by the owner on Dec 4, 2018. It is now read-only.

Concurrently query datagrepper during start-up #32

Merged
merged 6 commits into from Aug 27, 2015

Conversation

nateyazdani
Copy link
Contributor

Employ the Python 2 backport of concurrent.futures to simulate non-blocking network I/O on top of the requests library in statscache.utils.datagrep(). Prior testing of statscache showed that it was able to achieve a throughput of 125-150 messages per second during the datagrepper backprocessing phase. Since a real deployment will need to sift through the entire 32 million+ message history (which would take about 3 days at that rate), I've been looking pretty intensely at ways to improve this performance, primarily through multiplexing network I/O. Moksha already brings in Twisted as a dependency, but the StatsConsumer must be fully initialized before the reactor runs, so that was out. (Even if this weren't the case, using Twisted would mean interlacing processing of old and new messages, adding another layer of complexity.) After some experimentation, I found that concurrent.futures increases backprocessing throughput to 250-350 messages per second, at which rate the entire datagrepper stream could be processed in around 24 hours. I also tried out gevent, but it, surprisingly, decreased throughput at every thread pool size that I used. Performance considerations aside, concurrent.futures is more suitable as it has minimal interference with Twisted when the reactor is run later.

Nathaniel Yazdani added 3 commits August 24, 2015 15:07
Use the Python 2 backport of concurrent.futures to issue many requests to
datagrepper concurrently. This method was specifically chosen to minimize
potential interference with the Twisted reactor system, although it should be
completely cleaned up by the time the reactor runs.
Use a one-line lambda instead of a nested function definition and use a
constant for the number of worker threads in the
concurrent.futures.ThreadPoolExecutor.
Include code to profile the statscache.utils.datagrep() generator for later
reuse but commented-out to avoid flooding the log with otherwise useless
information.
@rtnpro
Copy link
Contributor

rtnpro commented Aug 25, 2015

👍

@ralphbean
Copy link
Contributor

Looks good!

Also good news -- python-futures is packaged for Fedora and EPEL, so no extra work for us there.

Nathaniel Yazdani added 2 commits August 25, 2015 13:37
Draw the number of worker threads used in the statscache.utils.datagrep()
generator from the statscache hub configuration.
Enable or disable profiling measurement of the statscache.utils.datagrep()
generator based on a statscache hub configuration value.
@nateyazdani
Copy link
Contributor Author

OK for now. It'd be great if we could move MAX_WORKERS to config/settings.

Agreed, I'll push that change here shortly and wait for re-review by somebody.

Also good news -- python-futures is packaged for Fedora and EPEL, so no extra work for us there.

Right on, I love it when things work out like that!

Thanks for the reviews, guys. I'm also going to push a change to activate the profiling code based on a configuration value, instead of leaving it commented out (that just felt sloppy...).

@ralphbean
Copy link
Contributor

👍 much cleaner!

nateyazdani added a commit that referenced this pull request Aug 27, 2015
Concurrently query datagrepper during start-up
@nateyazdani nateyazdani merged commit 4869d8b into fedora-infra:develop Aug 27, 2015
@nateyazdani nateyazdani deleted the concurrent_datagrepper branch August 27, 2015 18:22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants