This repository has been archived by the owner on Dec 4, 2018. It is now read-only.
Concurrently query datagrepper during start-up #32
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Employ the Python 2 backport of
concurrent.futures
to simulate non-blocking network I/O on top of therequests
library instatscache.utils.datagrep()
. Prior testing of statscache showed that it was able to achieve a throughput of 125-150 messages per second during the datagrepper backprocessing phase. Since a real deployment will need to sift through the entire 32 million+ message history (which would take about 3 days at that rate), I've been looking pretty intensely at ways to improve this performance, primarily through multiplexing network I/O. Moksha already brings in Twisted as a dependency, but theStatsConsumer
must be fully initialized before the reactor runs, so that was out. (Even if this weren't the case, using Twisted would mean interlacing processing of old and new messages, adding another layer of complexity.) After some experimentation, I found thatconcurrent.futures
increases backprocessing throughput to 250-350 messages per second, at which rate the entire datagrepper stream could be processed in around 24 hours. I also tried outgevent
, but it, surprisingly, decreased throughput at every thread pool size that I used. Performance considerations aside,concurrent.futures
is more suitable as it has minimal interference with Twisted when the reactor is run later.