New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use defined timeout also for HTTP/FTP connections. #53
Conversation
The timeout check was only used between the HTTP/FTP connections. It could still happen that the crawler was waiting for a very long time on a single open connection. Specify the timeout now also for each opened HTTP/FTP connection to make sure the crawler times out after the specified time.
Looks good to me, but do we really want the timeout that long? |
About the timeout. This is really complex as we have different timeouts at different parts in the crawler. Or better, the timeouts have different meanings. The default configuration of the crawler has a timeout of 120 minutes per host. If the timeout has been reached is tested at different parts of the code. The enforcement of the timeout, however, also depends on the protocol used for crawling. rsync crawled hosts are basically free of the timeout. We have specified a timeout of 4 hours to the rsync command. The rsync timeout, however, has a different meaning. If for more than 4 hours no network I/O happens the process is aborted. What we do not have is a timeout for the rsync command. If the crawling takes 6 hours via rsync we currently have to wait until it finishes, if there is is continuously network I/O.Once the crawling via rsync has finished there is no further timeout check. If the host takes longer than 120 minutes (mentioned the at the beginning) in four consecutive crawls it will be disabled. So we kind have an indirect way for excluding slow rsync mirrors. If rsync is not available FTP or HTTP is used. FTP is crawling whole directories in one go and HTTP is crawling each file separately. After each crawled entity (directory or file) the timeout is checked and the crawl is aborted if reach two hours. Now to the point this patch is trying to address. I have seen hosts which hang longer than 2 hours in a single HTTP GET/HEAD (for whatever reason). During this blocking network IO the crawler does not check if the timeout has been reached or not. Therefore I wanted to limit the time of a HTTP/FTP connection to the crawler's timeout. As you have mentioned there is also keepalive involved. So either we open a connection a HTTP connection for each HTTP HEAD/GET or keepalive is supported. Without keepalive the behaviour is again different. I addition the timeout is per host. It would probably make more sense to have timeout per category as mirror who has archive and secondary mirrored will take much more time than a host which only has EPEL. So the timeout handling is difficult and complicated and providing the same timeout behaviour for all parts of the code sounds like a lot of work. Would be great if it could be properly changed/fixed/resolved. This simple patch is more a band-aid to help to detect very slow network connections and not have to wait forever until the OOM killers finishes the crawl. I am also not 100% what effect the timeout values actually has (especially in combination with or without keepalive). Rather long comment for such a 'simple' patch. |
Thanks for the long and detailed reply. ;) Makes sense, so I am +1 to this band aid for now, and perhaps down the road we can rework the timeouts so they are all configurable/make more sense. :) |
RFE: Could we document this somewhere under the |
I will write something similar and store it in the |
Use defined timeout also for HTTP/FTP connections.
The timeout check was only used between the HTTP/FTP connections.
It could still happen that the crawler was waiting for a very long
time on a single open connection. Specify the timeout now also for
each opened HTTP/FTP connection to make sure the crawler times out
after the specified time.