Use defined timeout also for HTTP/FTP connections. #53

adrianreber · 2015-04-25T17:36:12Z

The timeout check was only used between the HTTP/FTP connections.
It could still happen that the crawler was waiting for a very long
time on a single open connection. Specify the timeout now also for
each opened HTTP/FTP connection to make sure the crawler times out
after the specified time.

The timeout check was only used between the HTTP/FTP connections. It could still happen that the crawler was waiting for a very long time on a single open connection. Specify the timeout now also for each opened HTTP/FTP connection to make sure the crawler times out after the specified time.

nirik · 2015-04-25T18:04:06Z

Looks good to me, but do we really want the timeout that long?
Or due to keepalive we might be pulling for that long over a single connection?

adrianreber · 2015-04-26T14:01:06Z

About the timeout. This is really complex as we have different timeouts at different parts in the crawler. Or better, the timeouts have different meanings. The default configuration of the crawler has a timeout of 120 minutes per host. If the timeout has been reached is tested at different parts of the code. The enforcement of the timeout, however, also depends on the protocol used for crawling.

rsync crawled hosts are basically free of the timeout. We have specified a timeout of 4 hours to the rsync command. The rsync timeout, however, has a different meaning. If for more than 4 hours no network I/O happens the process is aborted. What we do not have is a timeout for the rsync command. If the crawling takes 6 hours via rsync we currently have to wait until it finishes, if there is is continuously network I/O.Once the crawling via rsync has finished there is no further timeout check. If the host takes longer than 120 minutes (mentioned the at the beginning) in four consecutive crawls it will be disabled. So we kind have an indirect way for excluding slow rsync mirrors.

If rsync is not available FTP or HTTP is used. FTP is crawling whole directories in one go and HTTP is crawling each file separately. After each crawled entity (directory or file) the timeout is checked and the crawl is aborted if reach two hours.

Now to the point this patch is trying to address. I have seen hosts which hang longer than 2 hours in a single HTTP GET/HEAD (for whatever reason). During this blocking network IO the crawler does not check if the timeout has been reached or not. Therefore I wanted to limit the time of a HTTP/FTP connection to the crawler's timeout.

As you have mentioned there is also keepalive involved. So either we open a connection a HTTP connection for each HTTP HEAD/GET or keepalive is supported. Without keepalive the behaviour is again different.

I addition the timeout is per host. It would probably make more sense to have timeout per category as mirror who has archive and secondary mirrored will take much more time than a host which only has EPEL.

So the timeout handling is difficult and complicated and providing the same timeout behaviour for all parts of the code sounds like a lot of work. Would be great if it could be properly changed/fixed/resolved. This simple patch is more a band-aid to help to detect very slow network connections and not have to wait forever until the OOM killers finishes the crawl. I am also not 100% what effect the timeout values actually has (especially in combination with or without keepalive).

Rather long comment for such a 'simple' patch.

nirik · 2015-04-26T16:56:06Z

Thanks for the long and detailed reply. ;) Makes sense, so I am +1 to this band aid for now, and perhaps down the road we can rework the timeouts so they are all configurable/make more sense. :)

pypingou · 2015-04-26T17:35:13Z

RFE: Could we document this somewhere under the doc/ folder? :)

adrianreber · 2015-04-27T08:39:37Z

I will write something similar and store it in the doc/ folder.

Use defined timeout also for HTTP/FTP connections.

adrianreber added a commit that referenced this pull request Apr 27, 2015

Merge pull request #53 from adrianreber/timeout

2d26b03

Use defined timeout also for HTTP/FTP connections.

adrianreber merged commit 2d26b03 into fedora-infra:master Apr 27, 2015

adrianreber deleted the timeout branch April 27, 2015 12:21

adrianreber mentioned this pull request May 16, 2015

Introduce timeout check during run_rsync() #77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use defined timeout also for HTTP/FTP connections. #53

Use defined timeout also for HTTP/FTP connections. #53

adrianreber commented Apr 25, 2015

nirik commented Apr 25, 2015

adrianreber commented Apr 26, 2015

nirik commented Apr 26, 2015

pypingou commented Apr 26, 2015

adrianreber commented Apr 27, 2015

Use defined timeout also for HTTP/FTP connections. #53

Use defined timeout also for HTTP/FTP connections. #53

Conversation

adrianreber commented Apr 25, 2015

nirik commented Apr 25, 2015

adrianreber commented Apr 26, 2015

nirik commented Apr 26, 2015

pypingou commented Apr 26, 2015

adrianreber commented Apr 27, 2015