Class: PHPCrawler

PHPCrawl mainclass
Description:

-

Members:

Constructor
PHPCrawler()Initiates a new crawler.

Public Methods
Basic settings
getProcessReportRetruns summarizing report-information about the crawling-process after it has finished.
goStarts the crawling process in single-process-mode.
goMultiProcessedStarts the cralwer by using multi processes.
setFollowModeSets the basic follow-mode of the crawler.
setHTTPProtocolVersionSets the HTTP protocol version the crawler should use for requests
setPortSets the port to connect to for crawling the starting-url set in setUrl().
setURLSets the URL of the first page the crawler should crawl (root-page).
setUrlCacheTypeDefines what type of cache will be internally used for caching URLs.
setWorkingDirectorySets the working-directory the crawler should use for storing temporary data.
Filter-settings
addContentTypeReceiveRuleAdds a rule to the list of rules that decides which pages or files - regarding their content-type - should be received
addURLFilterRuleAdds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler.
addURLFollowRuleAdds a rule to the list of rules that decide which URLs found on a page should be followd explicitly.
obeyNoFollowTagsDecides whether the crawler should obey "nofollow"-tags
obeyRobotsTxtDecides whether the crawler should parse and obey robots.txt-files.
Overridable methods / User data-processing
handleDocumentInfoOverride this method to get access to all information about a page or file the crawler found and received.
handleHeaderInfoOverridable method that will be called after the header of a document was received and BEFORE the content will be received.
initChildProcessOverridable method that will be called by every used child-process just before it starts the crawling-procedure.
Limit-settings
setContentSizeLimitSets the content-size-limit for content the crawler should receive from documents.
setPageLimitSets a limit to the number of pages/files the crawler should follow.
setRequestDelaySets a delay for every HTTP-requests the crawler executes.
setTrafficLimitSets a limit to the number of bytes the crawler should receive alltogether during crawling-process.
Linkfinding settings
addLinkSearchContentTypeAdds a rule to the list of rules that decide in what kind of documents the crawler should search for links in (regarding their content-type)
enableAggressiveLinkSearchEnables or disables agressive link-searching.
setLinkExtractionTagsSets the list of html-tags the crawler should search for links in.
Process resumption
enableResumptionPrepares the crawler for process-resumption.
getCrawlerIdReturns the unique ID of the instance of the crawler
resumeResumes the crawling-process with the given crawler-ID
Other settings
addBasicAuthenticationAdds a basic-authentication (username and password) to the list of basic authentications that will be send with requests.
addLinkPriorityAdds a regular expression togehter with a priority-level to the list of rules that decide what links should be prefered.
addPostDataAdds post-data together with an URL-rule to the list of post-data to send with requests.
addStreamToFileContentTypeAdds a rule to the list of rules that decides what types of content should be streamed diretly to a temporary file.
enableCookieHandlingEnables or disables cookie-handling.
requestGzipContentEnables support/requests for gzip-encoded content.
setConnectionTimeoutSets the timeout in seconds for connection tries to hosting webservers.
setFollowRedirectsDefines whether the crawler should follow redirects sent with headers by a webserver or not.
setFollowRedirectsTillContentDefines whether the crawler should follow HTTP-redirects until first content was found, regardless of defined filter-rules and follow-modes.
setProxyAssigns a proxy-server the crawler should use for all HTTP-Requests.
setStreamTimeoutSets the timeout in seconds for waiting for data on an established server-connection.
setUserAgentStringSets the "User-Agent" identification-string that will be send with HTTP-requests.
Deprecated
addFollowMatchAlias for addURLFollowRule(). (deprecated!)
addLinkExtractionTagsSets the list of html-tags from which links should be extracted from. (deprecated!)
addNonFollowMatchAlias for addURLFilterRule(). (deprecated!)
addReceiveContentTypeAlias for addContentTypeReceiveRule(). (deprecated!)
addReceiveToMemoryMatchHas no function anymore! (deprecated!)
addReceiveToTmpFileMatchAlias for addStreamToFileContentType(). (deprecated!)
disableExtendedLinkInfoHas no function anymore. (deprecated!)
getReportRetruns an array with summarizing report-information after the crawling-process has finished (deprecated!)
setAggressiveLinkExtractionAlias for enableAggressiveLinkSearch() (deprecated!)
setCookieHandlingAlias for enableCookieHandling() (deprecated!)
setTmpFileHas no function anymore. (deprecated!)

Public Properties
class_version