Web crawling technology

PyTask

PyTask is our revolutionary new technology for scalable crawling, capable of processing an arbitrary large number of url's per second. It's infrastructure is cloud based, supporting practically instant automatic deployment for nodes on any number of debian linux running nodes. Nodes automatically distribute work amongst themselves, allowing for optimal load of the available resources. It can run modular tasks, allowing for rapid development through modular testing and for increased complexity of possible tasks. Tasks can be automatically deployed on any number of available nodes and, based on constantly gathered performance metrics, they can be scaled to any desirable speed, allowing us to decide how long a task should take instead of being constrained by a predeterminate run time. Tasks have a complex tree structure, with each step chainable to subsequent steps. Processing a task starts with an initial seed which each step's results serving as inputs for subsequent steps. Each step has the option of producing partial results, which can then be transmitted to the next step and be used in their processing as necessary. The final result of the processing is then stored in a typical NOSQL database instance/cluster.

Qtie

PyTask uses a distributed queue called Qtie as a support technology. There is significant communication between task steps during processing and handling it depends on using a fast queue. However, since the tasks involve millions of seeds, we're going to have a problem even temporarily storing them. We've implemented Qtie to get around some limitations we perceived in Redis, which, at this time, does not support mixed storage (both memory and disk) data structures. Since Qtie does a single thing, but does it well, Qtie is also extremely fast.

Use case

Assume we want to process all the products present on the website of an online store. The root seed would be the address of the site itself. By processing this initial seed, we obtain the inputs for the next step, which are the url's of the categories of products on the site. Let's assume that we need some description of the category for the final result; this will be placed in the partial result and passed to the next step. The second step processes the category url's and extracts the url's of all the product pages. Since the description of the category to which the page belongs is not used here, it will be passed to the next step along with the page url's. The third step will extract all the product url's on a particular page. Again, since we're not at the final step, we'll pass the category description on to the next step. Assume however, that there is some information here we're not going to be able to extract from the product page. This too will be added to the partial result we already have and passed forward to the next step. Our forth and final step handles product details. It will process the details page of a product and extract, for example, the name and the price. Since we have a partial result, we'll add what we have already extracted to the final result and save it.