Popularity
1.5
Growing
Activity
7.4
-
64
3
7

Description

Avoid loosing bandwidth capacity and processing time for webpages which are probably not worth the effort. This library provides an additional brain for web crawling, scraping and management of Internet archives. Specific fonctionality for crawlers: stay away from pages with little text content or target synoptic pages explicitly to gather links.

This navigation help targets text-based documents (i.e. currently web pages expected to be in HTML format) and tries to guess the language of pages to allow for language-focused collection. Additional functions include straightforward domain name extraction and URL sampling.

Programming language: Python
License: Apache License 2.0
Latest version: v0.6.0

coURLan alternatives and similar packages

Based on the "URL Manipulation" category.
Alternatively, view courlan alternatives based on common mentions on social networks and blogs.

Do you think we are missing an alternative of coURLan or a related project?

Add another 'URL Manipulation' Package