⚡️ Much update!
- 🐧 Confirmed and added support for OS/X and Linux thanks to michellemorales and j-setiawan.
- 📚 Updated documentation to the current state of things. Still work to be done there.
- ✂ Removed 'bad file' functionality as it wasn't working as intended and wasn't important anyway. That's what error logs are for.
<base>tags to grab links that wouldn't have been recognized before. Thanks lxml!
- ➕ Added an optional (on by default) check for file size. Won't download any files larger than 500 MB, assuming the site returns a
- ➕ Added Firefox (on Ubuntu) as an option for browser spoofing.
config/, while the source code archives contain all files.
➕ Added domain restrictions. Crawling can now be limited to a certain domain, such as
https://www.wsj.com/article. Can be set when entering configuration settings or in the config files.
🛠 Also more bugfixes and MIME types because those are cool.
🚀 The first official release of spidy!
A GUI is in the works, as well as many more awesome features.
spidy.zipcontains only the files necessary to run the crawler, while the source code downloads contain all the things.