|Stats: 280 members, 5,908 topics. Date: January 23, 2017, 2:31 am|
How long the crawling process takes depends on how targeted you make your crawler. You need to define both where you want it to go, and – more importantly – where you don’t want it to go. For example, if you are only interested in Men’s clothes; there is no point letting your crawler visit pages with Women’s clothes. It’s a waste of time for you and an unnecessary load on the website. Not to mention it will bring you back a lot of data you don’t actually want, causing more work for you in post processing of the data.
Here are some of the controls you should look for:
Crawl depth – How many clicks from the start page you want the crawler travel. For the majority of websites, a crawl depth of 5 should be more than enough for most websites.
Crawl exclusions – These are the parts of the site you do NOT want the crawler to visit, essentially where not to crawl
Simultaneous pages – The number of pages the crawler will attempt to visit at the same time.
Pause between pages – The length of time (in seconds) the crawler will pause before moving on to the next page.
Crawl URL templates – This is how the crawler determines which pages you want data from, so it’s important to make it as specific as possible.
Save log – Crawlers can take a long time and you don’t want to lose your work if something goes wrong along the way. A save log will let you see which URLs were visited and which were converted into data. This log will help you troubleshoot your crawler if something goes wrong with your extraction. In addition, the URLs converted to data can be used through an Extractor directly next time so you don’t have to re-crawl the site.
Viewing this topic: 1 guest viewing this topic
TnLounge - Copyright © 2016 TnLounge. All rights reserved.
Contact us / How to post on TN
Follow TnLounge on Facebook and Twitter
Disclaimer: Every TnLounge member is solely responsible for anything that he/she posts or uploads on TnLounge.