AnsweredAssumed Answered

Crawling Issue in Web Application Scanning (WAS).

Question asked by Riz dplex on Dec 10, 2013
Latest reply on Dec 10, 2013 by jkent


For our web Application we have set URL as "https://abcd.example.com/" and also set the option "Limit to content at or below URL subdirectory".

Now in our site we are using the “robots.txt” file to disallow the web crawlers from our website pages being indexed or crawled. In robots.txt file we have done the following setting “Disallow: /” and “Disallow: /OurSiteFolderName/”.

For Example:

User-agent: *

Disallow: /

#Disallow Directories:

Disallow: /Product/

Disallow: /Orders/

 

We have added the above lines of code in our robots.txt file and also in the WAS we have one setting that we need to do for robots.txt file. For our QualysGuard web application we have checked the “Crawl all links and directories found in the robots.txt file, if present” checkbox from the “Crawling Hints” section. Now after running the WAS Discovery scan when we go through the report, in the “Links Crawled” we can see the “/Product/” and "/Orders/" which we have actually disallowed in robots.txt file is present there. But there are no pages from that folder that are actually been crawled shown in the scan report.

1) Why the pages from that folder have not been crawled by the scanner?

2) Does the robots.txt file restricting the scanner to crawl through the pages from that particular folder? Although we have checked/set “Crawl all links and directories found in the robots.txt file, if present” checkbox.

3) When we go through the scan report we can see that only the links from the default page has been crawled by the scanner and no other links from other pages and directories have been scanned? What may be the reason for  this?
4) What are the settings that will help to crawl and scan the maximum links from our site?

Outcomes