Karl Bernard

RegEx for avoiding documents

Discussion created by Karl Bernard on Oct 23, 2012
Latest reply on Oct 24, 2012 by Karl Bernard

In case you need to exclude URL's in a WAS scan based on file extensions...

 

We're currently scanning a large CMS-based site and I noticed that there were a large number (552) of documents (.pdf, .doc, etc) that came up in the scan. I don't see any real value in scanning them, so I did an analysis of what extensions were present by doing some grep processing of links listed in QID: 150009/Information Gathered/Links Crawled,  (grep -Eo \.([a-zA-Z]{3,4})$ cms_host_files_crawled.txt|sort|uniq -c) which gave me a list and count of all extension-looking endings.

 

Based on these findings, I created the following RegEx which I tested with grep -f:

\.(doc|docx|pdf|ppt|pptx|wav|wmv|xls|xlsx|zip)$

 

I ran another discovery scan with this RegEx in the blacklist section (Edit Application -> Crawl Exclusion Lists -> Regular Expressions) and no URLs with these extensions were listed.

Outcomes