Best Settings for Screaming Frog works for almost all kind of Site Audits

0

In my opinion, Screaming Frog is the most used SEO tool in the industry.

On the official site, there is a lot of documentation about all the different settings, as there are a lot! There are a lot of guides on the different ways it can be used.

However, I have yet to come across a resource that outlines the ideal settings for performing entire site audits. 

I've been training folks on how to use Screaming Frog since 2018, and I've seen that new users struggle to comprehend what the ideal settings are for performing audits. 

Issues are frequently missed as a result of the default settings.

The default parameters provided upon installation are insufficient for doing a comprehensive audit and must be modified to achieve the best results. 

I'm quite confident I've nailed the ideal settings for most site audits after using Screaming Frog since its inception in 2018.

This isn't a tutorial to every single setting in the tool; rather, it's just the ones I modify from the default.

So, for any options that aren't listed, I recommend sticking with the defaults.

Here are the some of the best bits from my experience:

1. Storage Mode.

If you have an SSD, use Database mode because:
  • It's continually saving to the database. If the Frog or your machine crashes the crawl is autosaved.
  • You can crawl much bigger sites than in RAM mode.

2. Memory Allocation.

Allocate as much RAM as you can, but always leave 2GB from your total RAM.

I have 16GB RAM, I’ll allocate 12GB.

3. Spider Settings Tabs - Crawl.

By default, there are 4 boxes unticked here that I tick

A. Tick "Pagination(Rel/Prev)"
There could be URLs only linked from deep paginated pages. Such as PDPs on ecommerce categories, or articles on a publishers site. We don't want to miss any pages, so tick this.

B. Tick "Href lang".
The alternate version of URLs may not be linked in the HTML body of the page, only in the href lang tags. We want to discover all URLs and be able to audit multilingual/local setups, so tick this.

C. Tick "AMP."
  • A site could also be using AMP, but you might not realise it.
  • The Frog checks for lots of AMP issues!

4. Spider Settings Tabs - Crawl Behaviour:


By default, there are 4 boxes unticked here that I tick (Green ticks):

A. Tick "Crawl All Subdomains"
Leaving this unticked won’t crawl any subdomains the Frog my encounter linked. I always have this ticked, because if I’m doing a complete audit of a site, I also want to know about any subdomains there are.

B. Tick Follow Internal “nofollow”.
  • I want to discover as many URLs as possible
  • I want to know if a site is using ”nofollow” so I can investigate & understand why they are using it on internal links.
C. Tick "Follow External “nofollow”.
  • I want the Frog to crawl all possible URLs.
  • Otherwise I might miss external URLs which are 404s, miss discovering pages that are participating in link spam or have been hacked.

5. Spider Settings Tabs - XML Sitemaps.

By default all 3 options in this section are unticked, I tick them all:

A. Tick "Crawl Linked XML Sitemaps."
  • Check if all important pages are included in sitemaps
  • Check only valid pages are included. No 404s, redirects, noindexed, canonicalised URLs.
  • Discover any URLs that are linked in XML Sitemaps but aren't linked on the site (orphan pages)

B. Tick "Auto Discover XML Sitemaps via robots.txt"
  • As many sites include a link to their XML sitemaps in robots, it’s a no brainer to click this, so you don’t have to manually add the Sitemap URL.

C. Tick "Crawl These Sitemaps."
  • Submit any you know about that aren't listed in the robots.txt


6. Extraction Tab Settings Tab - Page Details

By default, all these elements are ticked and that’s how I recommend you keep them for most audits.


7. Extraction Tab Settings Tab - URL Details

  • I tick one option here over the default settings:
  • Tick "HTTP Headers."
  • Lots of interesting things can be in the headers.
  • e.g If a site uses dynamic content serving for desktop vs mobile, it should use the Vary HTTP Header.

8. Extraction Tab Settings Tab - Structured Data

All the elements in this section are unticked by default, I tick them all:
  • JSON-LD
  • Microdata
  • RDFa
I tick all of the above options, so I can fully audit the schema of the site, no matter how it’s implemented.

Tick "Schema org Validation".

A great feature to check all schema validates against the official suggested implementation.

Tick "Google Rich Results Feature Validation."

Validates the mark-up against Google’s own documentation.

Select both options here, as Google has some specific requirements that aren’t included in the schema org guidelines.


9. Extraction Tab Settings Tab - Structured Data

Both the options in this section are unticked by default, I always tick them:
  • Tick "Store HTML.
  • The Frog will save the HTML for every page.
  • This is extremely useful for double-checking any elements Frog reports on.
  • Tick "Store Rendered HTML."
  • This is useful when auditing JavaScript sites to see the difference between the HTML code sent from the server and what is actually rendered client-side in the browser.

10. Limits Settings Tab.

Change "Max Redirects to Follow".

This is the only option I change for most crawls in this tab. It’s set to 5 by default, I set it to the max, 20. Setting the maximum helps me find the final destination in most redirect chains


11. Advanced Settings Tab.

In this settings tab, I tick and untick a few boxes from the default settings:

  • Untick "Ignore Non-indexable URLs for On-Page Filters".
  • Even if a URL is already non-indexable, I still want to see if there are issues with the page. There are often times where pages have been set to noindex or canonicalised, but this has been done in error.
  • Tick "Always Follow Redirects" & "Always Follow Canonicals"
  • I tick both of these, as I want to ensure the Frog discovers all URLs on the site. There could be URLs that aren’t linked in the HTML of the code but are only linked via a redirect, or a canonical tag.
  • Tick "Extract images from img srcset Attribute."
  • Google can crawl images implemented in the srcset attribute, so I tick this to ensure the Frog is extracting the same images Google would be. I can then check how they are optimised. (image file names, alt tags, size)
  • The following options are unticked by default and I also keep them that way.
These settings are quite important, so I’ll explain the reasoning behind keeping them unticked:
  • Respect Noindex
  • Respect Canonicals
  • Respect next/prev
As I want to get a full picture of all the URLs on the site, whether they are indexable or not, I don’t want to tick the options above.

If I did tick them, it means any URLs set to noindex, or canonicalised to a different URL, would not be reported in the Frog.


12. Content > Duplicates Settings:

Untick "Only Check Indexable Pages for Duplicates."

Even if pages are currently set to noindex, I still want to know if they are duplicating content, in case they should be set to index.

13. Robots.txt Settings:


Default setting here is: "Respect robots.txt"

I have this set to: "Ignore robots.txt but report status."

I can then audit the pages which are blocked, make sure they aren’t blocked in error & report if the URLs need to be removed from robots.txt

14. CDNs Settings:


This is very useful if a site uses a CDN for hosting images that is not part of the domain you are crawling.

e.g. cdn.not-your-domain.com/photos-of-cats…

You can add the domain above so Frog counts images on the other CDN domain as internal, then you can audit them.

User-Agent Settings:

Set this to Googlebot (Smart Phone) so we see the same code Googles does.

Sites may change things depending on the user agent.

It could be done for valid reasons, it could be done for black hat reasons, or because the site has been hacked.


15. API Access Settings:

Using API access, we can enrich the crawl data with traffic or backlink data. For two main reasons:

a. using GSC/GA data is another way to find orphan URLs.
b. Traffic data can help you prioritise the issues in your report


16. Saving Your Configuration!

Once you have everything set it up, it's time to save it. Otherwise, you will lose them on restart!
Go to file > Configuration > Save current configuration as default.


So, there you have it. 

If you're a new user, I hope this information is helpful.

If you've been a long-time user, I hope you found my observations useful.

Please leave a remark whether you agree or disagree with my choices, or if you have any queries.

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.

Good readers always drop comments!!

Good readers always drop comments!!

Post a Comment (0)

buttons=(Accept !) days=(20)

Our website uses cookies to enhance your experience. Learn More
Accept !
To Top