Anomaly Detection

Category
Complexity
2/5
Date published
2020-02-12
Author
  • Vincent Terrasi
Prerequisites
  • Google API Token
Links

Detect Over- and Under-Performance on Any OnCrawl Metric

Context

One purpose of an SEO audit is to find metrics or KPIs where the website does not perform as expected.

These metrics, such as page speed, number of impressions on a Google SERP, or the number of internal links that point to a URL, vary naturally from crawl to crawl.

Anomaly detection allows you to know whether a change is within the "normal" range for the website, or whether the change represents an unusual event that needs to be addressed.

Using machine learning to find anomalies revealed by crawls also means that you can take seasonal events into account, along with gradual changes to the website over time.

Examining anomalies can also reveal whether certain metrics are key to a website's SEO, and which are only incidental.

This project uses the Robust Random Cut Forest (RRCF) algorithm. Before beginning, you should make sure you have a basic understanding of how this algorithm works.

Objectives
  • Use multiple crawls with the same crawl profile to establish a baseline for what constitute normal values on a given website for selected metrics
  • Identify when a crawl analysis produces results that fall outside the normal range of values for these metrics
Method

The Robust Random Cut Forest (RRCF) algorithm is an ensemble method for detecting outliers in streaming data. RRCF offers a number of features that many competing anomaly detection algorithms lack.

To get all aggregated data from OnCrawl. you will first need to select a specific crawl profile and unarchive your crawls using this profile over a long enough period of time to establish a basis for "normal". Then, you can use the RRCF to look at the most recent crawls to find anomalies in your site's performance.