For those that aren’t familiar TF-IDF is a basic natural language processing metric used to ascertain the natural relevancy of a term within a document. The process is simple and works by counting the frequency of a term for a given piece of content, and multiplying it against the frequency of the same term within a large corpus of documents. In some cases the two maybe weighted with a multiplier or other metric to account for other variables within the corpus or the document itself. But largely this metric is just a basic understanding of where a specific document falls on a “scale of normalcy” for a target term, within a large corpus of documents. I assume (no one really knows) that if Google uses this type of metric, they likely do so to develop a baseline for things like spam or over optimization.
Several years ago I had a plan for a new SEO tool that relied heavily on correlation data, similar to what Moz and others provide. One of the metrics that I wanted to get correlation data on was TF-IDF.
I ended up not including TF-IDF as a metric within my tool set because I felt that it was impossible at the time for me to calculate such a metric without access to a massive corpus of documents like Google or other search engines, utilize. In fact I remember having a very brief conversation with Matthew Peters at Moz about how they calculate their TF-IDF for the correlation studies, and he mentioned that they pull from within one of Moz’s crawlers to pull from at the time 25 million URLs. He also recommended that I consider using Wikipedia which is easy to download and extract data from. Wikipedia would have been a good choice however the style and type of language used on Wikipedia often times doesn’t represent normalcy across the rest of the internet.
I came away from that experience understanding that if you want to use TF-IDF in a meaningful way that relates to SEO, you need a corpus of documents as close to the size of Google’s index as possible. Otherwise it seems as if your metric will be way off or have deeply inherent biases. Unfortunately I ended up not developing the tool at that time and putting it on the back burner for other projects that were more pressing.
However, the issue of TF-IDF came back up later, when OnPage.org launched, (which has since rebranded to ryte.com). At the time I heard the claims that TF-IDF was some sort of magic metric that could be used to optimize content perfectly. I was suspicious when I heard the claims especially since they kept referring to “homonyms” which have nothing to do with TF-IDF.
Then a few months later I went to MozCon, and had some very nerdy conversations with some very smart folks that claimed what I thought about TF-IDF was wrong and I really needed to look at OnPage.org. So, when I had time, I signed up for their free account, and gave it a shot, and recently another smart SEO, Nick Eubanks, has released an almost identical tool that makes many of the same magic claims.
Do these tools help SEOs and content developers optimize content? Yes! But, it has very little to do with TF-IDF. A major problem with these tools approach is the extremely bias dataset they rely on. Both tools only look at around 10 or 20 documents as their “corpus”. This alone means that they aren’t getting an accurate metric as it applies to SEO, where a search engine will likely pull from billions if not trillions of pages in its index. Another problem is that the 10 or 20 documents are taken from the top results. It’s likely that other ranking factors such as links have a lot more significance than any basic NLP metric with in the first 20 results.
Which is why the brilliant Cyrus Shepard stated in 2014 “In other words, generating a high TF-IDF score by itself generally isn’t enough to expect much of an SEO boost.”
This misunderstanding tends to contribute to an overwhelming prevalence of confirmation bias found in many inexperienced SEO analysts. These types of tools can help optimize content for SEO, but not because of TF-IDF. Simply because they provide guidance and encouragement to rewrite content with more natural language that is commonly used. These same tools can be made using other metrics like “keyword density” or just “total term counts”, that can be compared against each other.
In my opinion the best (and only) secret weapon to optimizing content for SEO, is learning how to write well and quit chasing the algorithms. If you are not building a search engine then focusing on TF-IDF is likely a waste of time. However, if these types of tools help you create better content, then don’t let me get in your way. There are always plenty of different ways to rank web pages.