TF-IDF (Term Frequency – Inverse Document Frequency) is a technique used in Natural Language Processing to convert text into numerical features while giving importance to meaningful words. Unlike Bag of Words, TF-IDF reduces the weight of common words and highlights words that are more informative.

Why TF-IDF is Important

Identifies important words in a document
Reduces the impact of common words like “the”, “is”
Improves performance of text-based Machine Learning models
Widely used in search engines and document ranking

Key Concepts

1. Term Frequency (TF)

Measures how often a word appears in a document
Formula:
TF = (Number of times a word appears in a document) / (Total number of words in the document)

2. Inverse Document Frequency (IDF)

Measures how important a word is across all documents
Words that appear in many documents get lower importance
Formula:
IDF = log (Total number of documents / Number of documents containing the word)

3. TF-IDF Score

Combines TF and IDF to assign a weight to each word
Formula:
TF-IDF = TF × IDF
Higher score means the word is more important in that document

Example

Text Data:

Document 1: “I love machine learning”
Document 2: “Machine learning is powerful”
Words like “machine” and “learning” appear in both documents, so they get lower weight
Words like “love” and “powerful” appear less frequently, so they get higher weight

Advantages

Highlights important and unique words
Reduces noise from common words
Improves model performance compared to simple word counts

Limitations

Still ignores word order and context
Does not capture semantic meaning
Can create large feature spaces

Implementation Example (Python using Scikit-learn)

from sklearn.feature_extraction.text import TfidfVectorizer# Sample text data
documents = [
    "I love machine learning",
    "Machine learning is powerful"
]# Create TF-IDF model
vectorizer = TfidfVectorizer()# Transform text into TF-IDF features
X = vectorizer.fit_transform(documents)# Vocabulary
print(vectorizer.get_feature_names_out())# TF-IDF matrix
print(X.toarray())

Applications

Search engines (ranking relevant documents)
Text classification and clustering
Spam detection
Sentiment analysis

Best Practices

Combine with text preprocessing steps like stopword removal
Limit vocabulary size to reduce computation
Use n-grams to capture more context
Compare performance with Bag of Words for your dataset

Conclusion

TF-IDF is a powerful text representation technique that improves upon Bag of Words by assigning importance to words based on their frequency and uniqueness. It is widely used in NLP tasks to create better-performing Machine Learning models.

Home » Advanced Machine Learning > NLP > TF-IDF

Free Video Tutorial

Want Mentorship on this Training?

Book a 1-on-1 Consultancy Session

TF-IDF