TF-IDF

TF-IDF (Term Frequency – Inverse Document Frequency) is a technique used in Natural Language Processing to convert text into numerical features while giving importance to meaningful words. Unlike Bag of Words, TF-IDF reduces the weight of common words and highlights words that are more informative.

Why TF-IDF is Important

  • Identifies important words in a document
  • Reduces the impact of common words like “the”, “is”
  • Improves performance of text-based Machine Learning models
  • Widely used in search engines and document ranking

Key Concepts

1. Term Frequency (TF)

  • Measures how often a word appears in a document
  • Formula:
    TF = (Number of times a word appears in a document) / (Total number of words in the document)

2. Inverse Document Frequency (IDF)

  • Measures how important a word is across all documents
  • Words that appear in many documents get lower importance
  • Formula:
    IDF = log (Total number of documents / Number of documents containing the word)

3. TF-IDF Score

  • Combines TF and IDF to assign a weight to each word
  • Formula:
    TF-IDF = TF × IDF
  • Higher score means the word is more important in that document

Example

Text Data:

  • Document 1: “I love machine learning”
  • Document 2: “Machine learning is powerful”
  • Words like “machine” and “learning” appear in both documents, so they get lower weight
  • Words like “love” and “powerful” appear less frequently, so they get higher weight

Advantages

  • Highlights important and unique words
  • Reduces noise from common words
  • Improves model performance compared to simple word counts

Limitations

  • Still ignores word order and context
  • Does not capture semantic meaning
  • Can create large feature spaces

Implementation Example (Python using Scikit-learn)

from sklearn.feature_extraction.text import TfidfVectorizer# Sample text data
documents = [
"I love machine learning",
"Machine learning is powerful"
]# Create TF-IDF model
vectorizer = TfidfVectorizer()# Transform text into TF-IDF features
X = vectorizer.fit_transform(documents)# Vocabulary
print(vectorizer.get_feature_names_out())# TF-IDF matrix
print(X.toarray())

Applications

  • Search engines (ranking relevant documents)
  • Text classification and clustering
  • Spam detection
  • Sentiment analysis

Best Practices

  • Combine with text preprocessing steps like stopword removal
  • Limit vocabulary size to reduce computation
  • Use n-grams to capture more context
  • Compare performance with Bag of Words for your dataset

Conclusion

TF-IDF is a powerful text representation technique that improves upon Bag of Words by assigning importance to words based on their frequency and uniqueness. It is widely used in NLP tasks to create better-performing Machine Learning models.

Home » Advanced Machine Learning > NLP > TF-IDF