TF-IDF (Term Frequency â Inverse Document Frequency) is a technique used in Natural Language Processing to convert text into numerical features while giving importance to meaningful words. Unlike Bag of Words, TF-IDF reduces the weight of common words and highlights words that are more informative.
Why TF-IDF is Important
- Identifies important words in a document
- Reduces the impact of common words like âtheâ, âisâ
- Improves performance of text-based Machine Learning models
- Widely used in search engines and document ranking
Key Concepts
1. Term Frequency (TF)
- Measures how often a word appears in a document
- Formula:
TF = (Number of times a word appears in a document) / (Total number of words in the document)
2. Inverse Document Frequency (IDF)
- Measures how important a word is across all documents
- Words that appear in many documents get lower importance
- Formula:
IDF = log (Total number of documents / Number of documents containing the word)
3. TF-IDF Score
- Combines TF and IDF to assign a weight to each word
- Formula:
TF-IDF = TF Ă IDF - Higher score means the word is more important in that document
Example
Text Data:
- Document 1: âI love machine learningâ
- Document 2: âMachine learning is powerfulâ
- Words like âmachineâ and âlearningâ appear in both documents, so they get lower weight
- Words like âloveâ and âpowerfulâ appear less frequently, so they get higher weight
Advantages
- Highlights important and unique words
- Reduces noise from common words
- Improves model performance compared to simple word counts
Limitations
- Still ignores word order and context
- Does not capture semantic meaning
- Can create large feature spaces
Implementation Example (Python using Scikit-learn)
from sklearn.feature_extraction.text import TfidfVectorizer# Sample text data
documents = [
"I love machine learning",
"Machine learning is powerful"
]# Create TF-IDF model
vectorizer = TfidfVectorizer()# Transform text into TF-IDF features
X = vectorizer.fit_transform(documents)# Vocabulary
print(vectorizer.get_feature_names_out())# TF-IDF matrix
print(X.toarray())
Applications
- Search engines (ranking relevant documents)
- Text classification and clustering
- Spam detection
- Sentiment analysis
Best Practices
- Combine with text preprocessing steps like stopword removal
- Limit vocabulary size to reduce computation
- Use n-grams to capture more context
- Compare performance with Bag of Words for your dataset
Conclusion
TF-IDF is a powerful text representation technique that improves upon Bag of Words by assigning importance to words based on their frequency and uniqueness. It is widely used in NLP tasks to create better-performing Machine Learning models.