Word Sense Disambiguation Algorithm

Problem

Word sense disambiguation is the problem of determining which sense a word takes on in a particular setting, if that word has multiple meanings. For example, in the sentence "I went to get money from the bank", bank probably means the place where people deposit money, not the land beside a river or lake.

Suppose you are given a list of meanings for several words, formatted like so:

{
    "word_1": ["meaning one", "meaning two", ...],
    ...
    "word_n": ["meaning one", "meaning two", ...]
}

Given a sentence, most of whose words are contained in the meaning list above, create an algorithm that determines the likely sense of each possibly ambiguous word.

Examples

Example 1

Input: 
sentence = "I went to get money from the bank"
meanings = {
    "bank": ["financial institution where money is kept", "land beside a river or lake"],
    "money": ["currency used for transactions", "wealth or riches"]
}
Output: {
    "bank": "financial institution where money is kept",
    "money": "currency used for transactions"
}
Explanation:
Context words "get", "money", and "from" suggest financial transaction, making "financial institution" more likely for "bank".
Word "money" in context of "get...from bank" suggests "currency" rather than abstract wealth.

Example 2

Input:
sentence = "The bank of the river was muddy"
meanings = {
    "bank": ["financial institution where money is kept", "land beside a river or lake"],
    "river": ["flowing body of water", "large stream of something"]
}
Output: {
    "bank": "land beside a river or lake",
    "river": "flowing body of water"
}
Explanation:
Context words "river", "muddy" clearly indicate geographical/natural setting.
"Bank" in this context means the riverside, not financial institution.

Example 3

Input:
sentence = "She played bass guitar in the band"
meanings = {
    "bass": ["low-frequency musical notes", "type of fish", "lowest male singing voice"],
    "band": ["musical group", "strip of material", "frequency range"]
}
Output: {
    "bass": "low-frequency musical notes",
    "band": "musical group"
}
Explanation:
Musical context with "played", "guitar", "band" makes musical meanings most appropriate.
"Bass" refers to musical notes/instrument, "band" refers to musical group.

Example 4

Input:
sentence = "The pitcher threw a fast ball"
meanings = {
    "pitcher": ["baseball player who throws", "container for liquids"],
    "ball": ["spherical object used in sports", "formal dancing event"]
}
Output: {
    "pitcher": "baseball player who throws",
    "ball": "spherical object used in sports"
}
Explanation:
Sports context with "threw", "fast" indicates baseball setting.
Both words take their sports-related meanings.

Example 5

Input:
sentence = "The mouse clicked on the screen"
meanings = {
    "mouse": ["small rodent", "computer input device"],
    "screen": ["display monitor", "mesh barrier", "movie projection surface"]
}
Output: {
    "mouse": "computer input device",
    "screen": "display monitor"
}
Explanation:
Technology context with "clicked" indicates computer setting.
"Mouse" refers to input device, "screen" refers to computer monitor.

Solution

Method 1 - Context Similarity Scoring

Intuition

We can determine word sense by measuring how well each possible meaning fits with the context of the sentence. We extract context words from the sentence and calculate similarity scores between each meaning and the context words. The meaning with the highest contextual similarity is most likely correct.

Approach

Parse the input sentence and extract all words as context
For each ambiguous word in the sentence, get all possible meanings
For each meaning, calculate a similarity score with context words
Use word overlap, semantic similarity, and co-occurrence patterns
Select the meaning with the highest combined score
Handle edge cases: unknown words, single meanings, empty context

Code

C++

class Solution {
private:
    vector<string> tokenize(const string& text) {
        vector<string> tokens;
        stringstream ss(text);
        string word;
        while (ss >> word) {
            // Remove punctuation
            word.erase(remove_if(word.begin(), word.end(), ::ispunct), word.end());
            transform(word.begin(), word.end(), word.begin(), ::tolower);
            if (!word.empty()) {
                tokens.push_back(word);
            }
        }
        return tokens;
    }
    
    double calculateSimilarity(const string& meaning, const vector<string>& context) {
        vector<string> meaningWords = tokenize(meaning);
        double score = 0.0;
        
        // Direct word overlap
        for (const string& meaningWord : meaningWords) {
            for (const string& contextWord : context) {
                if (meaningWord == contextWord) {
                    score += 2.0; // High weight for exact matches
                }
                // Partial similarity (simple substring check)
                else if (meaningWord.find(contextWord) != string::npos || 
                         contextWord.find(meaningWord) != string::npos) {
                    score += 0.5;
                }
            }
        }
        
        // Boost score based on meaning relevance
        if (meaningWords.size() > 0) {
            score = score / meaningWords.size(); // Normalize by meaning length
        }
        
        return score;
    }
    
public:
    unordered_map<string, string> disambiguateWords(const string& sentence, 
                                                   const unordered_map<string, vector<string>>& meanings) {
        unordered_map<string, string> result;
        vector<string> words = tokenize(sentence);
        
        // Build context (all words in sentence)
        unordered_set<string> contextSet(words.begin(), words.end());
        vector<string> context(contextSet.begin(), contextSet.end());
        
        for (const string& word : words) {
            auto it = meanings.find(word);
            if (it != meanings.end()) {
                const vector<string>& senses = it->second;
                
                if (senses.size() == 1) {
                    result[word] = senses[0];
                } else {
                    // Find best sense
                    string bestSense = senses[0];
                    double bestScore = calculateSimilarity(senses[0], context);
                    
                    for (int i = 1; i < senses.size(); i++) {
                        double score = calculateSimilarity(senses[i], context);
                        if (score > bestScore) {
                            bestScore = score;
                            bestSense = senses[i];
                        }
                    }
                    
                    result[word] = bestSense;
                }
            }
        }
        
        return result;
    }
};

Go

func disambiguateWords(sentence string, meanings map[string][]string) map[string]string {
    result := make(map[string]string)
    
    tokenize := func(text string) []string {
        words := strings.Fields(strings.ToLower(text))
        var tokens []string
        for _, word := range words {
            // Remove punctuation
            cleaned := strings.Map(func(r rune) rune {
                if unicode.IsPunct(r) {
                    return -1
                }
                return r
            }, word)
            if cleaned != "" {
                tokens = append(tokens, cleaned)
            }
        }
        return tokens
    }
    
    calculateSimilarity := func(meaning string, context []string) float64 {
        meaningWords := tokenize(meaning)
        score := 0.0
        
        for _, meaningWord := range meaningWords {
            for _, contextWord := range context {
                if meaningWord == contextWord {
                    score += 2.0
                } else if strings.Contains(meaningWord, contextWord) || strings.Contains(contextWord, meaningWord) {
                    score += 0.5
                }
            }
        }
        
        if len(meaningWords) > 0 {
            score = score / float64(len(meaningWords))
        }
        
        return score
    }
    
    words := tokenize(sentence)
    
    // Build context
    contextSet := make(map[string]bool)
    for _, word := range words {
        contextSet[word] = true
    }
    
    var context []string
    for word := range contextSet {
        context = append(context, word)
    }
    
    for _, word := range words {
        if senses, exists := meanings[word]; exists {
            if len(senses) == 1 {
                result[word] = senses[0]
            } else {
                bestSense := senses[0]
                bestScore := calculateSimilarity(senses[0], context)
                
                for i := 1; i < len(senses); i++ {
                    score := calculateSimilarity(senses[i], context)
                    if score > bestScore {
                        bestScore = score
                        bestSense = senses[i]
                    }
                }
                
                result[word] = bestSense
            }
        }
    }
    
    return result
}

Java

class Solution {
    private List<String> tokenize(String text) {
        List<String> tokens = new ArrayList<>();
        String[] words = text.toLowerCase().split("\\s+");
        
        for (String word : words) {
            // Remove punctuation
            String cleaned = word.replaceAll("[^a-zA-Z0-9]", "");
            if (!cleaned.isEmpty()) {
                tokens.add(cleaned);
            }
        }
        
        return tokens;
    }
    
    private double calculateSimilarity(String meaning, List<String> context) {
        List<String> meaningWords = tokenize(meaning);
        double score = 0.0;
        
        for (String meaningWord : meaningWords) {
            for (String contextWord : context) {
                if (meaningWord.equals(contextWord)) {
                    score += 2.0;
                } else if (meaningWord.contains(contextWord) || contextWord.contains(meaningWord)) {
                    score += 0.5;
                }
            }
        }
        
        if (meaningWords.size() > 0) {
            score = score / meaningWords.size();
        }
        
        return score;
    }
    
    public Map<String, String> disambiguateWords(String sentence, Map<String, List<String>> meanings) {
        Map<String, String> result = new HashMap<>();
        List<String> words = tokenize(sentence);
        
        // Build context
        Set<String> contextSet = new HashSet<>(words);
        List<String> context = new ArrayList<>(contextSet);
        
        for (String word : words) {
            if (meanings.containsKey(word)) {
                List<String> senses = meanings.get(word);
                
                if (senses.size() == 1) {
                    result.put(word, senses.get(0));
                } else {
                    String bestSense = senses.get(0);
                    double bestScore = calculateSimilarity(senses.get(0), context);
                    
                    for (int i = 1; i < senses.size(); i++) {
                        double score = calculateSimilarity(senses.get(i), context);
                        if (score > bestScore) {
                            bestScore = score;
                            bestSense = senses.get(i);
                        }
                    }
                    
                    result.put(word, bestSense);
                }
            }
        }
        
        return result;
    }
}

Python

class Solution:
    def _tokenize(self, text: str) -> List[str]:
        import re
        words = re.findall(r'\b\w+\b', text.lower())
        return [word for word in words if word]
    
    def _calculate_similarity(self, meaning: str, context: List[str]) -> float:
        meaning_words = self._tokenize(meaning)
        score = 0.0
        
        for meaning_word in meaning_words:
            for context_word in context:
                if meaning_word == context_word:
                    score += 2.0
                elif meaning_word in context_word or context_word in meaning_word:
                    score += 0.5
        
        if meaning_words:
            score = score / len(meaning_words)
        
        return score
    
    def disambiguate_words(self, sentence: str, meanings: Dict[str, List[str]]) -> Dict[str, str]:
        result = {}
        words = self._tokenize(sentence)
        
        # Build context
        context = list(set(words))
        
        for word in words:
            if word in meanings:
                senses = meanings[word]
                
                if len(senses) == 1:
                    result[word] = senses[0]
                else:
                    best_sense = senses[0]
                    best_score = self._calculate_similarity(senses[0], context)
                    
                    for i in range(1, len(senses)):
                        score = self._calculate_similarity(senses[i], context)
                        if score > best_score:
                            best_score = score
                            best_sense = senses[i]
                    
                    result[word] = best_sense
        
        return result

Complexity

⏰ Time complexity: O(n * m * k * l), where n is number of words in sentence, m is average number of senses per word, k is average length of each sense, and l is context size
🧺 Space complexity: O(n + m), for storing context words and intermediate results

Method 2 - Enhanced Context Analysis with Semantic Clustering

Intuition

We can improve disambiguation by using semantic clustering and weighted context analysis. Instead of simple word overlap, we can group related words and use domain-specific scoring to better identify the correct sense based on thematic coherence.

Approach

Create semantic clusters for different domains (financial, natural, technology, etc.)
Extract context words and classify them into semantic domains
For each word sense, calculate domain alignment scores
Use weighted scoring based on proximity and semantic relatedness
Apply heuristics for common disambiguation patterns

Code

C++

class Solution {
private:
    unordered_map<string, string> domainKeywords = {
        {"financial", "money bank deposit withdraw account credit loan"},
        {"natural", "river lake water mud shore flow stream"},
        {"technology", "computer click screen mouse keyboard software"},
        {"music", "play band guitar song note instrument sound"},
        {"sports", "ball throw pitch catch game team player"}
    };
    
    vector<string> tokenize(const string& text) {
        vector<string> tokens;
        stringstream ss(text);
        string word;
        while (ss >> word) {
            word.erase(remove_if(word.begin(), word.end(), ::ispunct), word.end());
            transform(word.begin(), word.end(), word.begin(), ::tolower);
            if (!word.empty()) {
                tokens.push_back(word);
            }
        }
        return tokens;
    }
    
    string identifyDomain(const vector<string>& context) {
        unordered_map<string, double> domainScores;
        
        for (const auto& domain : domainKeywords) {
            vector<string> domainWords = tokenize(domain.second);
            double score = 0.0;
            
            for (const string& contextWord : context) {
                for (const string& domainWord : domainWords) {
                    if (contextWord == domainWord) {
                        score += 1.0;
                    }
                }
            }
            
            domainScores[domain.first] = score;
        }
        
        string bestDomain = "general";
        double bestScore = 0.0;
        for (const auto& pair : domainScores) {
            if (pair.second > bestScore) {
                bestScore = pair.second;
                bestDomain = pair.first;
            }
        }
        
        return bestDomain;
    }
    
    double calculateEnhancedSimilarity(const string& meaning, const vector<string>& context, const string& domain) {
        vector<string> meaningWords = tokenize(meaning);
        double score = 0.0;
        
        // Domain keywords bonus
        if (domainKeywords.count(domain)) {
            vector<string> domainWords = tokenize(domainKeywords[domain]);
            for (const string& meaningWord : meaningWords) {
                for (const string& domainWord : domainWords) {
                    if (meaningWord == domainWord) {
                        score += 3.0; // Higher weight for domain match
                    }
                }
            }
        }
        
        // Context similarity
        for (const string& meaningWord : meaningWords) {
            for (const string& contextWord : context) {
                if (meaningWord == contextWord) {
                    score += 2.0;
                } else if (meaningWord.find(contextWord) != string::npos || 
                           contextWord.find(meaningWord) != string::npos) {
                    score += 0.8;
                }
            }
        }
        
        return score;
    }
    
public:
    unordered_map<string, string> disambiguateWordsEnhanced(const string& sentence, 
                                                           const unordered_map<string, vector<string>>& meanings) {
        unordered_map<string, string> result;
        vector<string> words = tokenize(sentence);
        
        unordered_set<string> contextSet(words.begin(), words.end());
        vector<string> context(contextSet.begin(), contextSet.end());
        
        string domain = identifyDomain(context);
        
        for (const string& word : words) {
            auto it = meanings.find(word);
            if (it != meanings.end()) {
                const vector<string>& senses = it->second;
                
                if (senses.size() == 1) {
                    result[word] = senses[0];
                } else {
                    string bestSense = senses[0];
                    double bestScore = calculateEnhancedSimilarity(senses[0], context, domain);
                    
                    for (int i = 1; i < senses.size(); i++) {
                        double score = calculateEnhancedSimilarity(senses[i], context, domain);
                        if (score > bestScore) {
                            bestScore = score;
                            bestSense = senses[i];
                        }
                    }
                    
                    result[word] = bestSense;
                }
            }
        }
        
        return result;
    }
};

Go

func disambiguateWordsEnhanced(sentence string, meanings map[string][]string) map[string]string {
    domainKeywords := map[string]string{
        "financial":  "money bank deposit withdraw account credit loan",
        "natural":    "river lake water mud shore flow stream",
        "technology": "computer click screen mouse keyboard software",
        "music":      "play band guitar song note instrument sound",
        "sports":     "ball throw pitch catch game team player",
    }
    
    tokenize := func(text string) []string {
        words := strings.Fields(strings.ToLower(text))
        var tokens []string
        for _, word := range words {
            cleaned := strings.Map(func(r rune) rune {
                if unicode.IsPunct(r) {
                    return -1
                }
                return r
            }, word)
            if cleaned != "" {
                tokens = append(tokens, cleaned)
            }
        }
        return tokens
    }
    
    identifyDomain := func(context []string) string {
        domainScores := make(map[string]float64)
        
        for domain, keywords := range domainKeywords {
            domainWords := tokenize(keywords)
            score := 0.0
            
            for _, contextWord := range context {
                for _, domainWord := range domainWords {
                    if contextWord == domainWord {
                        score += 1.0
                    }
                }
            }
            
            domainScores[domain] = score
        }
        
        bestDomain := "general"
        bestScore := 0.0
        for domain, score := range domainScores {
            if score > bestScore {
                bestScore = score
                bestDomain = domain
            }
        }
        
        return bestDomain
    }
    
    calculateEnhancedSimilarity := func(meaning string, context []string, domain string) float64 {
        meaningWords := tokenize(meaning)
        score := 0.0
        
        // Domain keywords bonus
        if keywords, exists := domainKeywords[domain]; exists {
            domainWords := tokenize(keywords)
            for _, meaningWord := range meaningWords {
                for _, domainWord := range domainWords {
                    if meaningWord == domainWord {
                        score += 3.0
                    }
                }
            }
        }
        
        // Context similarity
        for _, meaningWord := range meaningWords {
            for _, contextWord := range context {
                if meaningWord == contextWord {
                    score += 2.0
                } else if strings.Contains(meaningWord, contextWord) || strings.Contains(contextWord, meaningWord) {
                    score += 0.8
                }
            }
        }
        
        return score
    }
    
    result := make(map[string]string)
    words := tokenize(sentence)
    
    contextSet := make(map[string]bool)
    for _, word := range words {
        contextSet[word] = true
    }
    
    var context []string
    for word := range contextSet {
        context = append(context, word)
    }
    
    domain := identifyDomain(context)
    
    for _, word := range words {
        if senses, exists := meanings[word]; exists {
            if len(senses) == 1 {
                result[word] = senses[0]
            } else {
                bestSense := senses[0]
                bestScore := calculateEnhancedSimilarity(senses[0], context, domain)
                
                for i := 1; i < len(senses); i++ {
                    score := calculateEnhancedSimilarity(senses[i], context, domain)
                    if score > bestScore {
                        bestScore = score
                        bestSense = senses[i]
                    }
                }
                
                result[word] = bestSense
            }
        }
    }
    
    return result
}

Java

class Solution {
    private Map<String, String> domainKeywords = Map.of(
        "financial", "money bank deposit withdraw account credit loan",
        "natural", "river lake water mud shore flow stream", 
        "technology", "computer click screen mouse keyboard software",
        "music", "play band guitar song note instrument sound",
        "sports", "ball throw pitch catch game team player"
    );
    
    private List<String> tokenize(String text) {
        List<String> tokens = new ArrayList<>();
        String[] words = text.toLowerCase().split("\\s+");
        
        for (String word : words) {
            String cleaned = word.replaceAll("[^a-zA-Z0-9]", "");
            if (!cleaned.isEmpty()) {
                tokens.add(cleaned);
            }
        }
        
        return tokens;
    }
    
    private String identifyDomain(List<String> context) {
        Map<String, Double> domainScores = new HashMap<>();
        
        for (Map.Entry<String, String> entry : domainKeywords.entrySet()) {
            List<String> domainWords = tokenize(entry.getValue());
            double score = 0.0;
            
            for (String contextWord : context) {
                for (String domainWord : domainWords) {
                    if (contextWord.equals(domainWord)) {
                        score += 1.0;
                    }
                }
            }
            
            domainScores.put(entry.getKey(), score);
        }
        
        String bestDomain = "general";
        double bestScore = 0.0;
        for (Map.Entry<String, Double> entry : domainScores.entrySet()) {
            if (entry.getValue() > bestScore) {
                bestScore = entry.getValue();
                bestDomain = entry.getKey();
            }
        }
        
        return bestDomain;
    }
    
    private double calculateEnhancedSimilarity(String meaning, List<String> context, String domain) {
        List<String> meaningWords = tokenize(meaning);
        double score = 0.0;
        
        // Domain keywords bonus
        if (domainKeywords.containsKey(domain)) {
            List<String> domainWords = tokenize(domainKeywords.get(domain));
            for (String meaningWord : meaningWords) {
                for (String domainWord : domainWords) {
                    if (meaningWord.equals(domainWord)) {
                        score += 3.0;
                    }
                }
            }
        }
        
        // Context similarity
        for (String meaningWord : meaningWords) {
            for (String contextWord : context) {
                if (meaningWord.equals(contextWord)) {
                    score += 2.0;
                } else if (meaningWord.contains(contextWord) || contextWord.contains(meaningWord)) {
                    score += 0.8;
                }
            }
        }
        
        return score;
    }
    
    public Map<String, String> disambiguateWordsEnhanced(String sentence, Map<String, List<String>> meanings) {
        Map<String, String> result = new HashMap<>();
        List<String> words = tokenize(sentence);
        
        Set<String> contextSet = new HashSet<>(words);
        List<String> context = new ArrayList<>(contextSet);
        
        String domain = identifyDomain(context);
        
        for (String word : words) {
            if (meanings.containsKey(word)) {
                List<String> senses = meanings.get(word);
                
                if (senses.size() == 1) {
                    result.put(word, senses.get(0));
                } else {
                    String bestSense = senses.get(0);
                    double bestScore = calculateEnhancedSimilarity(senses.get(0), context, domain);
                    
                    for (int i = 1; i < senses.size(); i++) {
                        double score = calculateEnhancedSimilarity(senses.get(i), context, domain);
                        if (score > bestScore) {
                            bestScore = score;
                            bestSense = senses.get(i);
                        }
                    }
                    
                    result.put(word, bestSense);
                }
            }
        }
        
        return result;
    }
}

Python

class Solution:
    def __init__(self):
        self.domain_keywords = {
            "financial": "money bank deposit withdraw account credit loan",
            "natural": "river lake water mud shore flow stream",
            "technology": "computer click screen mouse keyboard software", 
            "music": "play band guitar song note instrument sound",
            "sports": "ball throw pitch catch game team player"
        }
    
    def _tokenize(self, text: str) -> List[str]:
        import re
        words = re.findall(r'\b\w+\b', text.lower())
        return [word for word in words if word]
    
    def _identify_domain(self, context: List[str]) -> str:
        domain_scores = {}
        
        for domain, keywords in self.domain_keywords.items():
            domain_words = self._tokenize(keywords)
            score = 0.0
            
            for context_word in context:
                for domain_word in domain_words:
                    if context_word == domain_word:
                        score += 1.0
            
            domain_scores[domain] = score
        
        best_domain = "general"
        best_score = 0.0
        for domain, score in domain_scores.items():
            if score > best_score:
                best_score = score
                best_domain = domain
        
        return best_domain
    
    def _calculate_enhanced_similarity(self, meaning: str, context: List[str], domain: str) -> float:
        meaning_words = self._tokenize(meaning)
        score = 0.0
        
        # Domain keywords bonus
        if domain in self.domain_keywords:
            domain_words = self._tokenize(self.domain_keywords[domain])
            for meaning_word in meaning_words:
                for domain_word in domain_words:
                    if meaning_word == domain_word:
                        score += 3.0
        
        # Context similarity
        for meaning_word in meaning_words:
            for context_word in context:
                if meaning_word == context_word:
                    score += 2.0
                elif meaning_word in context_word or context_word in meaning_word:
                    score += 0.8
        
        return score
    
    def disambiguate_words_enhanced(self, sentence: str, meanings: Dict[str, List[str]]) -> Dict[str, str]:
        result = {}
        words = self._tokenize(sentence)
        
        context = list(set(words))
        domain = self._identify_domain(context)
        
        for word in words:
            if word in meanings:
                senses = meanings[word]
                
                if len(senses) == 1:
                    result[word] = senses[0]
                else:
                    best_sense = senses[0]
                    best_score = self._calculate_enhanced_similarity(senses[0], context, domain)
                    
                    for i in range(1, len(senses)):
                        score = self._calculate_enhanced_similarity(senses[i], context, domain)
                        if score > best_score:
                            best_score = score
                            best_sense = senses[i]
                    
                    result[word] = best_sense
        
        return result

Complexity

⏰ Time complexity: O(n * m * k * d), where n is sentence length, m is average senses per word, k is sense length, and d is domain keywords size
🧺 Space complexity: O(n + d + m), for context storage, domain keywords, and results

Notes

Method 1 provides basic context similarity scoring using word overlap and partial matching
Method 2 enhances disambiguation with semantic domain identification and weighted scoring
Both methods handle edge cases like single-sense words and unknown words gracefully
The algorithm works best with rich context and well-defined word senses
Real-world implementations often use pre-trained word embeddings or neural networks for better semantic understanding
Domain identification helps bias toward contextually appropriate meanings
The scoring system can be further refined with linguistic features like part-of-speech tags
Performance can be improved with caching and preprocessing of domain keywords
The approach is extensible to support more sophisticated NLP techniques like dependency parsing
Consider using external knowledge bases like WordNet for more comprehensive word sense definitions

Problem

Examples

Example 1

Example 2

Example 3

Example 4

Example 5

Solution

Method 1 - Context Similarity Scoring

Intuition

Approach

Code

C++

Go

Java

Python

Complexity

Method 2 - Enhanced Context Analysis with Semantic Clustering

Intuition

Approach

Code

C++

Go

Java

Python

Complexity

Notes

Comments