Designing a disambiguation model

## Progress

StoryWeb can now load articles, run them through spaCy NER and store the extracted entity tags (e.g. `John Doe`) to a database. In that database, each tag is identified per article, ie. (`article_id`, `tag`, `count_of_mentions`, ...). There's also a database model that describes a link between two tags (A and B are the same, unrelated, or have some semantic link - e.g. family). Once two tags have a `same` link between them they are considered a *cluster*, i.e. they become essentially the same node in the graph.

There is also a small UI that lets users make those links manually - both between different tags in the same article, or tags with the same surface form across different articles.

<img width="957" alt="Screenshot 2022-10-17 at 12 58 42" src="https://user-images.githubusercontent.com/41628/196160735-279c5afe-aaa3-4a40-a05a-fa909f12d241.png">

The rationale for keeping tags constrained to one article is disambiguation: `John Doe` in article A may refer to a different individual than `John Doe` in article B. 

## Challenge

While disambiguation between different tags with the same surface form (e.g. two `John Doe`) is needed, doing this whole thing manually is intensely annoying and not even good for a prototype. What I'd like to do is to find a way to auto-decide the unambiguous/auto-decidable cases, then show the rest to the user and refine further merges based on their input. 

In my mind, the core evidence for making these decisions is co-occurrence: `John Doe` A co-occurs with `Jane Doe` and `Italy`; `John Doe` B co-occurs with `MegaCorp Ltd.` and `State Prosecutor`. I'm aware that this would leave *a lot* of signal - the body of the documents - on the table. This goes back to wanting to build an interactive, human-in-the-loop system for better precision: keep it down to something that's explainable and where we can even re-compute clustering proposals as and while the user is providing input (active learning).

But I'm kind of stuck on this: how do I take the co-occurrence sets, model them into an input to some very simple machine learning model and then get both a set of judgements and a confidence score for each, so that I can then a) decide the ones the system is confident about, and b) show the most informative uncertain ones to a user to judge by hand, then c) re-train the model with these additional judgements.

Some things I've pondered:
* Using tf/idf on the tags to score down the super common ones (Russia in my corpus is not signal, it's a stopword). I tried implementing that, and it leads to more interesting co-occurrence patterns - but not impressively so. Especially for common names (like "Vladimir Putin") the co-occurrence ends up generic as well.
* Maybe this could be a simple bayes classifier (given A, B, C, what's the likelihood of this being X?)?

## Stuff I want to avoid 

* I'd really like to avoid some sort of article-content-based mystery vectorisation (e.g. BERT), unless there's a really nice and reproducible way of productising this. However, that's what a lot of the literature is pushing: 
  * [Named Entity Disambiguation with Knowledge Graphs](https://blogs.oracle.com/ai-and-datascience/post/named-entity-disambiguation-with-knowledge-graphs)
  * [Improving Named Entity Disambiguation using Entity Relatedness within Wikipedia](https://towardsdatascience.com/improving-named-entity-disambiguation-using-entity-relatedness-within-wikipedia-92f400ee5994)
  * [bert_ned](https://github.com/amundfr/bert_ned) 
* Very keen to avoid using an external knowledge base to disambiguate, because the entities we're most interested in are the ones that would not yet be recorded and identified in a KB like Wikidata - the people who work for oligarchs, kleptocrats, etc. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Designing a disambiguation model #8

Progress

Challenge

Stuff I want to avoid

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Designing a disambiguation model #8

Description

Progress

Challenge

Stuff I want to avoid

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions