Skip to content
This repository was archived by the owner on Oct 26, 2023. It is now read-only.
This repository was archived by the owner on Oct 26, 2023. It is now read-only.

Designing a disambiguation model #8

@pudo

Description

@pudo

Progress

StoryWeb can now load articles, run them through spaCy NER and store the extracted entity tags (e.g. John Doe) to a database. In that database, each tag is identified per article, ie. (article_id, tag, count_of_mentions, ...). There's also a database model that describes a link between two tags (A and B are the same, unrelated, or have some semantic link - e.g. family). Once two tags have a same link between them they are considered a cluster, i.e. they become essentially the same node in the graph.

There is also a small UI that lets users make those links manually - both between different tags in the same article, or tags with the same surface form across different articles.

Screenshot 2022-10-17 at 12 58 42

The rationale for keeping tags constrained to one article is disambiguation: John Doe in article A may refer to a different individual than John Doe in article B.

Challenge

While disambiguation between different tags with the same surface form (e.g. two John Doe) is needed, doing this whole thing manually is intensely annoying and not even good for a prototype. What I'd like to do is to find a way to auto-decide the unambiguous/auto-decidable cases, then show the rest to the user and refine further merges based on their input.

In my mind, the core evidence for making these decisions is co-occurrence: John Doe A co-occurs with Jane Doe and Italy; John Doe B co-occurs with MegaCorp Ltd. and State Prosecutor. I'm aware that this would leave a lot of signal - the body of the documents - on the table. This goes back to wanting to build an interactive, human-in-the-loop system for better precision: keep it down to something that's explainable and where we can even re-compute clustering proposals as and while the user is providing input (active learning).

But I'm kind of stuck on this: how do I take the co-occurrence sets, model them into an input to some very simple machine learning model and then get both a set of judgements and a confidence score for each, so that I can then a) decide the ones the system is confident about, and b) show the most informative uncertain ones to a user to judge by hand, then c) re-train the model with these additional judgements.

Some things I've pondered:

  • Using tf/idf on the tags to score down the super common ones (Russia in my corpus is not signal, it's a stopword). I tried implementing that, and it leads to more interesting co-occurrence patterns - but not impressively so. Especially for common names (like "Vladimir Putin") the co-occurrence ends up generic as well.
  • Maybe this could be a simple bayes classifier (given A, B, C, what's the likelihood of this being X?)?

Stuff I want to avoid

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions