Progress
StoryWeb can now load articles, run them through spaCy NER and store the extracted entity tags (e.g. John Doe) to a database. In that database, each tag is identified per article, ie. (article_id, tag, count_of_mentions, ...). There's also a database model that describes a link between two tags (A and B are the same, unrelated, or have some semantic link - e.g. family). Once two tags have a same link between them they are considered a cluster, i.e. they become essentially the same node in the graph.
There is also a small UI that lets users make those links manually - both between different tags in the same article, or tags with the same surface form across different articles.

The rationale for keeping tags constrained to one article is disambiguation: John Doe in article A may refer to a different individual than John Doe in article B.
Challenge
While disambiguation between different tags with the same surface form (e.g. two John Doe) is needed, doing this whole thing manually is intensely annoying and not even good for a prototype. What I'd like to do is to find a way to auto-decide the unambiguous/auto-decidable cases, then show the rest to the user and refine further merges based on their input.
In my mind, the core evidence for making these decisions is co-occurrence: John Doe A co-occurs with Jane Doe and Italy; John Doe B co-occurs with MegaCorp Ltd. and State Prosecutor. I'm aware that this would leave a lot of signal - the body of the documents - on the table. This goes back to wanting to build an interactive, human-in-the-loop system for better precision: keep it down to something that's explainable and where we can even re-compute clustering proposals as and while the user is providing input (active learning).
But I'm kind of stuck on this: how do I take the co-occurrence sets, model them into an input to some very simple machine learning model and then get both a set of judgements and a confidence score for each, so that I can then a) decide the ones the system is confident about, and b) show the most informative uncertain ones to a user to judge by hand, then c) re-train the model with these additional judgements.
Some things I've pondered:
- Using tf/idf on the tags to score down the super common ones (Russia in my corpus is not signal, it's a stopword). I tried implementing that, and it leads to more interesting co-occurrence patterns - but not impressively so. Especially for common names (like "Vladimir Putin") the co-occurrence ends up generic as well.
- Maybe this could be a simple bayes classifier (given A, B, C, what's the likelihood of this being X?)?
Stuff I want to avoid
- I'd really like to avoid some sort of article-content-based mystery vectorisation (e.g. BERT), unless there's a really nice and reproducible way of productising this. However, that's what a lot of the literature is pushing:
- Very keen to avoid using an external knowledge base to disambiguate, because the entities we're most interested in are the ones that would not yet be recorded and identified in a KB like Wikidata - the people who work for oligarchs, kleptocrats, etc.
Progress
StoryWeb can now load articles, run them through spaCy NER and store the extracted entity tags (e.g.
John Doe) to a database. In that database, each tag is identified per article, ie. (article_id,tag,count_of_mentions, ...). There's also a database model that describes a link between two tags (A and B are the same, unrelated, or have some semantic link - e.g. family). Once two tags have asamelink between them they are considered a cluster, i.e. they become essentially the same node in the graph.There is also a small UI that lets users make those links manually - both between different tags in the same article, or tags with the same surface form across different articles.
The rationale for keeping tags constrained to one article is disambiguation:
John Doein article A may refer to a different individual thanJohn Doein article B.Challenge
While disambiguation between different tags with the same surface form (e.g. two
John Doe) is needed, doing this whole thing manually is intensely annoying and not even good for a prototype. What I'd like to do is to find a way to auto-decide the unambiguous/auto-decidable cases, then show the rest to the user and refine further merges based on their input.In my mind, the core evidence for making these decisions is co-occurrence:
John DoeA co-occurs withJane DoeandItaly;John DoeB co-occurs withMegaCorp Ltd.andState Prosecutor. I'm aware that this would leave a lot of signal - the body of the documents - on the table. This goes back to wanting to build an interactive, human-in-the-loop system for better precision: keep it down to something that's explainable and where we can even re-compute clustering proposals as and while the user is providing input (active learning).But I'm kind of stuck on this: how do I take the co-occurrence sets, model them into an input to some very simple machine learning model and then get both a set of judgements and a confidence score for each, so that I can then a) decide the ones the system is confident about, and b) show the most informative uncertain ones to a user to judge by hand, then c) re-train the model with these additional judgements.
Some things I've pondered:
Stuff I want to avoid