Skip to content

MichalRedm/wikipedia-recommender-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia Recommender System

Python package

Recommender systems are a cornerstone of modern information retrieval, providing personalized suggestions to users. This project implements a recommender system for Wikipedia, one of the largest and most frequently accessed online encyclopedias.

The system recommends relevant Wikipedia articles based on the content of a selected page. By leveraging Wikipedia's hyperlink structure and textual content, it identifies articles with thematic or topical similarities, making it a valuable tool for education, research, and knowledge exploration.

Using web scraping, text preprocessing, and machine learning techniques, articles are represented numerically with the Term Frequency-Inverse Document Frequency (TF-IDF) method. Recommendations are generated by calculating cosine similarity between the input article and a dataset of pre-scraped pages.

This repository provides the full implementation, enabling users to explore and extend the recommender system for their needs. For more information, please refer to our report.

Setup

To get started with the Wikipedia Recommender System project, follow these steps:

1. Clone the repository

First, clone the repository to your local machine:

git clone https://github.com/MichalRedm/wikipedia-recommender-system.git
cd wikipedia-recommender-system

2. Create a Python Virtual Environment

It’s recommended to use a virtual environment to manage dependencies and avoid conflicts with other projects. Here’s how to create and activate a virtual environment:

On Windows:

python -m venv venv
venv\Scripts\activate

On macOS/Linux:

python3 -m venv venv
source venv/bin/activate

3. Install the package

Since the package is not published on PyPI, install it directly from the local source:

pip install .

This will install the Wikipedia Recommender System package along with its dependencies.

4. Verify installation

To verify that the package has been installed correctly, try running Python and importing the package:

python
>>> from wikirecommender import WikipediaRecommender
>>> print(WikipediaRecommender)

You should see the class definition or a similar output indicating the package is ready to use.

Usage

Here’s a guide on how to use the WikipediaRecommender class to load articles, generate recommendations, and save or load the recommender system.

Basic Example

from wikirecommender import WikipediaRecommender

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

# Instantiate the recommender
recommender = WikipediaRecommender()

# Load articles into the recommender (default: 20 articles from the Wikipedia Popular Pages)
recommender.load_articles()

# Compare a new article to the dataset
recommendations = recommender.recommend(url)

# Show top 5 recommendations
print(recommendations.head())

Customizing Article Loading

You can control the number of pages to scrape and the starting point for article scraping.

# Load 100 articles starting from a specific URL
recommender.load_articles(page_count=100, start_link="https://en.wikipedia.org/wiki/Main_Page")

Handling Multiple URLs

If you want to recommend articles for multiple Wikipedia pages simultaneously:

urls = [
    "https://en.wikipedia.org/wiki/Machine_learning",
    "https://en.wikipedia.org/wiki/Artificial_intelligence"
]

# Generate recommendations for multiple pages
recommendations = recommender.recommend(urls)

# Display the top 10 recommendations
print(recommendations.head(10))

Including or Excluding Provided URLs

By default, the recommendations exclude the URLs provided as input. You can include them if needed:

# Include the provided URL in the recommendations
recommendations = recommender.recommend(url, include_provided_urls=True)

# Display the top 5 recommendations, including the provided URL
print(recommendations.head(5))

Saving and Loading the Recommender System

You can save the recommender system to a file and reload it later for reuse.

Saving to a File

# Save the current state to a file
recommender.save_to_file("recommender.csv")

Loading from a File

from wikirecommender import WikipediaRecommender

# Load a previously saved recommender system
recommender = WikipediaRecommender.load_from_file("recommender.csv")

Advanced Use: Scraping and Processing Articles

The WikipediaRecommender class uses internal methods for scraping and processing articles. For example:

  • load_articles scrapes Wikipedia articles starting from a given URL.
  • stemmer processes article text into stemmed tokens.
  • TF-IDF Representation is used to calculate similarities between articles.

This ensures efficient and accurate recommendations based on article content.

Scripts

There are a few available scripts intended for more general use that make use of WikipediaRecommender. They can be found in the ./scripts directory.

create_recommender.py

Creates an instance of WikipediaRecommender and saves it into a .csv file. The user must provide the name of the output file. By default, it loads articles from 1000 pages starting from the Wikipedia Popular Pages. The starting URL and the number of pages to load can be customized using optional arguments. Usage:

python -m scripts.create_recommender <output_file_name> [--page-count <number_of_pages>] [--start-link <starting_url>]
  • <output_file_name>: Required. The name of the output file to save the recommender.
  • --page-count: Optional. The number of pages to load for the recommender (default: 1000).
  • --start-link: Optional. The starting Wikipedia article URL (default: https://en.wikipedia.org/wiki/Wikipedia:Popular_pages).

Example:

python -m scripts.create_recommender recommender.csv --page-count 500 --start-link https://en.wikipedia.org/wiki/Main_Page

recommend.py

Loads an instance of WikipediaRecommender from a .csv file (name provided by the user) and uses it to find the most similar documents to the one(s) for which a URL is provided. By default, it returns the top 5 recommendations but allows customization. Users can optionally include the provided URLs in the recommendations. Usage:

python -m scripts.recommend <recommender_file_name> <wikipedia_url> [<wikipedia_url> ...] [--top <number_of_recommendations>] [--include-provided]
  • <recommender_file_name>: Required. The name of the file containing the recommender system.
  • <wikipedia_url>: Required. One or more Wikipedia article URLs to generate recommendations for.
  • --top: Optional. The number of recommendations to display (default: 5).
  • --include-provided: Optional. Include the provided URLs in the recommendations.

Example:

python -m scripts.recommend recommender.csv https://en.wikipedia.org/wiki/Python_(programming_language) --top 10 --include-provided

Troubleshooting

When testing our recommender system, we've ran into one issue. The nltk resources, used by our recommender system when processing scraped articles, should normally download automatically when our package is loaded; however, sometimes that was not the case. In such a situation, the following code can be used:

from wikirecommender.utils import download_nltk_resources

download_nltk_resources()

About

Recommender system for Wikipedia articles.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages