This project is my final-year major project that implements a chatbot capable of efficiently answering user queries based on articles, reports, and news scraped from user-provided URLs. The chatbot offers multilingual support (Currently working on it) and provides graphical analysis(Currently working on it) of tabular data to facilitate better insights.
The primary objective of this project is to significantly improve article analysis by extracting critical insights quickly and accurately. The project demonstrates advanced capabilities in web scraping, natural language processing (NLP), and data visualization.
-
Efficient Web Scraping:
- Scrapes content from user-provided URLs, including JavaScript-heavy websites, using Selenium.
- Extracts text from images in articles using OCR with Tesseract.
-
Data Preprocessing:
- Basic preprocessing of scraped data for optimal analysis.
- Recursive text splitting with LangChain for manageable chunk sizes.
-
Generative AI for Querying:
- Uses Google Generative AI embeddings to convert text into fixed-length vectors.
- Stores processed content in a FAISS Vector Store for efficient similarity searches.
- Employs LLaMA text generation model for accurate and context-relevant answers based on similarity search results.
-
Multilingual Support:
- Provides responses in multiple languages, enabling accessibility for diverse users.
-
Graphical Analysis:
- Analyzes tabular data and generates graphical visualizations to present insights in an intuitive format.
This project was designed with the following considerations:
-
Enable efficient and comprehensive data extraction from user-provided URLs, including:
- Dynamic content (JavaScript-heavy websites).
- Embedded image-based text using OCR techniques.
-
Enhance the quality of responses by applying advanced preprocessing techniques and leveraging LangChain for effective text chunking.
-
Utilize powerful AI models (e.g., Google Generative AI embeddings and LLaMA) for robust similarity search and context-aware responses.
-
Empower users with multilingual interactions and graphical insights for tabular data, making the chatbot a versatile tool for analysis.
- Python
- Streamlit (for hosting and front-end interface)
- Web Scraping: Selenium, BeautifulSoup
- OCR: Tesseract
- NLP: LangChain, LLaMA, FAISS Vector Store, Google Generative AI
- Data Visualization: Matplotlib, Seaborn, Plotly
- GitHub (Version Control)
- Streamlit Cloud (Hosting)
-
Clone the Repository:
git clone https://github.com/SMPY2002/CurioVeda---Powered-by-AI.git cd CurioVeda---Powered-by-AI -
Install Dependencies:
pip install -r requirements.txt
-
Run the Application:
streamlit run app.py
-
Usage:
- Input the URLs containing articles/reports/news.
- Query the chatbot in your preferred language.
- View graphical insights for any tabular data provided.
- Add support for more advanced AI models and embedding techniques.
- Expand multilingual capabilities to include more languages.
- Integrate real-time streaming data analysis.
- Enhance graphical analysis features to include predictive insights.
- Optimize the backend for faster query responses and lower resource usage.
This project is licensed under the MIT License. Feel free to use, modify, and distribute this project as per the license terms.
For any queries or suggestions, please reach out via:
- Email: smpy1405@gmail.com
- LinkedIn: Shivam Pandey

