A Java-based application that uses LLMs (via LangChain4j and Ollama) to analyze data files (.csv and .tab) and identify variables containing location and time/date information.
The application scans a specified directory (recursively), reads each data file, and uses an LLM to determine if any of the columns represent:
- Location: Cities, countries, coordinates, addresses, latitude, longitude, etc.
- Time/Date: Years, months, timestamps, dates, durations, etc.
It outputs whether each file meets the requirements (contains both a location and a time variable).
- Java 17 or higher.
- Maven for building the project.
- Ollama installed and running locally (or accessible via network).
- Ensure you have the
llama3.2model (or your preferred model) pulled in Ollama:ollama pull llama3.2
- Ensure you have the
The application is configured via src/main/resources/application.properties:
ollama.url: The base URL for the Ollama API (default:http://localhost:11434).ollama.model: The LLM model to use (default:llama3.2).analyzer.search-root: The root directory to scan for data files (default:data).
mvn clean compilemvn exec:javaAlternatively, if you've already compiled:
mvn exec:java -Dexec.mainClass="edu.harvard.iq.datacommons.analyzer.Application"You can override any property defined in application.properties by passing it as a system property (using -D) on the command line:
java -Dollama.model=llama3 -Danalyzer.search-root=/path/to/data -jar target/data-commons-datafile-filter-1.0-SNAPSHOT-jar-with-dependencies.jarOr when running via Maven:
mvn exec:java -Dollama.model=llama3 -Danalyzer.search-root=/path/to/dataThe supported properties are:
ollama.urlollama.modelanalyzer.search-root
- Scanning: The
AnalyzerServicewalks the directory tree starting fromanalyzer.search-root. - Parsing: For each
.csvor.tabfile, it reads the header and the first 5 rows of data. - LLM Analysis: For each column, it sends a prompt to the Ollama model (using LangChain4j) containing the column label and sample values.
- Classification: The LLM responds with
YESorNOto classify if the column represents a location or time/date. - Results: The application prints the analysis results for each file to the console.
- Copying: If a file is identified as having both a location and a time/date variable, it is copied to a new directory named
DataCommonsReady-<timestamp>(e.g.DataCommonsReady-20240315-103000).- The original directory structure is preserved within this directory.
- For example, if
data/subdir/file.csvmeets the requirements, it will be copied toDataCommonsReady-<timestamp>/subdir/file.csv.
The DataCommonsReady-<timestamp> directory will contain:
- All files that are deemed "compliant" (containing both Location and Time data). The original directory structure is preserved.
selection_log.txt: A log file detailing for each selected file all variables that contributed to its selection (location variables and time/date variables).DataCommonsReady-<timestamp>.log: The full application execution log.
src/main/java/edu/harvard/iq/datacommons/analyzer/Application.java: Main entry point.src/main/java/edu/harvard/iq/datacommons/analyzer/AnalyzerService.java: Core analysis logic.src/main/resources/application.properties: Configuration settings.pom.xml: Maven dependencies and build configuration.data/: Sample data directory.