Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions docs/audit/andalucia-geo-source-locator-v1071.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Andalucía geo source locator v0.10.7.1

This research report locates repository files that mention Andalucía/andalucia.

It does not import distributor data and does not modify datasets.

## Summary

- Candidate files found: **33**

## Candidate files by extension

| extension | files |
|---|---:|
| `.csv` | 8 |
| `.geojson` | 1 |
| `.html` | 2 |
| `.js` | 1 |
| `.json` | 2 |
| `.md` | 9 |
| `.py` | 7 |
| `.txt` | 3 |

## Candidate files

| file | bytes | dataset_id mentions | zone_id mentions | parsed feature/list count |
|---|---:|---:|---:|---:|
| `frontend/public/changelog.html` | 68096 | 2 | 7 | |
| `frontend/public/cobertura-distribuidoras.html` | 14273 | 0 | 0 | |
| `frontend/public/data/andalucia_municipios.geojson` | 7989369 | 786 | 786 | 786 |
| `frontend/public/data/distributor_hints.json` | 2625218 | 2610 | 2610 | |
| `frontend/src/data/distributor_hints.json` | 2625218 | 2610 | 2610 | |
| `frontend/src/geo/datasets.js` | 6515 | 6 | 0 | |
| `CHANGELOG.md` | 54870 | 3 | 9 | |
| `README.md` | 9069 | 1 | 1 | |
| `docs/audit/andalucia-distributor-pending-audit-v1070.md` | 1586 | 0 | 0 | |
| `docs/audit/distributor-coverage-snapshot-v1068.md` | 3345 | 1 | 0 | |
| `docs/audit/distributor-next-targets-v1069.md` | 3341 | 0 | 0 | |
| `docs/audit/distributor_hint_quality_audit.md` | 1742 | 0 | 0 | |
| `docs/research/distributor_coverage_matrix.md` | 4083 | 0 | 0 | |
| `docs/research/distributor_import_batches/andalucia_edistribucion_strong_lineowner_import.md` | 1638 | 0 | 0 | |
| `docs/research/distributor_import_batches/next_batches_plan.md` | 1751 | 0 | 0 | |
| `docs/research/distributor_regional_audits/edistribucion_coverage_candidates.csv` | 2381181 | 1 | 0 | |
| `docs/research/distributor_regional_audits/edistribucion_local_exception_hunt_by_dataset/andalucia.csv` | 479192 | 1 | 1 | |
| `docs/research/distributor_regional_audits/edistribucion_local_exception_hunt_by_province_v1024.csv` | 1782 | 1 | 0 | |
| `docs/research/distributor_regional_audits/edistribucion_local_exception_hunt_v1024.csv` | 1755408 | 1 | 1 | |
| `docs/research/distributor_regional_audits/edistribucion_local_exception_hunt_v1024_summary.txt` | 764 | 0 | 0 | |
| `docs/research/distributor_regional_audits/edistribucion_local_exception_search_queries_v1024.csv` | 28083 | 1 | 0 | |
| `docs/research/distributor_regional_audits/edistribucion_review_queue_by_dataset/andalucia.csv` | 349374 | 1 | 0 | |
| `docs/research/distributor_regional_audits/edistribucion_review_summary.txt` | 812 | 0 | 0 | |
| `docs/research/distributor_regional_audits/remaining_regional_distributor_candidates_v1023.csv` | 3323720 | 1 | 1 | |
| `docs/research/distributor_regional_audits/remaining_regional_distributor_candidates_v1023_summary.txt` | 1464 | 0 | 0 | |
| `docs/research/distributor_regional_audits/remaining_regional_review_queue_by_dataset/andalucia.csv` | 391875 | 1 | 1 | |
| `scripts/audit_geo_datasets.py` | 6221 | 18 | 16 | |
| `scripts/check_geo_dataset_provinces.py` | 4100 | 5 | 0 | |
| `scripts/check_spain_geo_coverage.py` | 6133 | 13 | 18 | |
| `scripts/generate_distributor_coverage_matrix.py` | 15919 | 23 | 6 | |
| `scripts/locate_andalucia_geo_sources.py` | 4842 | 4 | 3 | |
| `scripts/report_andalucia_distributor_pending.py` | 4160 | 1 | 0 | |
| `scripts/report_distributor_hint_coverage.py` | 4526 | 9 | 0 | |

## Next step

Use this locator to identify the real Andalucía municipal source file before
building a sanitized pending review CSV.

A future queue builder must only use repository-local public geography and
must keep these constraints:

- No CUPS.
- No addresses.
- No exact coordinates in the generated review queue.
- No customer data.
- No private grid inventory.
- No raw external API responses.
153 changes: 153 additions & 0 deletions scripts/locate_andalucia_geo_sources.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
#!/usr/bin/env python3
from __future__ import annotations

import json
import re
from collections import Counter
from pathlib import Path

OUT = Path("docs/audit/andalucia-geo-source-locator-v1071.md")

TEXT_EXTS = {
".js", ".jsx", ".ts", ".tsx", ".json", ".geojson", ".md",
".html", ".csv", ".txt", ".py"
}

SKIP_DIRS = {
".git", "node_modules", "dist", "build", ".venv", "venv",
"__pycache__", ".pytest_cache"
}


def iter_files(root: Path):
for path in root.rglob("*"):
if not path.is_file():
continue
if any(part in SKIP_DIRS for part in path.parts):
continue
if path.suffix.lower() not in TEXT_EXTS:
continue
yield path


def safe_read(path: Path) -> str | None:
try:
return path.read_text(encoding="utf-8", errors="ignore")
except Exception:
return None


def count_geojson_features(path: Path, text: str) -> int | None:
if path.suffix.lower() not in {".json", ".geojson"}:
return None
try:
data = json.loads(text)
except Exception:
return None

if isinstance(data, dict) and isinstance(data.get("features"), list):
return len(data["features"])
if isinstance(data, list):
return len(data)
return None


def main() -> int:
root = Path(".")
hits = []

patterns = [
re.compile(r"andalucia", re.IGNORECASE),
re.compile(r"andalucía", re.IGNORECASE),
re.compile(r"dataset_id['\"]?\s*[:=]\s*['\"]andalucia['\"]", re.IGNORECASE),
]

for path in iter_files(root):
text = safe_read(path)
if text is None:
continue

lower = text.lower()
if "andalucia" not in lower and "andalucía" not in lower:
continue

matched = []
for pat in patterns:
if pat.search(text):
matched.append(pat.pattern)

dataset_mentions = len(re.findall(r"dataset_id", text, flags=re.IGNORECASE))
zone_mentions = len(re.findall(r"zone_id", text, flags=re.IGNORECASE))
feature_count = count_geojson_features(path, text)

hits.append({
"path": str(path),
"size": path.stat().st_size,
"matched": matched,
"dataset_mentions": dataset_mentions,
"zone_mentions": zone_mentions,
"feature_count": feature_count,
})

hits.sort(key=lambda h: (0 if "frontend" in h["path"] else 1, h["path"]))

by_ext = Counter(Path(h["path"]).suffix.lower() or "(none)" for h in hits)

lines = []
lines.append("# Andalucía geo source locator v0.10.7.1")
lines.append("")
lines.append("This research report locates repository files that mention Andalucía/andalucia.")
lines.append("")
lines.append("It does not import distributor data and does not modify datasets.")
lines.append("")
lines.append("## Summary")
lines.append("")
lines.append(f"- Candidate files found: **{len(hits)}**")
lines.append("")
lines.append("## Candidate files by extension")
lines.append("")
lines.append("| extension | files |")
lines.append("|---|---:|")
for ext, count in sorted(by_ext.items()):
lines.append(f"| `{ext}` | {count} |")
lines.append("")
lines.append("## Candidate files")
lines.append("")
lines.append("| file | bytes | dataset_id mentions | zone_id mentions | parsed feature/list count |")
lines.append("|---|---:|---:|---:|---:|")

for h in hits[:200]:
fc = "" if h["feature_count"] is None else str(h["feature_count"])
lines.append(
f"| `{h['path']}` | {h['size']} | {h['dataset_mentions']} | "
f"{h['zone_mentions']} | {fc} |"
)

lines.append("")
lines.append("## Next step")
lines.append("")
lines.append("Use this locator to identify the real Andalucía municipal source file before")
lines.append("building a sanitized pending review CSV.")
lines.append("")
lines.append("A future queue builder must only use repository-local public geography and")
lines.append("must keep these constraints:")
lines.append("")
lines.append("- No CUPS.")
lines.append("- No addresses.")
lines.append("- No exact coordinates in the generated review queue.")
lines.append("- No customer data.")
lines.append("- No private grid inventory.")
lines.append("- No raw external API responses.")
lines.append("")

OUT.write_text("\n".join(lines), encoding="utf-8")

print(f"OK wrote {OUT}")
print(f"candidate_files={len(hits)}")
for h in hits[:30]:
print(f"- {h['path']} size={h['size']} dataset_id={h['dataset_mentions']} zone_id={h['zone_mentions']} features={h['feature_count']}")
return 0


if __name__ == "__main__":
raise SystemExit(main())
Loading