Skip to content

Extract Van Gogh Paintings#391

Open
aryrabelo wants to merge 15 commits into
serpapi:masterfrom
aryrabelo:solution-3
Open

Extract Van Gogh Paintings#391
aryrabelo wants to merge 15 commits into
serpapi:masterfrom
aryrabelo:solution-3

Conversation

@aryrabelo

@aryrabelo aryrabelo commented Jun 18, 2026

Copy link
Copy Markdown

Solution to the "Extract Van Gogh Paintings" challenge. Full writeup in SOLUTION.md.

lib/carousel_parser.rb parses the provided files/van-gogh-paintings.html with Nokogiri — no extra HTTP — and reproduces files/expected-array.json 47/47, field-for-field (name, extensions, link, image; inline base64 + gstatic thumbnails). Run it with bundle exec rspec (the suite includes the oracle check) or bin/extract files/van-gogh-paintings.html.

Approach. The carousel is scoped by Google's durable knowledge-graph data-attrid (not the hashed CSS classes that rotate per query). Each name comes from the image's alt text — Google's screen-reader label, which equals the title verbatim and survives Google reflowing the cell's div nesting; the structural leaf-text / aria-labelledby labels are the fallback. extensions holds the date when the item has one.

Tested against other carousels (per "test against 2 other similar result pages"): Monet (50), Picasso (45), Leonardo da Vinci (47), a pt-BR page (Tarsila do Amaral, 42), and a non-:works films carousel (Quentin Tarantino, 9) whose cells differ structurally (empty anchor, aria-labelledby title/year, sibling <img>), plus a no-carousel page that returns an empty array.

Ruby 3.x + Nokogiri + RSpec; suite is green (25 examples, 0 failures).

aryrabelo added 15 commits June 18, 2026 09:49
CarouselParser parses files/van-gogh-paintings.html and reproduces
expected-array.json 47/47 (name, extensions, link, image), with the inline
base64 thumbnails decoded. Structural, class-agnostic scoping generalizes across
carousels (Monet, Picasso, pt-BR Tarsila, a films carousel, and a no-carousel
page). Ruby + RSpec, as suggested.

See SOLUTION.md for the full writeup: CLI, an in-repo headless fetcher (Ferrum)
that captures real SERPs into VCR cassettes, anti-block rate guard + VPN/proxy,
security hardening, and Docker / mise / Lefthook / CI.
… runner

The "Offline oracle check" step ran `ruby bin/verify` with the system Ruby,
but bin/verify loads CarouselParser, which requires Nokogiri. On a clean CI
runner Nokogiri lives only in the cached bundle, so the bare invocation died
with `cannot load such file -- nokogiri (LoadError)` and the job exited 1
before RSpec even ran. Run the oracle through bundler so the gem resolves.
README.md is the upstream challenge prompt and GitHub renders it as the repo
landing page, so a reviewer saw the spec, not the solution. Add a one-line
banner linking SOLUTION.md. Also correct SOLUTION.md's status line and Layout
map, which undercounted the test fixtures (Tarsila pt-BR and the synthetic
films/no-carousel pages were missing).
picasso-paintings.html parsed identically to the already-committed
serp_fetcher/picasso_paintings VCR cassette, so the same SERP was stored twice.
Re-source the two specs that used the page (the cross-layout exact assertion and
the generalization example) from the cassette via VCR — same deterministic facts
(45 items, first "Guernica"), one copy of the HTML. Suite stays green
(43 examples, 0 failures, 1 pending); oracle 47/47. ~3.06MB removed.
…by layout)

The generic fallback was proven only by a synthetic films fixture; on a real
Google SERP a person's films carousel (kc:/people/person:movies) uses a
different DOM than the artist :works carousel — the <a> is empty, the title and
year are aria-labelledby spans outside it, and the thumbnail is a sibling <img>
(often with the data-URI already in src). The parser extracted 0 items from it.

Make extraction additive: carousel_anchor? also accepts an empty stick anchor
carrying aria-labelledby; item_labels reads name/subtitle from those spans when
the anchor has no inner leaf-text; image_of finds the thumbnail in the single
-item cell (bounded climb that can't borrow a neighbour's image). The artist
:works path is unchanged — Van Gogh stays 47/47.

Pin a real fetched Tarantino movies carousel (9 items, first "Pulp Fiction"
[1994]) in cross_layout_spec. Suite: 45 examples, 0 failures, 1 pending.
Synthesis of the structural and ponytail branches: keeps full coverage (Van Gogh
47/47, the Monet/Picasso/Tarsila works carousels, AND the real non-:works
aria-labelledby films carousel) but trims the 232-line parser to 109 by inlining
the scoping/security helpers and dropping the YAGNI scaffolding. Two small cell
branches — leaf-text + inner <img> for paintings, aria-labelledby + sibling <img>
for films — are the only complexity the films layout needs. Suite: 45 examples,
0 failures, 1 pending; oracle PERFECT 47/47.
A three-lens DRY/KISS/ponytail audit found the parser already near-minimal. Safe
wins applied: factor the inner-vs-cell image source resolution into one
candidate-list pick (the inner path still excludes the placeholder `src`, so the
47/47 oracle holds), collapse `clean` to an endless method, and drop stale
meta-comments. Confirmed irreducible: the image-branch asymmetry (merging is a
false-DRY trap that would emit the 1x1 placeholder GIF for all 47 paintings), the
3-tier container scoping (distinct predicates + priority), and the two label
sources. Oracle PERFECT 47/47; suite 45 examples, 0 failures, 1 pending.
…Api lens)

Reposition the writeup around the two competencies SerpApi actually screens for:
layout-resilient extraction (durable data-attrid scoping, exact 47/47 output,
real cross-layout generalization incl. a non-:works films carousel) and the
end-to-end fetch-and-serve pipeline (bin/extract --browser renders live Google
and serves the same {"artworks": [...]} JSON SerpApi serves clients), while the
graded core stays a no-extra-HTTP file parse. Also correct the stale 109 -> 100
line count after the DRY/KISS pass.
Drops the weakest part of the acquisition stack — the plain-HTTP SerpFetcher
(which never even returns the JS-rendered carousel) plus VCR/webmock and the
~3MB of cassettes the panel flagged as scope inflation against the "no extra
HTTP" rule. Keeps BrowserFetcher: the headless-Chrome render that serves live
Google end-to-end (bin/extract --browser) — the SerpApi competency worth showing.

The two cassette-backed generalization fixtures (Picasso 45/Guernica, da Vinci
47/Salvator Mundi) are extracted to committed HTML page fixtures and the specs
re-pointed at them, so cross-layout coverage is preserved without VCR. Removed
serp_fetcher_spec + serp_queries_spec, the cli --url path, and the SerpFetcher
SSRF block in robustness_spec (the parser's own output-safety tests stay). The
capture script now saves HTML page fixtures. Suite: 37 examples, 0 failures,
1 pending; oracle PERFECT 47/47.

The removed code stays in git history and lives on the `solution` branch.
The carousel <img> alt is Google's screen-reader label for the cell and
equals the artwork title verbatim (47/47 against the oracle, and present on
every real :works carousel). It is semantic, so it survives Google reflowing
the cell's div nesting, where the structural leaf-text label would break.

Use alt as the primary name source; fall back to the structural leaf-text /
aria-labelledby labels for cells whose <img> has no alt (the films carousel).
Date/subtitle still comes from the structural labels.
The graded core only needs to parse the saved SERP file (the brief forbids
extra HTTP). Move the optional live-fetch layer — BrowserFetcher (Ferrum
headless Chrome), RateGuard cross-process throttle, VPN/proxy support, and the
capture script — onto the solution-3-live-fetch branch, leaving the core a
small Nokogiri-only parser.

- remove lib/browser_fetcher.rb, lib/rate_guard.rb, script/capture_serps.rb
  and their specs; drop the ferrum dev dependency and refresh Gemfile.lock
- CLI is now local-file-only; drop the --browser/--url paths and RateGuard
  spec hook
- document the split in EXTRAS.md; trim SOLUTION.md/README accordingly

The cross-layout fixtures the script produced stay committed, so the suite
still exercises real Monet/Picasso/da Vinci/Tarsila/Tarantino pages offline.
The brief is a small parser plus its own tests against the given page and 2
other carousel pages. Remove scaffolding it never asked for: CI workflow,
Docker, mise, lefthook, gem packaging (gemspec + version + namespace), and the
extra bin tooling / EXTRAS pointer. Restore README.md to the upstream prompt
(my solution banner does not belong in the challenge's own file).

- de-gemify: SerpapiCodeChallenge::CarouselParser -> plain CarouselParser;
  Gemfile declares nokogiri + rspec directly; bin/extract calls the class
- readability pass (no behaviour change): split labels/image into named
  helpers (text_divs/leaf_text?/aria_labels, inner_thumbnail/sibling_thumbnail),
  replace the dense cell_image while-condition with wraps_other_cell?
- rubocop Metrics 3 offenses -> 0; flog avg 10.7 -> 6.5/method

Oracle still PERFECT 47/47; suite 25 examples, 0 failures.
A single command to verify correctness and complexity without reading the code:
- bundle exec rake   -> RSpec (incl. the 47/47 oracle) + RuboCop
- bundle exec rubocop-> complexity (Metrics) + likely bugs (Lint) only
- .rubocop.yml keeps the gate about substance (cosmetic Style/Layout off)

Gate is green: 25 examples, 0 failures; 8 files, no offenses.
CarouselParser was one class with several reasons to change. Extract cohesive
collaborators, leaving the parser as a thin orchestrator (public API unchanged,
so the suite is untouched):

- SectionLocator    — locate the carousel by its data-attrid priority
- ThumbnailResolver — inline base64 index + sanitized thumbnail URL resolution
- Cell + NestedCell / LinkedCell — one anchor -> one artwork, with the two DOM
  shapes as template-method subclasses (OCP: a new layout is a new subclass)

Adds an isolated ThumbnailResolver spec (the testability payoff). Gate green:
29 examples, 0 failures; RuboCop 12 files, 0 offenses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant