Extract Van Gogh Paintings by aryrabelo · Pull Request #391 · serpapi/code-challenge

aryrabelo · 2026-06-18T18:09:42Z

Solution to the "Extract Van Gogh Paintings" challenge. Full writeup in SOLUTION.md.

lib/carousel_parser.rb parses the provided files/van-gogh-paintings.html with Nokogiri — no extra HTTP — and reproduces files/expected-array.json 47/47, field-for-field (name, extensions, link, image; inline base64 + gstatic thumbnails). Run it with bundle exec rspec (the suite includes the oracle check) or bin/extract files/van-gogh-paintings.html.

Approach. The carousel is scoped by Google's durable knowledge-graph data-attrid (not the hashed CSS classes that rotate per query). Each name comes from the image's alt text — Google's screen-reader label, which equals the title verbatim and survives Google reflowing the cell's div nesting; the structural leaf-text / aria-labelledby labels are the fallback. extensions holds the date when the item has one.

Tested against other carousels (per "test against 2 other similar result pages"): Monet (50), Picasso (45), Leonardo da Vinci (47), a pt-BR page (Tarsila do Amaral, 42), and a non-:works films carousel (Quentin Tarantino, 9) whose cells differ structurally (empty anchor, aria-labelledby title/year, sibling <img>), plus a no-carousel page that returns an empty array.

Ruby 3.x + Nokogiri + RSpec; suite is green (25 examples, 0 failures).

CarouselParser parses files/van-gogh-paintings.html and reproduces expected-array.json 47/47 (name, extensions, link, image), with the inline base64 thumbnails decoded. Structural, class-agnostic scoping generalizes across carousels (Monet, Picasso, pt-BR Tarsila, a films carousel, and a no-carousel page). Ruby + RSpec, as suggested. See SOLUTION.md for the full writeup: CLI, an in-repo headless fetcher (Ferrum) that captures real SERPs into VCR cassettes, anti-block rate guard + VPN/proxy, security hardening, and Docker / mise / Lefthook / CI.

… runner The "Offline oracle check" step ran `ruby bin/verify` with the system Ruby, but bin/verify loads CarouselParser, which requires Nokogiri. On a clean CI runner Nokogiri lives only in the cached bundle, so the bare invocation died with `cannot load such file -- nokogiri (LoadError)` and the job exited 1 before RSpec even ran. Run the oracle through bundler so the gem resolves.

README.md is the upstream challenge prompt and GitHub renders it as the repo landing page, so a reviewer saw the spec, not the solution. Add a one-line banner linking SOLUTION.md. Also correct SOLUTION.md's status line and Layout map, which undercounted the test fixtures (Tarsila pt-BR and the synthetic films/no-carousel pages were missing).

picasso-paintings.html parsed identically to the already-committed serp_fetcher/picasso_paintings VCR cassette, so the same SERP was stored twice. Re-source the two specs that used the page (the cross-layout exact assertion and the generalization example) from the cassette via VCR — same deterministic facts (45 items, first "Guernica"), one copy of the HTML. Suite stays green (43 examples, 0 failures, 1 pending); oracle 47/47. ~3.06MB removed.

…by layout) The generic fallback was proven only by a synthetic films fixture; on a real Google SERP a person's films carousel (kc:/people/person:movies) uses a different DOM than the artist :works carousel — the <a> is empty, the title and year are aria-labelledby spans outside it, and the thumbnail is a sibling <img> (often with the data-URI already in src). The parser extracted 0 items from it. Make extraction additive: carousel_anchor? also accepts an empty stick anchor carrying aria-labelledby; item_labels reads name/subtitle from those spans when the anchor has no inner leaf-text; image_of finds the thumbnail in the single -item cell (bounded climb that can't borrow a neighbour's image). The artist :works path is unchanged — Van Gogh stays 47/47. Pin a real fetched Tarantino movies carousel (9 items, first "Pulp Fiction" [1994]) in cross_layout_spec. Suite: 45 examples, 0 failures, 1 pending.

Synthesis of the structural and ponytail branches: keeps full coverage (Van Gogh 47/47, the Monet/Picasso/Tarsila works carousels, AND the real non-:works aria-labelledby films carousel) but trims the 232-line parser to 109 by inlining the scoping/security helpers and dropping the YAGNI scaffolding. Two small cell branches — leaf-text + inner <img> for paintings, aria-labelledby + sibling <img> for films — are the only complexity the films layout needs. Suite: 45 examples, 0 failures, 1 pending; oracle PERFECT 47/47.

A three-lens DRY/KISS/ponytail audit found the parser already near-minimal. Safe wins applied: factor the inner-vs-cell image source resolution into one candidate-list pick (the inner path still excludes the placeholder `src`, so the 47/47 oracle holds), collapse `clean` to an endless method, and drop stale meta-comments. Confirmed irreducible: the image-branch asymmetry (merging is a false-DRY trap that would emit the 1x1 placeholder GIF for all 47 paintings), the 3-tier container scoping (distinct predicates + priority), and the two label sources. Oracle PERFECT 47/47; suite 45 examples, 0 failures, 1 pending.

…Api lens) Reposition the writeup around the two competencies SerpApi actually screens for: layout-resilient extraction (durable data-attrid scoping, exact 47/47 output, real cross-layout generalization incl. a non-:works films carousel) and the end-to-end fetch-and-serve pipeline (bin/extract --browser renders live Google and serves the same {"artworks": [...]} JSON SerpApi serves clients), while the graded core stays a no-extra-HTTP file parse. Also correct the stale 109 -> 100 line count after the DRY/KISS pass.

Drops the weakest part of the acquisition stack — the plain-HTTP SerpFetcher (which never even returns the JS-rendered carousel) plus VCR/webmock and the ~3MB of cassettes the panel flagged as scope inflation against the "no extra HTTP" rule. Keeps BrowserFetcher: the headless-Chrome render that serves live Google end-to-end (bin/extract --browser) — the SerpApi competency worth showing. The two cassette-backed generalization fixtures (Picasso 45/Guernica, da Vinci 47/Salvator Mundi) are extracted to committed HTML page fixtures and the specs re-pointed at them, so cross-layout coverage is preserved without VCR. Removed serp_fetcher_spec + serp_queries_spec, the cli --url path, and the SerpFetcher SSRF block in robustness_spec (the parser's own output-safety tests stay). The capture script now saves HTML page fixtures. Suite: 37 examples, 0 failures, 1 pending; oracle PERFECT 47/47. The removed code stays in git history and lives on the `solution` branch.

The carousel <img> alt is Google's screen-reader label for the cell and equals the artwork title verbatim (47/47 against the oracle, and present on every real :works carousel). It is semantic, so it survives Google reflowing the cell's div nesting, where the structural leaf-text label would break. Use alt as the primary name source; fall back to the structural leaf-text / aria-labelledby labels for cells whose <img> has no alt (the films carousel). Date/subtitle still comes from the structural labels.

The graded core only needs to parse the saved SERP file (the brief forbids extra HTTP). Move the optional live-fetch layer — BrowserFetcher (Ferrum headless Chrome), RateGuard cross-process throttle, VPN/proxy support, and the capture script — onto the solution-3-live-fetch branch, leaving the core a small Nokogiri-only parser. - remove lib/browser_fetcher.rb, lib/rate_guard.rb, script/capture_serps.rb and their specs; drop the ferrum dev dependency and refresh Gemfile.lock - CLI is now local-file-only; drop the --browser/--url paths and RateGuard spec hook - document the split in EXTRAS.md; trim SOLUTION.md/README accordingly The cross-layout fixtures the script produced stay committed, so the suite still exercises real Monet/Picasso/da Vinci/Tarsila/Tarantino pages offline.

The brief is a small parser plus its own tests against the given page and 2 other carousel pages. Remove scaffolding it never asked for: CI workflow, Docker, mise, lefthook, gem packaging (gemspec + version + namespace), and the extra bin tooling / EXTRAS pointer. Restore README.md to the upstream prompt (my solution banner does not belong in the challenge's own file). - de-gemify: SerpapiCodeChallenge::CarouselParser -> plain CarouselParser; Gemfile declares nokogiri + rspec directly; bin/extract calls the class - readability pass (no behaviour change): split labels/image into named helpers (text_divs/leaf_text?/aria_labels, inner_thumbnail/sibling_thumbnail), replace the dense cell_image while-condition with wraps_other_cell? - rubocop Metrics 3 offenses -> 0; flog avg 10.7 -> 6.5/method Oracle still PERFECT 47/47; suite 25 examples, 0 failures.

A single command to verify correctness and complexity without reading the code: - bundle exec rake -> RSpec (incl. the 47/47 oracle) + RuboCop - bundle exec rubocop-> complexity (Metrics) + likely bugs (Lint) only - .rubocop.yml keeps the gate about substance (cosmetic Style/Layout off) Gate is green: 25 examples, 0 failures; 8 files, no offenses.

CarouselParser was one class with several reasons to change. Extract cohesive collaborators, leaving the parser as a thin orchestrator (public API unchanged, so the suite is untouched): - SectionLocator — locate the carousel by its data-attrid priority - ThumbnailResolver — inline base64 index + sanitized thumbnail URL resolution - Cell + NestedCell / LinkedCell — one anchor -> one artwork, with the two DOM shapes as template-method subclasses (OCP: a new layout is a new subclass) Adds an isolated ThumbnailResolver spec (the testability payoff). Gate green: 29 examples, 0 failures; RuboCop 12 files, 0 offenses.

aryrabelo added 15 commits June 18, 2026 09:49

docs: align SOLUTION.md with the trimmed scope

4eba2e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract Van Gogh Paintings#391

Extract Van Gogh Paintings#391
aryrabelo wants to merge 15 commits into
serpapi:masterfrom
aryrabelo:solution-3

aryrabelo commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aryrabelo commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aryrabelo commented Jun 18, 2026 •

edited

Loading