Extract Van Gogh Paintings#391
Open
aryrabelo wants to merge 15 commits into
Open
Conversation
CarouselParser parses files/van-gogh-paintings.html and reproduces expected-array.json 47/47 (name, extensions, link, image), with the inline base64 thumbnails decoded. Structural, class-agnostic scoping generalizes across carousels (Monet, Picasso, pt-BR Tarsila, a films carousel, and a no-carousel page). Ruby + RSpec, as suggested. See SOLUTION.md for the full writeup: CLI, an in-repo headless fetcher (Ferrum) that captures real SERPs into VCR cassettes, anti-block rate guard + VPN/proxy, security hardening, and Docker / mise / Lefthook / CI.
… runner The "Offline oracle check" step ran `ruby bin/verify` with the system Ruby, but bin/verify loads CarouselParser, which requires Nokogiri. On a clean CI runner Nokogiri lives only in the cached bundle, so the bare invocation died with `cannot load such file -- nokogiri (LoadError)` and the job exited 1 before RSpec even ran. Run the oracle through bundler so the gem resolves.
README.md is the upstream challenge prompt and GitHub renders it as the repo landing page, so a reviewer saw the spec, not the solution. Add a one-line banner linking SOLUTION.md. Also correct SOLUTION.md's status line and Layout map, which undercounted the test fixtures (Tarsila pt-BR and the synthetic films/no-carousel pages were missing).
picasso-paintings.html parsed identically to the already-committed serp_fetcher/picasso_paintings VCR cassette, so the same SERP was stored twice. Re-source the two specs that used the page (the cross-layout exact assertion and the generalization example) from the cassette via VCR — same deterministic facts (45 items, first "Guernica"), one copy of the HTML. Suite stays green (43 examples, 0 failures, 1 pending); oracle 47/47. ~3.06MB removed.
…by layout) The generic fallback was proven only by a synthetic films fixture; on a real Google SERP a person's films carousel (kc:/people/person:movies) uses a different DOM than the artist :works carousel — the <a> is empty, the title and year are aria-labelledby spans outside it, and the thumbnail is a sibling <img> (often with the data-URI already in src). The parser extracted 0 items from it. Make extraction additive: carousel_anchor? also accepts an empty stick anchor carrying aria-labelledby; item_labels reads name/subtitle from those spans when the anchor has no inner leaf-text; image_of finds the thumbnail in the single -item cell (bounded climb that can't borrow a neighbour's image). The artist :works path is unchanged — Van Gogh stays 47/47. Pin a real fetched Tarantino movies carousel (9 items, first "Pulp Fiction" [1994]) in cross_layout_spec. Suite: 45 examples, 0 failures, 1 pending.
Synthesis of the structural and ponytail branches: keeps full coverage (Van Gogh 47/47, the Monet/Picasso/Tarsila works carousels, AND the real non-:works aria-labelledby films carousel) but trims the 232-line parser to 109 by inlining the scoping/security helpers and dropping the YAGNI scaffolding. Two small cell branches — leaf-text + inner <img> for paintings, aria-labelledby + sibling <img> for films — are the only complexity the films layout needs. Suite: 45 examples, 0 failures, 1 pending; oracle PERFECT 47/47.
A three-lens DRY/KISS/ponytail audit found the parser already near-minimal. Safe wins applied: factor the inner-vs-cell image source resolution into one candidate-list pick (the inner path still excludes the placeholder `src`, so the 47/47 oracle holds), collapse `clean` to an endless method, and drop stale meta-comments. Confirmed irreducible: the image-branch asymmetry (merging is a false-DRY trap that would emit the 1x1 placeholder GIF for all 47 paintings), the 3-tier container scoping (distinct predicates + priority), and the two label sources. Oracle PERFECT 47/47; suite 45 examples, 0 failures, 1 pending.
…Api lens)
Reposition the writeup around the two competencies SerpApi actually screens for:
layout-resilient extraction (durable data-attrid scoping, exact 47/47 output,
real cross-layout generalization incl. a non-:works films carousel) and the
end-to-end fetch-and-serve pipeline (bin/extract --browser renders live Google
and serves the same {"artworks": [...]} JSON SerpApi serves clients), while the
graded core stays a no-extra-HTTP file parse. Also correct the stale 109 -> 100
line count after the DRY/KISS pass.
Drops the weakest part of the acquisition stack — the plain-HTTP SerpFetcher (which never even returns the JS-rendered carousel) plus VCR/webmock and the ~3MB of cassettes the panel flagged as scope inflation against the "no extra HTTP" rule. Keeps BrowserFetcher: the headless-Chrome render that serves live Google end-to-end (bin/extract --browser) — the SerpApi competency worth showing. The two cassette-backed generalization fixtures (Picasso 45/Guernica, da Vinci 47/Salvator Mundi) are extracted to committed HTML page fixtures and the specs re-pointed at them, so cross-layout coverage is preserved without VCR. Removed serp_fetcher_spec + serp_queries_spec, the cli --url path, and the SerpFetcher SSRF block in robustness_spec (the parser's own output-safety tests stay). The capture script now saves HTML page fixtures. Suite: 37 examples, 0 failures, 1 pending; oracle PERFECT 47/47. The removed code stays in git history and lives on the `solution` branch.
The carousel <img> alt is Google's screen-reader label for the cell and equals the artwork title verbatim (47/47 against the oracle, and present on every real :works carousel). It is semantic, so it survives Google reflowing the cell's div nesting, where the structural leaf-text label would break. Use alt as the primary name source; fall back to the structural leaf-text / aria-labelledby labels for cells whose <img> has no alt (the films carousel). Date/subtitle still comes from the structural labels.
The graded core only needs to parse the saved SERP file (the brief forbids extra HTTP). Move the optional live-fetch layer — BrowserFetcher (Ferrum headless Chrome), RateGuard cross-process throttle, VPN/proxy support, and the capture script — onto the solution-3-live-fetch branch, leaving the core a small Nokogiri-only parser. - remove lib/browser_fetcher.rb, lib/rate_guard.rb, script/capture_serps.rb and their specs; drop the ferrum dev dependency and refresh Gemfile.lock - CLI is now local-file-only; drop the --browser/--url paths and RateGuard spec hook - document the split in EXTRAS.md; trim SOLUTION.md/README accordingly The cross-layout fixtures the script produced stay committed, so the suite still exercises real Monet/Picasso/da Vinci/Tarsila/Tarantino pages offline.
The brief is a small parser plus its own tests against the given page and 2 other carousel pages. Remove scaffolding it never asked for: CI workflow, Docker, mise, lefthook, gem packaging (gemspec + version + namespace), and the extra bin tooling / EXTRAS pointer. Restore README.md to the upstream prompt (my solution banner does not belong in the challenge's own file). - de-gemify: SerpapiCodeChallenge::CarouselParser -> plain CarouselParser; Gemfile declares nokogiri + rspec directly; bin/extract calls the class - readability pass (no behaviour change): split labels/image into named helpers (text_divs/leaf_text?/aria_labels, inner_thumbnail/sibling_thumbnail), replace the dense cell_image while-condition with wraps_other_cell? - rubocop Metrics 3 offenses -> 0; flog avg 10.7 -> 6.5/method Oracle still PERFECT 47/47; suite 25 examples, 0 failures.
A single command to verify correctness and complexity without reading the code: - bundle exec rake -> RSpec (incl. the 47/47 oracle) + RuboCop - bundle exec rubocop-> complexity (Metrics) + likely bugs (Lint) only - .rubocop.yml keeps the gate about substance (cosmetic Style/Layout off) Gate is green: 25 examples, 0 failures; 8 files, no offenses.
CarouselParser was one class with several reasons to change. Extract cohesive collaborators, leaving the parser as a thin orchestrator (public API unchanged, so the suite is untouched): - SectionLocator — locate the carousel by its data-attrid priority - ThumbnailResolver — inline base64 index + sanitized thumbnail URL resolution - Cell + NestedCell / LinkedCell — one anchor -> one artwork, with the two DOM shapes as template-method subclasses (OCP: a new layout is a new subclass) Adds an isolated ThumbnailResolver spec (the testability payoff). Gate green: 29 examples, 0 failures; RuboCop 12 files, 0 offenses.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Solution to the "Extract Van Gogh Paintings" challenge. Full writeup in SOLUTION.md.
lib/carousel_parser.rbparses the providedfiles/van-gogh-paintings.htmlwith Nokogiri — no extra HTTP — and reproducesfiles/expected-array.json47/47, field-for-field (name, extensions, link, image; inline base64 + gstatic thumbnails). Run it withbundle exec rspec(the suite includes the oracle check) orbin/extract files/van-gogh-paintings.html.Approach. The carousel is scoped by Google's durable knowledge-graph
data-attrid(not the hashed CSS classes that rotate per query). Each name comes from the image'salttext — Google's screen-reader label, which equals the title verbatim and survives Google reflowing the cell's div nesting; the structural leaf-text /aria-labelledbylabels are the fallback.extensionsholds the date when the item has one.Tested against other carousels (per "test against 2 other similar result pages"): Monet (50), Picasso (45), Leonardo da Vinci (47), a pt-BR page (Tarsila do Amaral, 42), and a non-
:worksfilms carousel (Quentin Tarantino, 9) whose cells differ structurally (empty anchor,aria-labelledbytitle/year, sibling<img>), plus a no-carousel page that returns an empty array.Ruby 3.x + Nokogiri + RSpec; suite is green (25 examples, 0 failures).