Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,7 @@ build-iPhoneSimulator/
# unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
.rvmrc
.DS_Store

# editor / local toolchain
*.swp
.mise.local.toml
3 changes: 3 additions & 0 deletions .rspec
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
--require spec_helper
--format documentation
--color
27 changes: 27 additions & 0 deletions .rubocop.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# A runnable quality gate, not a cosmetic style nitpicker. It enforces the two
# things worth failing a build over: complexity (Metrics) and likely bugs (Lint).
# Run it with `bundle exec rubocop` (or `bundle exec rake`, which also runs RSpec).
AllCops:
TargetRubyVersion: 3.3
NewCops: disable
SuggestExtensions: false

# Cosmetic layers are off on purpose — keep the gate about substance.
Style:
Enabled: false
Layout:
Enabled: false
Naming:
Enabled: false

Metrics:
Enabled: true

# RSpec describe/context blocks are legitimately long; don't count them.
Metrics/BlockLength:
Exclude:
- "spec/**/*"
- "Rakefile"

Lint:
Enabled: true
11 changes: 11 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# frozen_string_literal: true

source "https://rubygems.org"

gem "nokogiri", ">= 1.13"

group :test do
gem "rake"
gem "rspec", "~> 3.13"
gem "rubocop", require: false
end
83 changes: 83 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
GEM
remote: https://rubygems.org/
specs:
ast (2.4.3)
diff-lcs (1.6.2)
json (2.19.9)
language_server-protocol (3.17.0.5)
lint_roller (1.1.0)
nokogiri (1.19.4-aarch64-linux-gnu)
racc (~> 1.4)
nokogiri (1.19.4-aarch64-linux-musl)
racc (~> 1.4)
nokogiri (1.19.4-arm-linux-gnu)
racc (~> 1.4)
nokogiri (1.19.4-arm-linux-musl)
racc (~> 1.4)
nokogiri (1.19.4-arm64-darwin)
racc (~> 1.4)
nokogiri (1.19.4-x86_64-darwin)
racc (~> 1.4)
nokogiri (1.19.4-x86_64-linux-gnu)
racc (~> 1.4)
nokogiri (1.19.4-x86_64-linux-musl)
racc (~> 1.4)
parallel (2.1.0)
parser (3.3.11.1)
ast (~> 2.4.1)
racc
prism (1.9.0)
racc (1.8.1)
rainbow (3.1.1)
rake (13.4.2)
regexp_parser (2.12.0)
rspec (3.13.2)
rspec-core (~> 3.13.0)
rspec-expectations (~> 3.13.0)
rspec-mocks (~> 3.13.0)
rspec-core (3.13.6)
rspec-support (~> 3.13.0)
rspec-expectations (3.13.5)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-mocks (3.13.8)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-support (3.13.7)
rubocop (1.88.0)
json (~> 2.3)
language_server-protocol (~> 3.17.0.2)
lint_roller (~> 1.1.0)
parallel (>= 1.10)
parser (>= 3.3.0.2)
rainbow (>= 2.2.2, < 4.0)
regexp_parser (>= 2.9.3, < 3.0)
rubocop-ast (>= 1.49.0, < 2.0)
ruby-progressbar (~> 1.7)
unicode-display_width (>= 2.4.0, < 4.0)
rubocop-ast (1.49.1)
parser (>= 3.3.7.2)
prism (~> 1.7)
ruby-progressbar (1.13.0)
unicode-display_width (3.2.0)
unicode-emoji (~> 4.1)
unicode-emoji (4.2.0)

PLATFORMS
aarch64-linux-gnu
aarch64-linux-musl
arm-linux-gnu
arm-linux-musl
arm64-darwin
x86_64-darwin
x86_64-linux-gnu
x86_64-linux-musl

DEPENDENCIES
nokogiri (>= 1.13)
rake
rspec (~> 3.13)
rubocop

BUNDLED WITH
2.5.22
11 changes: 11 additions & 0 deletions Rakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# frozen_string_literal: true

require "rspec/core/rake_task"
require "rubocop/rake_task"

RSpec::Core::RakeTask.new(:spec)
RuboCop::RakeTask.new

# `rake` runs the whole gate: correctness (incl. the 47/47 oracle spec) + the
# complexity/bug checks.
task default: %i[spec rubocop]
90 changes: 90 additions & 0 deletions SOLUTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# serpapi-code-challenge

Solution to the SerpApi "Extract Van Gogh Paintings" challenge.

> Challenge: <https://github.com/serpapi/code-challenge>
> Parse a saved Google SERP HTML page (no extra HTTP) and return the
> knowledge-graph artworks carousel as an array of `{ name, extensions, link, image }`.

`lib/carousel_parser.rb` parses the saved page with Nokogiri. It reproduces the
official `expected-array.json` exactly, 47 of 47, every field. It also runs on
other artists: Monet (50), Picasso (45), Leonardo da Vinci (47), a pt-BR page for
Tarsila do Amaral (42), and a person's films carousel (Tarantino, 9), which has a
different cell shape (empty anchor, title and year in `aria-labelledby` spans,
thumbnail in a sibling `<img>`).

## Run

```bash
bundle install
bundle exec rake # RSpec (incl. the 47/47 oracle) + RuboCop
bundle exec rspec # just the tests
bundle exec rubocop # just the complexity / bug check
bin/extract files/van-gogh-paintings.html # print the {"artworks": [...]} JSON
```

Ruby 3.x, Nokogiri, RSpec, RuboCop. Nothing hits the network; it reads the file.
The RuboCop config (`.rubocop.yml`) only checks complexity (Metrics) and likely
bugs (Lint), not cosmetic style.

## Approach

The carousel is found by its knowledge-graph `data-attrid`, not by CSS class
names, which are hashed and rotate per query. `SectionLocator` tries the exact
artist-works attrid (`kc:/visual_art/visual_artist:works`), then any `:works`,
then the first `kc:/<domain>/<type>:<collection>` section that holds `stick=`
anchors (a person's films or books). Picking one section keeps a page with
several carousels from bleeding into each other.

Each anchor becomes one item:

- name: the `<img>` `alt`. Google writes the title there as the screen-reader
text, so it matches what's shown and doesn't move when Google reshuffles the
surrounding divs. When an image has no `alt` (the films cell), it falls back to
the leaf-text divs or the `aria-labelledby` spans.
- extensions: the date next to the name (second leaf-text div, or second aria
span). Left out when the item has no date.
- link: the anchor's `/...` href with `https://www.google.com` in front.
Anything that isn't a root-relative Google link is dropped (`javascript:`,
`data:`, and so on).
- image: the gstatic URL in `data-src`, or the base64 Google injects through
`_setImagesSrc` scripts (the `\x3d`-style escapes get decoded back to `=`), or
the data-URI in a films cell's sibling `<img>`. Only `https` and raster
data-URIs make it out, no svg or js.

A few defensive bits, since the input is a real Google page: bad UTF-8 is
scrubbed before parsing, libxml2's node-size cap is raised (`huge`) so a big node
ahead of the carousel can't cut the DOM short, and the inline-image regex uses
possessive quantifiers so it won't backtrack on adversarial input.

The work is split into small pieces: `SectionLocator` finds the section,
`ThumbnailResolver` builds the inline-base64 index and sanitizes thumbnail URLs,
and `Cell` (with `NestedCell` and `LinkedCell`) turns one anchor into one item.
`CarouselParser` wires them together.

## Tested against other carousels

The challenge asks for two other pages; there are more. Real fetched pages for
Monet (50), Picasso (45), Leonardo da Vinci (47), and Tarsila do Amaral in pt-BR
(42). A real films carousel for Quentin Tarantino (`kc:/people/person:movies`,
9) to exercise the empty-anchor / aria-labelledby / sibling-image shape. A
synthetic films carousel where the subtitle isn't a year, to check that
`extensions` doesn't pick up the title. And a page with no carousel, which
returns an empty array.

## Layout

```
lib/
carousel_parser.rb # orchestrator
section_locator.rb # find the carousel section by data-attrid
thumbnail_resolver.rb # inline base64 index + thumbnail URL
cell.rb # Cell + NestedCell / LinkedCell (one anchor -> one item)
bin/extract # print the JSON for a saved SERP file
spec/ # RSpec: oracle (47/47) + cross-layout + unit specs
fixtures/pages/*.html # other real carousels (Monet, da Vinci, Picasso, Tarsila, Tarantino)
fixtures/*.html # synthetic edge cases (films-carousel, no-carousel)
files/ # the original challenge files (inputs + oracle)
```

License: MIT (see `LICENSE`).
9 changes: 9 additions & 0 deletions bin/extract
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/usr/bin/env ruby
# frozen_string_literal: true

# Print the artworks carousel from a saved Google SERP as JSON.
# bin/extract path/to/serp.html
require_relative "../lib/carousel_parser"

abort "usage: bin/extract <file.html>" if ARGV.empty?
puts JSON.pretty_generate(CarouselParser.from_file(ARGV[0]).to_h)
34 changes: 34 additions & 0 deletions lib/carousel_parser.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# frozen_string_literal: true

require "nokogiri"
require "json"
require_relative "section_locator"
require_relative "thumbnail_resolver"
require_relative "cell"

# Extracts a Google knowledge-graph carousel from a saved SERP page into
# { name:, extensions:, link:, image: } items. Orchestrates three collaborators:
# SectionLocator (where the carousel is), ThumbnailResolver (image urls), and
# Cell (one anchor -> one artwork).
class CarouselParser
def self.from_file(path) = new(File.read(path))

def initialize(html)
html = html.to_s.dup.force_encoding("UTF-8").scrub
@doc = Nokogiri::HTML(html) { |c| c.huge } # huge: keep libxml from truncating a big DOM
@thumbnails = ThumbnailResolver.new(html)
end

def artworks
anchors.filter_map { |anchor| Cell.for(anchor, @doc, @thumbnails).artwork }
end

def to_h = { "artworks" => artworks.map { |art| art.transform_keys(&:to_s) } }
def to_json(*args) = JSON.generate(to_h, *args)

private

def anchors
SectionLocator.new(@doc).container.css('a[href*="/search"][href*="stick="]')
end
end
105 changes: 105 additions & 0 deletions lib/cell.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# frozen_string_literal: true

# One carousel cell: turns a single <a> anchor into an artwork hash.
#
# Two DOM shapes share the same algorithm (name = image alt or first label, then
# date / link / image) and differ only in WHERE the labels and image live, so
# they are template-method subclasses:
# NestedCell — labels + <img> nested inside the anchor (artist works, and the
# inline films/books carousels).
# LinkedCell — empty anchor whose labels are aria-labelledby <span>s and whose
# thumbnail is a sibling <img> (a person's films carousel).
class Cell
GOOGLE = "https://www.google.com"

# Pick the cell shape: a nested <img> means the labels are nested too.
def self.for(anchor, doc, thumbnails)
shape = anchor.at_css("img") ? NestedCell : LinkedCell
shape.new(anchor, doc, thumbnails)
end

def initialize(anchor, doc, thumbnails)
@anchor = anchor
@doc = doc
@thumbnails = thumbnails
end

# The artwork hash, or nil when the cell has no name (not a real item).
def artwork
return unless name

art = { name: name }
art[:extensions] = [date] if date
art[:link] = link if link
art[:image] = image
art
end

private

attr_reader :anchor, :doc, :thumbnails

# The image alt is Google's screen-reader title and the most durable name; the
# structural label is the fallback when the <img> carries no alt.
def name = norm(image_node&.[]("alt")) || labels[0]
def date = labels[1]

def link
href = anchor["href"].to_s
GOOGLE + href if href.start_with?("/")
end

def image = thumbnails.resolve(image_node, allow_src: allow_src?)

# Subclass hooks.
def labels = raise(NotImplementedError)
def image_node = raise(NotImplementedError)
def allow_src? = raise(NotImplementedError)

def norm(str) = str && !(t = str.gsub("\u00A0", " ").strip).empty? ? t : nil
def clean(node) = norm(node&.text)
end

# Labels and <img> nested inside the anchor.
class NestedCell < Cell
private

def image_node = anchor.at_css("img")

# A nested <img> shows a 1x1 placeholder in src (the real bytes arrive via the
# _setImagesSrc script), so its src must be ignored.
def allow_src? = false

def labels
anchor.css("div").select { |div| leaf_text?(div) }.map { |div| clean(div) }
end

def leaf_text?(div)
div.children.any? && div.children.all?(&:text?) && !div.text.strip.empty?
end
end

# Empty anchor: labels are aria-labelledby <span>s; thumbnail is a sibling <img>.
class LinkedCell < Cell
private

def allow_src? = true # the sibling <img> carries the data-URI directly in src

def labels
anchor["aria-labelledby"].to_s.split.first(2).map { |id| clean(doc.at_css("##{id}")) }
end

# Climb to the nearest ancestor holding an <img>, stopping before one that wraps
# another cell, so we never borrow a neighbour's thumbnail.
def image_node
node = anchor.parent
until node.nil? || wraps_other_cell?(node)
img = node.at_css("img")
return img if img

node = node.parent
end
end

def wraps_other_cell?(node) = node.css('a[href*="stick="]').size > 1
end
Loading