Skip to content

Extract Van Gogh Paintings Carousel#389

Open
Simar-malhotra09 wants to merge 8 commits into
serpapi:masterfrom
Simar-malhotra09:master
Open

Extract Van Gogh Paintings Carousel#389
Simar-malhotra09 wants to merge 8 commits into
serpapi:masterfrom
Simar-malhotra09:master

Conversation

@Simar-malhotra09

Copy link
Copy Markdown

Extract Van Gogh Paintings Carousel

Ruby parser that reads files/van-gogh-paintings.html and extracts each painting's:

  • Name
  • Extensions
  • Link
  • Image

The output matches the provided expected-array.json exactly.

All data is parsed from the local HTML file. No network requests are made.

Approach

Identifying carousel items

The main challenge is determining which elements belong to the carousel.

Google's CSS class names are generated and obfuscated (pgNMRc, iELo6, etc.). They do not carry meaning and can change at any time, so the parser does not rely on them.

Instead, carousel items are identified using the stick= query parameter in the link URL. Any <a> element with a /search?...&stick=... URL that wraps an <img> is treated as a carousel item.

This relies on Google's search URL structure rather than presentation specific CSS classes, making it a more stable signal.

Extracting metadata

The painting name and year are read from the innermost text nodes within each anchor element.

The year is only added to extensions when it matches a four digit year:

/\A\d{4}\z/

This prevents title fragments from being incorrectly included in the extensions field.

Image extraction

The HTML contains two thumbnail formats.

Images embedded in scripts

The first set of paintings stores image data inside inline <script> blocks through _setImagesSrc(ii, s) calls.

During initialization, the parser:

  1. Extracts the image data from the scripts
  2. Decodes escaped characters such as \x3d to =
  3. Builds an ID to image lookup table

Images from data-src

The remaining paintings are lazy loaded and expose the image URL directly through the data-src attribute.

These images are read directly from the corresponding <img> elements.

Validation

The parser was tested against three carousel layouts:

File Items Content Type
files/van-gogh-paintings.html 47 Artworks, exact match with provided JSON
spec/fixtures/deniro-movies.html 12 Movies
spec/fixtures/shinkai-books.html 12 Books

Running

bin/setup
# Verifies Ruby >= 3.1 and installs dependencies

bundle exec bin/parse files/van-gogh-paintings.html
# Outputs JSON to stdout

bundle exec rspec
# 16 examples, 0 failures

Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
…resolver

Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
…d schema

Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
…d spec

Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
…lity

Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
Signed-off-by: Simar Malhotra <malhotrasimar009@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant