domain-skills/facebook: add page-archival.md (full-preservation scraping)#385
Open
thetimechain wants to merge 1 commit into
Open
domain-skills/facebook: add page-archival.md (full-preservation scraping)#385thetimechain wants to merge 1 commit into
thetimechain wants to merge 1 commit into
Conversation
…raping
Third Facebook domain skill alongside groups.md and pages.md. Where
those two are tuned for monitoring (top-N recent posts + outbound URLs),
this one is tuned for full preservation of a Page:
- walk every reachable post permalink
- visit each one in permalink view (not feed view)
- recursively expand every comment and reply thread
- download every image
- emit one wiki-compatible markdown file per post
Captures patterns from a live archival session against a long-running
community Page (~63 posts, every comment + reply + image preserved,
zero account checkpoints). Most operationally important findings:
- Two-phase manifest-then-scrape — feed virtualization makes comment
expansion impossible in feed view; permalink view is required.
- Vanity-scoped permalink filter (a[href*="/{vanity}/posts/pfbid"]) —
unscoped pfbid matches leak in ~30% notification/recommendation pollution.
- Comment depth from DOM nesting — each comment/reply is its own
div[role="article"]; indentation pixels are unreliable.
- 30s pause every 50 scrolls — pacing floor that kept the test account
un-checkpointed through a 581-URL run.
No pixel coordinates, no user-specific narration, no secrets.
Contributor
There was a problem hiding this comment.
1 issue found across 1 file
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="agent-workspace/domain-skills/facebook/page-archival.md">
<violation number="1" location="agent-workspace/domain-skills/facebook/page-archival.md:526">
P2: Single-block comments silently lose body text in the full end-to-end example. The standalone section earlier in the same file correctly handles this with a fallback `(blocks[0] && blocks.length === 1 ? blocks[0] : null)`, but the full example only uses `blocks.slice(1).join('\\n') || null`, which drops comment text when a comment renders with only one `div[dir="auto"]` block. Mirror the fallback from the standalone extraction snippet to prevent data loss.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| const t = el.querySelector('a[href*="comment_id="]'); | ||
| comments.push({ | ||
| depth, author: blocks[0] || null, author_url: a?.href || null, | ||
| text: blocks.slice(1).join('\\n') || null, |
Contributor
There was a problem hiding this comment.
P2: Single-block comments silently lose body text in the full end-to-end example. The standalone section earlier in the same file correctly handles this with a fallback (blocks[0] && blocks.length === 1 ? blocks[0] : null), but the full example only uses blocks.slice(1).join('\\n') || null, which drops comment text when a comment renders with only one div[dir="auto"] block. Mirror the fallback from the standalone extraction snippet to prevent data loss.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-workspace/domain-skills/facebook/page-archival.md, line 526:
<comment>Single-block comments silently lose body text in the full end-to-end example. The standalone section earlier in the same file correctly handles this with a fallback `(blocks[0] && blocks.length === 1 ? blocks[0] : null)`, but the full example only uses `blocks.slice(1).join('\\n') || null`, which drops comment text when a comment renders with only one `div[dir="auto"]` block. Mirror the fallback from the standalone extraction snippet to prevent data loss.</comment>
<file context>
@@ -0,0 +1,630 @@
+ const t = el.querySelector('a[href*="comment_id="]');
+ comments.push({
+ depth, author: blocks[0] || null, author_url: a?.href || null,
+ text: blocks.slice(1).join('\\n') || null,
+ time_hint: t?.innerText || null,
+ });
</file context>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a third Facebook domain skill,
page-archival.md, alongside the existinggroups.mdandpages.md. The existing two are tuned for monitoring (top-N recent posts + outbound URLs). This new one is tuned for full preservation of a Page: walk every reachable post permalink, visit each in permalink view, recursively expand every comment and reply thread, download every image, and emit one wiki-compatible markdown file per post.Why this is a separate skill
pages.mdandgroups.mdquite reasonably stop at feed-view harvesting since their workflows don't need comments + replies + images. But the moment the goal is preservation (admin's content might disappear), feed-view falls apart:<img src>in feed-view returns thumbnails; full-res is in the wrapping<a href>So a Page archive needs a two-phase architecture: phase 1 walks the feed and captures only permalinks, phase 2 visits each permalink and does the deep extraction. Keeping that as its own skill avoids muddling the simpler
pages.mdflow.Patterns captured
All field-tested against a long-running community Page (~63 posts, ~268 comments, ~108 images preserved, zero account checkpoints through a 581-URL run):
a[href*=\"/{vanity}/posts/pfbid\"]) — unscoped pfbid matches leak ~30% notification/recommendation pollution<h1>misdirection — FB's chat widget renders its own<h1>, so globalh1returns "Chats" instead of the Page name; filter inside[role=\"main\"]expand_all_once()loop — tight regex covering "See more / View N comments / View N replies / Show more replies"; idempotent passes until zero clicks, capped at 30div[role=\"article\"]between comment and post article (indentation pixels are unreliable)emoji.php, dropwidth < 60to avoid avatars + reaction icons + tracking pixelsensure_real_tab()between ops as the lighter-weight option/posts→/" 302 quirk documented; use the bare Page URLWhat I deliberately didn't include
Verification
Run
rg --files agent-workspace/domain-skills/facebookand confirm the three skills coexist. The new file mirrors the style + section ordering ofpages.md(URL patterns table, DOM anchors table, scrolling pattern, decoder helpers, rate-limit discipline, self-inspection block, full Python example, gotchas log).Happy to iterate on tone or trim sections if the maintainers prefer a leaner skill.
Summary by cubic
Adds the Facebook domain skill
page-archival.mdto fully preserve a Page by building a permalink manifest, scraping each post in permalink view, expanding all comments/replies, downloading images, and writing one markdown file per post. Complementspages.mdandgroups.md, which focus on monitoring.agent-workspace/domain-skills/facebook/page-archival.mddocumenting end-to-end Page archival.emoji.php, width ≥ 60) and per-post markdown output.ensure_real_tab()), plus a self-inspection block and a full Python example.Written for commit 303ae0f. Summary will update on new commits. Review in cubic