Methodology

Methodology

Data Collection

  • 4 attractions: Empire State Building, Edge Hudson Yards, Summit One Vanderbilt, Top of the Rock.
  • 1 scrape per day. Scraper enters each attraction's public booking widget and reads every tour time and price. Public data only, no private API endpoints.
  • Playwright (Microsoft browser automation) drives a real Chromium browser. Standard clicks, keystrokes, waits. Same behavior as a guest booking manually.
  • TOR uses bot-protection. Scraper presents a standard browser identity to clear it.
  • Forward windows vary by attraction. ESB sells 173 days out (~6 months). Summit sells 200 days (~6.5 months). Edge and TOR sell 265 days (~9 months).
  • Apr 12 2026 scrape: ESB 9,568 rows across 172 dates. Edge 20,868 rows across 267 dates. Summit 5,371 rows across 199 dates. TOR 22,929 rows across 264 dates.

Schema & Key Fields

  • attraction: attraction key (esb, edge, summit, totr)
  • travel_date: guest visit date (YYYY-MM-DD)
  • tour_time: tour time (e.g. "10:30 AM")
  • price_cents: base price in cents, before booking fee. Null if unavailable or sold out.
  • status: available, sold_out, going_fast
  • scrape_date: date row was captured

Making Prices Comparable

  • Base price only is stored. Booking fees not in the JSON. All-in price computed at display time.
  • Standard: single adult GA, same travel date.
  • Tour time intervals vary: ESB every 15 min, Summit every 30 min, Edge and TOR every 10 min. Cross-attraction comparisons match to nearest available tour time within ±30 min.
  • Booking fees: ESB $5 flat per order. Edge $2 per ticket. Summit $3 flat per order. TOR fee embedded in displayed price.
  • 1-ticket all-in (Apr 2026): ESB $49, Edge $42, Summit $47, TOR $42. ESB is $5 to $7 higher at 1 ticket. Fee math shifts at larger party sizes.

Anomaly Handling

  • Sold-out tour times carry forward the last known price. Shown as "Sold Out" on tracker, excluded from averages, retained for same-date comparisons.
  • Zero-row scrapes block publish. Yesterday's data stays live until the next successful run.
  • Failure alerts fire via Gmail (OAuth) within seconds. Alert includes attraction name, row counts, and traceback.
  • Outliers flagged, not deleted. Threshold: price more than 3 standard deviations from 30-day rolling mean. Real price spikes (holidays, competitive moves) appear as outliers on day one but are retained.
  • Known closures (Summit private-event days) render as "Closed" in the tracker.

Sunset Detection

  • Sunset premium = peak evening price minus same-date noon price. Noon is the baseline because every attraction sells a noon tour time and noon prices are never sunset-inflated.
  • ESB: Tour times labeled "Sunset" or "Twilight" in the booking widget. Peak price in that window minus noon price.
  • TOR: Tour times labeled "Sunset" in the booking widget (e.g. "5:00 PM Sunset"). 9 price tiers at $3 intervals ($42 to $71). Dynamic by date.
  • Edge: Tour times labeled "Sunset" in the booking widget. Peak price in 4:00 PM to 9:00 PM window minus noon. 26 distinct prices observed, $1 steps.
  • Summit: No sunset label. Two price bands per date: floor and peak. $13 step on 199 of 200 dates (99.5%). Peak band = sunset window.

AI Utilization

  • Claude Code (Anthropic): Full system architecture, scraper implementation, schema design, data pipeline, static site (5 pages), sunset detection algorithm, case study document.
  • 48-hour build. Manual estimate: 5-6 weeks with a 2-person team.
  • Competitor API discovery: Some booking platforms expose API endpoints queryable at scale. This approach likely violates terms of service. Production collection uses only public booking widget interaction, equivalent to manual guest behavior.
  • AI limitations: Cannot predict competitor pricing intent, cannot capture login-gated rates, cannot forecast site structure changes.

Limitations

  • Prices reflect scrape-time widget display. Fees layered in UI, not stored. Checkout totals can differ if fee structure changes between runs.
  • No pre-launch baseline. Tool created April 2026. History grows one day per run.
  • TOR uses bot-protection. Scraper clears it with a standard browser identity. Site-side changes may require re-tuning.
  • IP throttling or site rebuilds can break a scraper without warning. Validation gate prevents empty runs from overwriting good data. Gaps possible until patched.
  • Reseller pricing (Viator, GetYourGuide) not captured. Direct ticket pages only. Adding resellers: 1-month build.
  • Forward windows differ. ESB 173 days (~6 months), Summit 200 days (~6.5 months), Edge and TOR 265 days (~9 months). Comparative analysis limited to the ESB window.

Assumptions

  • GA only: All comparisons use general-admission pricing. Premium tiers (Express, VIP, Sunrise) excluded.
  • Single adult: One ticket, one adult. No child/senior rates, no group discounts.
  • Base price: Price before booking fees. All-in computed at display time using known fee structures.
  • Same travel date: Cross-attraction comparisons always use the same calendar date.
  • Nearest tour time: When comparing sunset pricing, each ESB tour time matched to nearest competitor tour time within ±30 min.
  • Noon baseline: Sunset premium = peak evening price minus noon price. Noon chosen because all attractions sell it and none price it as sunset.
  • Sold-out carry-forward: When a tour time sells out, last known price retained for trend analysis.
  • No reseller pricing: Direct booking widget only. OTA prices (Viator, GetYourGuide) not captured.

Infrastructure & Deployment

  • Current build (pilot): Local machine, SQLite, JSON export, GitHub Pages, shared password. Adequate for 48-hour demo. Not production-grade.
  • Help chatbot (bottom-right corner): Plain-language Q&A on pricing data, sunset logic, and methodology.
  • Production build: Azure tenant. Container Apps Jobs (daily cron). PostgreSQL (audit, backups). Blob Storage + CDN (JSON). Static Web Apps + Entra ID SSO (IT-provisioned access). Key Vault (secrets). Monitor (alerting).
  • Cost: $58 to $99/month.
  • GitHub Actions: Workflow in repo, dormant. Enable in settings to activate daily scrape.