Commit Graph

19 Commits (9c4d10609f6f8500d6cd7cb5d381c6191e492028)

Author SHA1 Message Date
Haewon Kam 2cda26a649 feat: per-URL clinic folder — auto-save all scraped data to Storage
Each analysis run now creates a dedicated folder in Supabase Storage:
  clinics/{domain}/{reportId}/
    ├── scrape_data.json    (discover-channels: website scrape + Perplexity)
    ├── channel_data.json   (collect-channel-data: all channel API results)
    └── report.json         (generate-report: final AI-generated report)

Screenshots also moved from {reportId}/{id}.png to:
  clinics/{domain}/{reportId}/screenshots/{id}.png

Migration: 20260407_clinic_data_storage.sql creates 'clinic-data' bucket
(private, 10MB/file, JSON only). All writes are non-fatal — pipeline
continues even if Storage upload fails.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 10:04:52 +09:00
Haewon Kam ae87953fa0 feat: Registry-verified badge + registryData data flow + V3 error recording
- ClinicSnapshot.tsx: 'Registry 검증' badge (ShieldCheck icon), district/branches/brandGroup pills, external links (강남언니/네이버플레이스/구글맵) when source=registry
- report.ts: add source and registryData fields to ClinicSnapshot type
- transformReport.ts: ApiMetadata now accepts source/registryData; passes to clinicSnapshot
- useReport.ts: DB load path extracts scrape_data.source + scrape_data.registryData → transformApiReport
- V3 dual-write error recording: discover-channels, collect-channel-data, generate-report now write error_message + error status to analysis_runs on catch instead of silently swallowing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 10:01:19 +09:00
Haewon Kam d5f7f24e0a feat: clinic registry DB + pipeline audit P0 fixes
## Clinic Registry
- data/clinic-registry/clinic_registry_working.csv — 91개 병원 채널 마스터 DB
- data/clinic-registry/INFINITH_Outbound_List.csv — BD팀 아웃바운드 리스트 (17컬럼)
- data/clinic-registry/update_csv.py — 안전 CSV 업데이트 스크립트 (빈 필드만 채움)
- data/clinic-registry/extract_place_ids.py — 네이버 플레이스 ID 추출기
- scripts/import-registry.ts — CSV → Supabase clinic_registry 테이블 임포트
- supabase/migrations/20260406_clinic_registry.sql — clinic_registry 테이블 스키마

## Pipeline P0 Bug Fixes (전수 감사 후)
- fix(collect-channel-data): 강남언니 rating 0-10 스케일 오변환 제거
  - 기존: rating ≤ 5이면 ×2 → 4.8/10을 9.6/10으로 잘못 변환
  - 수정: Firecrawl 프롬프트가 이미 0-10 지시 → rawValue 직접 신뢰
- fix(generate-report): Perplexity 단일 fetch → fetchWithRetry 교체
  - maxRetries:2, backoffMs:[5000,15000], timeoutMs:90s
  - 기존: 타임아웃/429 시 리포트 생성 전체 실패
  - 수정: 자동 재시도로 일시적 API 오류 극복

## Docs
- docs/PIPELINE_IMPROVEMENT_PLAN.md — Sprint 0/1/2 완료 표시 + 전수 감사 결과 추가
- docs/REGISTRY_FUNCTIONAL_SPECS.md, DB_SCHEMA_V3.md 외 기획 문서 다수 추가

## New Components & Features
- supabase/functions/generate-content-plan, adjust-strategy — 콘텐츠 플랜/전략 조정
- src/components/plan/EditEntryModal, StrategyAdjustmentSection — 플랜 편집 UI
- supabase/functions/_shared/dataQuality, foundingYearExtractor, urlClassifier — 데이터 품질 유틸

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 09:33:25 +09:00
Haewon Kam 2ca9ec0306 fix: YouTube name matching + Facebook domain fallback in channel discovery
YouTube now verifies all candidates and picks best match by channel title.
Facebook tries all candidates with domain-name fallback when Firecrawl returns empty.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 12:15:37 +09:00
Haewon Kam 7fe3ff82c9 feat: DB V3 dual-write — clinics + analysis_runs + channel_snapshots
Phase 2-4 of SaaS schema migration. All Edge Functions now write to
BOTH legacy marketing_reports AND new V3 tables:

discover-channels:
  - UPSERT clinics (url-based dedup)
  - INSERT analysis_runs (status: discovering)

collect-channel-data:
  - INSERT channel_snapshots (one per channel — time-series!)
  - INSERT screenshots (evidence rows)
  - UPDATE analysis_runs (raw_channel_data, vision_analysis)

generate-report:
  - UPDATE analysis_runs (report, status: complete)
  - UPDATE clinics (last_analyzed_at, established_year)

Frontend passes clinicId + runId through all 3 phases.
Legacy marketing_reports still written for backward compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:51:11 +09:00
Haewon Kam ed37f23f78 feat: extract social links from JS-rendered buttons on clinic website
Added A4 parallel Firecrawl call with actions: [wait 3s, scrape]
to execute JavaScript and extract social button href URLs from
header/footer. This is the most reliable source — most Korean
clinics have Facebook/Instagram/YouTube/Blog icons in their nav.

Results merged as Source 3 (buttonHandles) alongside HTML links,
JSON extraction, and API searches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:41:27 +09:00
Haewon Kam 80c57147e7 feat: Sprint 1 — 7 data quality quick wins
WP-1: YouTube channel ID regex {20,} → {22} (exactly 24 chars)
WP-2: Naver Place category filtering in enrich-channels (성형/피부)
WP-3: Google Maps stores mapsUrl separately from clinicWebsite
WP-4: Naver Blog separates officialBlogUrl from search results
WP-5: 강남언니 rawRating + normalized rating (≤5 → ×2), Firecrawl
      prompt explicitly states "out of 10, NOT out of 5"
WP-6: Perplexity model centralized in _shared/config.ts (env override)
WP-7: Apify Instagram timeout 30s → 45s

Frontend: transformReport uses mapsUrl and officialBlogUrl when available

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:35:40 +09:00
Haewon Kam 087f65eec1 fix: revert to single Perplexity query with proven prompt pattern
Split queries performed worse. The proven working pattern is:
- Single query with Korean+English clinic name
- "검색해서 찾아줘. 검색 결과에서 발견된 계정을 모두 알려줘" phrasing
- All channels in one request
- English name in parentheses helps Perplexity find international accounts

Tested: "그랜드성형외과 (Grand Plastic Surgery)" → finds Instagram,
YouTube, Facebook, TikTok, Naver Blog all in one call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:36:36 +09:00
Haewon Kam 5157cf446a fix: split Perplexity into 3 focused queries matching research methodology
Single mega-query returns empty results. Split into:
B4a. Instagram + YouTube (most important, focused search)
B4b. Facebook + TikTok + Naver Blog + Kakao
B4c. 강남언니 + review platforms

Each query is short and focused — matches the proven pattern of
2-5 keyword searches that Perplexity handles well.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:34:45 +09:00
Haewon Kam ac2da7a4ac fix: simplify Perplexity prompt — short system + direct user query
Long system prompt caused sonar-pro to return empty results.
Reverted to sonar model with short, proven prompt pattern that
matches the user's successful manual test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:32:54 +09:00
Haewon Kam e64d168d34 feat: Perplexity sonar-pro research agent with structured online presence analysis
Replaced simple "find handles" prompt with comprehensive research agent:
- Model: sonar → sonar-pro (advanced multi-step web search)
- System prompt: full research methodology with 2-3 keyword searches,
  URL fetching, quantitative data extraction
- Output: structured JSON with channels (handles + follower counts +
  subscriber counts) + platforms (강남언니 rating, reviews)
- Research results saved to scrape_data.onlinePresenceResearch for
  downstream use in collect-channel-data and generate-report

Added _shared/researchPrompt.ts with prompt template + builder.
Updated agent documentation in doc/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:31:00 +09:00
Haewon Kam c74832d764 feat: Perplexity Online Presence 종합 분석 + Apify Instagram 검색
B4 Perplexity: rewrote from narrow "find social accounts" to broad
"Online Presence 종합 분석" — finds Instagram, YouTube, Facebook,
TikTok, Naver, Kakao, 강남언니, 바비톡 in one query.

B5 Apify Instagram: generates handle candidates from clinic name
(english name, domain, _official, _ps, _clinic variants) and directly
checks each via Apify instagram-profile-scraper. Finds accounts that
web search misses.

Removed redundant B4b (platform presence) — now merged into B4.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:24:56 +09:00
Haewon Kam f224d1788c feat: API-first channel discovery — YouTube API + Naver API + Firecrawl Search + Perplexity
Replaced Perplexity-only approach with 5 parallel direct API searches:

B1. YouTube Data API: search?type=channel&q={clinicName} → find channel
B2a. Naver Blog API: search blog.json → find official Naver blog
B2b. Naver Web API: search webkr.json → find Instagram/YouTube/Facebook URLs
B3. Firecrawl Search: web search → extract social URLs from results
B4. Perplexity: supplement — catch what direct APIs missed

All 5 sources run in parallel after Stage A (Firecrawl scrape for clinicName).
Results merged + deduplicated + verified. Perplexity is now a fallback,
not the primary source.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:15:49 +09:00
Haewon Kam 25aece2366 fix: Perplexity prompt rewrite + clinicName fallback via AI
Perplexity prompts changed from "find verified accounts" (returns all
null) to "search and report what you find" (returns actual handles).
Added clinicName resolution: Firecrawl Korean → English → Perplexity
URL-to-name lookup → domain fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:07:04 +09:00
Haewon Kam 122b1915f0 fix: 2-stage discovery — Firecrawl first for clinicName, then Perplexity
Previously Firecrawl and Perplexity ran in parallel, so Perplexity
received raw URL instead of clinic name → poor search results.

Now:
Stage A: Firecrawl scrape+map (parallel) → extract clinicName from HTML
Stage B: Perplexity searches using extracted clinicName → finds Instagram,
  YouTube, Facebook handles that Firecrawl HTML parsing missed
Stage C: Merge 3 sources + verify all handles

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:02:30 +09:00
Haewon Kam df8f84c3b9 fix: YouTube channel ID (UC...) handling + handle-to-channelId resolution
discover-channels: extractHandle('youtube') now detects UC* channel IDs
and returns them without @ prefix (previously @UC... caused verify fail)

verifyHandles: verifyYouTube uses cleanHandle for UC* check, requests
part=id,snippet for richer data

collect-channel-data: if channelId missing but handle present, resolves
via forHandle/forUsername lookup or direct UC* detection before skipping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:00:21 +09:00
Haewon Kam f65f0e85b3 fix: robust handle extraction — reject non-platform URLs, fix type safety
discover-channels: new extractHandle() validates each handle belongs to
its platform (rejects hospital-internal URLs like /idtube/view being
treated as YouTube). Extracts handles from full URLs correctly.

collect-channel-data: explicit Record<string,unknown> typing for DB JSON
fields — fixes TypeScript property access on VerifiedChannels from DB.

verifyHandles: fix TikTok double-URL concatenation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 00:03:26 +09:00
Haewon Kam 5239ad7382 chore: add deno.json for new Edge Functions
Required by Supabase deploy to resolve @supabase/functions-js import.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 22:04:13 +09:00
Haewon Kam 7557ef774c feat: Pipeline V2 — 3-phase analysis with verified channel discovery
Restructured the entire analysis pipeline from AI-guessing social
handles to deterministic 3-phase discovery + collection + generation.

Phase 1 (discover-channels): 3-source channel discovery
  - Firecrawl scrape: extract social links from HTML
  - Perplexity search: find handles via web search
  - URL regex parsing: deterministic link extraction
  - Handle verification: HEAD requests + YouTube API
  - DB: creates row with verified_channels + scrape_data

Phase 2 (collect-channel-data): 9 parallel data collectors
  - Instagram (Apify), YouTube (Data API v3), Facebook (Apify)
  - 강남언니 (Firecrawl), Naver Blog + Place (Naver API)
  - Google Maps (Apify), Market analysis (Perplexity 4x parallel)
  - DB: stores ALL raw data in channel_data column

Phase 3 (generate-report): AI report from real data
  - Reads channel_data + analysis_data from DB
  - Builds channel summary with real metrics
  - AI generates report using only verified data
  - V1 backwards compatibility preserved (url-based flow)

Supporting changes:
  - DB migration: status, verified_channels, channel_data columns
  - _shared/extractSocialLinks.ts: regex-based social link parser
  - _shared/verifyHandles.ts: multi-platform handle verifier
  - AnalysisLoadingPage: real 3-phase progress + channel panel
  - useReport: channel_data column support + V2 enrichment merge
  - 강남언니 rating: auto-correct 5→10 scale + search fallback
  - KPIDashboard: navigate() instead of <a href>
  - Loading text: 20-30초 → 1-2분

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 21:49:13 +09:00