Commit Graph

88 Commits (cd2463fb2dece7361ffd0cb42f9a6f6f6ddc0ed8)

Author SHA1 Message Date
Haewon Kam cd2463fb2d fix: clinic_registry CSV 임포트 + NaverPlace 검색 개선
- VerifiedChannels에 naverPlace 필드 추가 (registry URL → placeId 전달)
- registryToVerifiedChannels: naver_place_url → placeId 추출하여 포함
- collect-channel-data NaverPlace 매칭 완화: exact match → contains match, 의원/병원 suffix 제거 shortName 사용, placeId 힌트로 검색 보강
- clinic_registry에 73개 병원 CSV 데이터 임포트 (올바른 YouTube/Blog/GangnamUnni/NaverPlace URL)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 14:22:59 +09:00
Haewon Kam e81c4cfee9 Merge branch 'claude/bold-hawking': 파이프라인 3대 버그 수정 + CLAUDE.md 업데이트 2026-04-10 13:52:40 +09:00
Haewon Kam 742c0f1bcc docs: CLAUDE.md 백엔드 파이프라인 실제 구현 반영
mock 데이터 기반이라는 잘못된 설명을 제거하고 실제 구현 상태로 업데이트:
- 4단계 Edge Functions 파이프라인 (discover→collect→generate-report→generate-content-plan)
- 실제 연동 API 목록 (YouTube/Apify/Naver/Firecrawl/Perplexity)
- DB 테이블 구조, _shared 유틸리티, 환경변수 정리
- 배포 방법 (Vercel 수동 + Supabase Functions)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 13:41:53 +09:00
Haewon Kam c753d8593f fix: 파이프라인 3대 핵심 버그 수정
- generate-report: Harness 4 groundTruth 주입 레이어 추가 (IG/YT/FB/NaverBlog/NaverPlace/GangnamUnni 필드 강제 주입, diagnosis 폴백, qualityReport DB 저장)
- discover-channels: CLINIC_NOT_REGISTERED 조기 종료 제거 + clinics 캐시 fast-path 추가 (14일 TTL, Firecrawl fallback 재활성화)
- collect-channel-data: silent skip → {status, reason, attemptedAt} 구조적 기록 (naverBlog/naverPlace/googleMaps/gangnamUnni)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 13:41:05 +09:00
Haewon Kam aabba1534b fix: 로딩 화면 분석 프로세스 텍스트 영어 통일
- planning 단계 'AI 콘텐츠 전략 생성 중...' → 'Generating AI content strategy...'
- labelDone '콘텐츠 플랜 생성 완료' → 'Content plan generated'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 17:24:13 +09:00
Haewon Kam c0c37b84de fix: naverBlog RSS 전환 + naverPlace DB-first 패턴 + 핵심문제진단 JSON 렌더링 버그 수정
- collect-channel-data: naverBlog 실시간 검색 제거 → verified handle 기반 RSS 직접 fetch
- collect-channel-data: naverPlace DB-first 패턴 (verified_channels에 저장된 데이터 우선 사용, 없을 때만 URL도메인 매칭 검색 후 DB에 저장)
- transformReport: ch.issues 배열 항목이 {issue, severity} 객체일 때 JSON.stringify 대신 .issue 문자열 추출
- ProblemDiagnosis: Lucide 아이콘 제거 → FilledIcons(ShieldFilled, FileTextFilled, LinkExternalFilled), 항목 구분자 ' ' → ' — '

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 17:17:22 +09:00
Haewon Kam 5ed35bc4cd fix: naverPlace 오매핑 수정 + naverBlog RSS 공식 블로그 스크래핑
- naverPlace: 매칭 우선순위 재정의 — 전체 클리닉명 정확 매칭 > 짧은명+성형 카테고리 > 성형 카테고리 순으로 변경 (기존 로직은 피부과까지 매칭되어 데이뷰의원 오매핑 발생)
- naverBlog: Firecrawl 공식 블로그 스크래핑 → Naver RSS 피드로 교체 (blog.naver.com은 Firecrawl 차단, RSS는 공개 엔드포인트)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 17:03:36 +09:00
Haewon Kam 2027ae9b64 feat: 마케팅 플랜 Phase 1~3 완성
- ContentCalendar: 드래그앤드롭(주차 내 요일 간 이동) + 엔트리 추가 버튼 + iCal Export
- BrandingGuide: 색상 팔레트 인라인 편집(스와치 클릭 → hex 팝오버) + DO/DON'T 2컬럼
- WorkflowTracker: 콘텐츠 제작 파이프라인(기획→AI초안→검토→승인→배포), 동영상/이미지+텍스트 분류
- RepurposingProposal: YouTube 인기 영상 리퍼포징 제안 아코디언 섹션
- AssetDetailModal: 에셋 카드 클릭 시 상세 모달
- 디자인 시스템 감사: Lucide 라인 아이콘 제거, 원색(pink/indigo/purple) 제거, 이모지 UI 제거
- "My Assets" → "나의 소재" 일관성 변경
- FilledIcons: DownloadFilled, RocketFilled 추가

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 16:44:21 +09:00
Haewon Kam b84410341f chore: archive-screenshots supabase-py 리팩터 + Storage 버킷 자동 생성
- sb_secret_* 신형 키 형식 지원을 위해 urllib → supabase-py 클라이언트로 전환
- ensure_bucket(): screenshots 버킷 없으면 public으로 자동 생성
- 41개 GCS 임시 스크린샷 → Supabase Storage 영구 URL로 아카이브 완료

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 15:32:40 +09:00
Haewon Kam 9c4d10609f feat: 스크린샷 리포트 반영 + 영구 저장 인프라 강화
- transformReport: channel_data.screenshots → report.screenshots 자동 매핑
- transformReport: youtubeAudit/instagramAudit/facebookAudit diagnosis에 evidenceIds 자동 연결 (채널별 스크린샷 → 진단 항목 연결)
- collect-channel-data: 스크린샷 아카이브를 병렬→순차로 변경 (rate-limit 방지), 실패 시 상세 로그
- scripts/archive-screenshots.py: 기존 GCS 임시 URL → Supabase Storage 일괄 재아카이브 스크립트 추가
- TypeScript 기존 에러 3개 수정 (SectionErrorBoundary, ClinicSnapshot, reviewCount 유니언 타입)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 14:43:25 +09:00
Haewon Kam 2d1937944a fix: 리포트 데이터 정확도 개선 + 강남언니·인스타그램 스크래핑 데이터 반영
- ClinicSnapshot: 내부 관리용 배지(Registry 검증·분점·등급) 병원 리포트에서 제거
- transformReport: Facebook 리뷰수 파싱 ("Not yet rated (3 Reviews)" 정규식 추출)
- transformReport: 네이버 플레이스 KPI 목표가 현재값보다 낮은 오류 수정 (동적 계산)
- transformReport: 네이버 블로그 방문자 "0(미운영)" → "검색 노출 N건 (방문자 비공개)"
- transformReport: 웹사이트+SNS 유입 "0%" → "측정 불가 (트래킹 미설치)"
- clinic_registry_working.csv: gangnam_unni_badges, gangnam_unni_procedures 컬럼 추가 (60개 병원)
- clinic_registry_working.csv: instagram_followers, instagram_posts 컬럼 추가 (64개 병원)
- INFINITH_Outbound_List.csv: 인스타그램 팔로워·게시물수 컬럼 추가 (64개 병원)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 14:18:31 +09:00
Haewon Kam 9991c672a1 feat: seed-clinic-data.sql — registry + gangnamUnni + doctor data
Single SQL file runnable in Supabase SQL Editor that:
1. Creates clinic_registry table with RLS
2. Inserts top 13 premium clinics from CSV (UPSERT on domain)
3. Patches 뷰성형외과 channel_data with gangnamUnni (9.1/10, 18840 reviews, 5 doctors)
4. Patches report.clinicInfo with leadDoctor (최순우) + staffCount (28)
5. Patches scrape_data with registry source metadata

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 10:32:05 +09:00
Haewon Kam 6e8f6940bf fix: gangnamUnni always-try + leadDoctor in Perplexity prompt
- collect-channel-data: gangnamUnni scraping no longer requires
  verified=true. Fallback: Firecrawl search for gangnamunni.com URL
  when discover-channels failed to verify. Solves empty ratings/reviews.
- generate-report: Perplexity prompt now explicitly requests leadDoctor
  (name, specialty, rating, reviewCount) and staffCount in clinicInfo.
- transformReport: clinicInfo type extended with leadDoctor + staffCount;
  transformation prefers clinic.leadDoctor over doctors[0] fallback.

Root cause: clinic_registry table not yet in DB → discover-channels
always falls back to API search → gangnamUnni URL not found →
collect-channel-data skips gangnamUnni → all clinic metrics empty.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 10:29:10 +09:00
Haewon Kam 2cda26a649 feat: per-URL clinic folder — auto-save all scraped data to Storage
Each analysis run now creates a dedicated folder in Supabase Storage:
  clinics/{domain}/{reportId}/
    ├── scrape_data.json    (discover-channels: website scrape + Perplexity)
    ├── channel_data.json   (collect-channel-data: all channel API results)
    └── report.json         (generate-report: final AI-generated report)

Screenshots also moved from {reportId}/{id}.png to:
  clinics/{domain}/{reportId}/screenshots/{id}.png

Migration: 20260407_clinic_data_storage.sql creates 'clinic-data' bucket
(private, 10MB/file, JSON only). All writes are non-fatal — pipeline
continues even if Storage upload fails.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 10:04:52 +09:00
Haewon Kam ae87953fa0 feat: Registry-verified badge + registryData data flow + V3 error recording
- ClinicSnapshot.tsx: 'Registry 검증' badge (ShieldCheck icon), district/branches/brandGroup pills, external links (강남언니/네이버플레이스/구글맵) when source=registry
- report.ts: add source and registryData fields to ClinicSnapshot type
- transformReport.ts: ApiMetadata now accepts source/registryData; passes to clinicSnapshot
- useReport.ts: DB load path extracts scrape_data.source + scrape_data.registryData → transformApiReport
- V3 dual-write error recording: discover-channels, collect-channel-data, generate-report now write error_message + error status to analysis_runs on catch instead of silently swallowing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 10:01:19 +09:00
Haewon Kam 36d2f1cf49 feat: archive Firecrawl screenshots to Supabase Storage (permanent URLs)
## 문제
Firecrawl이 반환하는 스크린샷 URL은 GCS Signed URL로 7일 후 만료.
리포트에 저장된 이미지 URL이 일주일 후 전부 깨짐 (403 Access Denied).

## 해결
collect-channel-data의 Vision 단계에 아카이빙 스텝 추가.
캡처 직후 base64(이미 메모리에 있음)를 Supabase Storage에 영구 업로드.

### 처리 흐름 (변경 후)
1. captureAllScreenshots() → GCS URL + base64 반환 (기존)
2. [신규] archiveTasks: base64 → Supabase Storage 업로드 (병렬)
   - 경로: screenshots/{reportId}/{screenshotId}.png
   - 성공 시 ss.url을 영구 Supabase URL로 in-place 교체
   - 실패 시 non-fatal — GCS URL fallback으로 Vision 분석 계속 진행
3. runVisionAnalysis() — base64 여전히 메모리에 있어 정상 실행 (기존)
4. channelData.screenshots 저장 시 영구 URL 사용 (자동)
   - archived: true/false 플래그 추가 (모니터링용)

### 비용/성능
- 추가 API 호출 없음 (base64 이미 캡처 시 다운로드됨)
- 업로드: ~1-3초/장 (병렬), 5MB limit, PNG/JPEG/WebP 허용
- 버킷: public (URL만 있으면 열람) + 서비스 역할만 업로드 가능

## 마이그레이션
supabase/migrations/20260407_screenshots_storage.sql
- screenshots 버킷 생성 (public, 5MB limit)
- RLS: public read / service_role write
- delete_old_screenshots() 함수: 90일 이상 된 파일 정리 (pg_cron 연동 가능)

## 타입
ScreenshotResult.archived?: boolean 필드 추가 (영구 vs GCS fallback 구분)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 09:51:31 +09:00
Haewon Kam bcc0b6ea5e fix: pipeline P0/P1 — rating bug, retry, health score, blog scrape
## P0 버그 수정 (즉시 영향)

### fix(collect-channel-data): 강남언니 rating 오변환 제거
- 기존: `rating ≤ 5 → ×2` 로직으로 4.8/10을 9.6/10으로 잘못 변환
- Firecrawl 프롬프트가 이미 0-10 반환 지시 → rawValue 직접 신뢰

### fix(generate-report): Perplexity 단일 fetch → fetchWithRetry
- maxRetries:2, backoffMs:[5000,15000], timeoutMs:90s 설정
- 기존: 일시적 429/타임아웃 시 리포트 생성 전체 실패

## P1 기능 추가 (데이터 품질)

### feat(collect-channel-data): channel_snapshots health_score 계산
- `computeHealthScore(channel, data)` 함수 추가 (채널별 0-100 스코어)
- Instagram: followers 기반 선형 보간 + posts bonus
- YouTube: subscribers 기반 + video count bonus
- 강남언니: rating×7 + reviews bonus (max 30pt)
- Google Maps: rating×12 + reviews bonus (max 40pt)
- Naver Blog: presence (50pt) + 언급 수 bonus (max 30pt)
- 모든 channel_snapshots INSERT에 health_score 포함

### feat(collect-channel-data): 네이버 블로그 공식 컨텐츠 스크랩 추가
- 기존: Naver Search API로 3rd-party 언급만 수집
- 추가: Registry에서 확인된 공식 블로그 URL을 Firecrawl로 직접 스크랩
  - 총 게시글 수, 최근 게시물 (제목/날짜/요약), 카테고리 추출
  - 실패 시 non-critical — 기존 Naver Search 결과는 항상 유지

## docs: PIPELINE_IMPROVEMENT_PLAN 감사 결과 반영
- Sprint 0 (Vision), Sprint 1, Sprint 2 완료 표시
- WP-10, WP-11 완료 표시
- 2026-04-07 전수 감사 섹션 추가 (구현 완료/수정/남은 Gap 표)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 09:35:20 +09:00
Haewon Kam d5f7f24e0a feat: clinic registry DB + pipeline audit P0 fixes
## Clinic Registry
- data/clinic-registry/clinic_registry_working.csv — 91개 병원 채널 마스터 DB
- data/clinic-registry/INFINITH_Outbound_List.csv — BD팀 아웃바운드 리스트 (17컬럼)
- data/clinic-registry/update_csv.py — 안전 CSV 업데이트 스크립트 (빈 필드만 채움)
- data/clinic-registry/extract_place_ids.py — 네이버 플레이스 ID 추출기
- scripts/import-registry.ts — CSV → Supabase clinic_registry 테이블 임포트
- supabase/migrations/20260406_clinic_registry.sql — clinic_registry 테이블 스키마

## Pipeline P0 Bug Fixes (전수 감사 후)
- fix(collect-channel-data): 강남언니 rating 0-10 스케일 오변환 제거
  - 기존: rating ≤ 5이면 ×2 → 4.8/10을 9.6/10으로 잘못 변환
  - 수정: Firecrawl 프롬프트가 이미 0-10 지시 → rawValue 직접 신뢰
- fix(generate-report): Perplexity 단일 fetch → fetchWithRetry 교체
  - maxRetries:2, backoffMs:[5000,15000], timeoutMs:90s
  - 기존: 타임아웃/429 시 리포트 생성 전체 실패
  - 수정: 자동 재시도로 일시적 API 오류 극복

## Docs
- docs/PIPELINE_IMPROVEMENT_PLAN.md — Sprint 0/1/2 완료 표시 + 전수 감사 결과 추가
- docs/REGISTRY_FUNCTIONAL_SPECS.md, DB_SCHEMA_V3.md 외 기획 문서 다수 추가

## New Components & Features
- supabase/functions/generate-content-plan, adjust-strategy — 콘텐츠 플랜/전략 조정
- src/components/plan/EditEntryModal, StrategyAdjustmentSection — 플랜 편집 UI
- supabase/functions/_shared/dataQuality, foundingYearExtractor, urlClassifier — 데이터 품질 유틸

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 09:33:25 +09:00
Haewon Kam ec991057e6 feat: add API Dashboard + filled icons + pipeline improvements
- Add /api-dashboard page with API connection status, env var checker,
  pipeline flow diagram, and cost estimator
- Add 15 new filled SVG icons (Shield, Database, Server, Bolt, Eye,
  Copy, Check, Cross, Warning, Refresh, Flow, Coin, LinkExternal etc.)
- Follow INFINITH design system: no emoji, no line icons, semantic
  status colors, diagonal shadows, brand gradients, font-serif headings
- Improve Vision Analysis with base64 encoding fix
- Add SectionErrorBoundary for graceful section-level error handling
- Add Google Places API utility (prepared for future migration)
- Fix Edge Function auth headers and report generation pipeline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 14:59:31 +09:00
Haewon Kam 2ca9ec0306 fix: YouTube name matching + Facebook domain fallback in channel discovery
YouTube now verifies all candidates and picks best match by channel title.
Facebook tries all candidates with domain-name fallback when Firecrawl returns empty.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 12:15:37 +09:00
Haewon Kam d1157da39c fix: restore button active state + inject gangnamunni Vision data into report
- Hero button: gray when empty, accent gradient when URL entered
- generate-report: force-inject gangnamUnniStats from Vision Analysis
  into channelAnalysis (score, rating, reviews, doctors)
- Add gangnamunni Vision data to prompt context

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 11:53:33 +09:00
Haewon Kam 82e9ec6cc0 fix: correct base64 encoding for Vision Analysis screenshots
- Previous chunked btoa approach encoded each chunk independently,
  producing corrupted base64 that Gemini couldn't parse (returned {})
- Now builds complete binary string first, then encodes once with btoa
- Added screenshot debug info to channel errors for diagnostics
- Confirmed: foundingYear 2004, doctors, gangnamunni data all extracted

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 11:38:16 +09:00
Haewon Kam 79950925a1 fix: add Authorization header to all Edge Function calls + fix Vision Analysis
- All fetch calls to Supabase Edge Functions now include
  Authorization: Bearer <anon_key> (was missing → 401 errors)
- Fix Firecrawl screenshot API: remove invalid screenshotOptions,
  use "screenshot@fullPage" format (v2 API compatibility)
- Fix screenshot response handling: v2 returns URL not base64,
  now downloads and converts to base64 for Gemini Vision
- Add about page to Vision Analysis capture targets
- Add retry utility, channel error tracking, pipeline resume,
  enrichment retry, EmptyState improvements (Sprint 2-3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 10:08:03 +09:00
Haewon Kam 2f2aa5a5b6 docs: update DB V3 checklist — Phase 1-4 implemented
Phase 1:  DB migration (9 tables + 2 views)
Phase 2:  discover-channels dual-write
Phase 3:  collect-channel-data dual-write
Phase 4:  generate-report dual-write
Phase 5:  Performance loop (pending)
Phase 6:  Frontend transition (pending)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:54:52 +09:00
Haewon Kam 7fe3ff82c9 feat: DB V3 dual-write — clinics + analysis_runs + channel_snapshots
Phase 2-4 of SaaS schema migration. All Edge Functions now write to
BOTH legacy marketing_reports AND new V3 tables:

discover-channels:
  - UPSERT clinics (url-based dedup)
  - INSERT analysis_runs (status: discovering)

collect-channel-data:
  - INSERT channel_snapshots (one per channel — time-series!)
  - INSERT screenshots (evidence rows)
  - UPDATE analysis_runs (raw_channel_data, vision_analysis)

generate-report:
  - UPDATE analysis_runs (report, status: complete)
  - UPDATE clinics (last_analyzed_at, established_year)

Frontend passes clinicId + runId through all 3 phases.
Legacy marketing_reports still written for backward compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:51:11 +09:00
Haewon Kam e011ef7357 docs: DB Schema V3 — SaaS multi-tenant with time-series + performance loop
9 tables: clinics, analysis_runs, channel_snapshots, screenshots,
content_plans, channel_configs, performance_metrics, content_performance,
strategy_adjustments

2 views: channel_latest, channel_weekly_delta

Key features:
- Clinic-centric (1 hospital = 1 row, multiple analyses)
- Time-series channel metrics (INSERT-only snapshots)
- Performance → Strategy loop (weekly KPI tracking → auto-adjust plans)
- Content performance tracking (individual post metrics)
- 8-phase implementation checklist with verification criteria

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:45:34 +09:00
Haewon Kam 6a3390840d feat: Vision Analysis — screenshot capture + Gemini Vision extraction
WP-V1: Multi-page screenshot capture via Firecrawl
  - Captures 6+ pages: main, doctors, surgery, YouTube, Instagram, 강남언니
  - Runs in parallel within collect-channel-data Phase 2

WP-V2: Gemini Vision analysis per screenshot
  - Page-specific prompts (main page OCR, doctor profiles, channel stats)
  - Extracts: founding year, doctors, certifications, services, social icons,
    brand colors, slogans, YouTube/Instagram stats from screenshots

WP-V3: Vision data pipeline integration
  - channel_data.visionAnalysis: merged structured data
  - channel_data.screenshots[]: evidence for report EvidenceGallery
  - generate-report embeds screenshots as report.screenshots[]
  - buildChannelSummary includes Vision data in AI prompt

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:59:19 +09:00
Haewon Kam 494dc186c5 docs: update Sprint 0 Vision — multi-page screenshots + channel evidence in report
WP-V1: 6+ page screenshots (main, doctors, surgery, YouTube landing,
  Instagram profile, 강남언니) stored in Supabase Storage
WP-V2: Gemini Vision analysis per screenshot (OCR + structured extraction)
WP-V3: Screenshots as evidence in report (connected to existing
  EvidenceGallery/Lightbox system)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:55:40 +09:00
Haewon Kam 3f1a25e298 docs: add Sprint 0 Vision Analysis to pipeline improvement plan
Vision analysis addresses the critical gap that text-only scraping
misses ~40% of clinic website information (founding year in banners,
doctor photos, certification marks, social icons in images).

Sprint 0 adds: Firecrawl screenshot → Gemini Vision → structured
data extraction for founding year, doctors, certifications, services,
social icons, floating buttons, brand colors, slogans.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:46:56 +09:00
Haewon Kam ed37f23f78 feat: extract social links from JS-rendered buttons on clinic website
Added A4 parallel Firecrawl call with actions: [wait 3s, scrape]
to execute JavaScript and extract social button href URLs from
header/footer. This is the most reliable source — most Korean
clinics have Facebook/Instagram/YouTube/Blog icons in their nav.

Results merged as Source 3 (buttonHandles) alongside HTML links,
JSON extraction, and API searches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:41:27 +09:00
Haewon Kam 80c57147e7 feat: Sprint 1 — 7 data quality quick wins
WP-1: YouTube channel ID regex {20,} → {22} (exactly 24 chars)
WP-2: Naver Place category filtering in enrich-channels (성형/피부)
WP-3: Google Maps stores mapsUrl separately from clinicWebsite
WP-4: Naver Blog separates officialBlogUrl from search results
WP-5: 강남언니 rawRating + normalized rating (≤5 → ×2), Firecrawl
      prompt explicitly states "out of 10, NOT out of 5"
WP-6: Perplexity model centralized in _shared/config.ts (env override)
WP-7: Apify Instagram timeout 30s → 45s

Frontend: transformReport uses mapsUrl and officialBlogUrl when available

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:35:40 +09:00
Haewon Kam 1071328574 docs: pipeline improvement plan with 15 work packages and verification checklist
Comprehensive audit of discover→collect→generate pipeline found:
- 16 silent failures, 8 data quality issues, 0 error recovery, 6 API issues
- Organized into 4 sprints (15 WPs, ~11h total)
- Each WP has file locations, changes, and verification criteria
- Checkbox format for progress tracking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:26:22 +09:00
Haewon Kam 29c1faf49e fix: correct OtherChannels URLs — Google Maps, Naver Blog, Naver Place
Google Maps: was using gm.website (clinic's own site) → now always
generates maps.google.com/search URL

Naver Blog: was linking to first search result post (random personal
blog) → now links to Naver blog search results page

Naver Place: np.link was the clinic's own website, not Naver Place →
now generates map.naver.com search URL. Also fixed collect-channel-data
to search with "성형외과" suffix and match by category (성형/피부) to
avoid same-name dental clinics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 21:25:26 +09:00
Haewon Kam a02c83155e fix: remove duplicate Facebook link at bottom — now only on page name
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 16:28:46 +09:00
Haewon Kam 4928d24ace feat: consistent ExternalLink on all channel names — Facebook + OtherChannels
Facebook pageName now links to facebook.com/{url} with ExternalLink icon.
OtherChannels: moved ExternalLink from right-end to inline with channel
name, matching the Instagram/YouTube pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 16:23:29 +09:00
Haewon Kam 2a35108149 fix: '의료진 3명' → '전문의 3명' — staffCount는 강남언니 등록 의사 수
staffCount is the number of registered doctors from 강남언니, not total
staff. Label changed from 의료진(medical staff) to 전문의(specialists)
to accurately reflect the data source. Diagnosis message also updated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 16:21:10 +09:00
Haewon Kam 66b4826f55 feat: add clickable source links to all report sections
YouTubeAudit: handle links to youtube.com/@{handle} with ExternalLink icon
InstagramAudit: handle links to instagram.com/{handle} with ExternalLink icon
ClinicSnapshot: domain is now clickable link, phone is tel: link
OtherChannels: Google Maps generates search URL, Naver Blog links to
  first blog post or search results (previously empty string)
transformReport: fills missing URL fields for Google Maps and Naver Blog

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 11:02:35 +09:00
Haewon Kam bb7b08e35c docs: comprehensive AI prompts catalog v2 — all pipeline prompts with engineering learnings
Includes every prompt in production across 3 pipeline phases:
- Phase 1 (discover): 7 data sources with exact prompts
- Phase 2 (collect): 5 API calls + 4 market analysis queries
- Phase 3 (generate): full report generation prompt template

Added prompt engineering learnings:
- Short prompts outperform long ones on Perplexity sonar
- sonar > sonar-pro for channel search
- English clinic name in parentheses improves international results
- Verify strategy: keep unverified handles as candidates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 10:01:23 +09:00
Haewon Kam 1fb1de8303 fix: keep unverified Instagram handles as candidates for collection
Instagram HEAD requests often fail (rate limiting, blocking) causing
valid handles to be dropped. Now all discovered handles are kept
(verified or not) and Apify attempts collection on all of them.
Apify's own scraper validates existence more reliably than HEAD requests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:38:12 +09:00
Haewon Kam 087f65eec1 fix: revert to single Perplexity query with proven prompt pattern
Split queries performed worse. The proven working pattern is:
- Single query with Korean+English clinic name
- "검색해서 찾아줘. 검색 결과에서 발견된 계정을 모두 알려줘" phrasing
- All channels in one request
- English name in parentheses helps Perplexity find international accounts

Tested: "그랜드성형외과 (Grand Plastic Surgery)" → finds Instagram,
YouTube, Facebook, TikTok, Naver Blog all in one call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:36:36 +09:00
Haewon Kam 5157cf446a fix: split Perplexity into 3 focused queries matching research methodology
Single mega-query returns empty results. Split into:
B4a. Instagram + YouTube (most important, focused search)
B4b. Facebook + TikTok + Naver Blog + Kakao
B4c. 강남언니 + review platforms

Each query is short and focused — matches the proven pattern of
2-5 keyword searches that Perplexity handles well.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:34:45 +09:00
Haewon Kam ac2da7a4ac fix: simplify Perplexity prompt — short system + direct user query
Long system prompt caused sonar-pro to return empty results.
Reverted to sonar model with short, proven prompt pattern that
matches the user's successful manual test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:32:54 +09:00
Haewon Kam e64d168d34 feat: Perplexity sonar-pro research agent with structured online presence analysis
Replaced simple "find handles" prompt with comprehensive research agent:
- Model: sonar → sonar-pro (advanced multi-step web search)
- System prompt: full research methodology with 2-3 keyword searches,
  URL fetching, quantitative data extraction
- Output: structured JSON with channels (handles + follower counts +
  subscriber counts) + platforms (강남언니 rating, reviews)
- Research results saved to scrape_data.onlinePresenceResearch for
  downstream use in collect-channel-data and generate-report

Added _shared/researchPrompt.ts with prompt template + builder.
Updated agent documentation in doc/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:31:00 +09:00
Haewon Kam c74832d764 feat: Perplexity Online Presence 종합 분석 + Apify Instagram 검색
B4 Perplexity: rewrote from narrow "find social accounts" to broad
"Online Presence 종합 분석" — finds Instagram, YouTube, Facebook,
TikTok, Naver, Kakao, 강남언니, 바비톡 in one query.

B5 Apify Instagram: generates handle candidates from clinic name
(english name, domain, _official, _ps, _clinic variants) and directly
checks each via Apify instagram-profile-scraper. Finds accounts that
web search misses.

Removed redundant B4b (platform presence) — now merged into B4.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:24:56 +09:00
Haewon Kam 64669888c2 fix: type-safe string handling in extractSocialLinks/mergeSocialLinks
API results may contain null, numbers, or objects instead of strings.
Now coerces all values to strings before processing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:17:49 +09:00
Haewon Kam f224d1788c feat: API-first channel discovery — YouTube API + Naver API + Firecrawl Search + Perplexity
Replaced Perplexity-only approach with 5 parallel direct API searches:

B1. YouTube Data API: search?type=channel&q={clinicName} → find channel
B2a. Naver Blog API: search blog.json → find official Naver blog
B2b. Naver Web API: search webkr.json → find Instagram/YouTube/Facebook URLs
B3. Firecrawl Search: web search → extract social URLs from results
B4. Perplexity: supplement — catch what direct APIs missed

All 5 sources run in parallel after Stage A (Firecrawl scrape for clinicName).
Results merged + deduplicated + verified. Perplexity is now a fallback,
not the primary source.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:15:49 +09:00
Haewon Kam 159de36e38 docs: add AI prompts catalog for all pipeline functions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:09:33 +09:00
Haewon Kam 25aece2366 fix: Perplexity prompt rewrite + clinicName fallback via AI
Perplexity prompts changed from "find verified accounts" (returns all
null) to "search and report what you find" (returns actual handles).
Added clinicName resolution: Firecrawl Korean → English → Perplexity
URL-to-name lookup → domain fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:07:04 +09:00
Haewon Kam 122b1915f0 fix: 2-stage discovery — Firecrawl first for clinicName, then Perplexity
Previously Firecrawl and Perplexity ran in parallel, so Perplexity
received raw URL instead of clinic name → poor search results.

Now:
Stage A: Firecrawl scrape+map (parallel) → extract clinicName from HTML
Stage B: Perplexity searches using extracted clinicName → finds Instagram,
  YouTube, Facebook handles that Firecrawl HTML parsing missed
Stage C: Merge 3 sources + verify all handles

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:02:30 +09:00
Haewon Kam df8f84c3b9 fix: YouTube channel ID (UC...) handling + handle-to-channelId resolution
discover-channels: extractHandle('youtube') now detects UC* channel IDs
and returns them without @ prefix (previously @UC... caused verify fail)

verifyHandles: verifyYouTube uses cleanHandle for UC* check, requests
part=id,snippet for richer data

collect-channel-data: if channelId missing but handle present, resolves
via forHandle/forUsername lookup or direct UC* detection before skipping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:00:21 +09:00