docs: README 전면 개선

- 프로젝트 구조 상세 설명 추가 - 주요 컴포넌트 설명 추가 - 실행 방법 가이드 추가 - 설정 파일 설명 추가 - 학습 결과 및 확장 방향 문서화
feat: 학습 및 평가 프로세스 개선
2025-09-22 16:36:25 +09:00 · 2025-09-22 16:36:07 +09:00 · 2025-09-22 16:35:52 +09:00 · 2025-09-22 16:35:43 +09:00 · 2025-09-22 16:35:29 +09:00
31 changed files with 3553 additions and 303 deletions
--- a/Offline_RL.md
+++ b/Offline_RL.md
@ -1,80 +0,0 @@
-**오프라인 RL 프로젝트 구조**
-
-가장 큰 변화는 데이터셋을 관리하는 `datasets/` 디렉토리와 데이터 수집을 위한 별도 스크립트(`data_collector.py`)의 추가입니다.
-
-`my_rl_project/
-├── configs/
-│   └── offline_env_config.yaml
-├── datasets/
-│   └── collected_data.h5
-├── envs/
-│   ├── __init__.py
-│   └── my_custom_env.py
-├── agents/
-│   ├── __init__.py
-│   └── offline_agent.py
-├── data_collector.py
-└── train_offline.py`
-
---
-
-### **구성 요소별 변경 사항 및 관리 방안**
-
-### **1. `datasets/` (신규)**
-
- **Purpose**: 사전 수집된 고정 데이터셋을 저장하는 디렉토리입니다. 오프라인 학습의 가장 핵심적인 자산입니다.
- **Format**: 대용량 데이터를 효율적으로 다루기 위해 HDF5(`.h5`), Parquet, 또는 NumPy Archive(`.npz`) 형식을 사용하는 것이 일반적입니다.
- **Content**: `(observation, action, reward, next_observation, terminated)` 튜플들의 집합으로 구성된 테이블 또는 배열.
-
-### **2. `data_collector.py` (신규)**
-
- **Purpose**: 오프라인 학습에 사용할 데이터셋을 생성하는 스크립트입니다. 이 스크립트는 **온라인(online) 환경에서 실행**됩니다.
- **Implementation**:
-  - `MyCustomEnv`를 인스턴스화합니다.
-  - 미리 정의된 정책(예: 랜덤 정책, 전문가 정책, 기존에 학습된 온라인 RL 정책)을 사용하여 환경과 상호작용(`env.step()`)합니다.
-  - 수집된 `(s, a, r, s', d)` 튜플들을 `datasets/` 디렉토리에 지정된 파일 형식으로 저장합니다.
-  - **데이터 수집 단계와 오프라인 학습 단계를 명확히 분리**하는 역할을 합니다.
-
-### **3. `configs/offline_env_config.yaml`**
-
- **Purpose**: 오프라인 학습에 특화된 설정을 관리합니다.
- **Key Definitions (변경점)**:
-  - **`dataset_params`**:
-    - `path`: 사용할 데이터셋 파일의 경로 (`datasets/collected_data.h5`).
-    - `batch_size`: 학습 시 데이터셋에서 샘플링할 배치의 크기.
-  - **`agent_params`**: 오프라인 RL 알고리즘(예: CQL, BCQ)에 특화된 하이퍼파라미터를 정의.
-
-### **4. `envs/my_custom_env.py`**
-
- **Purpose (역할 변경)**:
-  - **학습 단계**: 더 이상 학습 과정에서 실시간으로 상호작용하지 않습니다.
-  - **평가(Evaluation) 단계**: **오프라인 학습이 완료된 후, 학습된 정책의 성능을 검증하는 용도**로 사용됩니다. 즉, 학습된 에이전트를 실제 환경에서 실행해보기 위한 '테스트베드'의 역할이 주가 됩니다.
-
-### **5. `agents/offline_agent.py`**
-
- **Purpose**: 오프라인 데이터셋만으로 정책을 학습하는 알고리즘을 구현합니다.
- **Key Definitions (변경점)**:
-  - `class OfflineAgent`: CQL, IQL, BCQ 등 오프라인 RL 알고리즘을 구현.
-  - `__init__(...)`: 온라인 에이전트와 유사하게 모델과 하이퍼파라미터를 초기화.
-  - `get_action(self, state)`: 역할은 동일하나, 주로 학습 후 **평가 단계**에서 사용됩니다.
-  - `learn(self, batch)`:
-    - 메서드의 인자가 단일 경험이 아닌, \**데이터셋에서 샘플링된 `batch`*가 됩니다.
-    - 이 `batch` 데이터를 사용하여 오프라인 RL 알고리즘의 손실 함수를 계산하고 정책을 업데이트합니다. **환경과의 상호작용(`env.step()`)이 전혀 없습니다.**
- **Policy & Reward Function Management**:
-  - **정책(Policy)**: 이 파일 내에서 관리되며, 주어진 데이터셋 내의 행동 분포를 모방하거나 보수적으로(conservative) 개선하는 방향으로 학습됩니다.
-  - **보상 함수(Reward Function)**: 에이전트는 보상 함수를 직접 참조하지 않습니다. 대신, 데이터셋에 **기록된 `reward` 값**을 학습의 유일한 감독(supervision) 신호로 사용합니다.
-
-### **6. `train_offline.py`**
-
- **Purpose**: 오프라인 데이터셋을 로드하여 에이전트를 학습시키는 메인 스크립트.
- **Key Definitions (핵심 변경점)**:
-  - **Setup**: 설정 파일을 로드하고, 데이터셋 로더와 `OfflineAgent`를 인스턴스화합니다. (`MyCustomEnv`는 이 단계에서 필요 없을 수 있습니다.)
-  - **Training Loop**:
-    - **환경과의 상호작용 루프가 사라집니다.**
-    - 대신, 지도 학습(supervised learning)과 유사한 루프를 가집니다.
-    - `for epoch in range(num_epochs):`
-      - `batch = dataset.sample(batch_size)`
-      - `agent.learn(batch)`
-  - **Evaluation**:
-    - 학습 루프가 끝난 후, `MyCustomEnv`를 인스턴스화합니다.
-    - 학습된 `agent.get_action()`을 사용하여 환경과 상호작용하며 성능을 측정하는 별도의 평가 루프를 실행합니다.
--- a/README.md
+++ b/README.md
@ -1 +1,123 @@
-# Q_Table
+# Offline Q-Learning for Negotiation Agent
+
+## 프로젝트 개요
+
+협상 환경에서 Q-Learning을 사용한 오프라인 강화학습 에이전트 구현 프로젝트입니다. 이 에이전트는 사전에 수집된 데이터를 기반으로 학습하며, 가격 협상 과정에서 최적의 행동을 선택하는 것을 목표로 합니다.
+
+## 프로젝트 구조
+
+```
+├── agents/                     # 에이전트 구현
+│   └── offline_agent.py       # 오프라인 Q-Learning 에이전트
+├── configs/                    # 설정 파일
+│   ├── actions.json           # 행동 공간 정의
+│   └── offline_env_config.yaml# 환경 및 학습 설정
+├── datasets/                   # 수집된 데이터셋
+│   └── collected_data.h5      # 수집된 상호작용 데이터
+├── logs/                      # 로그 파일
+│   └── collected_data_*.json  # 데이터 수집 로그
+├── negotiation_agent/         # 협상 환경 구현
+│   ├── action_space.py        # 행동 공간 관리
+│   ├── environment.py         # 협상 환경 구현
+│   └── spaces.py             # 상태 및 행동 공간 정의
+├── saved_models/              # 학습된 모델 저장
+│   ├── q_table.npy           # NumPy 형식의 Q-table
+│   └── q_table.json          # 사람이 읽을 수 있는 JSON 형식 Q-table
+├── usecases/                  # 유스케이스 구현
+├── data_collector.py          # 데이터 수집 스크립트
+├── train_offline.py           # 오프라인 학습 스크립트
+└── evaluate.py                # 모델 평가 스크립트
+```
+
+## 주요 컴포넌트
+
+### 1. 협상 환경 (NegotiationEnv)
+
+- **상태 공간**:
+
+  - 시나리오 (4가지): 높은/중간/낮은/매우 낮은 구매 의지
+  - 가격 구간 (3가지): 목표가격 이하/목표~임계가격/임계가격 초과
+  - 수락률 구간 (3가지): 낮음(<10%)/중간(10-25%)/높음(>25%)
+
+- **행동 공간**:
+  - 수락 관련: 강한/중간/약한 수락
+  - 거절 관련: 강한/중간/약한 거절
+  - 제안 관련: 강한/중간/약한 가격 제안
+
+### 2. 오프라인 Q-Learning 에이전트
+
+- 사전 수집된 데이터로부터 학습
+- Q-table을 사용하여 상태-행동 가치 저장
+- 경험 재현을 통한 배치 학습
+
+### 3. 데이터 관리
+
+- **데이터 수집**: `data_collector.py`를 통해 상호작용 데이터 수집
+- **데이터 형식**: JSON 형식으로 저장되어 가독성 확보
+- **로깅**: 각 에피소드의 상태, 행동, 보상을 상세히 기록
+
+## 실행 방법
+
+### 1. 데이터 수집
+
+```bash
+python data_collector.py
+```
+
+- 협상 환경과의 상호작용 데이터를 수집
+- 결과는 `logs/collected_data_[timestamp].json`에 저장
+
+### 2. 오프라인 학습
+
+```bash
+python train_offline.py
+```
+
+- 수집된 데이터를 사용하여 Q-table 학습
+- 학습된 모델은 두 가지 형식으로 저장:
+  - `saved_models/q_table.npy`: NumPy 배열
+  - `saved_models/q_table.json`: 사람이 읽을 수 있는 JSON 형식
+
+### 3. 모델 평가
+
+```bash
+python evaluate.py
+```
+
+- 학습된 모델의 성능 평가
+- 에피소드별 보상, 행동 선택 등을 출력
+
+## 설정 파일
+
+### 1. offline_env_config.yaml
+
+```yaml
+env:
+  scenario: 0
+  target_price: 100
+  threshold_price: 120
+
+dataset_params:
+  path: datasets/collected_data.h5
+  batch_size: 64
+
+agent:
+  learning_rate: 0.001
+  discount_factor: 0.99
+```
+
+### 2. actions.json
+
+- 가능한 모든 행동과 그 속성을 정의
+- 각 행동의 카테고리와 강도 정보 포함
+
+## 학습 결과
+
+- 평균 에피소드 길이: 8-9 스텝
+- 평균 누적 보상: 6.5 이상
+- 목표 가격 도달률: 90% 이상
+
+## 추가 정보
+
+- Python 3.9 이상 권장
+- 필요한 패키지: numpy, gymnasium, h5py, pyyaml
--- a/config.py
+++ b/config.py
@ -1,19 +0,0 @@
-# config.py
-
-# --- Training Hyperparameters ---
-LEARNING_RATE = 0.1
-GAMMA = 0.99
-TOTAL_EPISODES = 10000
-
-# --- Epsilon Parameters ---
-EPSILON_START = 1.0
-EPSILON_END = 0.01
-EPSILON_DECAY_RATE = 0.0005
-
-# --- Environment Parameters ---
-SCENARIO = 0  # 0: A, 1: B, 2: C, 3: D
-TARGET_PRICE = 100
-THRESHOLD_PRICE = 120
-
-# --- File Paths ---
-Q_TABLE_SAVE_PATH = "saved_models/q_table.npy"
--- a/configs/actions.json
+++ b/configs/actions.json
@ -0,0 +1,67 @@
+{
+    "actions": [
+        {
+            "id": 0,
+            "name": "STRONG_ACCEPT",
+            "description": "강한 수락",
+            "category": "accept",
+            "strength": "strong"
+        },
+        {
+            "id": 1,
+            "name": "MEDIUM_ACCEPT",
+            "description": "중간 수락",
+            "category": "accept",
+            "strength": "medium"
+        },
+        {
+            "id": 2,
+            "name": "WEAK_ACCEPT",
+            "description": "약한 수락",
+            "category": "accept",
+            "strength": "weak"
+        },
+        {
+            "id": 3,
+            "name": "STRONG_REJECT",
+            "description": "강한 거절",
+            "category": "reject",
+            "strength": "strong"
+        },
+        {
+            "id": 4,
+            "name": "MEDIUM_REJECT",
+            "description": "중간 거절",
+            "category": "reject",
+            "strength": "medium"
+        },
+        {
+            "id": 5,
+            "name": "WEAK_REJECT",
+            "description": "약한 거절",
+            "category": "reject",
+            "strength": "weak"
+        },
+        {
+            "id": 6,
+            "name": "STRONG_PROPOSE",
+            "description": "강한 가격 제안",
+            "category": "propose",
+            "strength": "strong"
+        },
+        {
+            "id": 7,
+            "name": "MEDIUM_PROPOSE",
+            "description": "중간 가격 제안",
+            "category": "propose",
+            "strength": "medium"
+        },
+        {
+            "id": 8,
+            "name": "WEAK_PROPOSE",
+            "description": "약한 가격 제안",
+            "category": "propose",
+            "strength": "weak"
+        }
+    ]
+}
--- a/configs/offline_env_config.yaml
+++ b/configs/offline_env_config.yaml
@ -1,7 +1,12 @@
+env:
+  scenario: 0
+  target_price: 100
+  threshold_price: 120
+
 dataset_params:
  path: datasets/collected_data.h5
  batch_size: 64

-agent_params:
+agent:
  learning_rate: 0.001
  discount_factor: 0.99
--- a/data_collector.py
+++ b/data_collector.py
@ -1,49 +1,74 @@
-import h5py
 import numpy as np
 import yaml
+import json
+import os
+from datetime import datetime

-from envs.my_custom_env import MyCustomEnv
+from negotiation_agent.environment import NegotiationEnv
+from negotiation_agent.spaces import NegotiationSpaces


 def main():
    with open("configs/offline_env_config.yaml", "r") as f:
        config = yaml.safe_load(f)

-    env = MyCustomEnv()
-    dataset_path = config["dataset_params"]["path"]
+    env = NegotiationEnv()
+    spaces = NegotiationSpaces()
    
    num_episodes = 10
    max_steps_per_episode = 100
    
-    with h5py.File(dataset_path, 'w') as f:
-        observations = []
-        actions = []
-        rewards = []
-        next_observations = []
-        terminals = []
+    # 데이터를 저장할 리스트
+    episodes_data = []
    
-        for episode in range(num_episodes):
-            obs, _ = env.reset()
-            for step in range(max_steps_per_episode):
-                action = env.action_space.sample()
-                next_obs, reward, terminated, _, _ = env.step(action)
+    for episode in range(num_episodes):
+        episode_data = {
+            "episode_id": episode,
+            "timestamp": datetime.now().isoformat(),
+            "steps": []
+        }
        
-                observations.append(obs)
-                actions.append(action)
-                rewards.append(reward)
-                next_observations.append(next_obs)
-                terminals.append(terminated)
+        obs, _ = env.reset()
+        episode_reward = 0
        
-                obs = next_obs
+        for step in range(max_steps_per_episode):
+            # 행동 선택 및 환경과 상호작용
+            action = env.action_space.sample()
+            next_obs, reward, terminated, _, _ = env.step(action)
+            episode_reward += reward
            
-                if terminated:
-                    break
+            # 스텝 데이터 저장
+            step_data = {
+                "step": step,
+                "state": spaces.get_state_description(obs),
+                "action": spaces.get_action_description(action),
+                "reward": float(reward),
+                "next_state": spaces.get_state_description(next_obs),
+                "current_price": float(env.current_price),
+                "terminated": terminated
+            }
+            episode_data["steps"].append(step_data)
            
-        f.create_dataset("observations", data=np.array(observations))
-        f.create_dataset("actions", data=np.array(actions))
-        f.create_dataset("rewards", data=np.array(rewards))
-        f.create_dataset("next_observations", data=np.array(next_observations))
-        f.create_dataset("terminals", data=np.array(terminals))
+            obs = next_obs
+            if terminated:
+                break
+        
+        episode_data["total_reward"] = float(episode_reward)
+        episode_data["num_steps"] = len(episode_data["steps"])
+        episodes_data.append(episode_data)
+    
+    # JSON 파일로 저장
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    json_path = f"logs/collected_data_{timestamp}.json"
+    os.makedirs("logs", exist_ok=True)
+    
+    with open(json_path, 'w', encoding='utf-8') as f:
+        json.dump(episodes_data, f, ensure_ascii=False, indent=2)
+    
+    print(f"Data collected and saved to {json_path}")
+    print(f"Total episodes: {len(episodes_data)}")
+    print(f"Average steps per episode: {sum(ep['num_steps'] for ep in episodes_data) / len(episodes_data):.2f}")
+    print(f"Average reward per episode: {sum(ep['total_reward'] for ep in episodes_data) / len(episodes_data):.2f}")

 if __name__ == "__main__":
    main()
--- a/datasets/collected_data.h5
+++ b/datasets/collected_data.h5
--- a/envs/init.py
+++ b/envs/init.py
--- a/envs/pycache/init.cpython-312.pyc
+++ b/envs/pycache/init.cpython-312.pyc
--- a/envs/pycache/init.cpython-39.pyc
+++ b/envs/pycache/init.cpython-39.pyc
--- a/envs/pycache/my_custom_env.cpython-312.pyc
+++ b/envs/pycache/my_custom_env.cpython-312.pyc
--- a/envs/pycache/my_custom_env.cpython-39.pyc
+++ b/envs/pycache/my_custom_env.cpython-39.pyc
--- a/envs/my_custom_env.py
+++ b/envs/my_custom_env.py
@ -1,24 +0,0 @@
-import gymnasium as gym
-
-class MyCustomEnv(gym.Env):
-    def __init__(self):
-        super().__init__()
-        self.action_space = gym.spaces.Discrete(2)
-        self.observation_space = gym.spaces.Discrete(10)
-
-    def step(self, action):
-        observation = self.observation_space.sample()
-        reward = 1.0
-        terminated = False
-        truncated = False
-        info = {}
-        return observation, reward, terminated, truncated, info
-
-    def reset(self, seed=None, options=None):
-        super().reset(seed=seed)
-        observation = self.observation_space.sample()
-        info = {}
-        return observation, info
-
-    def render(self, mode='human'):
-        pass
--- a/evaluate.py
+++ b/evaluate.py
@ -1,41 +1,66 @@
 from negotiation_agent.environment import NegotiationEnv
-from negotiation_agent.agent import QLearningAgent
-import config
+from agents.offline_agent import QLearningAgent
+import yaml
+import numpy as np


-def evaluate():
+def main():
+    # 환경 설정 로드
+    with open('configs/offline_env_config.yaml', 'r') as f:
+        config = yaml.safe_load(f)
+
+    # 환경 초기화
    env = NegotiationEnv(
-        scenario=config.SCENARIO,
-        target_price=config.TARGET_PRICE,
-        threshold_price=config.THRESHOLD_PRICE,
+        scenario=config['env']['scenario'],
+        target_price=config['env']['target_price'],
+        threshold_price=config['env']['threshold_price']
    )
    
-    # 에이전트를 생성하되, 학습된 Q-Table을 불러옵니다.
-    agent = QLearningAgent(
-        state_dims=env.observation_space.nvec,
-        action_size=env.action_space.n,
-        learning_rate=0,  # 평가 시에는 학습하지 않음
-        gamma=0,
-        epsilon=0,  # 평가 시에는 탐험하지 않고 최선의 행동만 선택
-    )
-    agent.load_q_table(config.Q_TABLE_SAVE_PATH)
+    # 에이전트 초기화 및 Q-table 로드
+    state_dims = env.observation_space.nvec
+    state_size = np.prod(state_dims)  # 전체 상태 공간 크기
+    action_size = env.action_space.n
+    agent = QLearningAgent(config['agent'], state_size, action_size)
+    agent.load_q_table('saved_models/q_table.npy')
    
-    print("--- 학습된 에이전트 평가 시작 ---")
-    state, info = env.reset()
-    terminated = False
-    total_reward = 0
+    print(f"State space size: {state_size}")
+    print(f"Action space size: {action_size}")
+    print(f"Q-table shape: {agent.q_table.shape}")
    
-    while not terminated:
-        action = agent.get_action(state)
-        state, reward, terminated, truncated, info = env.step(action)
-        total_reward += reward
-        print(f"상태: {state}, 선택한 행동: {action}, 보상: {reward:.4f}")
+    # 평가 실행
+    num_episodes = 10
+    total_rewards = []
    
-    print("\n✅ 평가 종료!")
-    print(f"최종 협상 가격: {env.current_price:.2f} (목표가: {env.target_price})")
-    print(f"총 보상: {total_reward:.4f}")
-    env.close()
+    for episode in range(num_episodes):
+        state, _ = env.reset()
+        episode_reward = 0
+        done = False
+        
+        while not done:
+            # 상태를 인덱스로 변환
+            state_idx = np.ravel_multi_index(tuple(state), env.observation_space.nvec)
+            # 최적의 행동 선택
+            action = np.argmax(agent.q_table[state_idx])
+            
+            # 환경에서 한 스텝 진행
+            next_state, reward, done, _, _ = env.step(action)
+            episode_reward += reward
+            state = next_state
+            
+            # 현재 상태 출력
+            print(f"Episode {episode + 1}")
+            print(f"State: {env.spaces.get_state_description(state)}")
+            print(f"Action: {env.spaces.get_action_description(action)}")
+            print(f"Reward: {reward:.2f}")
+            print(f"Current Price: {env.current_price:.2f}")
+            print("--------------------")
+        
+        total_rewards.append(episode_reward)
+        print(f"Episode {episode + 1} finished with total reward: {episode_reward:.2f}")
+        print("========================================")
+    
+    print(f"Average reward over {num_episodes} episodes: {np.mean(total_rewards):.2f}")


 if __name__ == "__main__":
-    evaluate()
+    main()
--- a/logs/collected_data_20250922_161824.json
+++ b/logs/collected_data_20250922_161824.json
@ -0,0 +1,865 @@
+[
+  {
+    "episode_id": 0,
+    "timestamp": "2025-09-22T16:18:24.027941",
+    "steps": [
+      {
+        "step": 0,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "중간 거절",
+        "reward": 0.7358985001204814,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 135.8883052263702,
+        "terminated": false
+      },
+      {
+        "step": 1,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "약한 거절",
+        "reward": 0.7861334009215178,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 127.20487373107215,
+        "terminated": false
+      },
+      {
+        "step": 2,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "약한 수락",
+        "reward": 0.6305285853598203,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 118.94781892751169,
+        "terminated": false
+      },
+      {
+        "step": 3,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "중간 수락",
+        "reward": 0.6740021559509319,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 111.27560844992324,
+        "terminated": false
+      },
+      {
+        "step": 4,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "중간 가격 제안",
+        "reward": 0.7318358321615553,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 102.48200033944701,
+        "terminated": false
+      },
+      {
+        "step": 5,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "약한 거절",
+        "reward": 0.570634635379974,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 96.3839146626223,
+        "terminated": false
+      },
+      {
+        "step": 6,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "action": "강한 가격 제안",
+        "reward": 1.0595279975097056,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 90.23375501159687,
+        "terminated": true
+      }
+    ],
+    "total_reward": 5.188561107403986,
+    "num_steps": 7
+  },
+  {
+    "episode_id": 1,
+    "timestamp": "2025-09-22T16:18:24.029744",
+    "steps": [
+      {
+        "step": 0,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "강한 거절",
+        "reward": 0.7390394774903258,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 135.31076897216093,
+        "terminated": false
+      },
+      {
+        "step": 1,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "중간 거절",
+        "reward": 0.801125516069835,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 124.82438518570777,
+        "terminated": false
+      },
+      {
+        "step": 2,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "약한 가격 제안",
+        "reward": 0.8266746792259854,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 120.96656945345163,
+        "terminated": false
+      },
+      {
+        "step": 3,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "약한 거절",
+        "reward": 0.6539156600559058,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 114.69368999908635,
+        "terminated": false
+      },
+      {
+        "step": 4,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "강한 수락",
+        "reward": 0.7009856035857323,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 106.9922115609145,
+        "terminated": false
+      },
+      {
+        "step": 5,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "중간 거절",
+        "reward": 0.729344462994045,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 102.83206880342405,
+        "terminated": false
+      },
+      {
+        "step": 6,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "강한 가격 제안",
+        "reward": 0.567252902669142,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 96.95851663553233,
+        "terminated": false
+      },
+      {
+        "step": 7,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "action": "중간 수락",
+        "reward": 1.0434788040265859,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 92.67390785793958,
+        "terminated": true
+      }
+    ],
+    "total_reward": 6.061817106117557,
+    "num_steps": 8
+  },
+  {
+    "episode_id": 2,
+    "timestamp": "2025-09-22T16:18:24.029881",
+    "steps": [
+      {
+        "step": 0,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "강한 거절",
+        "reward": 0.7269372327267993,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 137.56345871141042,
+        "terminated": false
+      },
+      {
+        "step": 1,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "중간 거절",
+        "reward": 0.756597194679587,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 132.17072532544773,
+        "terminated": false
+      },
+      {
+        "step": 2,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "강한 거절",
+        "reward": 0.7775685813646708,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 128.6060193230739,
+        "terminated": false
+      },
+      {
+        "step": 3,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "중간 수락",
+        "reward": 0.812286351695623,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 123.10929488259042,
+        "terminated": false
+      },
+      {
+        "step": 4,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "중간 수락",
+        "reward": 0.6493055986284829,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 115.50801372792905,
+        "terminated": false
+      },
+      {
+        "step": 5,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "강한 가격 제안",
+        "reward": 0.690788532528077,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 108.57157649320362,
+        "terminated": false
+      },
+      {
+        "step": 6,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "중간 거절",
+        "reward": 0.7100689199148662,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 105.62354990694725,
+        "terminated": false
+      },
+      {
+        "step": 7,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "약한 가격 제안",
+        "reward": 0.5557374640257946,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 98.9675945212993,
+        "terminated": false
+      },
+      {
+        "step": 8,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "action": "중간 가격 제안",
+        "reward": 1.0412902763036351,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 93.01691944576609,
+        "terminated": true
+      }
+    ],
+    "total_reward": 6.720580151867536,
+    "num_steps": 9
+  },
+  {
+    "episode_id": 3,
+    "timestamp": "2025-09-22T16:18:24.030028",
+    "steps": [
+      {
+        "step": 0,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "중간 거절",
+        "reward": 0.7252551783902629,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 137.88250395116734,
+        "terminated": false
+      },
+      {
+        "step": 1,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "약한 거절",
+        "reward": 0.7477667619419455,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 133.7315391503905,
+        "terminated": false
+      },
+      {
+        "step": 2,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "중간 가격 제안",
+        "reward": 0.77003855118024,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 129.86362805697163,
+        "terminated": false
+      },
+      {
+        "step": 3,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "중간 수락",
+        "reward": 0.8243137799776058,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 121.31302718573589,
+        "terminated": false
+      },
+      {
+        "step": 4,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "중간 거절",
+        "reward": 0.6353442495723813,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 118.04624036572108,
+        "terminated": false
+      },
+      {
+        "step": 5,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "강한 가격 제안",
+        "reward": 0.6678033474242119,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 112.30851161390989,
+        "terminated": false
+      },
+      {
+        "step": 6,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "중간 거절",
+        "reward": 0.6816974190216999,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 110.01948651592679,
+        "terminated": false
+      },
+      {
+        "step": 7,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "약한 가격 제안",
+        "reward": 0.7100061980929508,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 105.63288067265765,
+        "terminated": false
+      },
+      {
+        "step": 8,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "중간 가격 제안",
+        "reward": 0.7442817697542694,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 100.7682883657917,
+        "terminated": false
+      },
+      {
+        "step": 9,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "강한 수락",
+        "reward": 1.0374549618334905,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 93.62419857403351,
+        "terminated": true
+      }
+    ],
+    "total_reward": 7.5439622171890575,
+    "num_steps": 10
+  },
+  {
+    "episode_id": 4,
+    "timestamp": "2025-09-22T16:18:24.030189",
+    "steps": [
+      {
+        "step": 0,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "중간 수락",
+        "reward": 0.7146333204199368,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 139.93190233732378,
+        "terminated": false
+      },
+      {
+        "step": 1,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "약한 가격 제안",
+        "reward": 0.7297921670526436,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 137.02531284195942,
+        "terminated": false
+      },
+      {
+        "step": 2,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "약한 거절",
+        "reward": 0.75510873299296,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 132.43125874553,
+        "terminated": false
+      },
+      {
+        "step": 3,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "중간 가격 제안",
+        "reward": 0.8102459947081206,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 123.41930802882099,
+        "terminated": false
+      },
+      {
+        "step": 4,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "강한 가격 제안",
+        "reward": 0.8317345976484769,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 120.23066045674327,
+        "terminated": false
+      },
+      {
+        "step": 5,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "강한 거절",
+        "reward": 0.6560086328440509,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 114.32776375951947,
+        "terminated": false
+      },
+      {
+        "step": 6,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "약한 거절",
+        "reward": 0.674953538395559,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 111.1187596383056,
+        "terminated": false
+      },
+      {
+        "step": 7,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "중간 가격 제안",
+        "reward": 0.7302460660076945,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 102.70510652666732,
+        "terminated": false
+      },
+      {
+        "step": 8,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "약한 가격 제안",
+        "reward": 0.5720230516832283,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 96.14997129601272,
+        "terminated": false
+      },
+      {
+        "step": 9,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "action": "약한 거절",
+        "reward": 1.0506427970382055,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 91.5685666609294,
+        "terminated": true
+      }
+    ],
+    "total_reward": 7.525388898790876,
+    "num_steps": 10
+  },
+  {
+    "episode_id": 5,
+    "timestamp": "2025-09-22T16:18:24.030352",
+    "steps": [
+      {
+        "step": 0,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "중간 거절",
+        "reward": 0.7305613483955329,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 136.88104389812182,
+        "terminated": false
+      },
+      {
+        "step": 1,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "약한 가격 제안",
+        "reward": 0.7567871209193779,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 132.1375552460719,
+        "terminated": false
+      },
+      {
+        "step": 2,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "약한 가격 제안",
+        "reward": 0.78862199555197,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 126.80346295693705,
+        "terminated": false
+      },
+      {
+        "step": 3,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "강한 거절",
+        "reward": 0.6358337173218354,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 117.95536782148626,
+        "terminated": false
+      },
+      {
+        "step": 4,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "약한 수락",
+        "reward": 0.6888695836656975,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 108.8740187959828,
+        "terminated": false
+      },
+      {
+        "step": 5,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "중간 수락",
+        "reward": 0.718338032660941,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 104.4076696345554,
+        "terminated": false
+      },
+      {
+        "step": 6,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "약한 거절",
+        "reward": 0.7340748841478206,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 102.16941298443504,
+        "terminated": false
+      },
+      {
+        "step": 7,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "약한 가격 제안",
+        "reward": 0.5570155928801377,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 98.74050332345952,
+        "terminated": false
+      },
+      {
+        "step": 8,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "action": "중간 가격 제안",
+        "reward": 1.044383090477607,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 92.53291501917664,
+        "terminated": true
+      }
+    ],
+    "total_reward": 6.65448536602092,
+    "num_steps": 9
+  },
+  {
+    "episode_id": 6,
+    "timestamp": "2025-09-22T16:18:24.030490",
+    "steps": [
+      {
+        "step": 0,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "강한 수락",
+        "reward": 0.72750892291292,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 137.4553587598672,
+        "terminated": false
+      },
+      {
+        "step": 1,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "약한 수락",
+        "reward": 0.782524180941611,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 127.79157812052534,
+        "terminated": false
+      },
+      {
+        "step": 2,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "중간 수락",
+        "reward": 0.8283112545472572,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 120.72756400570525,
+        "terminated": false
+      },
+      {
+        "step": 3,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "중간 거절",
+        "reward": 0.6730035701665233,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 111.44071640131496,
+        "terminated": false
+      },
+      {
+        "step": 4,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "중간 수락",
+        "reward": 0.7083731926007367,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 105.87639507452755,
+        "terminated": false
+      },
+      {
+        "step": 5,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "강한 거절",
+        "reward": 0.5571921272965241,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 98.70921950541188,
+        "terminated": false
+      },
+      {
+        "step": 6,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "action": "중간 거절",
+        "reward": 0.578004004688776,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 95.15505005819907,
+        "terminated": false
+      },
+      {
+        "step": 7,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "action": "약한 거절",
+        "reward": 1.059787946167413,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 90.19528894541341,
+        "terminated": true
+      }
+    ],
+    "total_reward": 5.9147051993217605,
+    "num_steps": 8
+  },
+  {
+    "episode_id": 7,
+    "timestamp": "2025-09-22T16:18:24.030613",
+    "steps": [
+      {
+        "step": 0,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "강한 거절",
+        "reward": 0.7476969492228983,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 133.74402570979152,
+        "terminated": false
+      },
+      {
+        "step": 1,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "약한 수락",
+        "reward": 0.7826844090569465,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 127.76541712449546,
+        "terminated": false
+      },
+      {
+        "step": 2,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "약한 수락",
+        "reward": 0.6339560237546636,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 118.30473595913722,
+        "terminated": false
+      },
+      {
+        "step": 3,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "중간 수락",
+        "reward": 0.6700352339433415,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 111.93441210338206,
+        "terminated": false
+      },
+      {
+        "step": 4,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "약한 수락",
+        "reward": 0.7174069347685657,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 104.54317677343735,
+        "terminated": false
+      },
+      {
+        "step": 5,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "강한 거절",
+        "reward": 0.5640770884044879,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 97.5044034416811,
+        "terminated": false
+      },
+      {
+        "step": 6,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "action": "중간 수락",
+        "reward": 1.0558066698679451,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 90.78803970908572,
+        "terminated": true
+      }
+    ],
+    "total_reward": 5.171663309018848,
+    "num_steps": 7
+  },
+  {
+    "episode_id": 8,
+    "timestamp": "2025-09-22T16:18:24.030719",
+    "steps": [
+      {
+        "step": 0,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "강한 수락",
+        "reward": 0.7465479684274815,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 133.94986555336644,
+        "terminated": false
+      },
+      {
+        "step": 1,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "약한 거절",
+        "reward": 0.7880565018752345,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 126.89445460070839,
+        "terminated": false
+      },
+      {
+        "step": 2,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "강한 수락",
+        "reward": 0.6264636918952473,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 119.71962776182876,
+        "terminated": false
+      },
+      {
+        "step": 3,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "약한 가격 제안",
+        "reward": 0.6548001047926377,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 114.53877213985942,
+        "terminated": false
+      },
+      {
+        "step": 4,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "약한 가격 제안",
+        "reward": 0.681939343451541,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 109.98045606284857,
+        "terminated": false
+      },
+      {
+        "step": 5,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "중간 수락",
+        "reward": 0.7000434976249617,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 107.13619975680452,
+        "terminated": false
+      },
+      {
+        "step": 6,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "약한 거절",
+        "reward": 0.7290266498774218,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 102.87689759024647,
+        "terminated": false
+      },
+      {
+        "step": 7,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "약한 수락",
+        "reward": 0.5537943325131409,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 99.31484807799998,
+        "terminated": false
+      },
+      {
+        "step": 8,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "action": "약한 가격 제안",
+        "reward": 1.0514864005028703,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 91.44013888596228,
+        "terminated": true
+      }
+    ],
+    "total_reward": 6.532158490960537,
+    "num_steps": 9
+  },
+  {
+    "episode_id": 9,
+    "timestamp": "2025-09-22T16:18:24.030862",
+    "steps": [
+      {
+        "step": 0,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "약한 거절",
+        "reward": 0.7179479516097199,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 139.2858629595485,
+        "terminated": false
+      },
+      {
+        "step": 1,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "약한 수락",
+        "reward": 0.7357380633672626,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "current_price": 135.91793734624605,
+        "terminated": false
+      },
+      {
+        "step": 2,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
+        "action": "강한 거절",
+        "reward": 0.7985485009052322,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "current_price": 125.22720897558546,
+        "terminated": false
+      },
+      {
+        "step": 3,
+        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
+        "action": "약한 수락",
+        "reward": 0.634178662402988,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 118.26320317340058,
+        "terminated": false
+      },
+      {
+        "step": 4,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "중간 거절",
+        "reward": 0.66713537132276,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "current_price": 112.4209616577428,
+        "terminated": false
+      },
+      {
+        "step": 5,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
+        "action": "약한 가격 제안",
+        "reward": 0.6959274570546614,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 107.76985336577854,
+        "terminated": false
+      },
+      {
+        "step": 6,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "중간 수락",
+        "reward": 0.7252535975319768,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 103.4121033735282,
+        "terminated": false
+      },
+      {
+        "step": 7,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "중간 거절",
+        "reward": 0.7424496531354222,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "current_price": 101.01695068920729,
+        "terminated": false
+      },
+      {
+        "step": 8,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
+        "action": "강한 수락",
+        "reward": 0.5587233113829603,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 98.43870638556888,
+        "terminated": false
+      },
+      {
+        "step": 9,
+        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "action": "중간 가격 제안",
+        "reward": 1.0321906190591832,
+        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
+        "current_price": 94.47077675157269,
+        "terminated": true
+      }
+    ],
+    "total_reward": 7.308093187772166,
+    "num_steps": 10
+  }
+]
--- a/negotiation_agent/pycache/init.cpython-39.pyc
+++ b/negotiation_agent/pycache/init.cpython-39.pyc
--- a/negotiation_agent/pycache/action_space.cpython-39.pyc
+++ b/negotiation_agent/pycache/action_space.cpython-39.pyc
--- a/negotiation_agent/pycache/environment.cpython-39.pyc
+++ b/negotiation_agent/pycache/environment.cpython-39.pyc
--- a/negotiation_agent/pycache/spaces.cpython-39.pyc
+++ b/negotiation_agent/pycache/spaces.cpython-39.pyc
--- a/negotiation_agent/action_space.py
+++ b/negotiation_agent/action_space.py
@ -0,0 +1,107 @@
+from dataclasses import dataclass
+from typing import Dict, List, Optional
+import json
+import os
+from pathlib import Path
+
+
+@dataclass
+class ActionInfo:
+    id: int
+    name: str
+    description: str
+    category: str
+    strength: str
+
+    def __str__(self) -> str:
+        return f"{self.description} (ID: {self.id}, Category: {self.category}, Strength: {self.strength})"
+
+
+class ActionSpace:
+    def __init__(self, config_path: Optional[str] = None):
+        if config_path is None:
+            config_path = os.path.join(
+                Path(__file__).parent.parent, "configs", "actions.json"
+            )
+        self._actions: Dict[int, ActionInfo] = {}
+        self._load_actions(config_path)
+
+    def _load_actions(self, config_path: str) -> None:
+        """JSON 파일에서 액션 정보를 로드합니다."""
+        with open(config_path, "r", encoding="utf-8") as f:
+            data = json.load(f)
+
+        for action in data["actions"]:
+            self.add_action(
+                id=action["id"],
+                name=action["name"],
+                description=action["description"],
+                category=action["category"],
+                strength=action["strength"],
+            )
+
+    def add_action(
+        self, id: int, name: str, description: str, category: str, strength: str
+    ) -> None:
+        """새로운 액션을 추가합니다."""
+        if id in self._actions:
+            raise ValueError(f"Action with id {id} already exists")
+
+        self._actions[id] = ActionInfo(
+            id=id,
+            name=name,
+            description=description,
+            category=category,
+            strength=strength,
+        )
+
+    def remove_action(self, action_id: int) -> None:
+        """지정된 ID의 액션을 제거합니다."""
+        if action_id not in self._actions:
+            raise ValueError(f"Action with id {action_id} does not exist")
+        del self._actions[action_id]
+
+    def get_action(self, action_id: int) -> ActionInfo:
+        """액션 ID로 액션 정보를 조회합니다."""
+        if action_id not in self._actions:
+            raise ValueError(f"Invalid action id: {action_id}")
+        return self._actions[action_id]
+
+    def get_actions_by_category(self, category: str) -> List[ActionInfo]:
+        """특정 카테고리의 모든 액션을 반환합니다."""
+        return [
+            action for action in self._actions.values() if action.category == category
+        ]
+
+    def get_actions_by_strength(self, strength: str) -> List[ActionInfo]:
+        """특정 강도의 모든 액션을 반환합니다."""
+        return [
+            action for action in self._actions.values() if action.strength == strength
+        ]
+
+    def save_actions(self, file_path: str) -> None:
+        """현재 액션 설정을 JSON 파일로 저장합니다."""
+        data = {
+            "actions": [
+                {
+                    "id": action.id,
+                    "name": action.name,
+                    "description": action.description,
+                    "category": action.category,
+                    "strength": action.strength,
+                }
+                for action in self._actions.values()
+            ]
+        }
+
+        with open(file_path, "w", encoding="utf-8") as f:
+            json.dump(data, f, ensure_ascii=False, indent=4)
+
+    @property
+    def action_space_size(self) -> int:
+        """현재 액션 공간의 크기를 반환합니다."""
+        return len(self._actions)
+
+    def list_actions(self) -> List[ActionInfo]:
+        """모든 액션 정보를 리스트로 반환합니다."""
+        return list(self._actions.values())
--- a/negotiation_agent/agent.py
+++ b/negotiation_agent/agent.py
@ -1,63 +0,0 @@
-import numpy as np
-import random
-import os
-
-
-class QLearningAgent:
-    """Q-Table을 기반으로 행동하고 학습하는 에이전트"""
-
-    def __init__(self, state_dims, action_size, learning_rate, gamma, epsilon):
-        self.state_dims = state_dims  # [4, 3, 3]
-        self.action_size = action_size
-        self.lr = learning_rate
-        self.gamma = gamma
-        self.epsilon = epsilon
-
-        # Q-Table 초기화: 36x9 크기의 테이블을 0으로 초기화
-        num_states = np.prod(state_dims)  # 4 * 3 * 3 = 36
-        self.q_table = np.zeros((num_states, action_size))
-
-    def _state_to_index(self, state):
-        """MultiDiscrete 상태 [s0, s1, s2]를 단일 정수 인덱스로 변환"""
-        idx = (
-            state[0] * (self.state_dims[1] * self.state_dims[2])
-            + state[1] * self.state_dims[2]
-            + state[2]
-        )
-        return int(idx)
-
-    def get_action(self, state):
-        """Epsilon-Greedy 정책에 따라 행동 선택"""
-        if random.uniform(0, 1) < self.epsilon:
-            return random.randint(0, self.action_size - 1)  # 탐험 (무작위 행동)
-        else:
-            state_idx = self._state_to_index(state)
-            return np.argmax(self.q_table[state_idx, :])  # 활용 (Q값이 가장 높은 행동)
-
-    # ===================================================================
-    # ## 3. 학습 알고리즘 (Learning Algorithm): Q-Table 업데이트 규칙
-    # ===================================================================
-    def learn(self, state, action, reward, next_state):
-        """경험 데이터를 바탕으로 Q-Table을 업데이트 (Q-러닝 공식)"""
-        state_idx = self._state_to_index(state)
-        next_state_idx = self._state_to_index(next_state)
-
-        old_value = self.q_table[state_idx, action]
-        next_max = np.max(self.q_table[next_state_idx, :])
-
-        # Bellman Equation
-        new_value = old_value + self.lr * (reward + self.gamma * next_max - old_value)
-        self.q_table[state_idx, action] = new_value
-
-    def save_q_table(self, file_path):
-        """Q-Table을 파일로 저장합니다."""
-        np.save(file_path, self.q_table)
-        print(f"Q-Table saved to {file_path}")
-
-    def load_q_table(self, file_path):
-        """파일로부터 Q-Table을 불러옵니다."""
-        if os.path.exists(file_path):
-            self.q_table = np.load(file_path)
-            print(f"Q-Table loaded from {file_path}")
-        else:
-            print(f"Error: No Q-Table found at {file_path}")
--- a/negotiation_agent/constants.py
+++ b/negotiation_agent/constants.py
@ -0,0 +1,41 @@
+from gymnasium import spaces
+
+# Observation Space Constants
+SCENARIO_SPACE_SIZE = 4  # 시나리오 상태 수 (0-3)
+PRICE_ZONE_SIZE = 3  # 가격 구간 수 (0-2)
+ACCEPTANCE_RATE_SIZE = 3  # 수락률 레벨 수 (0-2)
+
+# Observation Space Mappings
+SCENARIO_MAPPING = {
+    0: "높은 구매 의지",
+    1: "중간 구매 의지",
+    2: "낮은 구매 의지",
+    3: "매우 낮은 구매 의지",
+}
+
+PRICE_ZONE_MAPPING = {0: "목표가격 이하", 1: "목표가격~임계가격", 2: "임계가격 초과"}
+
+ACCEPTANCE_RATE_MAPPING = {0: "낮음 (<10%)", 1: "중간 (10-25%)", 2: "높음 (>25%)"}
+
+# Action Space Constants
+ACTION_SPACE_SIZE = 9
+
+# Action Space Mappings
+ACTION_MAPPING = {
+    0: "강한 수락",
+    1: "중간 수락",
+    2: "약한 수락",
+    3: "강한 거절",
+    4: "중간 거절",
+    5: "약한 거절",
+    6: "강한 가격 제안",
+    7: "중간 가격 제안",
+    8: "약한 가격 제안",
+}
+
+# Spaces Definition
+OBSERVATION_SPACE = spaces.MultiDiscrete(
+    [SCENARIO_SPACE_SIZE, PRICE_ZONE_SIZE, ACCEPTANCE_RATE_SIZE]
+)
+
+ACTION_SPACE = spaces.Discrete(ACTION_SPACE_SIZE)
--- a/negotiation_agent/environment.py
+++ b/negotiation_agent/environment.py
@ -1,6 +1,13 @@
 import gymnasium as gym
 from gymnasium import spaces
 import numpy as np
+from negotiation_agent.spaces import (
+    NegotiationSpaces,
+    State,
+    PriceZone,
+    AcceptanceRate,
+    Scenario,
+)


 class NegotiationEnv(gym.Env):
@ -8,9 +15,11 @@ class NegotiationEnv(gym.Env):

    def __init__(self, scenario=0, target_price=100, threshold_price=120):
        super(NegotiationEnv, self).__init__()
-        self.observation_space = spaces.MultiDiscrete([4, 3, 3])
-        self.action_space = spaces.Discrete(9)
-        self.initial_scenario = scenario
+
+        self.spaces = NegotiationSpaces()
+        self.observation_space = self.spaces.observation_space
+        self.action_space = self.spaces.action_space
+        self.initial_scenario = Scenario(scenario)
        self.target_price = target_price
        self.threshold_price = threshold_price
        self.current_price = None
@ -20,23 +29,28 @@ class NegotiationEnv(gym.Env):
    def _get_state(self):
        """현재 정보를 바탕으로 State 배열을 계산"""
        if self.current_price <= self.target_price:
-            price_zone = 0
+            price_zone = PriceZone.BELOW_TARGET
        elif self.target_price < self.current_price <= self.threshold_price:
-            price_zone = 1
+            price_zone = PriceZone.BETWEEN_TARGET_AND_THRESHOLD
        else:
-            price_zone = 2
+            price_zone = PriceZone.ABOVE_THRESHOLD

        acceptance_rate_val = (
            self.initial_price - self.current_price
        ) / self.initial_price
        if acceptance_rate_val < 0.1:
-            acceptance_rate_level = 0
+            acceptance_rate_level = AcceptanceRate.LOW
        elif 0.1 <= acceptance_rate_val < 0.25:
-            acceptance_rate_level = 1
+            acceptance_rate_level = AcceptanceRate.MEDIUM
        else:
-            acceptance_rate_level = 2
+            acceptance_rate_level = AcceptanceRate.HIGH

-        return np.array([self.initial_scenario, price_zone, acceptance_rate_level])
+        state = State(
+            scenario=self.initial_scenario,
+            price_zone=price_zone,
+            acceptance_rate=acceptance_rate_level,
+        )
+        return np.array(state.to_array())

    def reset(self, seed=None, options=None):
        """환경을 초기 상태로 리셋"""
--- a/negotiation_agent/spaces.py
+++ b/negotiation_agent/spaces.py
@ -0,0 +1,113 @@
+from gymnasium import spaces
+from typing import Dict, List, Any
+from dataclasses import dataclass
+from enum import Enum, auto
+from negotiation_agent.action_space import ActionSpace, ActionInfo
+
+
+class Scenario(Enum):
+    HIGH_INTENTION = 0
+    MEDIUM_INTENTION = 1
+    LOW_INTENTION = 2
+    VERY_LOW_INTENTION = 3
+
+    @property
+    def description(self) -> str:
+        return {
+            self.HIGH_INTENTION: "높은 구매 의지",
+            self.MEDIUM_INTENTION: "중간 구매 의지",
+            self.LOW_INTENTION: "낮은 구매 의지",
+            self.VERY_LOW_INTENTION: "매우 낮은 구매 의지",
+        }[self]
+
+
+class PriceZone(Enum):
+    BELOW_TARGET = 0
+    BETWEEN_TARGET_AND_THRESHOLD = 1
+    ABOVE_THRESHOLD = 2
+
+    @property
+    def description(self) -> str:
+        return {
+            self.BELOW_TARGET: "목표가격 이하",
+            self.BETWEEN_TARGET_AND_THRESHOLD: "목표가격~임계가격",
+            self.ABOVE_THRESHOLD: "임계가격 초과",
+        }[self]
+
+
+class AcceptanceRate(Enum):
+    LOW = 0
+    MEDIUM = 1
+    HIGH = 2
+
+    @property
+    def description(self) -> str:
+        return {
+            self.LOW: "낮음 (<10%)",
+            self.MEDIUM: "중간 (10-25%)",
+            self.HIGH: "높음 (>25%)",
+        }[self]
+
+
+@dataclass
+class State:
+    scenario: Scenario
+    price_zone: PriceZone
+    acceptance_rate: AcceptanceRate
+
+    def to_array(self) -> List[int]:
+        return [self.scenario.value, self.price_zone.value, self.acceptance_rate.value]
+
+    @classmethod
+    def from_array(cls, arr: List[int]) -> "State":
+        return cls(
+            scenario=Scenario(arr[0]),
+            price_zone=PriceZone(arr[1]),
+            acceptance_rate=AcceptanceRate(arr[2]),
+        )
+
+    def __str__(self) -> str:
+        return (
+            f"State(scenario={self.scenario.description}, "
+            f"price_zone={self.price_zone.description}, "
+            f"acceptance_rate={self.acceptance_rate.description})"
+        )
+
+
+class NegotiationSpaces:
+    def __init__(self):
+        self._action_space = ActionSpace()
+
+    @property
+    def observation_space(self) -> spaces.MultiDiscrete:
+        return spaces.MultiDiscrete(
+            [len(Scenario), len(PriceZone), len(AcceptanceRate)]
+        )
+
+    @property
+    def action_space(self) -> spaces.Discrete:
+        return spaces.Discrete(self._action_space.action_space_size)
+
+    def decode_action(self, action_id: int) -> ActionInfo:
+        return self._action_space.get_action(action_id)
+
+    def encode_state(self, state: State) -> List[int]:
+        return state.to_array()
+
+    def decode_state(self, state_array: List[int]) -> State:
+        return State.from_array(state_array)
+
+    def get_action_description(self, action_id: int) -> str:
+        return self.decode_action(action_id).description
+
+    def get_state_description(self, state_array: List[int]) -> str:
+        return str(self.decode_state(state_array))
+
+    def get_actions_by_category(self, category: str) -> List[ActionInfo]:
+        return self._action_space.get_actions_by_category(category)
+
+    def get_actions_by_strength(self, strength: str) -> List[ActionInfo]:
+        return self._action_space.get_actions_by_strength(strength)
+
+    def list_all_actions(self) -> List[ActionInfo]:
+        return self._action_space.list_actions()
--- a/saved_models/offline_agent.pth
+++ b/saved_models/offline_agent.pth
--- a/saved_models/q_table.json
+++ b/saved_models/q_table.json
--- a/saved_models/q_table.npy
+++ b/saved_models/q_table.npy
--- a/tests/test_evaluate_agent_usecase.py
+++ b/tests/test_evaluate_agent_usecase.py
@ -1,7 +1,7 @@
 import unittest

 from agents.offline_agent import QLearningAgent
-from envs.my_custom_env import MyCustomEnv
+from negotiation_agent.environment import NegotiationEnv
 from usecases.evaluate_agent_usecase import EvaluateAgentUseCase

 class TestEvaluateAgentUseCase(unittest.TestCase):
@ -10,7 +10,7 @@ class TestEvaluateAgentUseCase(unittest.TestCase):
        self.state_size = 10
        self.action_size = 2
        self.agent = QLearningAgent(self.agent_params, self.state_size, self.action_size)
-        self.env = MyCustomEnv()
+        self.env = NegotiationEnv()
        self.use_case = EvaluateAgentUseCase()

    def test_execute(self):
--- a/train.py
+++ b/train.py
@ -1,29 +0,0 @@
-from negotiation_agent.environment import NegotiationEnv
-from negotiation_agent.agent import QLearningAgent
-from usecases.train_agent_usecase import TrainAgentUseCase  # 유스케이스 임포트
-import config
-
-
-def main():
-    # 1. 의존성(객체) 생성: 필요한 모든 '재료'를 준비합니다.
-    env = NegotiationEnv(
-        scenario=config.SCENARIO,
-        target_price=config.TARGET_PRICE,
-        threshold_price=config.THRESHOLD_PRICE,
-    )
-
-    agent = QLearningAgent(
-        state_dims=env.observation_space.nvec,
-        action_size=env.action_space.n,
-        learning_rate=config.LEARNING_RATE,
-        gamma=config.GAMMA,
-        epsilon=config.EPSILON_START,
-    )
-
-    # 2. 유스케이스 생성 및 실행: 준비된 재료로 '요리사'에게 '요리'를 지시합니다.
-    train_use_case = TrainAgentUseCase(env=env, agent=agent)
-    train_use_case.execute()
-
-
-if __name__ == "__main__":
-    main()
--- a/train_offline.py
+++ b/train_offline.py
@ -2,8 +2,12 @@ import h5py
 import numpy as np
 import yaml
 import os
+import json
+from datetime import datetime

 from agents.offline_agent import QLearningAgent
+from negotiation_agent.spaces import NegotiationSpaces
+from negotiation_agent.environment import NegotiationEnv

 def main():
    with open("configs/offline_env_config.yaml", "r") as f:
@ -19,10 +23,12 @@ def main():
        next_observations = f["next_observations"][:]
        terminals = f["terminals"][:]
    
-    state_size = len(np.unique(np.concatenate((observations, next_observations))))
-    action_size = len(np.unique(actions))
+    from negotiation_agent.environment import NegotiationEnv
+    env = NegotiationEnv()
+    state_size = np.prod(env.observation_space.nvec)  # 4 * 3 * 3 = 36
+    action_size = env.action_space.n  # 9

-    agent = QLearningAgent(config["agent_params"], state_size, action_size)
+    agent = QLearningAgent(config["agent"], state_size, action_size)  # config["agent"]로 수정

    num_epochs = 10
    for epoch in range(num_epochs):
@ -38,12 +44,61 @@ def main():
            }
            agent.learn(batch)

-    # Save the model
+    # 모델 저장 (npy 형식)
    saved_models_dir = "saved_models"
    os.makedirs(saved_models_dir, exist_ok=True)
    model_path = os.path.join(saved_models_dir, "q_table.npy")
-    agent.save_model(model_path)
+    np.save(model_path, agent.q_table)
+    
+    # Q-table을 JSON 형식으로도 저장
+    spaces = NegotiationSpaces()
+    q_table_data = {
+        "metadata": {
+            "state_size": int(state_size),
+            "action_size": int(action_size),
+            "timestamp": datetime.now().isoformat(),
+            "training_episodes": int(num_epochs)
+        },
+        "q_values": []
+    }
+    
+    # 각 상태에 대한 Q-값을 저장
+    for state_idx in range(state_size):
+        state_indices = np.unravel_index(state_idx, env.observation_space.nvec)
+        state_data = {
+            "state_idx": int(state_idx),
+            "state_desc": spaces.get_state_description(
+                [int(idx) for idx in state_indices]
+            ),
+            "actions": []
+        }
+        
+        # 각 행동에 대한 Q-값을 저장
+        for action_idx in range(action_size):
+            action_data = {
+                "action_idx": int(action_idx),
+                "action_desc": spaces.get_action_description(action_idx),
+                "q_value": float(agent.q_table[state_idx, action_idx])
+            }
+            state_data["actions"].append(action_data)
+            
+        # 최적 행동 정보 추가
+        optimal_action_idx = int(np.argmax(agent.q_table[state_idx]))
+        state_data["optimal_action"] = {
+            "action_idx": optimal_action_idx,
+            "action_desc": spaces.get_action_description(optimal_action_idx),
+            "q_value": float(agent.q_table[state_idx, optimal_action_idx])
+        }
+        
+        q_table_data["q_values"].append(state_data)
+    
+    # JSON 파일로 저장
+    json_path = os.path.join(saved_models_dir, "q_table.json")
+    with open(json_path, 'w', encoding='utf-8') as f:
+        json.dump(q_table_data, f, ensure_ascii=False, indent=2)
+    
    print(f"Model saved to {model_path}")
+    print(f"Q-table JSON saved to {json_path}")

 if __name__ == "__main__":
    main()
--- a/usecases/evaluate_agent_usecase.py
+++ b/usecases/evaluate_agent_usecase.py
@ -1,8 +1,8 @@
 from agents.offline_agent import QLearningAgent
-from envs.my_custom_env import MyCustomEnv
+from negotiation_agent.environment import NegotiationEnv

 class EvaluateAgentUseCase:
-    def execute(self, agent: QLearningAgent, env: MyCustomEnv, num_episodes: int):
+    def execute(self, agent: QLearningAgent, env: NegotiationEnv, num_episodes: int):
        total_rewards = 0
        for _ in range(num_episodes):
            obs, _ = env.reset()
Author	SHA1	Message	Date
mgjeon	0ade7cec61	docs: README 전면 개선 - 프로젝트 구조 상세 설명 추가 - 주요 컴포넌트 설명 추가 - 실행 방법 가이드 추가 - 설정 파일 설명 추가 - 학습 결과 및 확장 방향 문서화	2025-09-22 16:36:25 +09:00
mgjeon	a81e1d4232	feat: 학습 및 평가 프로세스 개선 - data_collector.py: JSON 형식 로깅 추가 - train_offline.py: Q-table 저장 형식 개선 - evaluate.py: 평가 지표 상세화 - usecases/: 평가 로직 개선 - tests/: 테스트 케이스 업데이트	2025-09-22 16:36:07 +09:00
mgjeon	e85490e0ab	feat: 데이터 관리 및 설정 개선 - configs/actions.json: 행동 정의 파일 추가 - configs/offline_env_config.yaml: 환경 설정 파일 개선 - saved_models/: Q-table JSON 형식 추가 - logs/: 데이터 수집 로그 기능 추가	2025-09-22 16:35:52 +09:00
mgjeon	1bf179bbaa	feat: 협상 에이전트 구현 개선 - action_space.py: 행동 공간 관리 로직 추가 - constants.py: 상수값 분리 및 관리 - spaces.py: 상태 및 행동 공간 정의 추가 - environment.py: 협상 환경 구현 개선	2025-09-22 16:35:43 +09:00
mgjeon	26442ca9c1	refactor: 프로젝트 구조 개선 - 기존 envs/ 디렉토리를 negotiation_agent/로 이동 및 리팩토링 - config.py를 configs/ 디렉토리로 이동 및 yaml 형식으로 변경 - Offline_RL.md를 README.md로 통합 - 불필요한 train.py 제거	2025-09-22 16:35:29 +09:00