31 changed files with 303 additions and 3553 deletions
--- a/Offline_RL.md
+++ b/Offline_RL.md
@ -0,0 +1,80 @@
+**오프라인 RL 프로젝트 구조**
+
+가장 큰 변화는 데이터셋을 관리하는 `datasets/` 디렉토리와 데이터 수집을 위한 별도 스크립트(`data_collector.py`)의 추가입니다.
+
+`my_rl_project/
+├── configs/
+│   └── offline_env_config.yaml
+├── datasets/
+│   └── collected_data.h5
+├── envs/
+│   ├── __init__.py
+│   └── my_custom_env.py
+├── agents/
+│   ├── __init__.py
+│   └── offline_agent.py
+├── data_collector.py
+└── train_offline.py`
+
+---
+
+### **구성 요소별 변경 사항 및 관리 방안**
+
+### **1. `datasets/` (신규)**
+
+- **Purpose**: 사전 수집된 고정 데이터셋을 저장하는 디렉토리입니다. 오프라인 학습의 가장 핵심적인 자산입니다.
+- **Format**: 대용량 데이터를 효율적으로 다루기 위해 HDF5(`.h5`), Parquet, 또는 NumPy Archive(`.npz`) 형식을 사용하는 것이 일반적입니다.
+- **Content**: `(observation, action, reward, next_observation, terminated)` 튜플들의 집합으로 구성된 테이블 또는 배열.
+
+### **2. `data_collector.py` (신규)**
+
+- **Purpose**: 오프라인 학습에 사용할 데이터셋을 생성하는 스크립트입니다. 이 스크립트는 **온라인(online) 환경에서 실행**됩니다.
+- **Implementation**:
+  - `MyCustomEnv`를 인스턴스화합니다.
+  - 미리 정의된 정책(예: 랜덤 정책, 전문가 정책, 기존에 학습된 온라인 RL 정책)을 사용하여 환경과 상호작용(`env.step()`)합니다.
+  - 수집된 `(s, a, r, s', d)` 튜플들을 `datasets/` 디렉토리에 지정된 파일 형식으로 저장합니다.
+  - **데이터 수집 단계와 오프라인 학습 단계를 명확히 분리**하는 역할을 합니다.
+
+### **3. `configs/offline_env_config.yaml`**
+
+- **Purpose**: 오프라인 학습에 특화된 설정을 관리합니다.
+- **Key Definitions (변경점)**:
+  - **`dataset_params`**:
+    - `path`: 사용할 데이터셋 파일의 경로 (`datasets/collected_data.h5`).
+    - `batch_size`: 학습 시 데이터셋에서 샘플링할 배치의 크기.
+  - **`agent_params`**: 오프라인 RL 알고리즘(예: CQL, BCQ)에 특화된 하이퍼파라미터를 정의.
+
+### **4. `envs/my_custom_env.py`**
+
+- **Purpose (역할 변경)**:
+  - **학습 단계**: 더 이상 학습 과정에서 실시간으로 상호작용하지 않습니다.
+  - **평가(Evaluation) 단계**: **오프라인 학습이 완료된 후, 학습된 정책의 성능을 검증하는 용도**로 사용됩니다. 즉, 학습된 에이전트를 실제 환경에서 실행해보기 위한 '테스트베드'의 역할이 주가 됩니다.
+
+### **5. `agents/offline_agent.py`**
+
+- **Purpose**: 오프라인 데이터셋만으로 정책을 학습하는 알고리즘을 구현합니다.
+- **Key Definitions (변경점)**:
+  - `class OfflineAgent`: CQL, IQL, BCQ 등 오프라인 RL 알고리즘을 구현.
+  - `__init__(...)`: 온라인 에이전트와 유사하게 모델과 하이퍼파라미터를 초기화.
+  - `get_action(self, state)`: 역할은 동일하나, 주로 학습 후 **평가 단계**에서 사용됩니다.
+  - `learn(self, batch)`:
+    - 메서드의 인자가 단일 경험이 아닌, \**데이터셋에서 샘플링된 `batch`*가 됩니다.
+    - 이 `batch` 데이터를 사용하여 오프라인 RL 알고리즘의 손실 함수를 계산하고 정책을 업데이트합니다. **환경과의 상호작용(`env.step()`)이 전혀 없습니다.**
+- **Policy & Reward Function Management**:
+  - **정책(Policy)**: 이 파일 내에서 관리되며, 주어진 데이터셋 내의 행동 분포를 모방하거나 보수적으로(conservative) 개선하는 방향으로 학습됩니다.
+  - **보상 함수(Reward Function)**: 에이전트는 보상 함수를 직접 참조하지 않습니다. 대신, 데이터셋에 **기록된 `reward` 값**을 학습의 유일한 감독(supervision) 신호로 사용합니다.
+
+### **6. `train_offline.py`**
+
+- **Purpose**: 오프라인 데이터셋을 로드하여 에이전트를 학습시키는 메인 스크립트.
+- **Key Definitions (핵심 변경점)**:
+  - **Setup**: 설정 파일을 로드하고, 데이터셋 로더와 `OfflineAgent`를 인스턴스화합니다. (`MyCustomEnv`는 이 단계에서 필요 없을 수 있습니다.)
+  - **Training Loop**:
+    - **환경과의 상호작용 루프가 사라집니다.**
+    - 대신, 지도 학습(supervised learning)과 유사한 루프를 가집니다.
+    - `for epoch in range(num_epochs):`
+      - `batch = dataset.sample(batch_size)`
+      - `agent.learn(batch)`
+  - **Evaluation**:
+    - 학습 루프가 끝난 후, `MyCustomEnv`를 인스턴스화합니다.
+    - 학습된 `agent.get_action()`을 사용하여 환경과 상호작용하며 성능을 측정하는 별도의 평가 루프를 실행합니다.
--- a/README.md
+++ b/README.md
@ -1,123 +1 @@
-# Offline Q-Learning for Negotiation Agent
-
-## 프로젝트 개요
-
-협상 환경에서 Q-Learning을 사용한 오프라인 강화학습 에이전트 구현 프로젝트입니다. 이 에이전트는 사전에 수집된 데이터를 기반으로 학습하며, 가격 협상 과정에서 최적의 행동을 선택하는 것을 목표로 합니다.
-
-## 프로젝트 구조
-
-```
-├── agents/                     # 에이전트 구현
-│   └── offline_agent.py       # 오프라인 Q-Learning 에이전트
-├── configs/                    # 설정 파일
-│   ├── actions.json           # 행동 공간 정의
-│   └── offline_env_config.yaml# 환경 및 학습 설정
-├── datasets/                   # 수집된 데이터셋
-│   └── collected_data.h5      # 수집된 상호작용 데이터
-├── logs/                      # 로그 파일
-│   └── collected_data_*.json  # 데이터 수집 로그
-├── negotiation_agent/         # 협상 환경 구현
-│   ├── action_space.py        # 행동 공간 관리
-│   ├── environment.py         # 협상 환경 구현
-│   └── spaces.py             # 상태 및 행동 공간 정의
-├── saved_models/              # 학습된 모델 저장
-│   ├── q_table.npy           # NumPy 형식의 Q-table
-│   └── q_table.json          # 사람이 읽을 수 있는 JSON 형식 Q-table
-├── usecases/                  # 유스케이스 구현
-├── data_collector.py          # 데이터 수집 스크립트
-├── train_offline.py           # 오프라인 학습 스크립트
-└── evaluate.py                # 모델 평가 스크립트
-```
-
-## 주요 컴포넌트
-
-### 1. 협상 환경 (NegotiationEnv)
-
- **상태 공간**:
-
-  - 시나리오 (4가지): 높은/중간/낮은/매우 낮은 구매 의지
-  - 가격 구간 (3가지): 목표가격 이하/목표~임계가격/임계가격 초과
-  - 수락률 구간 (3가지): 낮음(<10%)/중간(10-25%)/높음(>25%)
-
- **행동 공간**:
-  - 수락 관련: 강한/중간/약한 수락
-  - 거절 관련: 강한/중간/약한 거절
-  - 제안 관련: 강한/중간/약한 가격 제안
-
-### 2. 오프라인 Q-Learning 에이전트
-
- 사전 수집된 데이터로부터 학습
- Q-table을 사용하여 상태-행동 가치 저장
- 경험 재현을 통한 배치 학습
-
-### 3. 데이터 관리
-
- **데이터 수집**: `data_collector.py`를 통해 상호작용 데이터 수집
- **데이터 형식**: JSON 형식으로 저장되어 가독성 확보
- **로깅**: 각 에피소드의 상태, 행동, 보상을 상세히 기록
-
-## 실행 방법
-
-### 1. 데이터 수집
-
-```bash
-python data_collector.py
-```
-
- 협상 환경과의 상호작용 데이터를 수집
- 결과는 `logs/collected_data_[timestamp].json`에 저장
-
-### 2. 오프라인 학습
-
-```bash
-python train_offline.py
-```
-
- 수집된 데이터를 사용하여 Q-table 학습
- 학습된 모델은 두 가지 형식으로 저장:
-  - `saved_models/q_table.npy`: NumPy 배열
-  - `saved_models/q_table.json`: 사람이 읽을 수 있는 JSON 형식
-
-### 3. 모델 평가
-
-```bash
-python evaluate.py
-```
-
- 학습된 모델의 성능 평가
- 에피소드별 보상, 행동 선택 등을 출력
-
-## 설정 파일
-
-### 1. offline_env_config.yaml
-
-```yaml
-env:
-  scenario: 0
-  target_price: 100
-  threshold_price: 120
-
-dataset_params:
-  path: datasets/collected_data.h5
-  batch_size: 64
-
-agent:
-  learning_rate: 0.001
-  discount_factor: 0.99
-```
-
-### 2. actions.json
-
- 가능한 모든 행동과 그 속성을 정의
- 각 행동의 카테고리와 강도 정보 포함
-
-## 학습 결과
-
- 평균 에피소드 길이: 8-9 스텝
- 평균 누적 보상: 6.5 이상
- 목표 가격 도달률: 90% 이상
-
-## 추가 정보
-
- Python 3.9 이상 권장
- 필요한 패키지: numpy, gymnasium, h5py, pyyaml
+# Q_Table
--- a/config.py
+++ b/config.py
@ -0,0 +1,19 @@
+# config.py
+
+# --- Training Hyperparameters ---
+LEARNING_RATE = 0.1
+GAMMA = 0.99
+TOTAL_EPISODES = 10000
+
+# --- Epsilon Parameters ---
+EPSILON_START = 1.0
+EPSILON_END = 0.01
+EPSILON_DECAY_RATE = 0.0005
+
+# --- Environment Parameters ---
+SCENARIO = 0  # 0: A, 1: B, 2: C, 3: D
+TARGET_PRICE = 100
+THRESHOLD_PRICE = 120
+
+# --- File Paths ---
+Q_TABLE_SAVE_PATH = "saved_models/q_table.npy"
--- a/configs/actions.json
+++ b/configs/actions.json
@ -1,67 +0,0 @@
-{
-    "actions": [
-        {
-            "id": 0,
-            "name": "STRONG_ACCEPT",
-            "description": "강한 수락",
-            "category": "accept",
-            "strength": "strong"
-        },
-        {
-            "id": 1,
-            "name": "MEDIUM_ACCEPT",
-            "description": "중간 수락",
-            "category": "accept",
-            "strength": "medium"
-        },
-        {
-            "id": 2,
-            "name": "WEAK_ACCEPT",
-            "description": "약한 수락",
-            "category": "accept",
-            "strength": "weak"
-        },
-        {
-            "id": 3,
-            "name": "STRONG_REJECT",
-            "description": "강한 거절",
-            "category": "reject",
-            "strength": "strong"
-        },
-        {
-            "id": 4,
-            "name": "MEDIUM_REJECT",
-            "description": "중간 거절",
-            "category": "reject",
-            "strength": "medium"
-        },
-        {
-            "id": 5,
-            "name": "WEAK_REJECT",
-            "description": "약한 거절",
-            "category": "reject",
-            "strength": "weak"
-        },
-        {
-            "id": 6,
-            "name": "STRONG_PROPOSE",
-            "description": "강한 가격 제안",
-            "category": "propose",
-            "strength": "strong"
-        },
-        {
-            "id": 7,
-            "name": "MEDIUM_PROPOSE",
-            "description": "중간 가격 제안",
-            "category": "propose",
-            "strength": "medium"
-        },
-        {
-            "id": 8,
-            "name": "WEAK_PROPOSE",
-            "description": "약한 가격 제안",
-            "category": "propose",
-            "strength": "weak"
-        }
-    ]
-}
--- a/configs/offline_env_config.yaml
+++ b/configs/offline_env_config.yaml
@ -1,12 +1,7 @@
-env:
-  scenario: 0
-  target_price: 100
-  threshold_price: 120
-
 dataset_params:
  path: datasets/collected_data.h5
  batch_size: 64

-agent:
+agent_params:
  learning_rate: 0.001
  discount_factor: 0.99
--- a/data_collector.py
+++ b/data_collector.py
@ -1,74 +1,49 @@
+import h5py
 import numpy as np
 import yaml
-import json
-import os
-from datetime import datetime

-from negotiation_agent.environment import NegotiationEnv
-from negotiation_agent.spaces import NegotiationSpaces
+from envs.my_custom_env import MyCustomEnv


 def main():
    with open("configs/offline_env_config.yaml", "r") as f:
        config = yaml.safe_load(f)

-    env = NegotiationEnv()
-    spaces = NegotiationSpaces()
+    env = MyCustomEnv()
+    dataset_path = config["dataset_params"]["path"]

    num_episodes = 10
    max_steps_per_episode = 100

-    # 데이터를 저장할 리스트
-    episodes_data = []
+    with h5py.File(dataset_path, 'w') as f:
+        observations = []
+        actions = []
+        rewards = []
+        next_observations = []
+        terminals = []

-    for episode in range(num_episodes):
-        episode_data = {
-            "episode_id": episode,
-            "timestamp": datetime.now().isoformat(),
-            "steps": []
-        }
+        for episode in range(num_episodes):
+            obs, _ = env.reset()
+            for step in range(max_steps_per_episode):
+                action = env.action_space.sample()
+                next_obs, reward, terminated, _, _ = env.step(action)

-        obs, _ = env.reset()
-        episode_reward = 0
+                observations.append(obs)
+                actions.append(action)
+                rewards.append(reward)
+                next_observations.append(next_obs)
+                terminals.append(terminated)

-        for step in range(max_steps_per_episode):
-            # 행동 선택 및 환경과 상호작용
-            action = env.action_space.sample()
-            next_obs, reward, terminated, _, _ = env.step(action)
-            episode_reward += reward
+                obs = next_obs

-            # 스텝 데이터 저장
-            step_data = {
-                "step": step,
-                "state": spaces.get_state_description(obs),
-                "action": spaces.get_action_description(action),
-                "reward": float(reward),
-                "next_state": spaces.get_state_description(next_obs),
-                "current_price": float(env.current_price),
-                "terminated": terminated
-            }
-            episode_data["steps"].append(step_data)
+                if terminated:
+                    break
        
-            obs = next_obs
-            if terminated:
-                break
-        
-        episode_data["total_reward"] = float(episode_reward)
-        episode_data["num_steps"] = len(episode_data["steps"])
-        episodes_data.append(episode_data)
-    
-    # JSON 파일로 저장
-    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-    json_path = f"logs/collected_data_{timestamp}.json"
-    os.makedirs("logs", exist_ok=True)
-    
-    with open(json_path, 'w', encoding='utf-8') as f:
-        json.dump(episodes_data, f, ensure_ascii=False, indent=2)
-    
-    print(f"Data collected and saved to {json_path}")
-    print(f"Total episodes: {len(episodes_data)}")
-    print(f"Average steps per episode: {sum(ep['num_steps'] for ep in episodes_data) / len(episodes_data):.2f}")
-    print(f"Average reward per episode: {sum(ep['total_reward'] for ep in episodes_data) / len(episodes_data):.2f}")
+        f.create_dataset("observations", data=np.array(observations))
+        f.create_dataset("actions", data=np.array(actions))
+        f.create_dataset("rewards", data=np.array(rewards))
+        f.create_dataset("next_observations", data=np.array(next_observations))
+        f.create_dataset("terminals", data=np.array(terminals))

 if __name__ == "__main__":
    main()
--- a/datasets/collected_data.h5
+++ b/datasets/collected_data.h5
--- a/envs/init.py
+++ b/envs/init.py
--- a/envs/pycache/init.cpython-312.pyc
+++ b/envs/pycache/init.cpython-312.pyc
--- a/envs/pycache/init.cpython-39.pyc
+++ b/envs/pycache/init.cpython-39.pyc
--- a/envs/pycache/my_custom_env.cpython-312.pyc
+++ b/envs/pycache/my_custom_env.cpython-312.pyc
--- a/envs/pycache/my_custom_env.cpython-39.pyc
+++ b/envs/pycache/my_custom_env.cpython-39.pyc
--- a/envs/my_custom_env.py
+++ b/envs/my_custom_env.py
@ -0,0 +1,24 @@
+import gymnasium as gym
+
+class MyCustomEnv(gym.Env):
+    def __init__(self):
+        super().__init__()
+        self.action_space = gym.spaces.Discrete(2)
+        self.observation_space = gym.spaces.Discrete(10)
+
+    def step(self, action):
+        observation = self.observation_space.sample()
+        reward = 1.0
+        terminated = False
+        truncated = False
+        info = {}
+        return observation, reward, terminated, truncated, info
+
+    def reset(self, seed=None, options=None):
+        super().reset(seed=seed)
+        observation = self.observation_space.sample()
+        info = {}
+        return observation, info
+
+    def render(self, mode='human'):
+        pass
--- a/evaluate.py
+++ b/evaluate.py
@ -1,66 +1,41 @@
 from negotiation_agent.environment import NegotiationEnv
-from agents.offline_agent import QLearningAgent
-import yaml
-import numpy as np
+from negotiation_agent.agent import QLearningAgent
+import config


-def main():
-    # 환경 설정 로드
-    with open('configs/offline_env_config.yaml', 'r') as f:
-        config = yaml.safe_load(f)
-
-    # 환경 초기화
+def evaluate():
    env = NegotiationEnv(
-        scenario=config['env']['scenario'],
-        target_price=config['env']['target_price'],
-        threshold_price=config['env']['threshold_price']
+        scenario=config.SCENARIO,
+        target_price=config.TARGET_PRICE,
+        threshold_price=config.THRESHOLD_PRICE,
    )

-    # 에이전트 초기화 및 Q-table 로드
-    state_dims = env.observation_space.nvec
-    state_size = np.prod(state_dims)  # 전체 상태 공간 크기
-    action_size = env.action_space.n
-    agent = QLearningAgent(config['agent'], state_size, action_size)
-    agent.load_q_table('saved_models/q_table.npy')
+    # 에이전트를 생성하되, 학습된 Q-Table을 불러옵니다.
+    agent = QLearningAgent(
+        state_dims=env.observation_space.nvec,
+        action_size=env.action_space.n,
+        learning_rate=0,  # 평가 시에는 학습하지 않음
+        gamma=0,
+        epsilon=0,  # 평가 시에는 탐험하지 않고 최선의 행동만 선택
+    )
+    agent.load_q_table(config.Q_TABLE_SAVE_PATH)

-    print(f"State space size: {state_size}")
-    print(f"Action space size: {action_size}")
-    print(f"Q-table shape: {agent.q_table.shape}")
+    print("--- 학습된 에이전트 평가 시작 ---")
+    state, info = env.reset()
+    terminated = False
+    total_reward = 0

-    # 평가 실행
-    num_episodes = 10
-    total_rewards = []
+    while not terminated:
+        action = agent.get_action(state)
+        state, reward, terminated, truncated, info = env.step(action)
+        total_reward += reward
+        print(f"상태: {state}, 선택한 행동: {action}, 보상: {reward:.4f}")

-    for episode in range(num_episodes):
-        state, _ = env.reset()
-        episode_reward = 0
-        done = False
-        
-        while not done:
-            # 상태를 인덱스로 변환
-            state_idx = np.ravel_multi_index(tuple(state), env.observation_space.nvec)
-            # 최적의 행동 선택
-            action = np.argmax(agent.q_table[state_idx])
-            
-            # 환경에서 한 스텝 진행
-            next_state, reward, done, _, _ = env.step(action)
-            episode_reward += reward
-            state = next_state
-            
-            # 현재 상태 출력
-            print(f"Episode {episode + 1}")
-            print(f"State: {env.spaces.get_state_description(state)}")
-            print(f"Action: {env.spaces.get_action_description(action)}")
-            print(f"Reward: {reward:.2f}")
-            print(f"Current Price: {env.current_price:.2f}")
-            print("--------------------")
-        
-        total_rewards.append(episode_reward)
-        print(f"Episode {episode + 1} finished with total reward: {episode_reward:.2f}")
-        print("========================================")
-    
-    print(f"Average reward over {num_episodes} episodes: {np.mean(total_rewards):.2f}")
+    print("\n✅ 평가 종료!")
+    print(f"최종 협상 가격: {env.current_price:.2f} (목표가: {env.target_price})")
+    print(f"총 보상: {total_reward:.4f}")
+    env.close()


 if __name__ == "__main__":
-    main()
+    evaluate()
--- a/logs/collected_data_20250922_161824.json
+++ b/logs/collected_data_20250922_161824.json
@ -1,865 +0,0 @@
-[
-  {
-    "episode_id": 0,
-    "timestamp": "2025-09-22T16:18:24.027941",
-    "steps": [
-      {
-        "step": 0,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "중간 거절",
-        "reward": 0.7358985001204814,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 135.8883052263702,
-        "terminated": false
-      },
-      {
-        "step": 1,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "약한 거절",
-        "reward": 0.7861334009215178,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 127.20487373107215,
-        "terminated": false
-      },
-      {
-        "step": 2,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "약한 수락",
-        "reward": 0.6305285853598203,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 118.94781892751169,
-        "terminated": false
-      },
-      {
-        "step": 3,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "중간 수락",
-        "reward": 0.6740021559509319,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 111.27560844992324,
-        "terminated": false
-      },
-      {
-        "step": 4,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "중간 가격 제안",
-        "reward": 0.7318358321615553,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 102.48200033944701,
-        "terminated": false
-      },
-      {
-        "step": 5,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "약한 거절",
-        "reward": 0.570634635379974,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 96.3839146626223,
-        "terminated": false
-      },
-      {
-        "step": 6,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "action": "강한 가격 제안",
-        "reward": 1.0595279975097056,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 90.23375501159687,
-        "terminated": true
-      }
-    ],
-    "total_reward": 5.188561107403986,
-    "num_steps": 7
-  },
-  {
-    "episode_id": 1,
-    "timestamp": "2025-09-22T16:18:24.029744",
-    "steps": [
-      {
-        "step": 0,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "강한 거절",
-        "reward": 0.7390394774903258,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 135.31076897216093,
-        "terminated": false
-      },
-      {
-        "step": 1,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "중간 거절",
-        "reward": 0.801125516069835,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 124.82438518570777,
-        "terminated": false
-      },
-      {
-        "step": 2,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "약한 가격 제안",
-        "reward": 0.8266746792259854,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 120.96656945345163,
-        "terminated": false
-      },
-      {
-        "step": 3,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "약한 거절",
-        "reward": 0.6539156600559058,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 114.69368999908635,
-        "terminated": false
-      },
-      {
-        "step": 4,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "강한 수락",
-        "reward": 0.7009856035857323,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 106.9922115609145,
-        "terminated": false
-      },
-      {
-        "step": 5,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "중간 거절",
-        "reward": 0.729344462994045,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 102.83206880342405,
-        "terminated": false
-      },
-      {
-        "step": 6,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "강한 가격 제안",
-        "reward": 0.567252902669142,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 96.95851663553233,
-        "terminated": false
-      },
-      {
-        "step": 7,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "action": "중간 수락",
-        "reward": 1.0434788040265859,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 92.67390785793958,
-        "terminated": true
-      }
-    ],
-    "total_reward": 6.061817106117557,
-    "num_steps": 8
-  },
-  {
-    "episode_id": 2,
-    "timestamp": "2025-09-22T16:18:24.029881",
-    "steps": [
-      {
-        "step": 0,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "강한 거절",
-        "reward": 0.7269372327267993,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 137.56345871141042,
-        "terminated": false
-      },
-      {
-        "step": 1,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "중간 거절",
-        "reward": 0.756597194679587,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 132.17072532544773,
-        "terminated": false
-      },
-      {
-        "step": 2,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "강한 거절",
-        "reward": 0.7775685813646708,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 128.6060193230739,
-        "terminated": false
-      },
-      {
-        "step": 3,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "중간 수락",
-        "reward": 0.812286351695623,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 123.10929488259042,
-        "terminated": false
-      },
-      {
-        "step": 4,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "중간 수락",
-        "reward": 0.6493055986284829,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 115.50801372792905,
-        "terminated": false
-      },
-      {
-        "step": 5,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "강한 가격 제안",
-        "reward": 0.690788532528077,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 108.57157649320362,
-        "terminated": false
-      },
-      {
-        "step": 6,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "중간 거절",
-        "reward": 0.7100689199148662,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 105.62354990694725,
-        "terminated": false
-      },
-      {
-        "step": 7,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "약한 가격 제안",
-        "reward": 0.5557374640257946,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 98.9675945212993,
-        "terminated": false
-      },
-      {
-        "step": 8,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "action": "중간 가격 제안",
-        "reward": 1.0412902763036351,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 93.01691944576609,
-        "terminated": true
-      }
-    ],
-    "total_reward": 6.720580151867536,
-    "num_steps": 9
-  },
-  {
-    "episode_id": 3,
-    "timestamp": "2025-09-22T16:18:24.030028",
-    "steps": [
-      {
-        "step": 0,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "중간 거절",
-        "reward": 0.7252551783902629,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 137.88250395116734,
-        "terminated": false
-      },
-      {
-        "step": 1,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "약한 거절",
-        "reward": 0.7477667619419455,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 133.7315391503905,
-        "terminated": false
-      },
-      {
-        "step": 2,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "중간 가격 제안",
-        "reward": 0.77003855118024,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 129.86362805697163,
-        "terminated": false
-      },
-      {
-        "step": 3,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "중간 수락",
-        "reward": 0.8243137799776058,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 121.31302718573589,
-        "terminated": false
-      },
-      {
-        "step": 4,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "중간 거절",
-        "reward": 0.6353442495723813,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 118.04624036572108,
-        "terminated": false
-      },
-      {
-        "step": 5,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "강한 가격 제안",
-        "reward": 0.6678033474242119,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 112.30851161390989,
-        "terminated": false
-      },
-      {
-        "step": 6,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "중간 거절",
-        "reward": 0.6816974190216999,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 110.01948651592679,
-        "terminated": false
-      },
-      {
-        "step": 7,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "약한 가격 제안",
-        "reward": 0.7100061980929508,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 105.63288067265765,
-        "terminated": false
-      },
-      {
-        "step": 8,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "중간 가격 제안",
-        "reward": 0.7442817697542694,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 100.7682883657917,
-        "terminated": false
-      },
-      {
-        "step": 9,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "강한 수락",
-        "reward": 1.0374549618334905,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 93.62419857403351,
-        "terminated": true
-      }
-    ],
-    "total_reward": 7.5439622171890575,
-    "num_steps": 10
-  },
-  {
-    "episode_id": 4,
-    "timestamp": "2025-09-22T16:18:24.030189",
-    "steps": [
-      {
-        "step": 0,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "중간 수락",
-        "reward": 0.7146333204199368,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 139.93190233732378,
-        "terminated": false
-      },
-      {
-        "step": 1,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "약한 가격 제안",
-        "reward": 0.7297921670526436,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 137.02531284195942,
-        "terminated": false
-      },
-      {
-        "step": 2,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "약한 거절",
-        "reward": 0.75510873299296,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 132.43125874553,
-        "terminated": false
-      },
-      {
-        "step": 3,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "중간 가격 제안",
-        "reward": 0.8102459947081206,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 123.41930802882099,
-        "terminated": false
-      },
-      {
-        "step": 4,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "강한 가격 제안",
-        "reward": 0.8317345976484769,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 120.23066045674327,
-        "terminated": false
-      },
-      {
-        "step": 5,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "강한 거절",
-        "reward": 0.6560086328440509,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 114.32776375951947,
-        "terminated": false
-      },
-      {
-        "step": 6,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "약한 거절",
-        "reward": 0.674953538395559,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 111.1187596383056,
-        "terminated": false
-      },
-      {
-        "step": 7,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "중간 가격 제안",
-        "reward": 0.7302460660076945,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 102.70510652666732,
-        "terminated": false
-      },
-      {
-        "step": 8,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "약한 가격 제안",
-        "reward": 0.5720230516832283,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 96.14997129601272,
-        "terminated": false
-      },
-      {
-        "step": 9,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "action": "약한 거절",
-        "reward": 1.0506427970382055,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 91.5685666609294,
-        "terminated": true
-      }
-    ],
-    "total_reward": 7.525388898790876,
-    "num_steps": 10
-  },
-  {
-    "episode_id": 5,
-    "timestamp": "2025-09-22T16:18:24.030352",
-    "steps": [
-      {
-        "step": 0,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "중간 거절",
-        "reward": 0.7305613483955329,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 136.88104389812182,
-        "terminated": false
-      },
-      {
-        "step": 1,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "약한 가격 제안",
-        "reward": 0.7567871209193779,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 132.1375552460719,
-        "terminated": false
-      },
-      {
-        "step": 2,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "약한 가격 제안",
-        "reward": 0.78862199555197,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 126.80346295693705,
-        "terminated": false
-      },
-      {
-        "step": 3,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "강한 거절",
-        "reward": 0.6358337173218354,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 117.95536782148626,
-        "terminated": false
-      },
-      {
-        "step": 4,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "약한 수락",
-        "reward": 0.6888695836656975,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 108.8740187959828,
-        "terminated": false
-      },
-      {
-        "step": 5,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "중간 수락",
-        "reward": 0.718338032660941,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 104.4076696345554,
-        "terminated": false
-      },
-      {
-        "step": 6,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "약한 거절",
-        "reward": 0.7340748841478206,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 102.16941298443504,
-        "terminated": false
-      },
-      {
-        "step": 7,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "약한 가격 제안",
-        "reward": 0.5570155928801377,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 98.74050332345952,
-        "terminated": false
-      },
-      {
-        "step": 8,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "action": "중간 가격 제안",
-        "reward": 1.044383090477607,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 92.53291501917664,
-        "terminated": true
-      }
-    ],
-    "total_reward": 6.65448536602092,
-    "num_steps": 9
-  },
-  {
-    "episode_id": 6,
-    "timestamp": "2025-09-22T16:18:24.030490",
-    "steps": [
-      {
-        "step": 0,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "강한 수락",
-        "reward": 0.72750892291292,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 137.4553587598672,
-        "terminated": false
-      },
-      {
-        "step": 1,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "약한 수락",
-        "reward": 0.782524180941611,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 127.79157812052534,
-        "terminated": false
-      },
-      {
-        "step": 2,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "중간 수락",
-        "reward": 0.8283112545472572,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 120.72756400570525,
-        "terminated": false
-      },
-      {
-        "step": 3,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "중간 거절",
-        "reward": 0.6730035701665233,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 111.44071640131496,
-        "terminated": false
-      },
-      {
-        "step": 4,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "중간 수락",
-        "reward": 0.7083731926007367,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 105.87639507452755,
-        "terminated": false
-      },
-      {
-        "step": 5,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "강한 거절",
-        "reward": 0.5571921272965241,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 98.70921950541188,
-        "terminated": false
-      },
-      {
-        "step": 6,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "action": "중간 거절",
-        "reward": 0.578004004688776,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 95.15505005819907,
-        "terminated": false
-      },
-      {
-        "step": 7,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "action": "약한 거절",
-        "reward": 1.059787946167413,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 90.19528894541341,
-        "terminated": true
-      }
-    ],
-    "total_reward": 5.9147051993217605,
-    "num_steps": 8
-  },
-  {
-    "episode_id": 7,
-    "timestamp": "2025-09-22T16:18:24.030613",
-    "steps": [
-      {
-        "step": 0,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "강한 거절",
-        "reward": 0.7476969492228983,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 133.74402570979152,
-        "terminated": false
-      },
-      {
-        "step": 1,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "약한 수락",
-        "reward": 0.7826844090569465,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 127.76541712449546,
-        "terminated": false
-      },
-      {
-        "step": 2,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "약한 수락",
-        "reward": 0.6339560237546636,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 118.30473595913722,
-        "terminated": false
-      },
-      {
-        "step": 3,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "중간 수락",
-        "reward": 0.6700352339433415,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 111.93441210338206,
-        "terminated": false
-      },
-      {
-        "step": 4,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "약한 수락",
-        "reward": 0.7174069347685657,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 104.54317677343735,
-        "terminated": false
-      },
-      {
-        "step": 5,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "강한 거절",
-        "reward": 0.5640770884044879,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 97.5044034416811,
-        "terminated": false
-      },
-      {
-        "step": 6,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "action": "중간 수락",
-        "reward": 1.0558066698679451,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 90.78803970908572,
-        "terminated": true
-      }
-    ],
-    "total_reward": 5.171663309018848,
-    "num_steps": 7
-  },
-  {
-    "episode_id": 8,
-    "timestamp": "2025-09-22T16:18:24.030719",
-    "steps": [
-      {
-        "step": 0,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "강한 수락",
-        "reward": 0.7465479684274815,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 133.94986555336644,
-        "terminated": false
-      },
-      {
-        "step": 1,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "약한 거절",
-        "reward": 0.7880565018752345,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 126.89445460070839,
-        "terminated": false
-      },
-      {
-        "step": 2,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "강한 수락",
-        "reward": 0.6264636918952473,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 119.71962776182876,
-        "terminated": false
-      },
-      {
-        "step": 3,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "약한 가격 제안",
-        "reward": 0.6548001047926377,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 114.53877213985942,
-        "terminated": false
-      },
-      {
-        "step": 4,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "약한 가격 제안",
-        "reward": 0.681939343451541,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 109.98045606284857,
-        "terminated": false
-      },
-      {
-        "step": 5,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "중간 수락",
-        "reward": 0.7000434976249617,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 107.13619975680452,
-        "terminated": false
-      },
-      {
-        "step": 6,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "약한 거절",
-        "reward": 0.7290266498774218,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 102.87689759024647,
-        "terminated": false
-      },
-      {
-        "step": 7,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "약한 수락",
-        "reward": 0.5537943325131409,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 99.31484807799998,
-        "terminated": false
-      },
-      {
-        "step": 8,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "action": "약한 가격 제안",
-        "reward": 1.0514864005028703,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 91.44013888596228,
-        "terminated": true
-      }
-    ],
-    "total_reward": 6.532158490960537,
-    "num_steps": 9
-  },
-  {
-    "episode_id": 9,
-    "timestamp": "2025-09-22T16:18:24.030862",
-    "steps": [
-      {
-        "step": 0,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "약한 거절",
-        "reward": 0.7179479516097199,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 139.2858629595485,
-        "terminated": false
-      },
-      {
-        "step": 1,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "약한 수락",
-        "reward": 0.7357380633672626,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "current_price": 135.91793734624605,
-        "terminated": false
-      },
-      {
-        "step": 2,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=낮음 (<10%))",
-        "action": "강한 거절",
-        "reward": 0.7985485009052322,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "current_price": 125.22720897558546,
-        "terminated": false
-      },
-      {
-        "step": 3,
-        "state": "State(scenario=높은 구매 의지, price_zone=임계가격 초과, acceptance_rate=중간 (10-25%))",
-        "action": "약한 수락",
-        "reward": 0.634178662402988,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 118.26320317340058,
-        "terminated": false
-      },
-      {
-        "step": 4,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "중간 거절",
-        "reward": 0.66713537132276,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "current_price": 112.4209616577428,
-        "terminated": false
-      },
-      {
-        "step": 5,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=중간 (10-25%))",
-        "action": "약한 가격 제안",
-        "reward": 0.6959274570546614,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 107.76985336577854,
-        "terminated": false
-      },
-      {
-        "step": 6,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "중간 수락",
-        "reward": 0.7252535975319768,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 103.4121033735282,
-        "terminated": false
-      },
-      {
-        "step": 7,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "중간 거절",
-        "reward": 0.7424496531354222,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "current_price": 101.01695068920729,
-        "terminated": false
-      },
-      {
-        "step": 8,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격~임계가격, acceptance_rate=높음 (>25%))",
-        "action": "강한 수락",
-        "reward": 0.5587233113829603,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 98.43870638556888,
-        "terminated": false
-      },
-      {
-        "step": 9,
-        "state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "action": "중간 가격 제안",
-        "reward": 1.0321906190591832,
-        "next_state": "State(scenario=높은 구매 의지, price_zone=목표가격 이하, acceptance_rate=높음 (>25%))",
-        "current_price": 94.47077675157269,
-        "terminated": true
-      }
-    ],
-    "total_reward": 7.308093187772166,
-    "num_steps": 10
-  }
-]
--- a/negotiation_agent/pycache/init.cpython-39.pyc
+++ b/negotiation_agent/pycache/init.cpython-39.pyc
--- a/negotiation_agent/pycache/action_space.cpython-39.pyc
+++ b/negotiation_agent/pycache/action_space.cpython-39.pyc
--- a/negotiation_agent/pycache/environment.cpython-39.pyc
+++ b/negotiation_agent/pycache/environment.cpython-39.pyc
--- a/negotiation_agent/pycache/spaces.cpython-39.pyc
+++ b/negotiation_agent/pycache/spaces.cpython-39.pyc
--- a/negotiation_agent/action_space.py
+++ b/negotiation_agent/action_space.py
@ -1,107 +0,0 @@
-from dataclasses import dataclass
-from typing import Dict, List, Optional
-import json
-import os
-from pathlib import Path
-
-
-@dataclass
-class ActionInfo:
-    id: int
-    name: str
-    description: str
-    category: str
-    strength: str
-
-    def __str__(self) -> str:
-        return f"{self.description} (ID: {self.id}, Category: {self.category}, Strength: {self.strength})"
-
-
-class ActionSpace:
-    def __init__(self, config_path: Optional[str] = None):
-        if config_path is None:
-            config_path = os.path.join(
-                Path(__file__).parent.parent, "configs", "actions.json"
-            )
-        self._actions: Dict[int, ActionInfo] = {}
-        self._load_actions(config_path)
-
-    def _load_actions(self, config_path: str) -> None:
-        """JSON 파일에서 액션 정보를 로드합니다."""
-        with open(config_path, "r", encoding="utf-8") as f:
-            data = json.load(f)
-
-        for action in data["actions"]:
-            self.add_action(
-                id=action["id"],
-                name=action["name"],
-                description=action["description"],
-                category=action["category"],
-                strength=action["strength"],
-            )
-
-    def add_action(
-        self, id: int, name: str, description: str, category: str, strength: str
-    ) -> None:
-        """새로운 액션을 추가합니다."""
-        if id in self._actions:
-            raise ValueError(f"Action with id {id} already exists")
-
-        self._actions[id] = ActionInfo(
-            id=id,
-            name=name,
-            description=description,
-            category=category,
-            strength=strength,
-        )
-
-    def remove_action(self, action_id: int) -> None:
-        """지정된 ID의 액션을 제거합니다."""
-        if action_id not in self._actions:
-            raise ValueError(f"Action with id {action_id} does not exist")
-        del self._actions[action_id]
-
-    def get_action(self, action_id: int) -> ActionInfo:
-        """액션 ID로 액션 정보를 조회합니다."""
-        if action_id not in self._actions:
-            raise ValueError(f"Invalid action id: {action_id}")
-        return self._actions[action_id]
-
-    def get_actions_by_category(self, category: str) -> List[ActionInfo]:
-        """특정 카테고리의 모든 액션을 반환합니다."""
-        return [
-            action for action in self._actions.values() if action.category == category
-        ]
-
-    def get_actions_by_strength(self, strength: str) -> List[ActionInfo]:
-        """특정 강도의 모든 액션을 반환합니다."""
-        return [
-            action for action in self._actions.values() if action.strength == strength
-        ]
-
-    def save_actions(self, file_path: str) -> None:
-        """현재 액션 설정을 JSON 파일로 저장합니다."""
-        data = {
-            "actions": [
-                {
-                    "id": action.id,
-                    "name": action.name,
-                    "description": action.description,
-                    "category": action.category,
-                    "strength": action.strength,
-                }
-                for action in self._actions.values()
-            ]
-        }
-
-        with open(file_path, "w", encoding="utf-8") as f:
-            json.dump(data, f, ensure_ascii=False, indent=4)
-
-    @property
-    def action_space_size(self) -> int:
-        """현재 액션 공간의 크기를 반환합니다."""
-        return len(self._actions)
-
-    def list_actions(self) -> List[ActionInfo]:
-        """모든 액션 정보를 리스트로 반환합니다."""
-        return list(self._actions.values())
--- a/negotiation_agent/agent.py
+++ b/negotiation_agent/agent.py
@ -0,0 +1,63 @@
+import numpy as np
+import random
+import os
+
+
+class QLearningAgent:
+    """Q-Table을 기반으로 행동하고 학습하는 에이전트"""
+
+    def __init__(self, state_dims, action_size, learning_rate, gamma, epsilon):
+        self.state_dims = state_dims  # [4, 3, 3]
+        self.action_size = action_size
+        self.lr = learning_rate
+        self.gamma = gamma
+        self.epsilon = epsilon
+
+        # Q-Table 초기화: 36x9 크기의 테이블을 0으로 초기화
+        num_states = np.prod(state_dims)  # 4 * 3 * 3 = 36
+        self.q_table = np.zeros((num_states, action_size))
+
+    def _state_to_index(self, state):
+        """MultiDiscrete 상태 [s0, s1, s2]를 단일 정수 인덱스로 변환"""
+        idx = (
+            state[0] * (self.state_dims[1] * self.state_dims[2])
+            + state[1] * self.state_dims[2]
+            + state[2]
+        )
+        return int(idx)
+
+    def get_action(self, state):
+        """Epsilon-Greedy 정책에 따라 행동 선택"""
+        if random.uniform(0, 1) < self.epsilon:
+            return random.randint(0, self.action_size - 1)  # 탐험 (무작위 행동)
+        else:
+            state_idx = self._state_to_index(state)
+            return np.argmax(self.q_table[state_idx, :])  # 활용 (Q값이 가장 높은 행동)
+
+    # ===================================================================
+    # ## 3. 학습 알고리즘 (Learning Algorithm): Q-Table 업데이트 규칙
+    # ===================================================================
+    def learn(self, state, action, reward, next_state):
+        """경험 데이터를 바탕으로 Q-Table을 업데이트 (Q-러닝 공식)"""
+        state_idx = self._state_to_index(state)
+        next_state_idx = self._state_to_index(next_state)
+
+        old_value = self.q_table[state_idx, action]
+        next_max = np.max(self.q_table[next_state_idx, :])
+
+        # Bellman Equation
+        new_value = old_value + self.lr * (reward + self.gamma * next_max - old_value)
+        self.q_table[state_idx, action] = new_value
+
+    def save_q_table(self, file_path):
+        """Q-Table을 파일로 저장합니다."""
+        np.save(file_path, self.q_table)
+        print(f"Q-Table saved to {file_path}")
+
+    def load_q_table(self, file_path):
+        """파일로부터 Q-Table을 불러옵니다."""
+        if os.path.exists(file_path):
+            self.q_table = np.load(file_path)
+            print(f"Q-Table loaded from {file_path}")
+        else:
+            print(f"Error: No Q-Table found at {file_path}")
--- a/negotiation_agent/constants.py
+++ b/negotiation_agent/constants.py
@ -1,41 +0,0 @@
-from gymnasium import spaces
-
-# Observation Space Constants
-SCENARIO_SPACE_SIZE = 4  # 시나리오 상태 수 (0-3)
-PRICE_ZONE_SIZE = 3  # 가격 구간 수 (0-2)
-ACCEPTANCE_RATE_SIZE = 3  # 수락률 레벨 수 (0-2)
-
-# Observation Space Mappings
-SCENARIO_MAPPING = {
-    0: "높은 구매 의지",
-    1: "중간 구매 의지",
-    2: "낮은 구매 의지",
-    3: "매우 낮은 구매 의지",
-}
-
-PRICE_ZONE_MAPPING = {0: "목표가격 이하", 1: "목표가격~임계가격", 2: "임계가격 초과"}
-
-ACCEPTANCE_RATE_MAPPING = {0: "낮음 (<10%)", 1: "중간 (10-25%)", 2: "높음 (>25%)"}
-
-# Action Space Constants
-ACTION_SPACE_SIZE = 9
-
-# Action Space Mappings
-ACTION_MAPPING = {
-    0: "강한 수락",
-    1: "중간 수락",
-    2: "약한 수락",
-    3: "강한 거절",
-    4: "중간 거절",
-    5: "약한 거절",
-    6: "강한 가격 제안",
-    7: "중간 가격 제안",
-    8: "약한 가격 제안",
-}
-
-# Spaces Definition
-OBSERVATION_SPACE = spaces.MultiDiscrete(
-    [SCENARIO_SPACE_SIZE, PRICE_ZONE_SIZE, ACCEPTANCE_RATE_SIZE]
-)
-
-ACTION_SPACE = spaces.Discrete(ACTION_SPACE_SIZE)
--- a/negotiation_agent/environment.py
+++ b/negotiation_agent/environment.py
@ -1,13 +1,6 @@
 import gymnasium as gym
 from gymnasium import spaces
 import numpy as np
-from negotiation_agent.spaces import (
-    NegotiationSpaces,
-    State,
-    PriceZone,
-    AcceptanceRate,
-    Scenario,
-)


 class NegotiationEnv(gym.Env):
@ -15,11 +8,9 @@ class NegotiationEnv(gym.Env):

    def __init__(self, scenario=0, target_price=100, threshold_price=120):
        super(NegotiationEnv, self).__init__()
-
-        self.spaces = NegotiationSpaces()
-        self.observation_space = self.spaces.observation_space
-        self.action_space = self.spaces.action_space
-        self.initial_scenario = Scenario(scenario)
+        self.observation_space = spaces.MultiDiscrete([4, 3, 3])
+        self.action_space = spaces.Discrete(9)
+        self.initial_scenario = scenario
        self.target_price = target_price
        self.threshold_price = threshold_price
        self.current_price = None
@ -29,28 +20,23 @@ class NegotiationEnv(gym.Env):
    def _get_state(self):
        """현재 정보를 바탕으로 State 배열을 계산"""
        if self.current_price <= self.target_price:
-            price_zone = PriceZone.BELOW_TARGET
+            price_zone = 0
        elif self.target_price < self.current_price <= self.threshold_price:
-            price_zone = PriceZone.BETWEEN_TARGET_AND_THRESHOLD
+            price_zone = 1
        else:
-            price_zone = PriceZone.ABOVE_THRESHOLD
+            price_zone = 2

        acceptance_rate_val = (
            self.initial_price - self.current_price
        ) / self.initial_price
        if acceptance_rate_val < 0.1:
-            acceptance_rate_level = AcceptanceRate.LOW
+            acceptance_rate_level = 0
        elif 0.1 <= acceptance_rate_val < 0.25:
-            acceptance_rate_level = AcceptanceRate.MEDIUM
+            acceptance_rate_level = 1
        else:
-            acceptance_rate_level = AcceptanceRate.HIGH
+            acceptance_rate_level = 2

-        state = State(
-            scenario=self.initial_scenario,
-            price_zone=price_zone,
-            acceptance_rate=acceptance_rate_level,
-        )
-        return np.array(state.to_array())
+        return np.array([self.initial_scenario, price_zone, acceptance_rate_level])

    def reset(self, seed=None, options=None):
        """환경을 초기 상태로 리셋"""
--- a/negotiation_agent/spaces.py
+++ b/negotiation_agent/spaces.py
@ -1,113 +0,0 @@
-from gymnasium import spaces
-from typing import Dict, List, Any
-from dataclasses import dataclass
-from enum import Enum, auto
-from negotiation_agent.action_space import ActionSpace, ActionInfo
-
-
-class Scenario(Enum):
-    HIGH_INTENTION = 0
-    MEDIUM_INTENTION = 1
-    LOW_INTENTION = 2
-    VERY_LOW_INTENTION = 3
-
-    @property
-    def description(self) -> str:
-        return {
-            self.HIGH_INTENTION: "높은 구매 의지",
-            self.MEDIUM_INTENTION: "중간 구매 의지",
-            self.LOW_INTENTION: "낮은 구매 의지",
-            self.VERY_LOW_INTENTION: "매우 낮은 구매 의지",
-        }[self]
-
-
-class PriceZone(Enum):
-    BELOW_TARGET = 0
-    BETWEEN_TARGET_AND_THRESHOLD = 1
-    ABOVE_THRESHOLD = 2
-
-    @property
-    def description(self) -> str:
-        return {
-            self.BELOW_TARGET: "목표가격 이하",
-            self.BETWEEN_TARGET_AND_THRESHOLD: "목표가격~임계가격",
-            self.ABOVE_THRESHOLD: "임계가격 초과",
-        }[self]
-
-
-class AcceptanceRate(Enum):
-    LOW = 0
-    MEDIUM = 1
-    HIGH = 2
-
-    @property
-    def description(self) -> str:
-        return {
-            self.LOW: "낮음 (<10%)",
-            self.MEDIUM: "중간 (10-25%)",
-            self.HIGH: "높음 (>25%)",
-        }[self]
-
-
-@dataclass
-class State:
-    scenario: Scenario
-    price_zone: PriceZone
-    acceptance_rate: AcceptanceRate
-
-    def to_array(self) -> List[int]:
-        return [self.scenario.value, self.price_zone.value, self.acceptance_rate.value]
-
-    @classmethod
-    def from_array(cls, arr: List[int]) -> "State":
-        return cls(
-            scenario=Scenario(arr[0]),
-            price_zone=PriceZone(arr[1]),
-            acceptance_rate=AcceptanceRate(arr[2]),
-        )
-
-    def __str__(self) -> str:
-        return (
-            f"State(scenario={self.scenario.description}, "
-            f"price_zone={self.price_zone.description}, "
-            f"acceptance_rate={self.acceptance_rate.description})"
-        )
-
-
-class NegotiationSpaces:
-    def __init__(self):
-        self._action_space = ActionSpace()
-
-    @property
-    def observation_space(self) -> spaces.MultiDiscrete:
-        return spaces.MultiDiscrete(
-            [len(Scenario), len(PriceZone), len(AcceptanceRate)]
-        )
-
-    @property
-    def action_space(self) -> spaces.Discrete:
-        return spaces.Discrete(self._action_space.action_space_size)
-
-    def decode_action(self, action_id: int) -> ActionInfo:
-        return self._action_space.get_action(action_id)
-
-    def encode_state(self, state: State) -> List[int]:
-        return state.to_array()
-
-    def decode_state(self, state_array: List[int]) -> State:
-        return State.from_array(state_array)
-
-    def get_action_description(self, action_id: int) -> str:
-        return self.decode_action(action_id).description
-
-    def get_state_description(self, state_array: List[int]) -> str:
-        return str(self.decode_state(state_array))
-
-    def get_actions_by_category(self, category: str) -> List[ActionInfo]:
-        return self._action_space.get_actions_by_category(category)
-
-    def get_actions_by_strength(self, strength: str) -> List[ActionInfo]:
-        return self._action_space.get_actions_by_strength(strength)
-
-    def list_all_actions(self) -> List[ActionInfo]:
-        return self._action_space.list_actions()
--- a/saved_models/offline_agent.pth
+++ b/saved_models/offline_agent.pth
--- a/saved_models/q_table.json
+++ b/saved_models/q_table.json
--- a/saved_models/q_table.npy
+++ b/saved_models/q_table.npy
--- a/tests/test_evaluate_agent_usecase.py
+++ b/tests/test_evaluate_agent_usecase.py
@ -1,7 +1,7 @@
 import unittest

 from agents.offline_agent import QLearningAgent
-from negotiation_agent.environment import NegotiationEnv
+from envs.my_custom_env import MyCustomEnv
 from usecases.evaluate_agent_usecase import EvaluateAgentUseCase

 class TestEvaluateAgentUseCase(unittest.TestCase):
@ -10,7 +10,7 @@ class TestEvaluateAgentUseCase(unittest.TestCase):
        self.state_size = 10
        self.action_size = 2
        self.agent = QLearningAgent(self.agent_params, self.state_size, self.action_size)
-        self.env = NegotiationEnv()
+        self.env = MyCustomEnv()
        self.use_case = EvaluateAgentUseCase()

    def test_execute(self):
--- a/train.py
+++ b/train.py
@ -0,0 +1,29 @@
+from negotiation_agent.environment import NegotiationEnv
+from negotiation_agent.agent import QLearningAgent
+from usecases.train_agent_usecase import TrainAgentUseCase  # 유스케이스 임포트
+import config
+
+
+def main():
+    # 1. 의존성(객체) 생성: 필요한 모든 '재료'를 준비합니다.
+    env = NegotiationEnv(
+        scenario=config.SCENARIO,
+        target_price=config.TARGET_PRICE,
+        threshold_price=config.THRESHOLD_PRICE,
+    )
+
+    agent = QLearningAgent(
+        state_dims=env.observation_space.nvec,
+        action_size=env.action_space.n,
+        learning_rate=config.LEARNING_RATE,
+        gamma=config.GAMMA,
+        epsilon=config.EPSILON_START,
+    )
+
+    # 2. 유스케이스 생성 및 실행: 준비된 재료로 '요리사'에게 '요리'를 지시합니다.
+    train_use_case = TrainAgentUseCase(env=env, agent=agent)
+    train_use_case.execute()
+
+
+if __name__ == "__main__":
+    main()
--- a/train_offline.py
+++ b/train_offline.py
@ -2,12 +2,8 @@ import h5py
 import numpy as np
 import yaml
 import os
-import json
-from datetime import datetime

 from agents.offline_agent import QLearningAgent
-from negotiation_agent.spaces import NegotiationSpaces
-from negotiation_agent.environment import NegotiationEnv

 def main():
    with open("configs/offline_env_config.yaml", "r") as f:
@ -23,12 +19,10 @@ def main():
        next_observations = f["next_observations"][:]
        terminals = f["terminals"][:]
    
-    from negotiation_agent.environment import NegotiationEnv
-    env = NegotiationEnv()
-    state_size = np.prod(env.observation_space.nvec)  # 4 * 3 * 3 = 36
-    action_size = env.action_space.n  # 9
+    state_size = len(np.unique(np.concatenate((observations, next_observations))))
+    action_size = len(np.unique(actions))

-    agent = QLearningAgent(config["agent"], state_size, action_size)  # config["agent"]로 수정
+    agent = QLearningAgent(config["agent_params"], state_size, action_size)

    num_epochs = 10
    for epoch in range(num_epochs):
@ -44,61 +38,12 @@ def main():
            }
            agent.learn(batch)

-    # 모델 저장 (npy 형식)
+    # Save the model
    saved_models_dir = "saved_models"
    os.makedirs(saved_models_dir, exist_ok=True)
    model_path = os.path.join(saved_models_dir, "q_table.npy")
-    np.save(model_path, agent.q_table)
-    
-    # Q-table을 JSON 형식으로도 저장
-    spaces = NegotiationSpaces()
-    q_table_data = {
-        "metadata": {
-            "state_size": int(state_size),
-            "action_size": int(action_size),
-            "timestamp": datetime.now().isoformat(),
-            "training_episodes": int(num_epochs)
-        },
-        "q_values": []
-    }
-    
-    # 각 상태에 대한 Q-값을 저장
-    for state_idx in range(state_size):
-        state_indices = np.unravel_index(state_idx, env.observation_space.nvec)
-        state_data = {
-            "state_idx": int(state_idx),
-            "state_desc": spaces.get_state_description(
-                [int(idx) for idx in state_indices]
-            ),
-            "actions": []
-        }
-        
-        # 각 행동에 대한 Q-값을 저장
-        for action_idx in range(action_size):
-            action_data = {
-                "action_idx": int(action_idx),
-                "action_desc": spaces.get_action_description(action_idx),
-                "q_value": float(agent.q_table[state_idx, action_idx])
-            }
-            state_data["actions"].append(action_data)
-            
-        # 최적 행동 정보 추가
-        optimal_action_idx = int(np.argmax(agent.q_table[state_idx]))
-        state_data["optimal_action"] = {
-            "action_idx": optimal_action_idx,
-            "action_desc": spaces.get_action_description(optimal_action_idx),
-            "q_value": float(agent.q_table[state_idx, optimal_action_idx])
-        }
-        
-        q_table_data["q_values"].append(state_data)
-    
-    # JSON 파일로 저장
-    json_path = os.path.join(saved_models_dir, "q_table.json")
-    with open(json_path, 'w', encoding='utf-8') as f:
-        json.dump(q_table_data, f, ensure_ascii=False, indent=2)
-    
+    agent.save_model(model_path)
    print(f"Model saved to {model_path}")
-    print(f"Q-table JSON saved to {json_path}")

 if __name__ == "__main__":
    main()
--- a/usecases/evaluate_agent_usecase.py
+++ b/usecases/evaluate_agent_usecase.py
@ -1,8 +1,8 @@
 from agents.offline_agent import QLearningAgent
-from negotiation_agent.environment import NegotiationEnv
+from envs.my_custom_env import MyCustomEnv

 class EvaluateAgentUseCase:
-    def execute(self, agent: QLearningAgent, env: NegotiationEnv, num_episodes: int):
+    def execute(self, agent: QLearningAgent, env: MyCustomEnv, num_episodes: int):
        total_rewards = 0
        for _ in range(num_episodes):
            obs, _ = env.reset()