完善中文描述，并修改更多细节

2025-08-17 17:14:16 +08:00 · 2025-08-17 17:14:16 +08:00 · 6d5b489f9e
parent 96b61d34d8
commit 6d5b489f9e
5 changed files with 986 additions and 91 deletions
--- a/README.md
+++ b/README.md
@ -1,7 +1,8 @@
 # FScanpy
 ## A Machine Learning-Based Framework for Programmed Ribosomal Frameshifting Prediction

-[![Python](https://img.shields.io/badge/Python-3.7%2B-blue.svg)](https://www.python.org/)
+[![中文](https://img.shields.io/badge/Language-中文-red.svg)](README_zh.md)
+[![Python](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://www.python.org/)
 [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

 FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR](https://github.com/seanchen607/FScanR.git) framework, FScanpy provides robust and accurate PRF site predictions.
@ -36,7 +37,7 @@ FScanpy is a comprehensive Python package designed for the prediction of [Progra
 ## 🔧 Installation

 ### Prerequisites
- Python ≥ 3.7
+- Python ≥ 3.9
 - All dependencies are automatically installed

 ### Install via pip (Recommended)
@ -104,24 +105,28 @@ results = predictor.predict_sequence(

 ## 🎛️ Ensemble Weight Configuration

-The `ensemble_weight` parameter controls the contribution of each model:
+The `ensemble_weight` parameter controls the weight ratio between HistGB and BiLSTM-CNN models:

-| ensemble_weight | Short Model | Long Model | Best For |
-|----------------|-------------|------------|----------|
-| **0.2-0.3** | 20-30% | 70-80% | **High sensitivity**, detecting subtle sites |
-| **0.4-0.5** | 40-50% | 50-60% | **Balanced detection** (recommended) |
-| **0.6-0.7** | 60-70% | 30-40% | **Fast screening**, high specificity |
+| ensemble_weight | HistGB Model | BiLSTM-CNN Model | Characteristics | Best For |
+|----------------|-------------|------------------|-----------------|----------|
+| **0.2-0.3** | 20-30% | 70-80% | **High specificity**, reduces false positives | Precise validation, clinical applications |
+| **0.4** | 40% | 60% | **Optimal balance**, highest AUC | Standard analysis (recommended) |
+| **0.6-0.8** | 60-80% | 20-40% | **High sensitivity**, captures more sites | High-throughput screening, exploratory research |
+
+### Model Characteristics
+- **HistGB Model**: Excels at identifying true negatives, conservative predictions, low false positive rate
+- **BiLSTM-CNN Model**: Excels at identifying true positives, sensitive predictions, captures more potential sites

 ### Weight Selection Examples
 ```python
-# High sensitivity (Long model dominant)
-sensitive_results = predict_prf(sequence, ensemble_weight=0.2)
+# High specificity configuration (favoring HistGB)
+precise_results = predict_prf(sequence, ensemble_weight=0.25)

-# Balanced approach (recommended)
+# Optimal balance configuration (4:6 ratio)
 balanced_results = predict_prf(sequence, ensemble_weight=0.4)

-# Fast screening (Short model dominant)  
-screening_results = predict_prf(sequence, ensemble_weight=0.7)
+# High sensitivity configuration (favoring BiLSTM-CNN)
+sensitive_results = predict_prf(sequence, ensemble_weight=0.7)
 ```

 ## 📊 Core Functions
@ -210,67 +215,17 @@ final_predictions = predict_prf(data=prf_regions, ensemble_weight=0.4)
 ## 📚 Documentation

 - **[Complete Tutorial](tutorial/tutorial.md)**: Comprehensive usage guide with examples
- **[Demo Notebook](FScanpy_Demo.ipynb)**: Interactive examples and workflows
- **[Example Scripts](example_plot_prediction.py)**: Ready-to-run code examples
-
-## 🎯 Use Cases
-
-### 1. **Viral Genome Analysis**
-```python
-# Scan viral genome for PRF sites
-viral_sequence = load_viral_genome()
-prf_sites = predict_prf(viral_sequence, ensemble_weight=0.3)
-high_confidence = prf_sites[prf_sites['Ensemble_Probability'] > 0.8]
-```
-
-### 2. **Comparative Genomics**
-```python
-# Compare PRF patterns across species
-species_data = pd.DataFrame({
-    'Species': ['Virus_A', 'Virus_B'],
-    'Long_Sequence': [seq_a_399bp, seq_b_399bp]
-})
-comparative_results = predict_prf(data=species_data)
-```
-
-### 3. **High-Throughput Screening**
-```python
-# Fast screening of large sequence datasets
-sequences = load_large_dataset()
-screening_results = predict_prf(
-    sequence=sequences,
-    ensemble_weight=0.7,  # Fast screening mode
-    short_threshold=0.3
-)
-```
-
-## 🤝 Contributing
-
-We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
+- **[Demo Notebook](FScanpy_Demo.ipynb)**: Practical usage of each function in the library and demonstration of analysis workflow results
+- **[Predict Sample Interpretation](tutorial/predict_sample.ipynb)**: Detailed interpretation of FScanpy's plotting results and signal analysis

 ## 📝 Citation

 If you use FScanpy in your research, please cite:

 ```bibtex
-@software{fscanpy2024,
-  title={FScanpy: A Machine Learning Framework for Programmed Ribosomal Frameshifting Prediction},
-  author={[Author names]},
-  year={2024},
-  url={https://github.com/your-org/FScanpy}
-}
+
 ```

-## 📄 License
-
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
-
-## 🆘 Support
-
- **Documentation**: [Tutorial](tutorial/tutorial.md)
- **Usage Example**: [Demo Notebook](FScanpy_Demo.ipynb)
- **Predict Result Explain**: [Predict Result Explain](tutorial/predict_sample.ipynb)
-
 ## 🏗️ Dependencies

 FScanpy automatically installs all required dependencies:
--- a/README_zh.md
+++ b/README_zh.md
@ -0,0 +1,248 @@
+# FScanpy
+## 基于机器学习的程序性核糖体移码预测框架
+
+[![English](https://img.shields.io/badge/Language-English-blue.svg)](README.md)
+[![Python](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://www.python.org/)
+[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
+
+FScanpy 是一个专为预测核苷酸序列中[程序性核糖体移码 (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) 位点而设计的综合性 Python 包。通过将先进的机器学习方法（HistGradientBoosting 和 BiLSTM-CNN）与已建立的 [FScanR](https://github.com/seanchen607/FScanR.git) 框架相结合，FScanpy 提供了稳健且准确的 PRF 位点预测。
+
+![FScanpy 架构](/tutorial/image/structure.jpeg)
+
+## 🌟 核心特性
+
+### 🎯 **双模型架构**
+- **短模型** (`HistGradientBoosting`)：使用 33bp 序列进行快速筛选
+- **长模型** (`BiLSTM-CNN`)：使用 399bp 序列进行深度分析
+- **集成预测**：可自定义模型权重以获得最佳性能
+
+### 🚀 **多样化输入支持**
+- **单/多序列**：对完整序列进行滑动窗口预测
+- **基于区域的分析**：直接对预提取的 399bp 区域进行预测
+- **BLASTX 集成**：与 FScanR 流程无缝衔接
+- **跨物种兼容性**：内置病毒、海洋噬菌体、Euplotes 等数据库
+
+### 📊 **高级可视化**
+- **交互式热图**：FS 位点概率可视化
+- **预测图表**：组合概率和置信度显示
+- **可定制阈值**：为每个模型单独设置过滤条件
+- **导出选项**：PNG、PDF 和交互式格式
+
+### ⚡ **高性能**
+- **优化算法**：高效的滑动窗口扫描
+- **批处理**：同时处理多个序列
+- **灵活阈值**：针对不同用例的可调敏感性
+- **内存高效**：针对大规模基因组数据进行优化
+
+## 🔧 安装
+
+### 前置条件
+- Python ≥ 3.9
+- 所有依赖项将自动安装
+
+### 通过 pip 安装（推荐）
+```bash
+pip install FScanpy
+```
+
+### 从源码安装
+```bash
+git clone https://github.com/your-org/FScanpy-package.git
+cd FScanpy-package
+pip install -e .
+```
+
+## 🚀 快速开始
+
+### 基本用法
+```python
+from FScanpy import predict_prf
+
+# 简单序列预测
+sequence = "ATGCGTACGTTAGC..." # 您的 DNA 序列
+results = predict_prf(sequence=sequence)
+
+# 查看前十个预测结果
+print(results[['Position', 'Ensemble_Probability', 'Short_Probability', 'Long_Probability']].head(10))
+```
+
+### 可视化
+```python
+from FScanpy import plot_prf_prediction
+
+# 生成预测图表
+results, fig = plot_prf_prediction(
+    sequence=sequence,
+    short_threshold=0.65,    # HistGB 阈值
+    long_threshold=0.8,      # BiLSTM-CNN 阈值
+    ensemble_weight=0.4,     # 40% 短模型，60% 长模型
+    title="PRF 预测结果"
+)
+```
+
+### 高级用法
+```python
+from FScanpy import PRFPredictor
+import pandas as pd
+
+# 创建预测器实例
+predictor = PRFPredictor()
+
+# 对预提取区域进行批量预测
+data = pd.DataFrame({
+    'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57]  # 399bp 序列
+})
+results = predictor.predict_regions(data, ensemble_weight=0.4)
+
+# 使用自定义参数进行序列级预测
+results = predictor.predict_sequence(
+    sequence=sequence,
+    window_size=1,           # 滑动窗口步长
+    ensemble_weight=0.3,     # 模型权重
+    short_threshold=0.5      # 过滤阈值
+)
+```
+
+## 🎛️ 集成权重配置
+
+`ensemble_weight` 参数控制 HistGB 和 BiLSTM-CNN 模型的权重比例：
+
+| ensemble_weight | HistGB 模型 | BiLSTM-CNN 模型 | 特性 | 最适用于 |
+|----------------|------------|----------------|------|----------|
+| **0.2-0.3** | 20-30% | 70-80% | **高特异性**，减少假阳性 | 精确验证、临床应用 |
+| **0.4** | 40% | 60% | **最优平衡**，最高 AUC | 标准分析（推荐） |
+| **0.6-0.8** | 60-80% | 20-40% | **高敏感性**，捕获更多位点 | 高通量筛选、探索研究 |
+
+### 模型特性说明
+- **HistGB 模型**：擅长识别真阴性样本，预测保守，假阳性率低
+- **BiLSTM-CNN 模型**：擅长识别真阳性样本，预测敏感，能捕获更多潜在位点
+
+### 权重选择示例
+```python
+# 高特异性配置（偏向 HistGB）
+precise_results = predict_prf(sequence, ensemble_weight=0.25)
+
+# 最优平衡配置（4:6 比例）
+balanced_results = predict_prf(sequence, ensemble_weight=0.4)
+
+# 高敏感性配置（偏向 BiLSTM-CNN）
+sensitive_results = predict_prf(sequence, ensemble_weight=0.7)
+```
+
+## 📊 核心函数
+
+### 主要预测接口
+```python
+predict_prf(
+    sequence=None,           # 单/多序列或 None
+    data=None,              # 包含 399bp 序列的 DataFrame 或 None
+    window_size=3,          # 滑动窗口步长
+    short_threshold=0.1,    # 短模型过滤阈值
+    ensemble_weight=0.4,    # 短模型权重 (0.0-1.0)
+    model_dir=None         # 自定义模型目录
+)
+```
+
+### 可视化函数
+```python
+plot_prf_prediction(
+    sequence,               # 输入 DNA 序列
+    window_size=3,          # 扫描步长
+    short_threshold=0.65,   # 短模型绘图阈值
+    long_threshold=0.8,     # 长模型绘图阈值
+    ensemble_weight=0.4,    # 模型权重
+    title=None,            # 图表标题
+    save_path=None,        # 保存文件路径
+    figsize=(12,8),        # 图形大小
+    dpi=300               # 保存图表的分辨率
+)
+```
+
+### PRFPredictor 类方法
+```python
+predictor = PRFPredictor()
+
+# 序列预测（滑动窗口）
+predictor.predict_sequence(sequence, ensemble_weight=0.4)
+
+# 区域预测（批处理）
+predictor.predict_regions(dataframe, ensemble_weight=0.4)
+
+# 特征提取
+predictor.extract_features(sequences)
+
+# 模型信息
+predictor.get_model_info()
+```
+
+## 📈 输出字段
+
+### 预测结果
+- **`Position`**：在原始序列中的位置
+- **`Ensemble_Probability`**：最终集成预测（主要结果）
+- **`Short_Probability`**：HistGradientBoosting 预测 (0-1)
+- **`Long_Probability`**：BiLSTM-CNN 预测 (0-1)
+- **`Ensemble_Weights`**：使用的模型权重配置
+
+### 序列信息
+- **`Short_Sequence`**：短模型使用的 33bp 序列
+- **`Long_Sequence`**：长模型使用的 399bp 序列
+- **`Codon`**：预测位置的 3bp 密码子
+- **`Sequence_ID`**：多序列输入的标识符
+
+## 🔬 与 FScanR 集成
+
+FScanpy 与 FScanR 流程无缝协作，提供全面的 PRF 分析：
+
+```python
+from FScanpy import fscanr, extract_prf_regions, predict_prf
+
+# 步骤 1：使用 FScanR 进行 BLASTX 分析
+blastx_results = fscanr(
+    blastx_data,
+    mismatch_cutoff=10,
+    evalue_cutoff=1e-5,
+    frameDist_cutoff=10
+)
+
+# 步骤 2：提取 PRF 候选区域
+prf_regions = extract_prf_regions(original_sequence, blastx_results)
+
+# 步骤 3：使用 FScanpy 进行预测
+final_predictions = predict_prf(data=prf_regions, ensemble_weight=0.4)
+```
+
+## 📚 文档
+
+- **[完整教程](tutorial/tutorial_zh.md)**：包含示例的综合使用指南
+- **[演示笔记本](FScanpy_Demo.ipynb)**：库中每个函数的实际用法以及分析流程结果演示
+- **[预测结果解释](tutorial/predict_sample.ipynb)**：FScanpy 绘图结果的详细解释和信号分析
+
+## 📝 引用
+
+如果您在研究中使用 FScanpy，请引用：
+
+```bibtex
+@software{fscanpy2024,
+  title={FScanpy: A Machine Learning Framework for Programmed Ribosomal Frameshifting Prediction},
+  author={[作者姓名]},
+  year={2024},
+  url={https://github.com/your-org/FScanpy}
+}
+```
+
+## 🏗️ 依赖项
+
+FScanpy 会自动安装所有必需的依赖项：
+-  `numpy>=1.24.3`
+-  `pandas>=2.2.3`
+-  `tensorflow>=2.10.1`
+-  `scikit-learn>=1.6.0`
+-  `matplotlib>=3.9.4`
+-  `joblib>=1.4.2`
+-  `biopython>=1.85`
+-  `wrapt>=1.17.0`
+
+---
+
+**FScanpy** - 通过机器学习推进程序性核糖体移码研究 🧬
--- a/tutorial/predict_sample.ipynb
+++ b/tutorial/predict_sample.ipynb
@ -1,5 +1,17 @@
 {
 "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# FScanpy Prediction Results Interpretation\n",
+    "\n",
+    "[![中文](https://img.shields.io/badge/Language-中文-red.svg)](predict_sample_zh.ipynb)\n",
+    "\n",
+    "This notebook provides detailed interpretation of FScanpy prediction results, explaining output fields and how to analyze predictions.\n",
+    "\n"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 1,
@ -154,11 +166,10 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## 结果解读：Sequence0\n",
-    "### 真实情况\n",
-    "该序列核糖体程序性移码发生于第113nt处\n",
-    "### 图上信息\n",
-    "在该处我们可以看到一个显著的最高峰，并且明显较粗。"
+    "## Sequence0: High-Confidence, Unambiguous Signal\n",
+    "- Known Ground Truth: The programmed ribosomal frameshifting (PRF) event for this sequence occurs at nucleotide 113.\n",
+    "\n",
+    "- Plot Interpretation: The FScanpy analysis shows a prominent probability peak at the 113 nt position. The magnitude of this peak significantly exceeds other regions in the sequence, and the surrounding bases also exhibit elevated frameshifting probabilities, forming a concentrated and well-defined signal. This indicates a high-confidence prediction from the model that corresponds precisely with the known PRF event location."
   ]
  },
  {
@ -208,12 +219,10 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## 结果解读：Sequence1\n",
+    "## Sequence1: Ambiguous Signal at Low Resolution\n",
+    "- Known Ground Truth: The PRF event for this sequence occurs at nucleotide 1794.\n",
    "\n",
-    "### 真实情况\n",
-    "该序列核糖体程序性移码发生于第1794nt处。\n",
-    "### 图上信息\n",
-    "在该处我们可以看到一个显著的高峰，但是肉眼难以分辨改高峰与其他位置的高峰的差异，因此需要提高分辨率。将window size参数调整更小，查看每个高峰周围碱基的移码概率，基于高概率位点的集中程度判断其移码可能性。\n"
+    "- Plot Interpretation: An initial low-resolution scan of Sequence1 identifies a probability peak near the known 1794 nt site. However, the plot also reveals multiple potential signal peaks of comparable intensity elsewhere, making it difficult to definitively identify the true event from this view alone. To resolve this ambiguity, a high-resolution analysis is necessary. By reducing the `window_size` parameter, the analysis can focus on the local vicinity of each peak to assess the concentration of high-probability predictions, which helps distinguish a true signal from background noise."
   ]
  },
  {
@ -263,13 +272,10 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## 高分辨率的结果解读：Sequence1\n",
+    "## Sequence1: Signal Confirmation at High Resolution\n",
+    "- Known Ground Truth: The PRF event for this sequence occurs at nucleotide 1794.\n",
    "\n",
-    "### 真实情况\n",
-    "该序列核糖体程序性移码发生于第1794nt处。\n",
-    "### 图上信息\n",
-    "在该位置存在大量的高概率碱基集中，但是其他高峰周围并不存在，因此其PRF可能性高于其余高峰。\n",
-    "\n"
+    "- Plot Interpretation: Following the high-resolution analysis, the results clearly show a dense cluster of high-probability bases centered around the 1794 nt position. In contrast, the other peaks identified in the initial scan do not show a similar concentration and appear as isolated, singular data points. This consolidation of the signal confirms that the 1794 nt site is the more reliable PRF event, with a much higher likelihood than other potential sites in the sequence."
   ]
  },
  {
@ -405,11 +411,10 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## 结果解读：Sequence4\n",
-    "### 真实情况\n",
-    "该序列核糖体程序性移码发生于第216nt处\n",
-    "### 图上信息\n",
-    "我们的算法并不能总是解决问题：在该处我们可以看到三个显著的高峰，其中80nt左右和216nt左右的高峰并无明显差距。我们需要通过湿实验验证位点的真实性。"
+    "## Sequence4: Ambiguous Result Requiring Experimental Validation\n",
+    "- Known Ground Truth: The PRF event for this sequence occurs at nucleotide 216.\n",
+    "\n",
+    "- Plot Interpretation: This analysis demonstrates the inherent limitations of computational methods. FScanpy identifies three significant probability peaks, with the peaks at ~80 nt and the true site at 216 nt showing no decisive difference in magnitude. In such cases, the computational result alone is insufficient to distinguish the true PRF event. The model successfully narrows down the potential candidates to a few key regions, but subsequent biological (wet-lab) experiments are essential to validate which site is functionally active."
   ]
  }
 ],
--- a/tutorial/tutorial.md
+++ b/tutorial/tutorial.md
@ -1,5 +1,7 @@
 # FScanpy Tutorial - Complete Usage Guide

+[![中文](https://img.shields.io/badge/Language-中文-red.svg)](tutorial_zh.md)
+
 ## Abstract
 FScanpy is a Python package designed to predict Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. This package integrates machine learning models, sequence feature analysis, and visualization capabilities to help researchers rapidly locate potential PRF sites.

@ -612,7 +614,4 @@ for i in range(0, len(large_dataset), chunk_size):
 ```

 ## Citation
-If you use FScanpy, please cite our paper: [Paper Link]
-
-## Support
-For questions and issues, please visit our GitHub repository or contact the development team. 
+If you use FScanpy, please cite our paper: [Paper Link]
--- a/tutorial/tutorial_zh.md
+++ b/tutorial/tutorial_zh.md
@ -0,0 +1,688 @@
+# FScanpy 教程 - 完整使用指南
+
+[![English](https://img.shields.io/badge/Language-English-blue.svg)](tutorial.md)
+
+## 摘要
+FScanpy 是一个专为预测 DNA 序列中程序性核糖体移码（PRF）位点而设计的 Python 包。该包集成了机器学习模型、序列特征分析和可视化功能，帮助研究人员快速定位潜在的 PRF 位点。
+
+## 介绍
+![FScanpy 结构](/image/structure.jpeg)
+
+FScanpy 是一个专门用于预测 DNA 序列中程序性核糖体移码（PRF）位点的 Python 包。它集成了机器学习模型（梯度提升和 BiLSTM-CNN）以及 FScanR 包，提供精确的 PRF 预测。用户可以使用三种类型的数据作为输入：需要预测的完整 cDNA/mRNA 序列、疑似移码位点附近的核苷酸序列，以及物种或相关物种的肽库 blastx 结果。它预期输入序列位于 + 链上，并可与 FScanR 集成以提高准确性。
+
+![机器学习模型](/image/ML.png)
+
+对于整个序列的预测，FScanpy 采用滑动窗口方法扫描整个序列并预测 PRF 位点。对于区域预测，它基于疑似移码位点周围 0 读码框中的 33bp 和 399bp 序列。首先，短模型（HistGradientBoosting）将预测扫描窗口内的潜在 PRF 位点。如果预测概率超过阈值，长模型（BiLSTM-CNN）将预测 399bp 序列中的 PRF 位点。然后，集成权重结合两个模型进行最终预测。
+
+对于从 BLASTX 输出检测 PRF，[FScanR](https://github.com/seanchen607/FScanR.git) 从 BLASTX 比对结果中识别潜在的 PRF 位点，获取同一查询序列的两个命中，然后利用 frameDist_cutoff、mismatch_cutoff 和 evalue_cutoff 过滤命中。最后，使用 FScanpy 预测 PRF 位点的概率。
+
+### 背景
+[核糖体移码](https://en.wikipedia.org/wiki/Ribosomal_frameshift)，也称为翻译移码或翻译重编码，是翻译过程中发生的一种生物现象，导致从单个 mRNA 产生多个独特的蛋白质。该过程可由 mRNA 的核苷酸序列程序化，有时受二级、三维 mRNA 结构影响。它主要在病毒（特别是逆转录病毒）、逆转录转座子和细菌插入元件中被描述，也在一些细胞基因中存在。
+
+### FScanpy 的主要特性包括：
+
+- 两个预测模型的集成：
+  - 短模型（HistGradientBoosting）：分析以潜在移码位点为中心的局部序列特征（33bp）。
+  - 长模型（BiLSTM-CNN）：分析更广泛的序列特征（399bp）。
+- 支持跨多种物种的 PRF 预测。
+- 可与 [FScanR](https://github.com/seanchen607/FScanR.git) 结合以提高准确性。
+
+## 安装 (python>=3.7)
+
+### 1. 使用 pip
+```bash
+pip install FScanpy
+```
+
+### 2. 从 GitHub 克隆
+```bash
+git clone https://github.com/.../FScanpy.git
+cd FScanpy
+pip install -e .
+```
+
+## 完整函数参考
+
+### 1. 核心预测函数
+
+#### 1.1 `predict_prf()` - 主要预测接口
+
+**函数签名：**
+```python
+def predict_prf(
+    sequence: Union[str, List[str], None] = None,
+    data: Union[pd.DataFrame, None] = None,
+    window_size: int = 3,
+    short_threshold: float = 0.1,
+    ensemble_weight: float = 0.4,
+    model_dir: str = None
+) -> pd.DataFrame
+```
+
+**参数：**
+- `sequence`：用于滑动窗口预测的单个或多个 DNA 序列
+- `data`：DataFrame 数据，必须包含 'Long_Sequence' 或 '399bp' 列用于区域预测
+- `window_size`：滑动窗口大小（默认：3，推荐：1-10）
+- `short_threshold`：短模型（HistGB）概率阈值（默认：0.1，范围：0.0-1.0）
+- `ensemble_weight`：短模型在集成中的权重（默认：0.4，范围：0.0-1.0）
+- `model_dir`：模型目录路径（可选，如果为 None 则使用内置模型）
+
+**返回值：**
+- `pd.DataFrame`：预测结果，包含以下列：
+  - `Short_Probability`：短模型预测概率
+  - `Long_Probability`：长模型预测概率
+  - `Ensemble_Probability`：集成预测概率（主要结果）
+  - `Position`：序列中的位置（滑动窗口模式）
+  - `Codon`：位置处的密码子（滑动窗口模式）
+  - `Ensemble_Weights`：权重配置信息
+
+**使用示例：**
+
+```python
+from FScanpy import predict_prf
+
+# 1. 单序列滑动窗口预测
+sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
+results = predict_prf(sequence=sequence)
+
+# 2. 多序列预测
+sequences = ["ATGCGTACGT...", "GCTATAGCAT..."]
+results = predict_prf(sequence=sequences)
+
+# 3. 自定义参数
+results = predict_prf(
+    sequence=sequence, 
+    window_size=1,           # 扫描每个位置
+    short_threshold=0.2,     # 更高阈值
+    ensemble_weight=0.3      # 3:7 比例（短:长）
+)
+
+# 4. DataFrame 区域预测
+import pandas as pd
+data = pd.DataFrame({
+    'Long_Sequence': ['ATGCGT...', 'GCTATAG...'],  # 或使用 '399bp'
+    'sample_id': ['sample1', 'sample2']
+})
+results = predict_prf(data=data)
+```
+
+#### 1.2 `plot_prf_prediction()` - 带可视化的预测
+
+**函数签名：**
+```python
+def plot_prf_prediction(
+    sequence: str,
+    window_size: int = 3,
+    short_threshold: float = 0.65,
+    long_threshold: float = 0.8,
+    ensemble_weight: float = 0.4,
+    title: str = None,
+    save_path: str = None,
+    figsize: tuple = (12, 8),
+    dpi: int = 300,
+    model_dir: str = None
+) -> tuple
+```
+
+**参数：**
+- `sequence`：输入 DNA 序列（字符串）
+- `window_size`：滑动窗口大小（默认：3）
+- `short_threshold`：热图显示的短模型过滤阈值（默认：0.65）
+- `long_threshold`：热图显示的长模型过滤阈值（默认：0.8）
+- `ensemble_weight`：集成中短模型的权重（默认：0.4）
+- `title`：图表标题（可选，如果为 None 则自动生成）
+- `save_path`：保存路径（可选，如果提供则保存图表）
+- `figsize`：图形大小元组（默认：(12, 8)）
+- `dpi`：图形分辨率（默认：300）
+- `model_dir`：模型目录路径（可选）
+
+**返回值：**
+- `tuple`：(prediction_results: pd.DataFrame, figure: matplotlib.figure.Figure)
+
+**使用示例：**
+
+```python
+from FScanpy import plot_prf_prediction
+import matplotlib.pyplot as plt
+
+# 1. 基本绘图
+sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
+results, fig = plot_prf_prediction(sequence)
+plt.show()
+
+# 2. 自定义阈值和权重
+results, fig = plot_prf_prediction(
+    sequence, 
+    short_threshold=0.7,     # 更高显示阈值
+    long_threshold=0.85,     # 更高显示阈值
+    ensemble_weight=0.3,     # 3:7 权重比例
+    title="自定义分析结果",
+    save_path="analysis.png",
+    figsize=(15, 10),
+    dpi=150
+)
+
+# 3. 高分辨率分析
+results, fig = plot_prf_prediction(
+    sequence,
+    window_size=1,           # 扫描每个位置
+    ensemble_weight=0.5,     # 等权重
+    dpi=600                  # 高分辨率
+)
+```
+
+### 2. PRFPredictor 类方法
+
+#### 2.1 类初始化
+
+```python
+from FScanpy import PRFPredictor
+
+# 使用默认模型初始化
+predictor = PRFPredictor()
+
+# 使用自定义模型目录初始化
+predictor = PRFPredictor(model_dir='/path/to/models')
+```
+
+#### 2.2 `predict_sequence()` - 滑动窗口预测
+
+**方法签名：**
+```python
+def predict_sequence(self, sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)
+```
+
+**参数：**
+- `sequence`：输入 DNA 序列
+- `window_size`：滑动窗口大小（默认：3）
+- `short_threshold`：短模型概率阈值（默认：0.1）
+- `ensemble_weight`：集成中短模型权重（默认：0.4）
+
+**用法：**
+```python
+predictor = PRFPredictor()
+results = predictor.predict_sequence(
+    sequence="ATGCGTACGT...",
+    window_size=1,
+    short_threshold=0.15,
+    ensemble_weight=0.35
+)
+```
+
+#### 2.3 `predict_regions()` - 基于区域的预测
+
+**方法签名：**
+```python
+def predict_regions(self, sequences, short_threshold=0.1, ensemble_weight=0.4)
+```
+
+**参数：**
+- `sequences`：399bp 序列的列表或 Series
+- `short_threshold`：短模型概率阈值（默认：0.1）
+- `ensemble_weight`：集成中短模型权重（默认：0.4）
+
+**用法：**
+```python
+predictor = PRFPredictor()
+sequences = ["ATGCGT...", "GCTATAG..."]  # 399bp 序列
+results = predictor.predict_regions(
+    sequences=sequences,
+    short_threshold=0.1,
+    ensemble_weight=0.4
+)
+```
+
+#### 2.4 `predict_single_position()` - 单位置预测
+
+**方法签名：**
+```python
+def predict_single_position(self, fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)
+```
+
+**参数：**
+- `fs_period`：移码位点周围的 33bp 序列
+- `full_seq`：长模型使用的 399bp 序列
+- `short_threshold`：短模型概率阈值（默认：0.1）
+- `ensemble_weight`：集成中短模型权重（默认：0.4）
+
+**用法：**
+```python
+predictor = PRFPredictor()
+result = predictor.predict_single_position(
+    fs_period="ATGCGTACGTATGCGTACGTATGCGTACGTA",  # 33bp
+    full_seq="ATGCGT..." * 133,  # 399bp
+    short_threshold=0.1,
+    ensemble_weight=0.4
+)
+```
+
+#### 2.5 `plot_sequence_prediction()` - 类方法绘图
+
+**方法签名：**
+```python
+def plot_sequence_prediction(self, sequence, window_size=3, short_threshold=0.65, 
+                           long_threshold=0.8, ensemble_weight=0.4, title=None, 
+                           save_path=None, figsize=(12, 8), dpi=300)
+```
+
+**用法：**
+```python
+predictor = PRFPredictor()
+results, fig = predictor.plot_sequence_prediction(
+    sequence="ATGCGTACGT...",
+    window_size=3,
+    ensemble_weight=0.4
+)
+```
+
+### 3. 实用函数
+
+#### 3.1 `fscanr()` - 从 BLASTX 检测 PRF 位点
+
+**函数签名：**
+```python
+def fscanr(
+    blastx_output: pd.DataFrame,
+    mismatch_cutoff: float = 10,
+    evalue_cutoff: float = 1e-5,
+    frameDist_cutoff: float = 10
+) -> pd.DataFrame
+```
+
+**参数：**
+- `blastx_output`：包含必需列的 BLASTX 输出 DataFrame：
+  - `qseqid`, `sseqid`, `pident`, `length`, `mismatch`, `gapopen`
+  - `qstart`, `qend`, `sstart`, `send`, `evalue`, `bitscore`, `qframe`, `sframe`
+- `mismatch_cutoff`：最大允许错配数（默认：10）
+- `evalue_cutoff`：E 值阈值（默认：1e-5）
+- `frameDist_cutoff`：框架距离阈值（默认：10）
+
+**返回值：**
+- `pd.DataFrame`：包含以下列的 PRF 位点：
+  - `DNA_seqid`：序列标识符
+  - `FS_start`, `FS_end`：移码开始和结束位置
+  - `Pep_seqid`：肽序列标识符
+  - `Pep_FS_start`, `Pep_FS_end`：肽移码位置
+  - `FS_type`：移码类型（-2, -1, 1, 2）
+  - `Strand`：链方向（+, -）
+
+**用法：**
+```python
+from FScanpy.utils import fscanr
+import pandas as pd
+
+# 加载 BLASTX 结果
+blastx_data = pd.read_excel('blastx_results.xlsx')
+
+# 检测 PRF 位点
+prf_sites = fscanr(
+    blastx_output=blastx_data,
+    mismatch_cutoff=5,       # 更严格的错配过滤
+    evalue_cutoff=1e-6,      # 更严格的 E 值过滤
+    frameDist_cutoff=15      # 允许更大的框架距离
+)
+```
+
+#### 3.2 `extract_prf_regions()` - 提取 PRF 位点周围的序列
+
+**函数签名：**
+```python
+def extract_prf_regions(mrna_file: str, prf_data: pd.DataFrame) -> pd.DataFrame
+```
+
+**参数：**
+- `mrna_file`：mRNA 序列文件路径（FASTA 格式）
+- `prf_data`：来自 `fscanr()` 输出的 DataFrame
+
+**返回值：**
+- `pd.DataFrame`：提取的序列，包含以下列：
+  - `DNA_seqid`：序列标识符
+  - `FS_start`, `FS_end`：移码位置
+  - `Strand`：链方向
+  - `399bp`：提取的 399bp 序列
+  - `FS_type`：移码类型
+
+**用法：**
+```python
+from FScanpy.utils import extract_prf_regions
+
+# 提取 PRF 位点周围的序列
+prf_sequences = extract_prf_regions(
+    mrna_file='sequences.fasta',
+    prf_data=prf_sites
+)
+
+# 预测 PRF 概率
+predictor = PRFPredictor()
+results = predictor.predict_regions(prf_sequences['399bp'])
+```
+
+### 4. 数据访问函数
+
+#### 4.1 测试数据访问
+
+```python
+from FScanpy.data import get_test_data_path, list_test_data
+
+# 列出可用的测试数据
+list_test_data()
+
+# 获取测试数据路径
+blastx_file = get_test_data_path('blastx_example.xlsx')
+mrna_file = get_test_data_path('mrna_example.fasta')
+region_file = get_test_data_path('region_example.csv')
+seq_file = get_test_data_path('full_seq.xlsx')
+```
+
+## 完整工作流程示例
+
+### 工作流程 1：完整序列分析
+
+```python
+from FScanpy import predict_prf, plot_prf_prediction
+import matplotlib.pyplot as plt
+
+# 定义序列
+full_seq = pd.read.excel(seq_file)
+
+# 方法 1：简单预测
+results = predict_prf(sequence=full_seq[0]['full_seq'])
+print(f"发现 {len(results)} 个潜在位点")
+
+# 方法 2：带可视化的预测
+results, fig = plot_prf_prediction(
+    sequence=full_seq[0]['full_seq'],
+    window_size=1,              # 扫描每个位置
+    short_threshold=0.3,        # 显示概率 > 0.3 的位点
+    long_threshold=0.4,         # 显示概率 > 0.4 的位点
+    ensemble_weight=0.4,        # 4:6 权重比例
+    title="PRF 分析结果",
+    save_path="prf_analysis.png"
+)
+plt.show()
+
+# 分析顶级预测
+top_sites = results.nlargest(5, 'Ensemble_Probability')
+print("前 5 个预测位点：")
+for _, site in top_sites.iterrows():
+    print(f"位置 {site['Position']}: {site['Ensemble_Probability']:.3f}")
+```
+
+### 工作流程 2：基于区域的预测
+
+```python
+from FScanpy import predict_prf
+import pandas as pd
+
+# 准备区域数据
+region_data = pd.DataFrame({
+    'sample_id': ['sample1', 'sample2', 'sample3'],
+    'Long_Sequence': [
+        'ATGCGT...',  # 399bp 序列 1
+        'GCTATAG...',  # 399bp 序列 2  
+        'TTACGGA...'   # 399bp 序列 3
+    ],
+    'known_label': [1, 0, 1]  # 可选：用于验证的已知标签
+})
+
+# 预测 PRF 概率
+results = predict_prf(
+    data=region_data,
+    ensemble_weight=0.3  # 偏向长模型（3:7 比例）
+)
+
+# 评估结果
+if 'known_label' in results.columns:
+    threshold = 0.5
+    predictions = (results['Ensemble_Probability'] > threshold).astype(int)
+    accuracy = (predictions == results['known_label']).mean()
+    print(f"阈值 {threshold} 下的准确率：{accuracy:.3f}")
+```
+
+### 工作流程 3：基于 BLASTX 的分析流程
+
+```python
+from FScanpy import PRFPredictor, predict_prf
+from FScanpy.data import get_test_data_path
+from FScanpy.utils import fscanr, extract_prf_regions
+import pandas as pd
+
+# 步骤 1：加载 BLASTX 数据
+blastx_data = pd.read_excel(get_test_data_path('blastx_example.xlsx'))
+print(f"加载了 {len(blastx_data)} 个 BLASTX 命中")
+
+# 步骤 2：使用 FScanR 检测 PRF 位点
+prf_sites = fscanr(
+    blastx_output=blastx_data,
+    mismatch_cutoff=10,
+    evalue_cutoff=1e-5,
+    frameDist_cutoff=10
+)
+print(f"检测到 {len(prf_sites)} 个潜在 PRF 位点")
+
+# 步骤 3：提取 PRF 位点周围的序列
+mrna_file = get_test_data_path('mrna_example.fasta')
+prf_sequences = extract_prf_regions(
+    mrna_file=mrna_file,
+    prf_data=prf_sites
+)
+print(f"提取了 {len(prf_sequences)} 个序列")
+
+# 步骤 4：预测 PRF 概率
+predictor = PRFPredictor()
+results = predictor.predict_regions(
+    sequences=prf_sequences['399bp'],
+    ensemble_weight=0.4
+)
+
+# 步骤 5：将结果与元数据结合
+final_results = pd.concat([
+    prf_sequences.reset_index(drop=True),
+    results.reset_index(drop=True)
+], axis=1)
+
+# 步骤 6：分析结果
+high_prob_sites = final_results[
+    final_results['Ensemble_Probability'] > 0.7
+]
+print(f"高概率 PRF 位点：{len(high_prob_sites)}")
+
+# 显示顶级结果
+print("\n顶级 PRF 预测：")
+top_results = final_results.nlargest(3, 'Ensemble_Probability')
+for _, row in top_results.iterrows():
+    print(f"序列 {row['DNA_seqid']}: {row['Ensemble_Probability']:.3f}")
+```
+
+### 工作流程 4：多序列自定义分析
+
+```python
+from FScanpy import predict_prf, plot_prf_prediction
+import matplotlib.pyplot as plt
+
+# 多序列分析
+sequences = [
+    "ATGCGTACGTATGCGTACGTATGCGTACGTAAGCCCTTTGAACCCAAAGGG",
+    "GCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCAT",
+    "TTACGGATTACGGATTACGGATTACGGATTACGGATTACGGATTACGGAT"
+]
+
+# 批量预测
+results = predict_prf(
+    sequence=sequences,
+    window_size=2,
+    ensemble_weight=0.5  # 等权重
+)
+
+# 按序列分析
+for seq_id in results['Sequence_ID'].unique():
+    seq_results = results[results['Sequence_ID'] == seq_id]
+    max_prob = seq_results['Ensemble_Probability'].max()
+    print(f"{seq_id}: 最大概率 = {max_prob:.3f}")
+
+# 可视化第一个序列
+first_seq_results, fig = plot_prf_prediction(
+    sequence=sequences[0],
+    ensemble_weight=0.5,
+    title="第一个序列分析"
+)
+plt.show()
+```
+
+## 参数优化指南
+
+### 1. 集成权重配置详解
+
+#### 1.1 模型互补优势原理
+
+FScanpy 集成了两个具有互补优势的机器学习模型：
+
+- **HistGB 模型（短模型）**：
+  - 擅长识别真阴性样本（正确排除非 PRF 位点）
+  - 基于 33bp 局部序列特征
+  - 预测更保守，假阳性率较低
+  - 适合高特异性要求的场景
+
+- **BiLSTM-CNN 模型（长模型）**：
+  - 擅长识别真阳性样本（正确识别 PRF 位点）
+  - 基于 399bp 长距离序列特征
+  - 预测更敏感，能捕获更多潜在位点
+  - 适合高敏感性要求的场景
+
+#### 1.2 最优权重比例（4:6）
+
+通过大量测试分析，我们确定了最优权重分布为 HistGB:BiLSTM-CNN = 4:6（`ensemble_weight = 0.4`）：
+
+- **最高 AUC 性能**：在测试集上达到最佳的曲线下面积
+- **平衡预测性能**：在敏感性和特异性之间取得最佳平衡
+- **降低极端错误**：减少 BiLSTM-CNN 模型的过度预测风险
+- **避免保守偏向**：防止 HistGB 模型过于保守导致的模糊性
+
+#### 1.3 权重选择策略
+
+根据研究目标选择合适的权重配置：
+
+**高通量筛选场景（偏向敏感性）**：
+- **权重配置**：`ensemble_weight = 0.6-0.8`
+- **适用场景**：初步筛选、候选位点发现、探索性研究
+- **优势**：筛选更多候选位点，降低漏检风险
+- **权衡**：可能产生更多假阳性，需要后续验证
+
+```python
+# 高敏感性配置示例
+results = predict_prf(
+    sequence=sequence,
+    ensemble_weight=0.7,  # 偏向 BiLSTM-CNN
+    short_threshold=0.05  # 降低短模型阈值
+)
+```
+
+**精确验证场景（偏向特异性）**：
+- **权重配置**：`ensemble_weight = 0.2-0.3`
+- **适用场景**：候选验证、临床应用、高置信度预测
+- **优势**：减少假阳性，提高预测可靠性
+- **权衡**：可能遗漏部分真阳性位点
+
+```python
+# 高特异性配置示例
+results = predict_prf(
+    sequence=sequence,
+    ensemble_weight=0.25,  # 偏向 HistGB
+    short_threshold=0.2    # 提高短模型阈值
+)
+```
+
+**平衡分析场景（推荐默认）**：
+- **权重配置**：`ensemble_weight = 0.4-0.6`
+- **适用场景**：常规分析、综合评估、标准研究
+- **优势**：在敏感性和特异性间取得最佳平衡
+- **推荐**：大多数研究场景的首选配置
+
+```python
+# 平衡配置示例
+results = predict_prf(
+    sequence=sequence,
+    ensemble_weight=0.4,   # 最优平衡比例
+    short_threshold=0.1    # 标准阈值
+)
+```
+
+
+### 2. 阈值选择
+
+- **短阈值**：通常 0.1-0.3，控制计算效率
+- **显示阈值**：0.3-0.8，控制可视化显示
+- **分类阈值**：0.5（标准），根据验证数据调整
+
+### 3. 窗口大小选择
+
+- **精细分析**：`window_size = 1`（每个位置）
+- **标准分析**：`window_size = 3`（每第 3 个位置，默认）
+- **粗略分析**：`window_size = 6-9`（更快，细节较少）
+
+## 故障排除
+
+### 常见问题和解决方案
+
+1. **模型加载错误**
+   ```python
+   # 检查模型目录
+   import FScanpy
+   predictor = PRFPredictor(model_dir='/custom/path')
+   ```
+
+2. **大序列内存问题**
+   ```python
+   # 使用更大的窗口大小减少计算负载
+   results = predict_prf(sequence=large_seq, window_size=9)
+   ```
+
+3. **可视化问题**
+   ```python
+   # 调整图形参数
+   results, fig = plot_prf_prediction(
+       sequence=seq,
+       figsize=(20, 10),  # 更大的图形
+       dpi=150            # 更低的分辨率
+   )
+   ```
+
+4. **输入格式问题**
+   ```python
+   # 确保正确的 DataFrame 格式
+   data = pd.DataFrame({
+       'Long_Sequence': sequences,  # 使用 'Long_Sequence' 或 '399bp'
+       'sample_id': ids
+   })
+   ```
+
+## 性能优化
+
+### 1. 批处理
+```python
+# 高效处理多个序列
+sequences = ["seq1", "seq2", "seq3", ...]
+results = predict_prf(sequence=sequences, window_size=3)
+```
+
+### 2. 阈值优化
+```python
+# 使用适当的 short_threshold 跳过不必要的长模型调用
+results = predict_prf(
+    sequence=sequence,
+    short_threshold=0.2  # 更高阈值 = 更快处理
+)
+```
+
+### 3. 内存管理
+```python
+# 对于非常大的数据集，分块处理
+chunk_size = 100
+for i in range(0, len(large_dataset), chunk_size):
+    chunk = large_dataset[i:i+chunk_size]
+    chunk_results = predict_prf(data=chunk)
+    # 处理 chunk_results
+```
+
+## 引用
+如果您使用 FScanpy，请引用我们的论文：[论文链接]