完善中文描述,并修改更多细节

This commit is contained in:
ChenLab54 2025-08-17 17:14:16 +08:00
parent 96b61d34d8
commit 6d5b489f9e
5 changed files with 986 additions and 91 deletions

View File

@ -1,7 +1,8 @@
# FScanpy
## A Machine Learning-Based Framework for Programmed Ribosomal Frameshifting Prediction
[![Python](https://img.shields.io/badge/Python-3.7%2B-blue.svg)](https://www.python.org/)
[![中文](https://img.shields.io/badge/Language-中文-red.svg)](README_zh.md)
[![Python](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://www.python.org/)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR](https://github.com/seanchen607/FScanR.git) framework, FScanpy provides robust and accurate PRF site predictions.
@ -36,7 +37,7 @@ FScanpy is a comprehensive Python package designed for the prediction of [Progra
## 🔧 Installation
### Prerequisites
- Python ≥ 3.7
- Python ≥ 3.9
- All dependencies are automatically installed
### Install via pip (Recommended)
@ -104,24 +105,28 @@ results = predictor.predict_sequence(
## 🎛️ Ensemble Weight Configuration
The `ensemble_weight` parameter controls the contribution of each model:
The `ensemble_weight` parameter controls the weight ratio between HistGB and BiLSTM-CNN models:
| ensemble_weight | Short Model | Long Model | Best For |
|----------------|-------------|------------|----------|
| **0.2-0.3** | 20-30% | 70-80% | **High sensitivity**, detecting subtle sites |
| **0.4-0.5** | 40-50% | 50-60% | **Balanced detection** (recommended) |
| **0.6-0.7** | 60-70% | 30-40% | **Fast screening**, high specificity |
| ensemble_weight | HistGB Model | BiLSTM-CNN Model | Characteristics | Best For |
|----------------|-------------|------------------|-----------------|----------|
| **0.2-0.3** | 20-30% | 70-80% | **High specificity**, reduces false positives | Precise validation, clinical applications |
| **0.4** | 40% | 60% | **Optimal balance**, highest AUC | Standard analysis (recommended) |
| **0.6-0.8** | 60-80% | 20-40% | **High sensitivity**, captures more sites | High-throughput screening, exploratory research |
### Model Characteristics
- **HistGB Model**: Excels at identifying true negatives, conservative predictions, low false positive rate
- **BiLSTM-CNN Model**: Excels at identifying true positives, sensitive predictions, captures more potential sites
### Weight Selection Examples
```python
# High sensitivity (Long model dominant)
sensitive_results = predict_prf(sequence, ensemble_weight=0.2)
# High specificity configuration (favoring HistGB)
precise_results = predict_prf(sequence, ensemble_weight=0.25)
# Balanced approach (recommended)
# Optimal balance configuration (4:6 ratio)
balanced_results = predict_prf(sequence, ensemble_weight=0.4)
# Fast screening (Short model dominant)
screening_results = predict_prf(sequence, ensemble_weight=0.7)
# High sensitivity configuration (favoring BiLSTM-CNN)
sensitive_results = predict_prf(sequence, ensemble_weight=0.7)
```
## 📊 Core Functions
@ -210,67 +215,17 @@ final_predictions = predict_prf(data=prf_regions, ensemble_weight=0.4)
## 📚 Documentation
- **[Complete Tutorial](tutorial/tutorial.md)**: Comprehensive usage guide with examples
- **[Demo Notebook](FScanpy_Demo.ipynb)**: Interactive examples and workflows
- **[Example Scripts](example_plot_prediction.py)**: Ready-to-run code examples
## 🎯 Use Cases
### 1. **Viral Genome Analysis**
```python
# Scan viral genome for PRF sites
viral_sequence = load_viral_genome()
prf_sites = predict_prf(viral_sequence, ensemble_weight=0.3)
high_confidence = prf_sites[prf_sites['Ensemble_Probability'] > 0.8]
```
### 2. **Comparative Genomics**
```python
# Compare PRF patterns across species
species_data = pd.DataFrame({
'Species': ['Virus_A', 'Virus_B'],
'Long_Sequence': [seq_a_399bp, seq_b_399bp]
})
comparative_results = predict_prf(data=species_data)
```
### 3. **High-Throughput Screening**
```python
# Fast screening of large sequence datasets
sequences = load_large_dataset()
screening_results = predict_prf(
sequence=sequences,
ensemble_weight=0.7, # Fast screening mode
short_threshold=0.3
)
```
## 🤝 Contributing
We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
- **[Demo Notebook](FScanpy_Demo.ipynb)**: Practical usage of each function in the library and demonstration of analysis workflow results
- **[Predict Sample Interpretation](tutorial/predict_sample.ipynb)**: Detailed interpretation of FScanpy's plotting results and signal analysis
## 📝 Citation
If you use FScanpy in your research, please cite:
```bibtex
@software{fscanpy2024,
title={FScanpy: A Machine Learning Framework for Programmed Ribosomal Frameshifting Prediction},
author={[Author names]},
year={2024},
url={https://github.com/your-org/FScanpy}
}
```
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🆘 Support
- **Documentation**: [Tutorial](tutorial/tutorial.md)
- **Usage Example**: [Demo Notebook](FScanpy_Demo.ipynb)
- **Predict Result Explain**: [Predict Result Explain](tutorial/predict_sample.ipynb)
## 🏗️ Dependencies
FScanpy automatically installs all required dependencies:

248
README_zh.md Normal file
View File

@ -0,0 +1,248 @@
# FScanpy
## 基于机器学习的程序性核糖体移码预测框架
[![English](https://img.shields.io/badge/Language-English-blue.svg)](README.md)
[![Python](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://www.python.org/)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
FScanpy 是一个专为预测核苷酸序列中[程序性核糖体移码 (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) 位点而设计的综合性 Python 包。通过将先进的机器学习方法HistGradientBoosting 和 BiLSTM-CNN与已建立的 [FScanR](https://github.com/seanchen607/FScanR.git) 框架相结合FScanpy 提供了稳健且准确的 PRF 位点预测。
![FScanpy 架构](/tutorial/image/structure.jpeg)
## 🌟 核心特性
### 🎯 **双模型架构**
- **短模型** (`HistGradientBoosting`):使用 33bp 序列进行快速筛选
- **长模型** (`BiLSTM-CNN`):使用 399bp 序列进行深度分析
- **集成预测**:可自定义模型权重以获得最佳性能
### 🚀 **多样化输入支持**
- **单/多序列**:对完整序列进行滑动窗口预测
- **基于区域的分析**:直接对预提取的 399bp 区域进行预测
- **BLASTX 集成**:与 FScanR 流程无缝衔接
- **跨物种兼容性**内置病毒、海洋噬菌体、Euplotes 等数据库
### 📊 **高级可视化**
- **交互式热图**FS 位点概率可视化
- **预测图表**:组合概率和置信度显示
- **可定制阈值**:为每个模型单独设置过滤条件
- **导出选项**PNG、PDF 和交互式格式
### ⚡ **高性能**
- **优化算法**:高效的滑动窗口扫描
- **批处理**:同时处理多个序列
- **灵活阈值**:针对不同用例的可调敏感性
- **内存高效**:针对大规模基因组数据进行优化
## 🔧 安装
### 前置条件
- Python ≥ 3.9
- 所有依赖项将自动安装
### 通过 pip 安装(推荐)
```bash
pip install FScanpy
```
### 从源码安装
```bash
git clone https://github.com/your-org/FScanpy-package.git
cd FScanpy-package
pip install -e .
```
## 🚀 快速开始
### 基本用法
```python
from FScanpy import predict_prf
# 简单序列预测
sequence = "ATGCGTACGTTAGC..." # 您的 DNA 序列
results = predict_prf(sequence=sequence)
# 查看前十个预测结果
print(results[['Position', 'Ensemble_Probability', 'Short_Probability', 'Long_Probability']].head(10))
```
### 可视化
```python
from FScanpy import plot_prf_prediction
# 生成预测图表
results, fig = plot_prf_prediction(
sequence=sequence,
short_threshold=0.65, # HistGB 阈值
long_threshold=0.8, # BiLSTM-CNN 阈值
ensemble_weight=0.4, # 40% 短模型60% 长模型
title="PRF 预测结果"
)
```
### 高级用法
```python
from FScanpy import PRFPredictor
import pandas as pd
# 创建预测器实例
predictor = PRFPredictor()
# 对预提取区域进行批量预测
data = pd.DataFrame({
'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57] # 399bp 序列
})
results = predictor.predict_regions(data, ensemble_weight=0.4)
# 使用自定义参数进行序列级预测
results = predictor.predict_sequence(
sequence=sequence,
window_size=1, # 滑动窗口步长
ensemble_weight=0.3, # 模型权重
short_threshold=0.5 # 过滤阈值
)
```
## 🎛️ 集成权重配置
`ensemble_weight` 参数控制 HistGB 和 BiLSTM-CNN 模型的权重比例:
| ensemble_weight | HistGB 模型 | BiLSTM-CNN 模型 | 特性 | 最适用于 |
|----------------|------------|----------------|------|----------|
| **0.2-0.3** | 20-30% | 70-80% | **高特异性**,减少假阳性 | 精确验证、临床应用 |
| **0.4** | 40% | 60% | **最优平衡**,最高 AUC | 标准分析(推荐) |
| **0.6-0.8** | 60-80% | 20-40% | **高敏感性**,捕获更多位点 | 高通量筛选、探索研究 |
### 模型特性说明
- **HistGB 模型**:擅长识别真阴性样本,预测保守,假阳性率低
- **BiLSTM-CNN 模型**:擅长识别真阳性样本,预测敏感,能捕获更多潜在位点
### 权重选择示例
```python
# 高特异性配置(偏向 HistGB
precise_results = predict_prf(sequence, ensemble_weight=0.25)
# 最优平衡配置4:6 比例)
balanced_results = predict_prf(sequence, ensemble_weight=0.4)
# 高敏感性配置(偏向 BiLSTM-CNN
sensitive_results = predict_prf(sequence, ensemble_weight=0.7)
```
## 📊 核心函数
### 主要预测接口
```python
predict_prf(
sequence=None, # 单/多序列或 None
data=None, # 包含 399bp 序列的 DataFrame 或 None
window_size=3, # 滑动窗口步长
short_threshold=0.1, # 短模型过滤阈值
ensemble_weight=0.4, # 短模型权重 (0.0-1.0)
model_dir=None # 自定义模型目录
)
```
### 可视化函数
```python
plot_prf_prediction(
sequence, # 输入 DNA 序列
window_size=3, # 扫描步长
short_threshold=0.65, # 短模型绘图阈值
long_threshold=0.8, # 长模型绘图阈值
ensemble_weight=0.4, # 模型权重
title=None, # 图表标题
save_path=None, # 保存文件路径
figsize=(12,8), # 图形大小
dpi=300 # 保存图表的分辨率
)
```
### PRFPredictor 类方法
```python
predictor = PRFPredictor()
# 序列预测(滑动窗口)
predictor.predict_sequence(sequence, ensemble_weight=0.4)
# 区域预测(批处理)
predictor.predict_regions(dataframe, ensemble_weight=0.4)
# 特征提取
predictor.extract_features(sequences)
# 模型信息
predictor.get_model_info()
```
## 📈 输出字段
### 预测结果
- **`Position`**:在原始序列中的位置
- **`Ensemble_Probability`**:最终集成预测(主要结果)
- **`Short_Probability`**HistGradientBoosting 预测 (0-1)
- **`Long_Probability`**BiLSTM-CNN 预测 (0-1)
- **`Ensemble_Weights`**:使用的模型权重配置
### 序列信息
- **`Short_Sequence`**:短模型使用的 33bp 序列
- **`Long_Sequence`**:长模型使用的 399bp 序列
- **`Codon`**:预测位置的 3bp 密码子
- **`Sequence_ID`**:多序列输入的标识符
## 🔬 与 FScanR 集成
FScanpy 与 FScanR 流程无缝协作,提供全面的 PRF 分析:
```python
from FScanpy import fscanr, extract_prf_regions, predict_prf
# 步骤 1使用 FScanR 进行 BLASTX 分析
blastx_results = fscanr(
blastx_data,
mismatch_cutoff=10,
evalue_cutoff=1e-5,
frameDist_cutoff=10
)
# 步骤 2提取 PRF 候选区域
prf_regions = extract_prf_regions(original_sequence, blastx_results)
# 步骤 3使用 FScanpy 进行预测
final_predictions = predict_prf(data=prf_regions, ensemble_weight=0.4)
```
## 📚 文档
- **[完整教程](tutorial/tutorial_zh.md)**:包含示例的综合使用指南
- **[演示笔记本](FScanpy_Demo.ipynb)**:库中每个函数的实际用法以及分析流程结果演示
- **[预测结果解释](tutorial/predict_sample.ipynb)**FScanpy 绘图结果的详细解释和信号分析
## 📝 引用
如果您在研究中使用 FScanpy请引用
```bibtex
@software{fscanpy2024,
title={FScanpy: A Machine Learning Framework for Programmed Ribosomal Frameshifting Prediction},
author={[作者姓名]},
year={2024},
url={https://github.com/your-org/FScanpy}
}
```
## 🏗️ 依赖项
FScanpy 会自动安装所有必需的依赖项:
- `numpy>=1.24.3`
- `pandas>=2.2.3`
- `tensorflow>=2.10.1`
- `scikit-learn>=1.6.0`
- `matplotlib>=3.9.4`
- `joblib>=1.4.2`
- `biopython>=1.85`
- `wrapt>=1.17.0`
---
**FScanpy** - 通过机器学习推进程序性核糖体移码研究 🧬

View File

@ -1,5 +1,17 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# FScanpy Prediction Results Interpretation\n",
"\n",
"[![中文](https://img.shields.io/badge/Language-中文-red.svg)](predict_sample_zh.ipynb)\n",
"\n",
"This notebook provides detailed interpretation of FScanpy prediction results, explaining output fields and how to analyze predictions.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
@ -154,11 +166,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 结果解读Sequence0\n",
"### 真实情况\n",
"该序列核糖体程序性移码发生于第113nt处\n",
"### 图上信息\n",
"在该处我们可以看到一个显著的最高峰,并且明显较粗。"
"## Sequence0: High-Confidence, Unambiguous Signal\n",
"- Known Ground Truth: The programmed ribosomal frameshifting (PRF) event for this sequence occurs at nucleotide 113.\n",
"\n",
"- Plot Interpretation: The FScanpy analysis shows a prominent probability peak at the 113 nt position. The magnitude of this peak significantly exceeds other regions in the sequence, and the surrounding bases also exhibit elevated frameshifting probabilities, forming a concentrated and well-defined signal. This indicates a high-confidence prediction from the model that corresponds precisely with the known PRF event location."
]
},
{
@ -208,12 +219,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 结果解读Sequence1\n",
"## Sequence1: Ambiguous Signal at Low Resolution\n",
"- Known Ground Truth: The PRF event for this sequence occurs at nucleotide 1794.\n",
"\n",
"### 真实情况\n",
"该序列核糖体程序性移码发生于第1794nt处。\n",
"### 图上信息\n",
"在该处我们可以看到一个显著的高峰但是肉眼难以分辨改高峰与其他位置的高峰的差异因此需要提高分辨率。将window size参数调整更小查看每个高峰周围碱基的移码概率基于高概率位点的集中程度判断其移码可能性。\n"
"- Plot Interpretation: An initial low-resolution scan of Sequence1 identifies a probability peak near the known 1794 nt site. However, the plot also reveals multiple potential signal peaks of comparable intensity elsewhere, making it difficult to definitively identify the true event from this view alone. To resolve this ambiguity, a high-resolution analysis is necessary. By reducing the `window_size` parameter, the analysis can focus on the local vicinity of each peak to assess the concentration of high-probability predictions, which helps distinguish a true signal from background noise."
]
},
{
@ -263,13 +272,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 高分辨率的结果解读Sequence1\n",
"## Sequence1: Signal Confirmation at High Resolution\n",
"- Known Ground Truth: The PRF event for this sequence occurs at nucleotide 1794.\n",
"\n",
"### 真实情况\n",
"该序列核糖体程序性移码发生于第1794nt处。\n",
"### 图上信息\n",
"在该位置存在大量的高概率碱基集中但是其他高峰周围并不存在因此其PRF可能性高于其余高峰。\n",
"\n"
"- Plot Interpretation: Following the high-resolution analysis, the results clearly show a dense cluster of high-probability bases centered around the 1794 nt position. In contrast, the other peaks identified in the initial scan do not show a similar concentration and appear as isolated, singular data points. This consolidation of the signal confirms that the 1794 nt site is the more reliable PRF event, with a much higher likelihood than other potential sites in the sequence."
]
},
{
@ -405,11 +411,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 结果解读Sequence4\n",
"### 真实情况\n",
"该序列核糖体程序性移码发生于第216nt处\n",
"### 图上信息\n",
"我们的算法并不能总是解决问题在该处我们可以看到三个显著的高峰其中80nt左右和216nt左右的高峰并无明显差距。我们需要通过湿实验验证位点的真实性。"
"## Sequence4: Ambiguous Result Requiring Experimental Validation\n",
"- Known Ground Truth: The PRF event for this sequence occurs at nucleotide 216.\n",
"\n",
"- Plot Interpretation: This analysis demonstrates the inherent limitations of computational methods. FScanpy identifies three significant probability peaks, with the peaks at ~80 nt and the true site at 216 nt showing no decisive difference in magnitude. In such cases, the computational result alone is insufficient to distinguish the true PRF event. The model successfully narrows down the potential candidates to a few key regions, but subsequent biological (wet-lab) experiments are essential to validate which site is functionally active."
]
}
],

View File

@ -1,5 +1,7 @@
# FScanpy Tutorial - Complete Usage Guide
[![中文](https://img.shields.io/badge/Language-中文-red.svg)](tutorial_zh.md)
## Abstract
FScanpy is a Python package designed to predict Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. This package integrates machine learning models, sequence feature analysis, and visualization capabilities to help researchers rapidly locate potential PRF sites.
@ -612,7 +614,4 @@ for i in range(0, len(large_dataset), chunk_size):
```
## Citation
If you use FScanpy, please cite our paper: [Paper Link]
## Support
For questions and issues, please visit our GitHub repository or contact the development team.
If you use FScanpy, please cite our paper: [Paper Link]

688
tutorial/tutorial_zh.md Normal file
View File

@ -0,0 +1,688 @@
# FScanpy 教程 - 完整使用指南
[![English](https://img.shields.io/badge/Language-English-blue.svg)](tutorial.md)
## 摘要
FScanpy 是一个专为预测 DNA 序列中程序性核糖体移码PRF位点而设计的 Python 包。该包集成了机器学习模型、序列特征分析和可视化功能,帮助研究人员快速定位潜在的 PRF 位点。
## 介绍
![FScanpy 结构](/image/structure.jpeg)
FScanpy 是一个专门用于预测 DNA 序列中程序性核糖体移码PRF位点的 Python 包。它集成了机器学习模型(梯度提升和 BiLSTM-CNN以及 FScanR 包,提供精确的 PRF 预测。用户可以使用三种类型的数据作为输入:需要预测的完整 cDNA/mRNA 序列、疑似移码位点附近的核苷酸序列,以及物种或相关物种的肽库 blastx 结果。它预期输入序列位于 + 链上,并可与 FScanR 集成以提高准确性。
![机器学习模型](/image/ML.png)
对于整个序列的预测FScanpy 采用滑动窗口方法扫描整个序列并预测 PRF 位点。对于区域预测,它基于疑似移码位点周围 0 读码框中的 33bp 和 399bp 序列。首先短模型HistGradientBoosting将预测扫描窗口内的潜在 PRF 位点。如果预测概率超过阈值长模型BiLSTM-CNN将预测 399bp 序列中的 PRF 位点。然后,集成权重结合两个模型进行最终预测。
对于从 BLASTX 输出检测 PRF[FScanR](https://github.com/seanchen607/FScanR.git) 从 BLASTX 比对结果中识别潜在的 PRF 位点,获取同一查询序列的两个命中,然后利用 frameDist_cutoff、mismatch_cutoff 和 evalue_cutoff 过滤命中。最后,使用 FScanpy 预测 PRF 位点的概率。
### 背景
[核糖体移码](https://en.wikipedia.org/wiki/Ribosomal_frameshift),也称为翻译移码或翻译重编码,是翻译过程中发生的一种生物现象,导致从单个 mRNA 产生多个独特的蛋白质。该过程可由 mRNA 的核苷酸序列程序化,有时受二级、三维 mRNA 结构影响。它主要在病毒(特别是逆转录病毒)、逆转录转座子和细菌插入元件中被描述,也在一些细胞基因中存在。
### FScanpy 的主要特性包括:
- 两个预测模型的集成:
- 短模型HistGradientBoosting分析以潜在移码位点为中心的局部序列特征33bp
- 长模型BiLSTM-CNN分析更广泛的序列特征399bp
- 支持跨多种物种的 PRF 预测。
- 可与 [FScanR](https://github.com/seanchen607/FScanR.git) 结合以提高准确性。
## 安装 (python>=3.7)
### 1. 使用 pip
```bash
pip install FScanpy
```
### 2. 从 GitHub 克隆
```bash
git clone https://github.com/.../FScanpy.git
cd FScanpy
pip install -e .
```
## 完整函数参考
### 1. 核心预测函数
#### 1.1 `predict_prf()` - 主要预测接口
**函数签名:**
```python
def predict_prf(
sequence: Union[str, List[str], None] = None,
data: Union[pd.DataFrame, None] = None,
window_size: int = 3,
short_threshold: float = 0.1,
ensemble_weight: float = 0.4,
model_dir: str = None
) -> pd.DataFrame
```
**参数:**
- `sequence`:用于滑动窗口预测的单个或多个 DNA 序列
- `data`DataFrame 数据,必须包含 'Long_Sequence' 或 '399bp' 列用于区域预测
- `window_size`滑动窗口大小默认3推荐1-10
- `short_threshold`短模型HistGB概率阈值默认0.1范围0.0-1.0
- `ensemble_weight`短模型在集成中的权重默认0.4范围0.0-1.0
- `model_dir`:模型目录路径(可选,如果为 None 则使用内置模型)
**返回值:**
- `pd.DataFrame`:预测结果,包含以下列:
- `Short_Probability`:短模型预测概率
- `Long_Probability`:长模型预测概率
- `Ensemble_Probability`:集成预测概率(主要结果)
- `Position`:序列中的位置(滑动窗口模式)
- `Codon`:位置处的密码子(滑动窗口模式)
- `Ensemble_Weights`:权重配置信息
**使用示例:**
```python
from FScanpy import predict_prf
# 1. 单序列滑动窗口预测
sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
results = predict_prf(sequence=sequence)
# 2. 多序列预测
sequences = ["ATGCGTACGT...", "GCTATAGCAT..."]
results = predict_prf(sequence=sequences)
# 3. 自定义参数
results = predict_prf(
sequence=sequence,
window_size=1, # 扫描每个位置
short_threshold=0.2, # 更高阈值
ensemble_weight=0.3 # 3:7 比例(短:长)
)
# 4. DataFrame 区域预测
import pandas as pd
data = pd.DataFrame({
'Long_Sequence': ['ATGCGT...', 'GCTATAG...'], # 或使用 '399bp'
'sample_id': ['sample1', 'sample2']
})
results = predict_prf(data=data)
```
#### 1.2 `plot_prf_prediction()` - 带可视化的预测
**函数签名:**
```python
def plot_prf_prediction(
sequence: str,
window_size: int = 3,
short_threshold: float = 0.65,
long_threshold: float = 0.8,
ensemble_weight: float = 0.4,
title: str = None,
save_path: str = None,
figsize: tuple = (12, 8),
dpi: int = 300,
model_dir: str = None
) -> tuple
```
**参数:**
- `sequence`:输入 DNA 序列(字符串)
- `window_size`滑动窗口大小默认3
- `short_threshold`热图显示的短模型过滤阈值默认0.65
- `long_threshold`热图显示的长模型过滤阈值默认0.8
- `ensemble_weight`集成中短模型的权重默认0.4
- `title`:图表标题(可选,如果为 None 则自动生成)
- `save_path`:保存路径(可选,如果提供则保存图表)
- `figsize`:图形大小元组(默认:(12, 8)
- `dpi`图形分辨率默认300
- `model_dir`:模型目录路径(可选)
**返回值:**
- `tuple`(prediction_results: pd.DataFrame, figure: matplotlib.figure.Figure)
**使用示例:**
```python
from FScanpy import plot_prf_prediction
import matplotlib.pyplot as plt
# 1. 基本绘图
sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
results, fig = plot_prf_prediction(sequence)
plt.show()
# 2. 自定义阈值和权重
results, fig = plot_prf_prediction(
sequence,
short_threshold=0.7, # 更高显示阈值
long_threshold=0.85, # 更高显示阈值
ensemble_weight=0.3, # 3:7 权重比例
title="自定义分析结果",
save_path="analysis.png",
figsize=(15, 10),
dpi=150
)
# 3. 高分辨率分析
results, fig = plot_prf_prediction(
sequence,
window_size=1, # 扫描每个位置
ensemble_weight=0.5, # 等权重
dpi=600 # 高分辨率
)
```
### 2. PRFPredictor 类方法
#### 2.1 类初始化
```python
from FScanpy import PRFPredictor
# 使用默认模型初始化
predictor = PRFPredictor()
# 使用自定义模型目录初始化
predictor = PRFPredictor(model_dir='/path/to/models')
```
#### 2.2 `predict_sequence()` - 滑动窗口预测
**方法签名:**
```python
def predict_sequence(self, sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)
```
**参数:**
- `sequence`:输入 DNA 序列
- `window_size`滑动窗口大小默认3
- `short_threshold`短模型概率阈值默认0.1
- `ensemble_weight`集成中短模型权重默认0.4
**用法:**
```python
predictor = PRFPredictor()
results = predictor.predict_sequence(
sequence="ATGCGTACGT...",
window_size=1,
short_threshold=0.15,
ensemble_weight=0.35
)
```
#### 2.3 `predict_regions()` - 基于区域的预测
**方法签名:**
```python
def predict_regions(self, sequences, short_threshold=0.1, ensemble_weight=0.4)
```
**参数:**
- `sequences`399bp 序列的列表或 Series
- `short_threshold`短模型概率阈值默认0.1
- `ensemble_weight`集成中短模型权重默认0.4
**用法:**
```python
predictor = PRFPredictor()
sequences = ["ATGCGT...", "GCTATAG..."] # 399bp 序列
results = predictor.predict_regions(
sequences=sequences,
short_threshold=0.1,
ensemble_weight=0.4
)
```
#### 2.4 `predict_single_position()` - 单位置预测
**方法签名:**
```python
def predict_single_position(self, fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)
```
**参数:**
- `fs_period`:移码位点周围的 33bp 序列
- `full_seq`:长模型使用的 399bp 序列
- `short_threshold`短模型概率阈值默认0.1
- `ensemble_weight`集成中短模型权重默认0.4
**用法:**
```python
predictor = PRFPredictor()
result = predictor.predict_single_position(
fs_period="ATGCGTACGTATGCGTACGTATGCGTACGTA", # 33bp
full_seq="ATGCGT..." * 133, # 399bp
short_threshold=0.1,
ensemble_weight=0.4
)
```
#### 2.5 `plot_sequence_prediction()` - 类方法绘图
**方法签名:**
```python
def plot_sequence_prediction(self, sequence, window_size=3, short_threshold=0.65,
long_threshold=0.8, ensemble_weight=0.4, title=None,
save_path=None, figsize=(12, 8), dpi=300)
```
**用法:**
```python
predictor = PRFPredictor()
results, fig = predictor.plot_sequence_prediction(
sequence="ATGCGTACGT...",
window_size=3,
ensemble_weight=0.4
)
```
### 3. 实用函数
#### 3.1 `fscanr()` - 从 BLASTX 检测 PRF 位点
**函数签名:**
```python
def fscanr(
blastx_output: pd.DataFrame,
mismatch_cutoff: float = 10,
evalue_cutoff: float = 1e-5,
frameDist_cutoff: float = 10
) -> pd.DataFrame
```
**参数:**
- `blastx_output`:包含必需列的 BLASTX 输出 DataFrame
- `qseqid`, `sseqid`, `pident`, `length`, `mismatch`, `gapopen`
- `qstart`, `qend`, `sstart`, `send`, `evalue`, `bitscore`, `qframe`, `sframe`
- `mismatch_cutoff`最大允许错配数默认10
- `evalue_cutoff`E 值阈值默认1e-5
- `frameDist_cutoff`框架距离阈值默认10
**返回值:**
- `pd.DataFrame`:包含以下列的 PRF 位点:
- `DNA_seqid`:序列标识符
- `FS_start`, `FS_end`:移码开始和结束位置
- `Pep_seqid`:肽序列标识符
- `Pep_FS_start`, `Pep_FS_end`:肽移码位置
- `FS_type`:移码类型(-2, -1, 1, 2
- `Strand`:链方向(+, -
**用法:**
```python
from FScanpy.utils import fscanr
import pandas as pd
# 加载 BLASTX 结果
blastx_data = pd.read_excel('blastx_results.xlsx')
# 检测 PRF 位点
prf_sites = fscanr(
blastx_output=blastx_data,
mismatch_cutoff=5, # 更严格的错配过滤
evalue_cutoff=1e-6, # 更严格的 E 值过滤
frameDist_cutoff=15 # 允许更大的框架距离
)
```
#### 3.2 `extract_prf_regions()` - 提取 PRF 位点周围的序列
**函数签名:**
```python
def extract_prf_regions(mrna_file: str, prf_data: pd.DataFrame) -> pd.DataFrame
```
**参数:**
- `mrna_file`mRNA 序列文件路径FASTA 格式)
- `prf_data`:来自 `fscanr()` 输出的 DataFrame
**返回值:**
- `pd.DataFrame`:提取的序列,包含以下列:
- `DNA_seqid`:序列标识符
- `FS_start`, `FS_end`:移码位置
- `Strand`:链方向
- `399bp`:提取的 399bp 序列
- `FS_type`:移码类型
**用法:**
```python
from FScanpy.utils import extract_prf_regions
# 提取 PRF 位点周围的序列
prf_sequences = extract_prf_regions(
mrna_file='sequences.fasta',
prf_data=prf_sites
)
# 预测 PRF 概率
predictor = PRFPredictor()
results = predictor.predict_regions(prf_sequences['399bp'])
```
### 4. 数据访问函数
#### 4.1 测试数据访问
```python
from FScanpy.data import get_test_data_path, list_test_data
# 列出可用的测试数据
list_test_data()
# 获取测试数据路径
blastx_file = get_test_data_path('blastx_example.xlsx')
mrna_file = get_test_data_path('mrna_example.fasta')
region_file = get_test_data_path('region_example.csv')
seq_file = get_test_data_path('full_seq.xlsx')
```
## 完整工作流程示例
### 工作流程 1完整序列分析
```python
from FScanpy import predict_prf, plot_prf_prediction
import matplotlib.pyplot as plt
# 定义序列
full_seq = pd.read.excel(seq_file)
# 方法 1简单预测
results = predict_prf(sequence=full_seq[0]['full_seq'])
print(f"发现 {len(results)} 个潜在位点")
# 方法 2带可视化的预测
results, fig = plot_prf_prediction(
sequence=full_seq[0]['full_seq'],
window_size=1, # 扫描每个位置
short_threshold=0.3, # 显示概率 > 0.3 的位点
long_threshold=0.4, # 显示概率 > 0.4 的位点
ensemble_weight=0.4, # 4:6 权重比例
title="PRF 分析结果",
save_path="prf_analysis.png"
)
plt.show()
# 分析顶级预测
top_sites = results.nlargest(5, 'Ensemble_Probability')
print("前 5 个预测位点:")
for _, site in top_sites.iterrows():
print(f"位置 {site['Position']}: {site['Ensemble_Probability']:.3f}")
```
### 工作流程 2基于区域的预测
```python
from FScanpy import predict_prf
import pandas as pd
# 准备区域数据
region_data = pd.DataFrame({
'sample_id': ['sample1', 'sample2', 'sample3'],
'Long_Sequence': [
'ATGCGT...', # 399bp 序列 1
'GCTATAG...', # 399bp 序列 2
'TTACGGA...' # 399bp 序列 3
],
'known_label': [1, 0, 1] # 可选:用于验证的已知标签
})
# 预测 PRF 概率
results = predict_prf(
data=region_data,
ensemble_weight=0.3 # 偏向长模型3:7 比例)
)
# 评估结果
if 'known_label' in results.columns:
threshold = 0.5
predictions = (results['Ensemble_Probability'] > threshold).astype(int)
accuracy = (predictions == results['known_label']).mean()
print(f"阈值 {threshold} 下的准确率:{accuracy:.3f}")
```
### 工作流程 3基于 BLASTX 的分析流程
```python
from FScanpy import PRFPredictor, predict_prf
from FScanpy.data import get_test_data_path
from FScanpy.utils import fscanr, extract_prf_regions
import pandas as pd
# 步骤 1加载 BLASTX 数据
blastx_data = pd.read_excel(get_test_data_path('blastx_example.xlsx'))
print(f"加载了 {len(blastx_data)} 个 BLASTX 命中")
# 步骤 2使用 FScanR 检测 PRF 位点
prf_sites = fscanr(
blastx_output=blastx_data,
mismatch_cutoff=10,
evalue_cutoff=1e-5,
frameDist_cutoff=10
)
print(f"检测到 {len(prf_sites)} 个潜在 PRF 位点")
# 步骤 3提取 PRF 位点周围的序列
mrna_file = get_test_data_path('mrna_example.fasta')
prf_sequences = extract_prf_regions(
mrna_file=mrna_file,
prf_data=prf_sites
)
print(f"提取了 {len(prf_sequences)} 个序列")
# 步骤 4预测 PRF 概率
predictor = PRFPredictor()
results = predictor.predict_regions(
sequences=prf_sequences['399bp'],
ensemble_weight=0.4
)
# 步骤 5将结果与元数据结合
final_results = pd.concat([
prf_sequences.reset_index(drop=True),
results.reset_index(drop=True)
], axis=1)
# 步骤 6分析结果
high_prob_sites = final_results[
final_results['Ensemble_Probability'] > 0.7
]
print(f"高概率 PRF 位点:{len(high_prob_sites)}")
# 显示顶级结果
print("\n顶级 PRF 预测:")
top_results = final_results.nlargest(3, 'Ensemble_Probability')
for _, row in top_results.iterrows():
print(f"序列 {row['DNA_seqid']}: {row['Ensemble_Probability']:.3f}")
```
### 工作流程 4多序列自定义分析
```python
from FScanpy import predict_prf, plot_prf_prediction
import matplotlib.pyplot as plt
# 多序列分析
sequences = [
"ATGCGTACGTATGCGTACGTATGCGTACGTAAGCCCTTTGAACCCAAAGGG",
"GCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCAT",
"TTACGGATTACGGATTACGGATTACGGATTACGGATTACGGATTACGGAT"
]
# 批量预测
results = predict_prf(
sequence=sequences,
window_size=2,
ensemble_weight=0.5 # 等权重
)
# 按序列分析
for seq_id in results['Sequence_ID'].unique():
seq_results = results[results['Sequence_ID'] == seq_id]
max_prob = seq_results['Ensemble_Probability'].max()
print(f"{seq_id}: 最大概率 = {max_prob:.3f}")
# 可视化第一个序列
first_seq_results, fig = plot_prf_prediction(
sequence=sequences[0],
ensemble_weight=0.5,
title="第一个序列分析"
)
plt.show()
```
## 参数优化指南
### 1. 集成权重配置详解
#### 1.1 模型互补优势原理
FScanpy 集成了两个具有互补优势的机器学习模型:
- **HistGB 模型(短模型)**
- 擅长识别真阴性样本(正确排除非 PRF 位点)
- 基于 33bp 局部序列特征
- 预测更保守,假阳性率较低
- 适合高特异性要求的场景
- **BiLSTM-CNN 模型(长模型)**
- 擅长识别真阳性样本(正确识别 PRF 位点)
- 基于 399bp 长距离序列特征
- 预测更敏感,能捕获更多潜在位点
- 适合高敏感性要求的场景
#### 1.2 最优权重比例4:6
通过大量测试分析,我们确定了最优权重分布为 HistGB:BiLSTM-CNN = 4:6`ensemble_weight = 0.4`
- **最高 AUC 性能**:在测试集上达到最佳的曲线下面积
- **平衡预测性能**:在敏感性和特异性之间取得最佳平衡
- **降低极端错误**:减少 BiLSTM-CNN 模型的过度预测风险
- **避免保守偏向**:防止 HistGB 模型过于保守导致的模糊性
#### 1.3 权重选择策略
根据研究目标选择合适的权重配置:
**高通量筛选场景(偏向敏感性)**
- **权重配置**`ensemble_weight = 0.6-0.8`
- **适用场景**:初步筛选、候选位点发现、探索性研究
- **优势**:筛选更多候选位点,降低漏检风险
- **权衡**:可能产生更多假阳性,需要后续验证
```python
# 高敏感性配置示例
results = predict_prf(
sequence=sequence,
ensemble_weight=0.7, # 偏向 BiLSTM-CNN
short_threshold=0.05 # 降低短模型阈值
)
```
**精确验证场景(偏向特异性)**
- **权重配置**`ensemble_weight = 0.2-0.3`
- **适用场景**:候选验证、临床应用、高置信度预测
- **优势**:减少假阳性,提高预测可靠性
- **权衡**:可能遗漏部分真阳性位点
```python
# 高特异性配置示例
results = predict_prf(
sequence=sequence,
ensemble_weight=0.25, # 偏向 HistGB
short_threshold=0.2 # 提高短模型阈值
)
```
**平衡分析场景(推荐默认)**
- **权重配置**`ensemble_weight = 0.4-0.6`
- **适用场景**:常规分析、综合评估、标准研究
- **优势**:在敏感性和特异性间取得最佳平衡
- **推荐**:大多数研究场景的首选配置
```python
# 平衡配置示例
results = predict_prf(
sequence=sequence,
ensemble_weight=0.4, # 最优平衡比例
short_threshold=0.1 # 标准阈值
)
```
### 2. 阈值选择
- **短阈值**:通常 0.1-0.3,控制计算效率
- **显示阈值**0.3-0.8,控制可视化显示
- **分类阈值**0.5(标准),根据验证数据调整
### 3. 窗口大小选择
- **精细分析**`window_size = 1`(每个位置)
- **标准分析**`window_size = 3`(每第 3 个位置,默认)
- **粗略分析**`window_size = 6-9`(更快,细节较少)
## 故障排除
### 常见问题和解决方案
1. **模型加载错误**
```python
# 检查模型目录
import FScanpy
predictor = PRFPredictor(model_dir='/custom/path')
```
2. **大序列内存问题**
```python
# 使用更大的窗口大小减少计算负载
results = predict_prf(sequence=large_seq, window_size=9)
```
3. **可视化问题**
```python
# 调整图形参数
results, fig = plot_prf_prediction(
sequence=seq,
figsize=(20, 10), # 更大的图形
dpi=150 # 更低的分辨率
)
```
4. **输入格式问题**
```python
# 确保正确的 DataFrame 格式
data = pd.DataFrame({
'Long_Sequence': sequences, # 使用 'Long_Sequence' 或 '399bp'
'sample_id': ids
})
```
## 性能优化
### 1. 批处理
```python
# 高效处理多个序列
sequences = ["seq1", "seq2", "seq3", ...]
results = predict_prf(sequence=sequences, window_size=3)
```
### 2. 阈值优化
```python
# 使用适当的 short_threshold 跳过不必要的长模型调用
results = predict_prf(
sequence=sequence,
short_threshold=0.2 # 更高阈值 = 更快处理
)
```
### 3. 内存管理
```python
# 对于非常大的数据集,分块处理
chunk_size = 100
for i in range(0, len(large_dataset), chunk_size):
chunk = large_dataset[i:i+chunk_size]
chunk_results = predict_prf(data=chunk)
# 处理 chunk_results
```
## 引用
如果您使用 FScanpy请引用我们的论文[论文链接]