2025-03-18 11:21:54 +08:00
# FScanpy
## A Machine Learning-Based Framework for Programmed Ribosomal Frameshifting Prediction
2025-08-17 17:14:16 +08:00
[](README_zh.md)
[](https://www.python.org/)
2025-06-11 21:44:29 +08:00
[](LICENSE)
FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF) ](https://en.wikipedia.org/wiki/Ribosomal_frameshift ) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR ](https://github.com/seanchen607/FScanR.git ) framework, FScanpy provides robust and accurate PRF site predictions.
2025-03-18 11:21:54 +08:00
2025-10-23 20:35:13 +08:00

2025-03-18 11:21:54 +08:00
2025-06-11 21:44:29 +08:00
## 🔧 Installation
### Prerequisites
2025-08-17 17:14:16 +08:00
- Python ≥ 3.9
2025-06-11 21:44:29 +08:00
- All dependencies are automatically installed
### Install via pip (Recommended)
```bash
pip install FScanpy
```
### Install from Source
```bash
git clone https://github.com/your-org/FScanpy-package.git
cd FScanpy-package
pip install -e .
```
## 🚀 Quick Start
### Basic Usage
2025-05-29 17:58:48 +08:00
```python
from FScanpy import predict_prf
2025-06-11 21:44:29 +08:00
# Simple sequence prediction
sequence = "ATGCGTACGTTAGC..." # Your DNA sequence
2025-05-29 17:58:48 +08:00
results = predict_prf(sequence=sequence)
2025-06-11 21:44:29 +08:00
# View top predictions
print(results[['Position', 'Ensemble_Probability', 'Short_Probability', 'Long_Probability']].head(10))
2025-05-29 17:58:48 +08:00
```
2025-06-11 21:44:29 +08:00
### Visualization
2025-05-29 17:58:48 +08:00
```python
from FScanpy import plot_prf_prediction
2025-06-11 21:44:29 +08:00
# Generate prediction plot
2025-05-29 17:58:48 +08:00
results, fig = plot_prf_prediction(
sequence=sequence,
2025-06-11 21:44:29 +08:00
short_threshold=0.65, # HistGB threshold
long_threshold=0.8, # BiLSTM-CNN threshold
ensemble_weight=0.4, # 40% Short, 60% Long
title="PRF Prediction Results"
2025-05-29 17:58:48 +08:00
)
```
2025-06-11 21:44:29 +08:00
### Advanced Usage
2025-05-29 17:58:48 +08:00
```python
from FScanpy import PRFPredictor
2025-06-11 21:44:29 +08:00
import pandas as pd
2025-05-29 17:58:48 +08:00
# Create predictor instance
predictor = PRFPredictor()
2025-06-11 21:44:29 +08:00
# Batch prediction on pre-extracted regions
data = pd.DataFrame({
'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57] # 399bp sequences
})
results = predictor.predict_regions(data, ensemble_weight=0.4)
# Sequence-level prediction with custom parameters
2025-05-29 17:58:48 +08:00
results = predictor.predict_sequence(
sequence=sequence,
2025-06-11 21:44:29 +08:00
window_size=1, # Step size for sliding window
ensemble_weight=0.3, # Model weighting
short_threshold=0.5 # Filtering threshold
2025-05-29 17:58:48 +08:00
)
2025-06-11 21:44:29 +08:00
```
2025-05-29 17:58:48 +08:00
2025-06-11 21:44:29 +08:00
## 🎛️ Ensemble Weight Configuration
2025-05-29 17:58:48 +08:00
2025-08-17 17:14:16 +08:00
The `ensemble_weight` parameter controls the weight ratio between HistGB and BiLSTM-CNN models:
2025-05-29 17:58:48 +08:00
2025-08-17 17:14:16 +08:00
| ensemble_weight | HistGB Model | BiLSTM-CNN Model | Characteristics | Best For |
|----------------|-------------|------------------|-----------------|----------|
| **0.2-0.3** | 20-30% | 70-80% | **High specificity** , reduces false positives | Precise validation, clinical applications |
| **0.4** | 40% | 60% | **Optimal balance** , highest AUC | Standard analysis (recommended) |
| **0.6-0.8** | 60-80% | 20-40% | **High sensitivity** , captures more sites | High-throughput screening, exploratory research |
### Model Characteristics
- **HistGB Model**: Excels at identifying true negatives, conservative predictions, low false positive rate
- **BiLSTM-CNN Model**: Excels at identifying true positives, sensitive predictions, captures more potential sites
2025-05-29 17:58:48 +08:00
2025-06-11 21:44:29 +08:00
### Weight Selection Examples
2025-05-29 17:58:48 +08:00
```python
2025-08-17 17:14:16 +08:00
# High specificity configuration (favoring HistGB)
precise_results = predict_prf(sequence, ensemble_weight=0.25)
2025-05-29 17:58:48 +08:00
2025-08-17 17:14:16 +08:00
# Optimal balance configuration (4:6 ratio)
2025-06-11 21:44:29 +08:00
balanced_results = predict_prf(sequence, ensemble_weight=0.4)
2025-05-29 17:58:48 +08:00
2025-08-17 17:14:16 +08:00
# High sensitivity configuration (favoring BiLSTM-CNN)
sensitive_results = predict_prf(sequence, ensemble_weight=0.7)
2025-05-29 17:58:48 +08:00
```
2025-06-11 21:44:29 +08:00
## 📊 Core Functions
2025-03-18 11:21:54 +08:00
2025-06-11 21:44:29 +08:00
### Main Prediction Interface
```python
predict_prf(
sequence=None, # Single/multiple sequences or None
data=None, # DataFrame with 399bp sequences or None
window_size=3, # Sliding window step size
short_threshold=0.1, # Short model filtering threshold
ensemble_weight=0.4, # Short model weight (0.0-1.0)
model_dir=None # Custom model directory
)
2025-03-18 11:21:54 +08:00
```
2025-06-11 21:44:29 +08:00
### Visualization Function
```python
plot_prf_prediction(
sequence, # Input DNA sequence
window_size=3, # Scanning step size
short_threshold=0.65, # Short model threshold for plotting
long_threshold=0.8, # Long model threshold for plotting
ensemble_weight=0.4, # Model weighting
title=None, # Plot title
save_path=None, # Save file path
figsize=(12,8), # Figure size
dpi=300 # Resolution for saved plots
)
2025-03-18 11:21:54 +08:00
```
2025-06-11 21:44:29 +08:00
### PRFPredictor Class Methods
2025-05-29 17:58:48 +08:00
```python
2025-06-11 21:44:29 +08:00
predictor = PRFPredictor()
2025-05-29 17:58:48 +08:00
2025-06-11 21:44:29 +08:00
# Sequence prediction (sliding window)
predictor.predict_sequence(sequence, ensemble_weight=0.4)
2025-05-29 17:58:48 +08:00
2025-06-11 21:44:29 +08:00
# Region prediction (batch processing)
predictor.predict_regions(dataframe, ensemble_weight=0.4)
2025-05-29 17:58:48 +08:00
2025-06-11 21:44:29 +08:00
# Feature extraction
predictor.extract_features(sequences)
# Model information
predictor.get_model_info()
2025-05-29 17:58:48 +08:00
```
2025-06-11 21:44:29 +08:00
## 📈 Output Fields
### Prediction Results
2025-05-29 17:58:48 +08:00
- **`Position`**: Position in the original sequence
2025-06-11 21:44:29 +08:00
- **`Ensemble_Probability`**: Final ensemble prediction (main result)
- **`Short_Probability`**: HistGradientBoosting prediction (0-1)
- **`Long_Probability`**: BiLSTM-CNN prediction (0-1)
- **`Ensemble_Weights`**: Model weight configuration used
2025-05-29 17:58:48 +08:00
2025-06-11 21:44:29 +08:00
### Sequence Information
- **`Short_Sequence`**: 33bp sequence for Short model
- **`Long_Sequence`**: 399bp sequence for Long model
- **`Codon`**: 3bp codon at the prediction position
- **`Sequence_ID`**: Identifier for multi-sequence inputs
2025-05-29 17:58:48 +08:00
2025-06-11 21:44:29 +08:00
## 🔬 Integration with FScanR
2025-05-29 17:58:48 +08:00
2025-06-11 21:44:29 +08:00
FScanpy works seamlessly with the FScanR pipeline for comprehensive PRF analysis:
2025-05-29 17:58:48 +08:00
2025-06-11 21:44:29 +08:00
```python
from FScanpy import fscanr, extract_prf_regions, predict_prf
# Step 1: BLASTX analysis with FScanR
blastx_results = fscanr(
blastx_data,
mismatch_cutoff=10,
evalue_cutoff=1e-5,
frameDist_cutoff=10
)
# Step 2: Extract PRF candidate regions
prf_regions = extract_prf_regions(original_sequence, blastx_results)
# Step 3: Predict with FScanpy
final_predictions = predict_prf(data=prf_regions, ensemble_weight=0.4)
```
2025-03-18 11:21:54 +08:00
2025-06-11 21:44:29 +08:00
## 📚 Documentation
2025-03-18 11:21:54 +08:00
2025-06-11 21:44:29 +08:00
- **[Complete Tutorial](tutorial/tutorial.md)**: Comprehensive usage guide with examples
2025-08-17 17:14:16 +08:00
- **[Demo Notebook](FScanpy_Demo.ipynb)**: Practical usage of each function in the library and demonstration of analysis workflow results
- **[Predict Sample Interpretation](tutorial/predict_sample.ipynb)**: Detailed interpretation of FScanpy's plotting results and signal analysis
2025-06-11 21:44:29 +08:00
## 📝 Citation
If you use FScanpy in your research, please cite:
2025-03-18 11:21:54 +08:00
```bibtex
2025-06-11 21:44:29 +08:00
2025-08-17 17:14:16 +08:00
```
2025-06-11 21:44:29 +08:00
## 🏗️ Dependencies
FScanpy automatically installs all required dependencies:
2025-06-12 00:58:39 +08:00
- `numpy>=1.24.3`
- `pandas>=2.2.3`
- `tensorflow>=2.10.1`
- `scikit-learn>=1.6.0`
- `matplotlib>=3.9.4`
- `joblib>=1.4.2`
- `biopython>=1.85`
- `wrapt>=1.17.0`
2025-06-11 21:44:29 +08:00
---
**FScanpy** - Advancing programmed ribosomal frameshifting research through machine learning 🧬