2025-03-18 11:21:54 +08:00
# FScanpy
## A Machine Learning-Based Framework for Programmed Ribosomal Frameshifting Prediction
2025-05-29 17:58:48 +08:00
FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF) ](https://en.wikipedia.org/wiki/Ribosomal_frameshift ) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR ](https://github.com/seanchen607/FScanR.git ) framework, FScanpy provides robust and accurate PRF site predictions. The package requires input sequences to be in the positive (5' to 3') orientation.
2025-03-18 11:21:54 +08:00

For detailed documentation and usage examples, please refer to our [tutorial ](tutorial/tutorial.md ).
2025-05-29 17:58:48 +08:00
## 🚀 What's New in v0.3.0
### Model Naming Optimization
- **Short Model** (`short.pkl`): HistGradientBoosting model for rapid screening
- **Long Model** (`long.pkl`): BiLSTM-CNN model for detailed analysis
- **Unified Interface**: Consistent parameter naming and clearer output fields
### Performance Improvements
- **Faster Prediction**: Optimized model type detection and reduced redundant operations
- **Better Error Handling**: More informative error messages and robust exception handling
- **Code Quality**: Reduced code duplication and improved maintainability
### 🎨 New Visualization Features
- **Sequence Plotting**: Built-in function for visualizing PRF prediction results
- **Dual Threshold Filtering**: Separate filtering for Short and Long models
- **Interactive Graphics**: Heatmap and bar chart visualization
- **Export Options**: Support for PNG and PDF output formats
### ⚖️ Ensemble Weighting System
- **Flexible Ensemble**: Control the contribution of Short and Long models
- **Weight Validation**: Automatic parameter validation and error handling
- **Clear Naming**: `ensemble_weight` parameter for intuitive usage
- **Visual Feedback**: Weight ratios displayed in plots and results
### 🔧 API Improvements
- **Method Renaming**: More intuitive method names
- `predict_sequence()` : Replaces `predict_full()` for sequence prediction
- `predict_regions()` : Replaces `predict_region()` for batch prediction
- **Field Standardization**: Consistent output field naming
- `Ensemble_Probability` : Main prediction result (replaces `Voting_Probability` )
- `Short_Sequence` / `Long_Sequence` : Clear sequence field names
- **Backward Compatibility**: Deprecated methods still work with warnings
2025-03-20 15:27:21 +08:00
## Core Features
2025-05-29 17:58:48 +08:00
- **Sequence Feature Extraction**: Support for extracting features from nucleic acid sequences, including base composition, k-mer features, and positional features.
2025-03-20 15:27:21 +08:00
- **Frameshift Hotspot Region Prediction**: Predict potential PRF sites in nucleotide sequences using machine learning models.
- **Feature Extraction**: Extract relevant features from sequences to assist in prediction.
2025-05-29 17:58:48 +08:00
- **Cross-Species Support**: Built-in databases for viruses, marine phages, Euplotes, etc., enabling PRF prediction across various species.
- **Visualization Tools**: Built-in plotting functions for result visualization and analysis.
- **Ensemble Modeling**: Customizable ensemble weights for different prediction strategies.
2025-03-20 15:27:21 +08:00
## Main Advantages
- **High Accuracy**: Integrates multiple machine learning models to provide accurate PRF site predictions.
- **Efficiency**: Utilizes a sliding window approach and feature extraction techniques to rapidly scan sequences.
- **Versatility**: Supports PRF prediction across various species and can be combined with the [FScanR ](https://github.com/seanchen607/FScanR.git ) framework for enhanced accuracy.
2025-05-29 17:58:48 +08:00
- **User-Friendly**: Comes with detailed documentation and usage examples, making it easy for researchers to use.
2025-03-20 15:27:21 +08:00
- **Flexible**: Provides different resolutions to suit different using situations.
2025-05-29 17:58:48 +08:00
## Quick Start
### Basic Prediction
```python
from FScanpy import predict_prf
# Single sequence prediction with default ensemble weights (0.4:0.6)
sequence = "ATGCGTACGT..."
results = predict_prf(sequence=sequence)
print(results[['Position', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']].head())
```
### Custom Ensemble Weighting
```python
# Adjust model weights for different prediction strategies
results_long_dominant = predict_prf(sequence=sequence, ensemble_weight=0.3) # 3:7 ratio (Long dominant)
results_equal_weight = predict_prf(sequence=sequence, ensemble_weight=0.5) # 5:5 ratio (Equal weight)
results_short_dominant = predict_prf(sequence=sequence, ensemble_weight=0.7) # 7:3 ratio (Short dominant)
# Compare ensemble probabilities
print("Long dominant:", results_long_dominant['Ensemble_Probability'].mean())
print("Equal weight:", results_equal_weight['Ensemble_Probability'].mean())
print("Short dominant:", results_short_dominant['Ensemble_Probability'].mean())
```
### Visualization with Custom Weights
```python
from FScanpy import plot_prf_prediction
import matplotlib.pyplot as plt
# Generate prediction plot with custom ensemble weighting
sequence = "ATGCGTACGT..."
results, fig = plot_prf_prediction(
sequence=sequence,
short_threshold=0.65, # HistGB threshold
long_threshold=0.8, # BiLSTM-CNN threshold
ensemble_weight=0.3, # Custom weight: 30% Short, 70% Long
title="Long-Dominant Ensemble PRF Prediction (3:7)",
save_path="prediction_result.png"
)
plt.show()
```
### Advanced Usage with New API
```python
from FScanpy import PRFPredictor
import matplotlib.pyplot as plt
# Create predictor instance
predictor = PRFPredictor()
# Use new sequence prediction method
results = predictor.predict_sequence(
sequence=sequence,
ensemble_weight=0.4
)
# Compare different ensemble configurations
weights = [0.2, 0.4, 0.6, 0.8]
weight_names = ["Long 80%", "Balanced", "Short 60%", "Short 80%"]
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()
for i, (weight, name) in enumerate(zip(weights, weight_names)):
results = predictor.predict_sequence(sequence=sequence, ensemble_weight=weight)
ax = axes[i]
ax.bar(results['Position'], results['Ensemble_Probability'], alpha=0.7)
ax.set_title(f'{name} (Weight: {weight:.1f}:{1-weight:.1f})')
ax.set_ylabel('Probability')
plt.tight_layout()
plt.show()
```
### Batch Region Prediction
```python
# Predict multiple 399bp sequences
import pandas as pd
data = pd.DataFrame({
'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57] # 399bp sequences
})
results = predict_prf(data=data, ensemble_weight=0.4)
print(results[['Ensemble_Probability', 'Ensemble_Weights']].head())
```
2025-03-18 11:21:54 +08:00
## Installation Requirements
- Python ≥ 3.7
- Dependencies are automatically handled during installation
### Option 1: Install via pip
```bash
pip install FScanpy
```
### Option 2: Install from source
```bash
2025-03-18 11:27:50 +08:00
git clone git@60.204.158.188:yyh/FScanpy-package.git
2025-03-18 11:21:54 +08:00
cd FScanpy-package
pip install -e .
```
2025-05-29 17:58:48 +08:00
## 🔄 Migration from Previous Versions
### API Changes Summary
```python
# OLD API (deprecated but still works)
results = predict_prf(sequence="ATGC...", short_weight=0.4)
results = predictor.predict_full(sequence, short_weight=0.4)
results = predictor.predict_region(sequences, short_weight=0.4)
# NEW API (recommended)
results = predict_prf(sequence="ATGC...", ensemble_weight=0.4)
results = predictor.predict_sequence(sequence, ensemble_weight=0.4)
results = predictor.predict_regions(sequences, ensemble_weight=0.4)
# Output field changes
# OLD: 'Voting_Probability', 'Weight_Info', '33bp', '399bp'
# NEW: 'Ensemble_Probability', 'Ensemble_Weights', 'Short_Sequence', 'Long_Sequence'
# Visualization with ensemble weights
results, fig = plot_prf_prediction(
sequence="ATGC...",
short_threshold=0.65,
long_threshold=0.8,
ensemble_weight=0.3 # 30% Short, 70% Long
)
```
### Backward Compatibility
- All old methods still work but will show deprecation warnings
- Old field names are automatically added for compatibility
- Gradual migration is supported
## Ensemble Weight Configuration Guide
### Recommended Weights for Different Scenarios:
| Scenario | ensemble_weight | Description | Use Case |
|----------|----------------|-------------|----------|
| **High Sensitivity** | 0.2-0.3 | Long model dominant | Detecting subtle PRF sites |
| **Balanced Detection** | 0.4-0.5 | Balanced ensemble (recommended) | General purpose prediction |
| **Fast Screening** | 0.6-0.7 | Short model dominant | Rapid initial screening |
| **Equal Contribution** | 0.5 | Equal weight to both models | Comparative analysis |
### Weight Selection Guidelines:
- **Low ensemble_weight (0.2-0.3)**:
- Emphasizes Long model (BiLSTM-CNN)
- Better for detecting complex patterns
- Higher sensitivity, may have more false positives
- **High ensemble_weight (0.6-0.8)**:
- Emphasizes Short model (HistGB)
- Faster computation
- Good for initial screening
- Higher specificity, may miss subtle sites
- **Balanced (0.4-0.5)**:
- Recommended for most applications
- Good balance of sensitivity and specificity
- Suitable for comprehensive analysis
## Output Field Reference
### Main Prediction Fields
- **`Short_Probability`**: HistGradientBoosting model prediction (0-1)
- **`Long_Probability`**: BiLSTM-CNN model prediction (0-1)
- **`Ensemble_Probability`**: Final ensemble prediction (primary result)
- **`Ensemble_Weights`**: Weight configuration information
### Sequence Fields
- **`Short_Sequence`**: 33bp sequence used by Short model
- **`Long_Sequence`**: 399bp sequence used by Long model
- **`Position`**: Position in the original sequence
- **`Codon`**: 3bp codon at the position
### Metadata Fields
- **`Sequence_ID`**: Identifier for multi-sequence predictions
- Additional fields from input DataFrame (for region predictions)
## Examples
See `example_plot_prediction.py` for comprehensive examples of:
- Basic prediction plotting
- Custom threshold configuration
- Ensemble weight parameter usage and comparison
- New API method demonstrations
- Saving plots to files
- Advanced visualization options
2025-03-18 11:21:54 +08:00
## Authors
## Citation
If you utilize FScanpy in your research, please cite our work:
```bibtex
[Citation details will be added upon publication]
```