260 lines
10 KiB
Markdown
260 lines
10 KiB
Markdown
# FScanpy
|
|
## A Machine Learning-Based Framework for Programmed Ribosomal Frameshifting Prediction
|
|
|
|
FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR](https://github.com/seanchen607/FScanR.git) framework, FScanpy provides robust and accurate PRF site predictions. The package requires input sequences to be in the positive (5' to 3') orientation.
|
|
|
|

|
|
|
|
For detailed documentation and usage examples, please refer to our [tutorial](tutorial/tutorial.md).
|
|
|
|
## 🚀 What's New in v0.3.0
|
|
|
|
### Model Naming Optimization
|
|
- **Short Model** (`short.pkl`): HistGradientBoosting model for rapid screening
|
|
- **Long Model** (`long.pkl`): BiLSTM-CNN model for detailed analysis
|
|
- **Unified Interface**: Consistent parameter naming and clearer output fields
|
|
|
|
### Performance Improvements
|
|
- **Faster Prediction**: Optimized model type detection and reduced redundant operations
|
|
- **Better Error Handling**: More informative error messages and robust exception handling
|
|
- **Code Quality**: Reduced code duplication and improved maintainability
|
|
|
|
### 🎨 New Visualization Features
|
|
- **Sequence Plotting**: Built-in function for visualizing PRF prediction results
|
|
- **Dual Threshold Filtering**: Separate filtering for Short and Long models
|
|
- **Interactive Graphics**: Heatmap and bar chart visualization
|
|
- **Export Options**: Support for PNG and PDF output formats
|
|
|
|
### ⚖️ Ensemble Weighting System
|
|
- **Flexible Ensemble**: Control the contribution of Short and Long models
|
|
- **Weight Validation**: Automatic parameter validation and error handling
|
|
- **Clear Naming**: `ensemble_weight` parameter for intuitive usage
|
|
- **Visual Feedback**: Weight ratios displayed in plots and results
|
|
|
|
### 🔧 API Improvements
|
|
- **Method Renaming**: More intuitive method names
|
|
- `predict_sequence()`: Replaces `predict_full()` for sequence prediction
|
|
- `predict_regions()`: Replaces `predict_region()` for batch prediction
|
|
- **Field Standardization**: Consistent output field naming
|
|
- `Ensemble_Probability`: Main prediction result (replaces `Voting_Probability`)
|
|
- `Short_Sequence` / `Long_Sequence`: Clear sequence field names
|
|
- **Backward Compatibility**: Deprecated methods still work with warnings
|
|
|
|
## Core Features
|
|
- **Sequence Feature Extraction**: Support for extracting features from nucleic acid sequences, including base composition, k-mer features, and positional features.
|
|
- **Frameshift Hotspot Region Prediction**: Predict potential PRF sites in nucleotide sequences using machine learning models.
|
|
- **Feature Extraction**: Extract relevant features from sequences to assist in prediction.
|
|
- **Cross-Species Support**: Built-in databases for viruses, marine phages, Euplotes, etc., enabling PRF prediction across various species.
|
|
- **Visualization Tools**: Built-in plotting functions for result visualization and analysis.
|
|
- **Ensemble Modeling**: Customizable ensemble weights for different prediction strategies.
|
|
|
|
## Main Advantages
|
|
- **High Accuracy**: Integrates multiple machine learning models to provide accurate PRF site predictions.
|
|
- **Efficiency**: Utilizes a sliding window approach and feature extraction techniques to rapidly scan sequences.
|
|
- **Versatility**: Supports PRF prediction across various species and can be combined with the [FScanR](https://github.com/seanchen607/FScanR.git) framework for enhanced accuracy.
|
|
- **User-Friendly**: Comes with detailed documentation and usage examples, making it easy for researchers to use.
|
|
- **Flexible**: Provides different resolutions to suit different using situations.
|
|
|
|
## Quick Start
|
|
|
|
### Basic Prediction
|
|
```python
|
|
from FScanpy import predict_prf
|
|
|
|
# Single sequence prediction with default ensemble weights (0.4:0.6)
|
|
sequence = "ATGCGTACGT..."
|
|
results = predict_prf(sequence=sequence)
|
|
print(results[['Position', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']].head())
|
|
```
|
|
|
|
### Custom Ensemble Weighting
|
|
```python
|
|
# Adjust model weights for different prediction strategies
|
|
results_long_dominant = predict_prf(sequence=sequence, ensemble_weight=0.3) # 3:7 ratio (Long dominant)
|
|
results_equal_weight = predict_prf(sequence=sequence, ensemble_weight=0.5) # 5:5 ratio (Equal weight)
|
|
results_short_dominant = predict_prf(sequence=sequence, ensemble_weight=0.7) # 7:3 ratio (Short dominant)
|
|
|
|
# Compare ensemble probabilities
|
|
print("Long dominant:", results_long_dominant['Ensemble_Probability'].mean())
|
|
print("Equal weight:", results_equal_weight['Ensemble_Probability'].mean())
|
|
print("Short dominant:", results_short_dominant['Ensemble_Probability'].mean())
|
|
```
|
|
|
|
### Visualization with Custom Weights
|
|
```python
|
|
from FScanpy import plot_prf_prediction
|
|
import matplotlib.pyplot as plt
|
|
|
|
# Generate prediction plot with custom ensemble weighting
|
|
sequence = "ATGCGTACGT..."
|
|
results, fig = plot_prf_prediction(
|
|
sequence=sequence,
|
|
short_threshold=0.65, # HistGB threshold
|
|
long_threshold=0.8, # BiLSTM-CNN threshold
|
|
ensemble_weight=0.3, # Custom weight: 30% Short, 70% Long
|
|
title="Long-Dominant Ensemble PRF Prediction (3:7)",
|
|
save_path="prediction_result.png"
|
|
)
|
|
|
|
plt.show()
|
|
```
|
|
|
|
### Advanced Usage with New API
|
|
```python
|
|
from FScanpy import PRFPredictor
|
|
import matplotlib.pyplot as plt
|
|
|
|
# Create predictor instance
|
|
predictor = PRFPredictor()
|
|
|
|
# Use new sequence prediction method
|
|
results = predictor.predict_sequence(
|
|
sequence=sequence,
|
|
ensemble_weight=0.4
|
|
)
|
|
|
|
# Compare different ensemble configurations
|
|
weights = [0.2, 0.4, 0.6, 0.8]
|
|
weight_names = ["Long 80%", "Balanced", "Short 60%", "Short 80%"]
|
|
|
|
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
|
|
axes = axes.flatten()
|
|
|
|
for i, (weight, name) in enumerate(zip(weights, weight_names)):
|
|
results = predictor.predict_sequence(sequence=sequence, ensemble_weight=weight)
|
|
ax = axes[i]
|
|
ax.bar(results['Position'], results['Ensemble_Probability'], alpha=0.7)
|
|
ax.set_title(f'{name} (Weight: {weight:.1f}:{1-weight:.1f})')
|
|
ax.set_ylabel('Probability')
|
|
|
|
plt.tight_layout()
|
|
plt.show()
|
|
```
|
|
|
|
### Batch Region Prediction
|
|
```python
|
|
# Predict multiple 399bp sequences
|
|
import pandas as pd
|
|
|
|
data = pd.DataFrame({
|
|
'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57] # 399bp sequences
|
|
})
|
|
|
|
results = predict_prf(data=data, ensemble_weight=0.4)
|
|
print(results[['Ensemble_Probability', 'Ensemble_Weights']].head())
|
|
```
|
|
|
|
## Installation Requirements
|
|
- Python ≥ 3.7
|
|
- Dependencies are automatically handled during installation
|
|
|
|
### Option 1: Install via pip
|
|
```bash
|
|
pip install FScanpy
|
|
```
|
|
|
|
### Option 2: Install from source
|
|
```bash
|
|
git clone git@60.204.158.188:yyh/FScanpy-package.git
|
|
cd FScanpy-package
|
|
pip install -e .
|
|
```
|
|
|
|
## 🔄 Migration from Previous Versions
|
|
|
|
### API Changes Summary
|
|
```python
|
|
# OLD API (deprecated but still works)
|
|
results = predict_prf(sequence="ATGC...", short_weight=0.4)
|
|
results = predictor.predict_full(sequence, short_weight=0.4)
|
|
results = predictor.predict_region(sequences, short_weight=0.4)
|
|
|
|
# NEW API (recommended)
|
|
results = predict_prf(sequence="ATGC...", ensemble_weight=0.4)
|
|
results = predictor.predict_sequence(sequence, ensemble_weight=0.4)
|
|
results = predictor.predict_regions(sequences, ensemble_weight=0.4)
|
|
|
|
# Output field changes
|
|
# OLD: 'Voting_Probability', 'Weight_Info', '33bp', '399bp'
|
|
# NEW: 'Ensemble_Probability', 'Ensemble_Weights', 'Short_Sequence', 'Long_Sequence'
|
|
|
|
# Visualization with ensemble weights
|
|
results, fig = plot_prf_prediction(
|
|
sequence="ATGC...",
|
|
short_threshold=0.65,
|
|
long_threshold=0.8,
|
|
ensemble_weight=0.3 # 30% Short, 70% Long
|
|
)
|
|
```
|
|
|
|
### Backward Compatibility
|
|
- All old methods still work but will show deprecation warnings
|
|
- Old field names are automatically added for compatibility
|
|
- Gradual migration is supported
|
|
|
|
## Ensemble Weight Configuration Guide
|
|
|
|
### Recommended Weights for Different Scenarios:
|
|
|
|
| Scenario | ensemble_weight | Description | Use Case |
|
|
|----------|----------------|-------------|----------|
|
|
| **High Sensitivity** | 0.2-0.3 | Long model dominant | Detecting subtle PRF sites |
|
|
| **Balanced Detection** | 0.4-0.5 | Balanced ensemble (recommended) | General purpose prediction |
|
|
| **Fast Screening** | 0.6-0.7 | Short model dominant | Rapid initial screening |
|
|
| **Equal Contribution** | 0.5 | Equal weight to both models | Comparative analysis |
|
|
|
|
### Weight Selection Guidelines:
|
|
- **Low ensemble_weight (0.2-0.3)**:
|
|
- Emphasizes Long model (BiLSTM-CNN)
|
|
- Better for detecting complex patterns
|
|
- Higher sensitivity, may have more false positives
|
|
|
|
- **High ensemble_weight (0.6-0.8)**:
|
|
- Emphasizes Short model (HistGB)
|
|
- Faster computation
|
|
- Good for initial screening
|
|
- Higher specificity, may miss subtle sites
|
|
|
|
- **Balanced (0.4-0.5)**:
|
|
- Recommended for most applications
|
|
- Good balance of sensitivity and specificity
|
|
- Suitable for comprehensive analysis
|
|
|
|
## Output Field Reference
|
|
|
|
### Main Prediction Fields
|
|
- **`Short_Probability`**: HistGradientBoosting model prediction (0-1)
|
|
- **`Long_Probability`**: BiLSTM-CNN model prediction (0-1)
|
|
- **`Ensemble_Probability`**: Final ensemble prediction (primary result)
|
|
- **`Ensemble_Weights`**: Weight configuration information
|
|
|
|
### Sequence Fields
|
|
- **`Short_Sequence`**: 33bp sequence used by Short model
|
|
- **`Long_Sequence`**: 399bp sequence used by Long model
|
|
- **`Position`**: Position in the original sequence
|
|
- **`Codon`**: 3bp codon at the position
|
|
|
|
### Metadata Fields
|
|
- **`Sequence_ID`**: Identifier for multi-sequence predictions
|
|
- Additional fields from input DataFrame (for region predictions)
|
|
|
|
## Examples
|
|
|
|
See `example_plot_prediction.py` for comprehensive examples of:
|
|
- Basic prediction plotting
|
|
- Custom threshold configuration
|
|
- Ensemble weight parameter usage and comparison
|
|
- New API method demonstrations
|
|
- Saving plots to files
|
|
- Advanced visualization options
|
|
|
|
## Authors
|
|
|
|
|
|
## Citation
|
|
If you utilize FScanpy in your research, please cite our work:
|
|
|
|
```bibtex
|
|
[Citation details will be added upon publication]
|
|
```
|