FScanpy-package/README.md

# FScanpy
## A Machine Learning-Based Framework for Programmed Ribosomal Frameshifting Prediction

FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR](https://github.com/seanchen607/FScanR.git) framework, FScanpy provides robust and accurate PRF site predictions. The package requires input sequences to be in the positive (5' to 3') orientation.

![FScanpy Architecture](/tutorial/image/structure.jpeg)

For detailed documentation and usage examples, please refer to our [tutorial](tutorial/tutorial.md).

## 🚀 What's New in v0.3.0

### Model Naming Optimization
- **Short Model** (`short.pkl`): HistGradientBoosting model for rapid screening
- **Long Model** (`long.pkl`): BiLSTM-CNN model for detailed analysis
- **Unified Interface**: Consistent parameter naming and clearer output fields

### Performance Improvements
- **Faster Prediction**: Optimized model type detection and reduced redundant operations
- **Better Error Handling**: More informative error messages and robust exception handling
- **Code Quality**: Reduced code duplication and improved maintainability

### 🎨 New Visualization Features
- **Sequence Plotting**: Built-in function for visualizing PRF prediction results
- **Dual Threshold Filtering**: Separate filtering for Short and Long models
- **Interactive Graphics**: Heatmap and bar chart visualization
- **Export Options**: Support for PNG and PDF output formats

### ⚖️ Ensemble Weighting System
- **Flexible Ensemble**: Control the contribution of Short and Long models
- **Weight Validation**: Automatic parameter validation and error handling
- **Clear Naming**: `ensemble_weight` parameter for intuitive usage
- **Visual Feedback**: Weight ratios displayed in plots and results

### 🔧 API Improvements
- **Method Renaming**: More intuitive method names
  - `predict_sequence()`: Replaces `predict_full()` for sequence prediction
  - `predict_regions()`: Replaces `predict_region()` for batch prediction
- **Field Standardization**: Consistent output field naming
  - `Ensemble_Probability`: Main prediction result (replaces `Voting_Probability`)
  - `Short_Sequence` / `Long_Sequence`: Clear sequence field names
- **Backward Compatibility**: Deprecated methods still work with warnings

## Core Features
- **Sequence Feature Extraction**: Support for extracting features from nucleic acid sequences, including base composition, k-mer features, and positional features.
- **Frameshift Hotspot Region Prediction**: Predict potential PRF sites in nucleotide sequences using machine learning models.
- **Feature Extraction**: Extract relevant features from sequences to assist in prediction.
- **Cross-Species Support**: Built-in databases for viruses, marine phages, Euplotes, etc., enabling PRF prediction across various species.
- **Visualization Tools**: Built-in plotting functions for result visualization and analysis.
- **Ensemble Modeling**: Customizable ensemble weights for different prediction strategies.

## Main Advantages
- **High Accuracy**: Integrates multiple machine learning models to provide accurate PRF site predictions.
- **Efficiency**: Utilizes a sliding window approach and feature extraction techniques to rapidly scan sequences.
- **Versatility**: Supports PRF prediction across various species and can be combined with the [FScanR](https://github.com/seanchen607/FScanR.git) framework for enhanced accuracy.
- **User-Friendly**: Comes with detailed documentation and usage examples, making it easy for researchers to use.
- **Flexible**: Provides different resolutions to suit different using situations.

## Quick Start

### Basic Prediction
```python
from FScanpy import predict_prf

# Single sequence prediction with default ensemble weights (0.4:0.6)
sequence = "ATGCGTACGT..."
results = predict_prf(sequence=sequence)
print(results[['Position', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']].head())
```

### Custom Ensemble Weighting
```python
# Adjust model weights for different prediction strategies
results_long_dominant = predict_prf(sequence=sequence, ensemble_weight=0.3)   # 3:7 ratio (Long dominant)
results_equal_weight = predict_prf(sequence=sequence, ensemble_weight=0.5)    # 5:5 ratio (Equal weight)
results_short_dominant = predict_prf(sequence=sequence, ensemble_weight=0.7)  # 7:3 ratio (Short dominant)

# Compare ensemble probabilities
print("Long dominant:", results_long_dominant['Ensemble_Probability'].mean())
print("Equal weight:", results_equal_weight['Ensemble_Probability'].mean())
print("Short dominant:", results_short_dominant['Ensemble_Probability'].mean())
```

### Visualization with Custom Weights
```python
from FScanpy import plot_prf_prediction
import matplotlib.pyplot as plt

# Generate prediction plot with custom ensemble weighting
sequence = "ATGCGTACGT..."
results, fig = plot_prf_prediction(
    sequence=sequence,
    short_threshold=0.65,     # HistGB threshold
    long_threshold=0.8,       # BiLSTM-CNN threshold
    ensemble_weight=0.3,      # Custom weight: 30% Short, 70% Long
    title="Long-Dominant Ensemble PRF Prediction (3:7)",
    save_path="prediction_result.png"
)

plt.show()
```

### Advanced Usage with New API
```python
from FScanpy import PRFPredictor
import matplotlib.pyplot as plt

# Create predictor instance
predictor = PRFPredictor()

# Use new sequence prediction method
results = predictor.predict_sequence(
    sequence=sequence,
    ensemble_weight=0.4
)

# Compare different ensemble configurations
weights = [0.2, 0.4, 0.6, 0.8]
weight_names = ["Long 80%", "Balanced", "Short 60%", "Short 80%"]

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

for i, (weight, name) in enumerate(zip(weights, weight_names)):
    results = predictor.predict_sequence(sequence=sequence, ensemble_weight=weight)
    ax = axes[i]
    ax.bar(results['Position'], results['Ensemble_Probability'], alpha=0.7)
    ax.set_title(f'{name} (Weight: {weight:.1f}:{1-weight:.1f})')
    ax.set_ylabel('Probability')

plt.tight_layout()
plt.show()
```

### Batch Region Prediction
```python
# Predict multiple 399bp sequences
import pandas as pd

data = pd.DataFrame({
    'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57]  # 399bp sequences
})

results = predict_prf(data=data, ensemble_weight=0.4)
print(results[['Ensemble_Probability', 'Ensemble_Weights']].head())
```

## Installation Requirements
- Python ≥ 3.7
- Dependencies are automatically handled during installation

### Option 1: Install via pip
```bash
pip install FScanpy
```

### Option 2: Install from source
```bash
git clone git@60.204.158.188:yyh/FScanpy-package.git
cd FScanpy-package
pip install -e .
```

## 🔄 Migration from Previous Versions

### API Changes Summary
```python
# OLD API (deprecated but still works)
results = predict_prf(sequence="ATGC...", short_weight=0.4)
results = predictor.predict_full(sequence, short_weight=0.4)
results = predictor.predict_region(sequences, short_weight=0.4)

# NEW API (recommended)
results = predict_prf(sequence="ATGC...", ensemble_weight=0.4)
results = predictor.predict_sequence(sequence, ensemble_weight=0.4)
results = predictor.predict_regions(sequences, ensemble_weight=0.4)

# Output field changes
# OLD: 'Voting_Probability', 'Weight_Info', '33bp', '399bp'
# NEW: 'Ensemble_Probability', 'Ensemble_Weights', 'Short_Sequence', 'Long_Sequence'

# Visualization with ensemble weights
results, fig = plot_prf_prediction(
    sequence="ATGC...",
    short_threshold=0.65,
    long_threshold=0.8,
    ensemble_weight=0.3  # 30% Short, 70% Long
)
```

### Backward Compatibility
- All old methods still work but will show deprecation warnings
- Old field names are automatically added for compatibility
- Gradual migration is supported

## Ensemble Weight Configuration Guide

### Recommended Weights for Different Scenarios:

| Scenario | ensemble_weight | Description | Use Case |
|----------|----------------|-------------|----------|
| **High Sensitivity** | 0.2-0.3 | Long model dominant | Detecting subtle PRF sites |
| **Balanced Detection** | 0.4-0.5 | Balanced ensemble (recommended) | General purpose prediction |
| **Fast Screening** | 0.6-0.7 | Short model dominant | Rapid initial screening |
| **Equal Contribution** | 0.5 | Equal weight to both models | Comparative analysis |

### Weight Selection Guidelines:
- **Low ensemble_weight (0.2-0.3)**:
  - Emphasizes Long model (BiLSTM-CNN)
  - Better for detecting complex patterns
  - Higher sensitivity, may have more false positives

- **High ensemble_weight (0.6-0.8)**:
  - Emphasizes Short model (HistGB)
  - Faster computation
  - Good for initial screening
  - Higher specificity, may miss subtle sites

- **Balanced (0.4-0.5)**:
  - Recommended for most applications
  - Good balance of sensitivity and specificity
  - Suitable for comprehensive analysis

## Output Field Reference

### Main Prediction Fields
- **`Short_Probability`**: HistGradientBoosting model prediction (0-1)
- **`Long_Probability`**: BiLSTM-CNN model prediction (0-1)
- **`Ensemble_Probability`**: Final ensemble prediction (primary result)
- **`Ensemble_Weights`**: Weight configuration information

### Sequence Fields
- **`Short_Sequence`**: 33bp sequence used by Short model
- **`Long_Sequence`**: 399bp sequence used by Long model
- **`Position`**: Position in the original sequence
- **`Codon`**: 3bp codon at the position

### Metadata Fields
- **`Sequence_ID`**: Identifier for multi-sequence predictions
- Additional fields from input DataFrame (for region predictions)

## Examples

See `example_plot_prediction.py` for comprehensive examples of:
- Basic prediction plotting
- Custom threshold configuration
- Ensemble weight parameter usage and comparison
- New API method demonstrations
- Saving plots to files
- Advanced visualization options

## Authors


## Citation
If you utilize FScanpy in your research, please cite our work:

```bibtex
[Citation details will be added upon publication]
```