FScanpy-package/README.md

# FScanpy
## A Machine Learning-Based Framework for Programmed Ribosomal Frameshifting Prediction

FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR](https://github.com/seanchen607/FScanR.git) framework, FScanpy provides robust and accurate PRF site predictions. The package requires input sequences to be in the positive (5' to 3') orientation.

![FScanpy Architecture](/tutorial/image/structure.jpeg)

For detailed documentation and usage examples, please refer to our [tutorial](tutorial/tutorial.md).

## 🚀 What's New in v0.3.0

### Model Naming Optimization
- **Short Model** (`short.pkl`): HistGradientBoosting model for rapid screening
- **Long Model** (`long.pkl`): BiLSTM-CNN model for detailed analysis
- **Unified Interface**: Consistent parameter naming and clearer output fields

### Performance Improvements
- **Faster Prediction**: Optimized model type detection and reduced redundant operations
- **Better Error Handling**: More informative error messages and robust exception handling
- **Code Quality**: Reduced code duplication and improved maintainability

### 🎨 New Visualization Features
- **Sequence Plotting**: Built-in function for visualizing PRF prediction results
- **Dual Threshold Filtering**: Separate filtering for Short and Long models
- **Interactive Graphics**: Heatmap and bar chart visualization
- **Export Options**: Support for PNG and PDF output formats

### ⚖️ Ensemble Weighting System
- **Flexible Ensemble**: Control the contribution of Short and Long models
- **Weight Validation**: Automatic parameter validation and error handling
- **Clear Naming**: `ensemble_weight` parameter for intuitive usage
- **Visual Feedback**: Weight ratios displayed in plots and results

### 🔧 API Improvements
- **Method Renaming**: More intuitive method names
  - `predict_sequence()`: Replaces `predict_full()` for sequence prediction
  - `predict_regions()`: Replaces `predict_region()` for batch prediction
- **Field Standardization**: Consistent output field naming
  - `Ensemble_Probability`: Main prediction result (replaces `Voting_Probability`)
  - `Short_Sequence` / `Long_Sequence`: Clear sequence field names
- **Backward Compatibility**: Deprecated methods still work with warnings

## Core Features
- **Sequence Feature Extraction**: Support for extracting features from nucleic acid sequences, including base composition, k-mer features, and positional features.
- **Frameshift Hotspot Region Prediction**: Predict potential PRF sites in nucleotide sequences using machine learning models.
- **Feature Extraction**: Extract relevant features from sequences to assist in prediction.
- **Cross-Species Support**: Built-in databases for viruses, marine phages, Euplotes, etc., enabling PRF prediction across various species.
- **Visualization Tools**: Built-in plotting functions for result visualization and analysis.
- **Ensemble Modeling**: Customizable ensemble weights for different prediction strategies.

## Main Advantages
- **High Accuracy**: Integrates multiple machine learning models to provide accurate PRF site predictions.
- **Efficiency**: Utilizes a sliding window approach and feature extraction techniques to rapidly scan sequences.
- **Versatility**: Supports PRF prediction across various species and can be combined with the [FScanR](https://github.com/seanchen607/FScanR.git) framework for enhanced accuracy.
- **User-Friendly**: Comes with detailed documentation and usage examples, making it easy for researchers to use.
- **Flexible**: Provides different resolutions to suit different using situations.

## Quick Start

### Basic Prediction
```python
from FScanpy import predict_prf

# Single sequence prediction with default ensemble weights (0.4:0.6)
sequence = "ATGCGTACGT..."
results = predict_prf(sequence=sequence)
print(results[['Position', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']].head())
```

### Custom Ensemble Weighting
```python
# Adjust model weights for different prediction strategies
results_long_dominant = predict_prf(sequence=sequence, ensemble_weight=0.3)   # 3:7 ratio (Long dominant)
results_equal_weight = predict_prf(sequence=sequence, ensemble_weight=0.5)    # 5:5 ratio (Equal weight)
results_short_dominant = predict_prf(sequence=sequence, ensemble_weight=0.7)  # 7:3 ratio (Short dominant)

# Compare ensemble probabilities
print("Long dominant:", results_long_dominant['Ensemble_Probability'].mean())
print("Equal weight:", results_equal_weight['Ensemble_Probability'].mean())
print("Short dominant:", results_short_dominant['Ensemble_Probability'].mean())
```

### Visualization with Custom Weights
```python
from FScanpy import plot_prf_prediction
import matplotlib.pyplot as plt

# Generate prediction plot with custom ensemble weighting
sequence = "ATGCGTACGT..."
results, fig = plot_prf_prediction(
    sequence=sequence,
    short_threshold=0.65,     # HistGB threshold
    long_threshold=0.8,       # BiLSTM-CNN threshold
    ensemble_weight=0.3,      # Custom weight: 30% Short, 70% Long
    title="Long-Dominant Ensemble PRF Prediction (3:7)",
    save_path="prediction_result.png"
)

plt.show()
```

### Advanced Usage with New API
```python
from FScanpy import PRFPredictor
import matplotlib.pyplot as plt

# Create predictor instance
predictor = PRFPredictor()

# Use new sequence prediction method
results = predictor.predict_sequence(
    sequence=sequence,
    ensemble_weight=0.4
)

# Compare different ensemble configurations
weights = [0.2, 0.4, 0.6, 0.8]
weight_names = ["Long 80%", "Balanced", "Short 60%", "Short 80%"]

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

for i, (weight, name) in enumerate(zip(weights, weight_names)):
    results = predictor.predict_sequence(sequence=sequence, ensemble_weight=weight)
    ax = axes[i]
    ax.bar(results['Position'], results['Ensemble_Probability'], alpha=0.7)
    ax.set_title(f'{name} (Weight: {weight:.1f}:{1-weight:.1f})')
    ax.set_ylabel('Probability')

plt.tight_layout()
plt.show()
```

### Batch Region Prediction
```python
# Predict multiple 399bp sequences
import pandas as pd

data = pd.DataFrame({
    'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57]  # 399bp sequences
})

results = predict_prf(data=data, ensemble_weight=0.4)
print(results[['Ensemble_Probability', 'Ensemble_Weights']].head())
```

## Installation Requirements
- Python ≥ 3.7
- Dependencies are automatically handled during installation

### Option 1: Install via pip
```bash
pip install FScanpy
```

### Option 2: Install from source
```bash
git clone git@60.204.158.188:yyh/FScanpy-package.git
cd FScanpy-package
pip install -e .
```

## 🔄 Migration from Previous Versions

### API Changes Summary
```python
# OLD API (deprecated but still works)
results = predict_prf(sequence="ATGC...", short_weight=0.4)
results = predictor.predict_full(sequence, short_weight=0.4)
results = predictor.predict_region(sequences, short_weight=0.4)

# NEW API (recommended)
results = predict_prf(sequence="ATGC...", ensemble_weight=0.4)
results = predictor.predict_sequence(sequence, ensemble_weight=0.4)
results = predictor.predict_regions(sequences, ensemble_weight=0.4)

# Output field changes
# OLD: 'Voting_Probability', 'Weight_Info', '33bp', '399bp'
# NEW: 'Ensemble_Probability', 'Ensemble_Weights', 'Short_Sequence', 'Long_Sequence'

# Visualization with ensemble weights
results, fig = plot_prf_prediction(
    sequence="ATGC...", 
    short_threshold=0.65, 
    long_threshold=0.8,
    ensemble_weight=0.3  # 30% Short, 70% Long
)
```

### Backward Compatibility
- All old methods still work but will show deprecation warnings
- Old field names are automatically added for compatibility
- Gradual migration is supported

## Ensemble Weight Configuration Guide

### Recommended Weights for Different Scenarios:

| Scenario | ensemble_weight | Description | Use Case |
|----------|----------------|-------------|----------|
| **High Sensitivity** | 0.2-0.3 | Long model dominant | Detecting subtle PRF sites |
| **Balanced Detection** | 0.4-0.5 | Balanced ensemble (recommended) | General purpose prediction |
| **Fast Screening** | 0.6-0.7 | Short model dominant | Rapid initial screening |
| **Equal Contribution** | 0.5 | Equal weight to both models | Comparative analysis |

### Weight Selection Guidelines:
- **Low ensemble_weight (0.2-0.3)**: 
  - Emphasizes Long model (BiLSTM-CNN)
  - Better for detecting complex patterns
  - Higher sensitivity, may have more false positives
  
- **High ensemble_weight (0.6-0.8)**: 
  - Emphasizes Short model (HistGB)
  - Faster computation
  - Good for initial screening
  - Higher specificity, may miss subtle sites
  
- **Balanced (0.4-0.5)**: 
  - Recommended for most applications
  - Good balance of sensitivity and specificity
  - Suitable for comprehensive analysis

## Output Field Reference

### Main Prediction Fields
- **`Short_Probability`**: HistGradientBoosting model prediction (0-1)
- **`Long_Probability`**: BiLSTM-CNN model prediction (0-1)
- **`Ensemble_Probability`**: Final ensemble prediction (primary result)
- **`Ensemble_Weights`**: Weight configuration information

### Sequence Fields
- **`Short_Sequence`**: 33bp sequence used by Short model
- **`Long_Sequence`**: 399bp sequence used by Long model
- **`Position`**: Position in the original sequence
- **`Codon`**: 3bp codon at the position

### Metadata Fields
- **`Sequence_ID`**: Identifier for multi-sequence predictions
- Additional fields from input DataFrame (for region predictions)

## Examples

See `example_plot_prediction.py` for comprehensive examples of:
- Basic prediction plotting
- Custom threshold configuration
- Ensemble weight parameter usage and comparison
- New API method demonstrations
- Saving plots to files
- Advanced visualization options

## Authors


## Citation
If you utilize FScanpy in your research, please cite our work:

```bibtex
[Citation details will be added upon publication]
```
first commit 2025-03-18 11:21:54 +08:00			`# FScanpy`
			`## A Machine Learning-Based Framework for Programmed Ribosomal Frameshifting Prediction`

ipynb文件创建，绘图函数完善 2025-05-29 17:58:48 +08:00			FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR](https://github.com/seanchen607/FScanR.git) framework, FScanpy provides robust and accurate PRF site predictions. The package requires input sequences to be in the positive (5' to 3') orientation.
first commit 2025-03-18 11:21:54 +08:00
			`![FScanpy Architecture](/tutorial/image/structure.jpeg)`

			`For detailed documentation and usage examples, please refer to our [tutorial](tutorial/tutorial.md).`

ipynb文件创建，绘图函数完善 2025-05-29 17:58:48 +08:00			`## 🚀 What's New in v0.3.0`

			`### Model Naming Optimization`
			- Short Model (`short.pkl`): HistGradientBoosting model for rapid screening
			- Long Model (`long.pkl`): BiLSTM-CNN model for detailed analysis
			`- Unified Interface: Consistent parameter naming and clearer output fields`

			`### Performance Improvements`
			`- Faster Prediction: Optimized model type detection and reduced redundant operations`
			`- Better Error Handling: More informative error messages and robust exception handling`
			`- Code Quality: Reduced code duplication and improved maintainability`

			`### 🎨 New Visualization Features`
			`- Sequence Plotting: Built-in function for visualizing PRF prediction results`
			`- Dual Threshold Filtering: Separate filtering for Short and Long models`
			`- Interactive Graphics: Heatmap and bar chart visualization`
			`- Export Options: Support for PNG and PDF output formats`

			`### ⚖️ Ensemble Weighting System`
			`- Flexible Ensemble: Control the contribution of Short and Long models`
			`- Weight Validation: Automatic parameter validation and error handling`
			- Clear Naming: `ensemble_weight` parameter for intuitive usage
			`- Visual Feedback: Weight ratios displayed in plots and results`

			`### 🔧 API Improvements`
			`- Method Renaming: More intuitive method names`
			- `predict_sequence()`: Replaces `predict_full()` for sequence prediction
			- `predict_regions()`: Replaces `predict_region()` for batch prediction
			`- Field Standardization: Consistent output field naming`
			- `Ensemble_Probability`: Main prediction result (replaces `Voting_Probability`)
			- `Short_Sequence` / `Long_Sequence`: Clear sequence field names
			`- Backward Compatibility: Deprecated methods still work with warnings`

更新 README.md 2025-03-20 15:27:21 +08:00			`## Core Features`
ipynb文件创建，绘图函数完善 2025-05-29 17:58:48 +08:00			`- Sequence Feature Extraction: Support for extracting features from nucleic acid sequences, including base composition, k-mer features, and positional features.`
更新 README.md 2025-03-20 15:27:21 +08:00			`- Frameshift Hotspot Region Prediction: Predict potential PRF sites in nucleotide sequences using machine learning models.`
			`- Feature Extraction: Extract relevant features from sequences to assist in prediction.`
ipynb文件创建，绘图函数完善 2025-05-29 17:58:48 +08:00			`- Cross-Species Support: Built-in databases for viruses, marine phages, Euplotes, etc., enabling PRF prediction across various species.`
			`- Visualization Tools: Built-in plotting functions for result visualization and analysis.`
			`- Ensemble Modeling: Customizable ensemble weights for different prediction strategies.`
更新 README.md 2025-03-20 15:27:21 +08:00
			`## Main Advantages`
			`- High Accuracy: Integrates multiple machine learning models to provide accurate PRF site predictions.`
			`- Efficiency: Utilizes a sliding window approach and feature extraction techniques to rapidly scan sequences.`
			`- Versatility: Supports PRF prediction across various species and can be combined with the [FScanR](https://github.com/seanchen607/FScanR.git) framework for enhanced accuracy.`
ipynb文件创建，绘图函数完善 2025-05-29 17:58:48 +08:00			`- User-Friendly: Comes with detailed documentation and usage examples, making it easy for researchers to use.`
更新 README.md 2025-03-20 15:27:21 +08:00			`- Flexible: Provides different resolutions to suit different using situations.`

ipynb文件创建，绘图函数完善 2025-05-29 17:58:48 +08:00			`## Quick Start`

			`### Basic Prediction`
			```python
			`from FScanpy import predict_prf`

			`# Single sequence prediction with default ensemble weights (0.4:0.6)`
			`sequence = "ATGCGTACGT..."`
			`results = predict_prf(sequence=sequence)`
			`print(results[['Position', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']].head())`
			```

			`### Custom Ensemble Weighting`
			```python
			`# Adjust model weights for different prediction strategies`
			`results_long_dominant = predict_prf(sequence=sequence, ensemble_weight=0.3) # 3:7 ratio (Long dominant)`
			`results_equal_weight = predict_prf(sequence=sequence, ensemble_weight=0.5) # 5:5 ratio (Equal weight)`
			`results_short_dominant = predict_prf(sequence=sequence, ensemble_weight=0.7) # 7:3 ratio (Short dominant)`

			`# Compare ensemble probabilities`
			`print("Long dominant:", results_long_dominant['Ensemble_Probability'].mean())`
			`print("Equal weight:", results_equal_weight['Ensemble_Probability'].mean())`
			`print("Short dominant:", results_short_dominant['Ensemble_Probability'].mean())`
			```

			`### Visualization with Custom Weights`
			```python
			`from FScanpy import plot_prf_prediction`
			`import matplotlib.pyplot as plt`

			`# Generate prediction plot with custom ensemble weighting`
			`sequence = "ATGCGTACGT..."`
			`results, fig = plot_prf_prediction(`
			`sequence=sequence,`
			`short_threshold=0.65, # HistGB threshold`
			`long_threshold=0.8, # BiLSTM-CNN threshold`
			`ensemble_weight=0.3, # Custom weight: 30% Short, 70% Long`
			`title="Long-Dominant Ensemble PRF Prediction (3:7)",`
			`save_path="prediction_result.png"`
			`)`

			`plt.show()`
			```

			`### Advanced Usage with New API`
			```python
			`from FScanpy import PRFPredictor`
			`import matplotlib.pyplot as plt`

			`# Create predictor instance`
			`predictor = PRFPredictor()`

			`# Use new sequence prediction method`
			`results = predictor.predict_sequence(`
			`sequence=sequence,`
			`ensemble_weight=0.4`
			`)`

			`# Compare different ensemble configurations`
			`weights = [0.2, 0.4, 0.6, 0.8]`
			`weight_names = ["Long 80%", "Balanced", "Short 60%", "Short 80%"]`

			`fig, axes = plt.subplots(2, 2, figsize=(15, 10))`
			`axes = axes.flatten()`

			`for i, (weight, name) in enumerate(zip(weights, weight_names)):`
			`results = predictor.predict_sequence(sequence=sequence, ensemble_weight=weight)`
			`ax = axes[i]`
			`ax.bar(results['Position'], results['Ensemble_Probability'], alpha=0.7)`
			`ax.set_title(f'{name} (Weight: {weight:.1f}:{1-weight:.1f})')`
			`ax.set_ylabel('Probability')`

			`plt.tight_layout()`
			`plt.show()`
			```

			`### Batch Region Prediction`
			```python
			`# Predict multiple 399bp sequences`
			`import pandas as pd`

			`data = pd.DataFrame({`
			`'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57] # 399bp sequences`
			`})`

			`results = predict_prf(data=data, ensemble_weight=0.4)`
			`print(results[['Ensemble_Probability', 'Ensemble_Weights']].head())`
			```

first commit 2025-03-18 11:21:54 +08:00			`## Installation Requirements`
			`- Python ≥ 3.7`
			`- Dependencies are automatically handled during installation`

			`### Option 1: Install via pip`
			```bash
			`pip install FScanpy`
			```

			`### Option 2: Install from source`
			```bash
更新 README.md 2025-03-18 11:27:50 +08:00			`git clone git@60.204.158.188:yyh/FScanpy-package.git`
first commit 2025-03-18 11:21:54 +08:00			`cd FScanpy-package`
			`pip install -e .`
			```

ipynb文件创建，绘图函数完善 2025-05-29 17:58:48 +08:00			`## 🔄 Migration from Previous Versions`

			`### API Changes Summary`
			```python
			`# OLD API (deprecated but still works)`
			`results = predict_prf(sequence="ATGC...", short_weight=0.4)`
			`results = predictor.predict_full(sequence, short_weight=0.4)`
			`results = predictor.predict_region(sequences, short_weight=0.4)`

			`# NEW API (recommended)`
			`results = predict_prf(sequence="ATGC...", ensemble_weight=0.4)`
			`results = predictor.predict_sequence(sequence, ensemble_weight=0.4)`
			`results = predictor.predict_regions(sequences, ensemble_weight=0.4)`

			`# Output field changes`
			`# OLD: 'Voting_Probability', 'Weight_Info', '33bp', '399bp'`
			`# NEW: 'Ensemble_Probability', 'Ensemble_Weights', 'Short_Sequence', 'Long_Sequence'`

			`# Visualization with ensemble weights`
			`results, fig = plot_prf_prediction(`
			`sequence="ATGC...",`
			`short_threshold=0.65,`
			`long_threshold=0.8,`
			`ensemble_weight=0.3 # 30% Short, 70% Long`
			`)`
			```

			`### Backward Compatibility`
			`- All old methods still work but will show deprecation warnings`
			`- Old field names are automatically added for compatibility`
			`- Gradual migration is supported`

			`## Ensemble Weight Configuration Guide`

			`### Recommended Weights for Different Scenarios:`

			`\| Scenario \| ensemble_weight \| Description \| Use Case \|`
			`\|----------\|----------------\|-------------\|----------\|`
			`\| High Sensitivity \| 0.2-0.3 \| Long model dominant \| Detecting subtle PRF sites \|`
			`\| Balanced Detection \| 0.4-0.5 \| Balanced ensemble (recommended) \| General purpose prediction \|`
			`\| Fast Screening \| 0.6-0.7 \| Short model dominant \| Rapid initial screening \|`
			`\| Equal Contribution \| 0.5 \| Equal weight to both models \| Comparative analysis \|`

			`### Weight Selection Guidelines:`
			`- Low ensemble_weight (0.2-0.3):`
			`- Emphasizes Long model (BiLSTM-CNN)`
			`- Better for detecting complex patterns`
			`- Higher sensitivity, may have more false positives`

			`- High ensemble_weight (0.6-0.8):`
			`- Emphasizes Short model (HistGB)`
			`- Faster computation`
			`- Good for initial screening`
			`- Higher specificity, may miss subtle sites`

			`- Balanced (0.4-0.5):`
			`- Recommended for most applications`
			`- Good balance of sensitivity and specificity`
			`- Suitable for comprehensive analysis`

			`## Output Field Reference`

			`### Main Prediction Fields`
			- `Short_Probability`: HistGradientBoosting model prediction (0-1)
			- `Long_Probability`: BiLSTM-CNN model prediction (0-1)
			- `Ensemble_Probability`: Final ensemble prediction (primary result)
			- `Ensemble_Weights`: Weight configuration information

			`### Sequence Fields`
			- `Short_Sequence`: 33bp sequence used by Short model
			- `Long_Sequence`: 399bp sequence used by Long model
			- `Position`: Position in the original sequence
			- `Codon`: 3bp codon at the position

			`### Metadata Fields`
			- `Sequence_ID`: Identifier for multi-sequence predictions
			`- Additional fields from input DataFrame (for region predictions)`

			`## Examples`

			See `example_plot_prediction.py` for comprehensive examples of:
			`- Basic prediction plotting`
			`- Custom threshold configuration`
			`- Ensemble weight parameter usage and comparison`
			`- New API method demonstrations`
			`- Saving plots to files`
			`- Advanced visualization options`

first commit 2025-03-18 11:21:54 +08:00			`## Authors`


			`## Citation`
			`If you utilize FScanpy in your research, please cite our work:`

			```bibtex
			`[Citation details will be added upon publication]`
			```