FScanpy-package/README.md

# FScanpy
## A Machine Learning-Based Framework for Programmed Ribosomal Frameshifting Prediction

[![Python](https://img.shields.io/badge/Python-3.7%2B-blue.svg)](https://www.python.org/)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR](https://github.com/seanchen607/FScanR.git) framework, FScanpy provides robust and accurate PRF site predictions.

![FScanpy Architecture](/tutorial/image/structure.jpeg)

## 🌟 Key Features

### 🎯 **Dual-Model Architecture**
- **Short Model** (`HistGradientBoosting`): Fast screening with 33bp sequences
- **Long Model** (`BiLSTM-CNN`): Deep analysis with 399bp sequences
- **Ensemble Prediction**: Customizable model weights for optimal performance

### 🚀 **Versatile Input Support**
- **Single/Multiple Sequences**: Sliding window prediction across full sequences
- **Region-Based Analysis**: Direct prediction on pre-extracted 399bp regions
- **BLASTX Integration**: Seamless workflow with FScanR pipeline
- **Cross-Species Compatibility**: Built-in databases for viruses, marine phages, Euplotes, etc.

### 📊 **Advanced Visualization**
- **Interactive Heatmaps**: FS site probability visualization
- **Prediction Plots**: Combined probability and confidence displays
- **Customizable Thresholds**: Separate filtering for each model
- **Export Options**: PNG, PDF, and interactive formats

### ⚡ **High Performance**
- **Optimized Algorithms**: Efficient sliding window scanning
- **Batch Processing**: Handle multiple sequences simultaneously
- **Flexible Thresholds**: Tunable sensitivity for different use cases
- **Memory Efficient**: Optimized for large-scale genomic data

## 🔧 Installation

### Prerequisites
- Python ≥ 3.7
- All dependencies are automatically installed

### Install via pip (Recommended)
```bash
pip install FScanpy
```

### Install from Source
```bash
git clone https://github.com/your-org/FScanpy-package.git
cd FScanpy-package
pip install -e .
```

## 🚀 Quick Start

### Basic Usage
```python
from FScanpy import predict_prf

# Simple sequence prediction
sequence = "ATGCGTACGTTAGC..." # Your DNA sequence
results = predict_prf(sequence=sequence)

# View top predictions
print(results[['Position', 'Ensemble_Probability', 'Short_Probability', 'Long_Probability']].head(10))
```

### Visualization
```python
from FScanpy import plot_prf_prediction

# Generate prediction plot
results, fig = plot_prf_prediction(
    sequence=sequence,
    short_threshold=0.65,    # HistGB threshold
    long_threshold=0.8,      # BiLSTM-CNN threshold
    ensemble_weight=0.4,     # 40% Short, 60% Long
    title="PRF Prediction Results"
)
```

### Advanced Usage
```python
from FScanpy import PRFPredictor
import pandas as pd

# Create predictor instance
predictor = PRFPredictor()

# Batch prediction on pre-extracted regions
data = pd.DataFrame({
    'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57]  # 399bp sequences
})
results = predictor.predict_regions(data, ensemble_weight=0.4)

# Sequence-level prediction with custom parameters
results = predictor.predict_sequence(
    sequence=sequence,
    window_size=1,           # Step size for sliding window
    ensemble_weight=0.3,     # Model weighting
    short_threshold=0.5      # Filtering threshold
)
```

## 🎛️ Ensemble Weight Configuration

The `ensemble_weight` parameter controls the contribution of each model:

| ensemble_weight | Short Model | Long Model | Best For |
|----------------|-------------|------------|----------|
| **0.2-0.3** | 20-30% | 70-80% | **High sensitivity**, detecting subtle sites |
| **0.4-0.5** | 40-50% | 50-60% | **Balanced detection** (recommended) |
| **0.6-0.7** | 60-70% | 30-40% | **Fast screening**, high specificity |

### Weight Selection Examples
```python
# High sensitivity (Long model dominant)
sensitive_results = predict_prf(sequence, ensemble_weight=0.2)

# Balanced approach (recommended)
balanced_results = predict_prf(sequence, ensemble_weight=0.4)

# Fast screening (Short model dominant)
screening_results = predict_prf(sequence, ensemble_weight=0.7)
```

## 📊 Core Functions

### Main Prediction Interface
```python
predict_prf(
    sequence=None,           # Single/multiple sequences or None
    data=None,              # DataFrame with 399bp sequences or None
    window_size=3,          # Sliding window step size
    short_threshold=0.1,    # Short model filtering threshold
    ensemble_weight=0.4,    # Short model weight (0.0-1.0)
    model_dir=None         # Custom model directory
)
```

### Visualization Function
```python
plot_prf_prediction(
    sequence,               # Input DNA sequence
    window_size=3,          # Scanning step size
    short_threshold=0.65,   # Short model threshold for plotting
    long_threshold=0.8,     # Long model threshold for plotting
    ensemble_weight=0.4,    # Model weighting
    title=None,            # Plot title
    save_path=None,        # Save file path
    figsize=(12,8),        # Figure size
    dpi=300               # Resolution for saved plots
)
```

### PRFPredictor Class Methods
```python
predictor = PRFPredictor()

# Sequence prediction (sliding window)
predictor.predict_sequence(sequence, ensemble_weight=0.4)

# Region prediction (batch processing)
predictor.predict_regions(dataframe, ensemble_weight=0.4)

# Feature extraction
predictor.extract_features(sequences)

# Model information
predictor.get_model_info()
```

## 📈 Output Fields

### Prediction Results
- **`Position`**: Position in the original sequence
- **`Ensemble_Probability`**: Final ensemble prediction (main result)
- **`Short_Probability`**: HistGradientBoosting prediction (0-1)
- **`Long_Probability`**: BiLSTM-CNN prediction (0-1)
- **`Ensemble_Weights`**: Model weight configuration used

### Sequence Information
- **`Short_Sequence`**: 33bp sequence for Short model
- **`Long_Sequence`**: 399bp sequence for Long model
- **`Codon`**: 3bp codon at the prediction position
- **`Sequence_ID`**: Identifier for multi-sequence inputs

## 🔬 Integration with FScanR

FScanpy works seamlessly with the FScanR pipeline for comprehensive PRF analysis:

```python
from FScanpy import fscanr, extract_prf_regions, predict_prf

# Step 1: BLASTX analysis with FScanR
blastx_results = fscanr(
    blastx_data,
    mismatch_cutoff=10,
    evalue_cutoff=1e-5,
    frameDist_cutoff=10
)

# Step 2: Extract PRF candidate regions
prf_regions = extract_prf_regions(original_sequence, blastx_results)

# Step 3: Predict with FScanpy
final_predictions = predict_prf(data=prf_regions, ensemble_weight=0.4)
```

## 📚 Documentation

- **[Complete Tutorial](tutorial/tutorial.md)**: Comprehensive usage guide with examples
- **[Demo Notebook](FScanpy_Demo.ipynb)**: Interactive examples and workflows
- **[Example Scripts](example_plot_prediction.py)**: Ready-to-run code examples

## 🎯 Use Cases

### 1. **Viral Genome Analysis**
```python
# Scan viral genome for PRF sites
viral_sequence = load_viral_genome()
prf_sites = predict_prf(viral_sequence, ensemble_weight=0.3)
high_confidence = prf_sites[prf_sites['Ensemble_Probability'] > 0.8]
```

### 2. **Comparative Genomics**
```python
# Compare PRF patterns across species
species_data = pd.DataFrame({
    'Species': ['Virus_A', 'Virus_B'],
    'Long_Sequence': [seq_a_399bp, seq_b_399bp]
})
comparative_results = predict_prf(data=species_data)
```

### 3. **High-Throughput Screening**
```python
# Fast screening of large sequence datasets
sequences = load_large_dataset()
screening_results = predict_prf(
    sequence=sequences,
    ensemble_weight=0.7,  # Fast screening mode
    short_threshold=0.3
)
```

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.

## 📝 Citation

If you use FScanpy in your research, please cite:

```bibtex
@software{fscanpy2024,
  title={FScanpy: A Machine Learning Framework for Programmed Ribosomal Frameshifting Prediction},
  author={[Author names]},
  year={2024},
  url={https://github.com/your-org/FScanpy}
}
```

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🆘 Support

- **Documentation**: [Tutorial](tutorial/tutorial.md)
- **Examples**: [Demo Notebook](FScanpy_Demo.ipynb)

## 🏗️ Dependencies

FScanpy automatically installs all required dependencies:
-  `numpy>=1.24.3`
-  `pandas>=2.2.3`
-  `tensorflow>=2.10.1`
-  `scikit-learn>=1.6.0`
-  `matplotlib>=3.9.4`
-  `joblib>=1.4.2`
-  `biopython>=1.85`
-  `wrapt>=1.17.0`

---

**FScanpy** - Advancing programmed ribosomal frameshifting research through machine learning 🧬