FScanpy-package/tutorial/tutorial.md

# FScanpy Tutorial - Complete Usage Guide

[![中文](https://img.shields.io/badge/Language-中文-red.svg)](tutorial_zh.md)

## Abstract
FScanpy is a Python package designed to predict Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. This package integrates machine learning models, sequence feature analysis, and visualization capabilities to help researchers rapidly locate potential PRF sites.

## Introduction
![FScanpy structure](/image/structure.jpeg)

FScanpy is a Python package dedicated to predicting Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. It integrates machine learning models (Gradient Boosting and BiLSTM-CNN) along with the FScanR package to furnish precise PRF predictions. Users are capable of employing three types of data as input: the entire cDNA/mRNA sequence that requires prediction, the nucleotide sequence in the vicinity of the suspected frameshift site, and the peptide library blastx results of the species or related species. It anticipates the input sequence to be in the + strand and can be integrated with FScanR to augment the accuracy.

![Machine learning models](/image/ML.png)

For the prediction of the entire sequence, FScanpy adopts a sliding window approach to scan the entire sequence and predict the PRF sites. For regional prediction, it is based on the 33-bp and 399-bp sequences in the 0 reading frame around the suspected frameshift site. Initially, the Short model (HistGradientBoosting) will predict the potential PRF sites within the scanning window. If the predicted probability exceeds the threshold, the Long model (BiLSTM-CNN) will predict the PRF sites in the 399bp sequence. Then, ensemble weighting combines the two models to make the final prediction.

For PRF detection from BLASTX output, [FScanR](https://github.com/seanchen607/FScanR.git) identifies potential PRF sites from BLASTX alignment results, acquires the two hits of the same query sequence, and then utilizes frameDist_cutoff, mismatch_cutoff, and evalue_cutoff to filter the hits. Finally, FScanpy is utilized to predict the probability of PRF sites.

### Background
[Ribosomal frameshifting](https://en.wikipedia.org/wiki/Ribosomal_frameshift), also known as translational frameshifting or translational recoding, is a biological phenomenon that occurs during translation that results in the production of multiple, unique proteins from a single mRNA. The process can be programmed by the nucleotide sequence of the mRNA and is sometimes affected by the secondary, 3-dimensional mRNA structure. It has been described mainly in viruses (especially retroviruses), retrotransposons and bacterial insertion elements, and also in some cellular genes.

### Key features of FScanpy include:

- Integration of two predictive models:
  - Short Model (HistGradientBoosting): Analyzes local sequence features centered around potential frameshift sites (33bp).
  - Long Model (BiLSTM-CNN): Analyzes broader sequence features (399bp).
- Supports PRF prediction across various species.
- Can be combined with [FScanR](https://github.com/seanchen607/FScanR.git) for enhanced accuracy.

## Installation (python>=3.7)

### 1. Use pip
```bash
pip install FScanpy
```

### 2. Clone from GitHub
```bash
git clone https://github.com/.../FScanpy.git
cd FScanpy
pip install -e .
```

## Complete Function Reference

### 1. Core Prediction Functions

#### 1.1 `predict_prf()` - Main Prediction Interface

**Function Signature:**
```python
def predict_prf(
    sequence: Union[str, List[str], None] = None,
    data: Union[pd.DataFrame, None] = None,
    window_size: int = 3,
    short_threshold: float = 0.1,
    ensemble_weight: float = 0.4,
    model_dir: str = None
) -> pd.DataFrame
```

**Parameters:**
- `sequence`: Single or multiple DNA sequences for sliding window prediction
- `data`: DataFrame data, must contain 'Long_Sequence' or '399bp' column for region prediction
- `window_size`: Sliding window size (default: 3, recommended: 1-10)
- `short_threshold`: Short model (HistGB) probability threshold (default: 0.1, range: 0.0-1.0)
- `ensemble_weight`: Weight of short model in ensemble (default: 0.4, range: 0.0-1.0)
- `model_dir`: Model directory path (optional, uses built-in models if None)

**Returns:**
- `pd.DataFrame`: Prediction results with columns:
  - `Short_Probability`: Short model prediction probability
  - `Long_Probability`: Long model prediction probability
  - `Ensemble_Probability`: Ensemble prediction probability (main result)
  - `Position`: Position in sequence (for sliding window mode)
  - `Codon`: Codon at position (for sliding window mode)
  - `Ensemble_Weights`: Weight configuration information

**Usage Examples:**

```python
from FScanpy import predict_prf

# 1. Single sequence sliding window prediction
sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
results = predict_prf(sequence=sequence)

# 2. Multiple sequences prediction
sequences = ["ATGCGTACGT...", "GCTATAGCAT..."]
results = predict_prf(sequence=sequences)

# 3. Custom parameters
results = predict_prf(
    sequence=sequence,
    window_size=1,           # Scan every position
    short_threshold=0.2,     # Higher threshold
    ensemble_weight=0.3      # 3:7 ratio (short:long)
)

# 4. DataFrame region prediction
import pandas as pd
data = pd.DataFrame({
    'Long_Sequence': ['ATGCGT...', 'GCTATAG...'],  # or use '399bp'
    'sample_id': ['sample1', 'sample2']
})
results = predict_prf(data=data)
```

#### 1.2 `plot_prf_prediction()` - Prediction with Visualization

**Function Signature:**
```python
def plot_prf_prediction(
    sequence: str,
    window_size: int = 3,
    short_threshold: float = 0.65,
    long_threshold: float = 0.8,
    ensemble_weight: float = 0.4,
    title: str = None,
    save_path: str = None,
    figsize: tuple = (12, 8),
    dpi: int = 300,
    model_dir: str = None
) -> tuple
```

**Parameters:**
- `sequence`: Input DNA sequence (string)
- `window_size`: Sliding window size (default: 3)
- `short_threshold`: Short model filtering threshold for heatmap display (default: 0.65)
- `long_threshold`: Long model filtering threshold for heatmap display (default: 0.8)
- `ensemble_weight`: Weight of short model in ensemble (default: 0.4)
- `title`: Plot title (optional, auto-generated if None)
- `save_path`: Save path (optional, saves plot if provided)
- `figsize`: Figure size tuple (default: (12, 8))
- `dpi`: Figure resolution (default: 300)
- `model_dir`: Model directory path (optional)

**Returns:**
- `tuple`: (prediction_results: pd.DataFrame, figure: matplotlib.figure.Figure)

**Usage Examples:**

```python
from FScanpy import plot_prf_prediction
import matplotlib.pyplot as plt

# 1. Basic plotting
sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
results, fig = plot_prf_prediction(sequence)
plt.show()

# 2. Custom thresholds and weights
results, fig = plot_prf_prediction(
    sequence,
    short_threshold=0.7,     # Higher display threshold
    long_threshold=0.85,     # Higher display threshold
    ensemble_weight=0.3,     # 3:7 weight ratio
    title="Custom Analysis Results",
    save_path="analysis.png",
    figsize=(15, 10),
    dpi=150
)

# 3. High-resolution analysis
results, fig = plot_prf_prediction(
    sequence,
    window_size=1,           # Scan every position
    ensemble_weight=0.5,     # Equal weights
    dpi=600                  # High resolution
)
```

### 2. PRFPredictor Class Methods

#### 2.1 Class Initialization

```python
from FScanpy import PRFPredictor

# Initialize with default models
predictor = PRFPredictor()

# Initialize with custom model directory
predictor = PRFPredictor(model_dir='/path/to/models')
```

#### 2.2 `predict_sequence()` - Sliding Window Prediction

**Method Signature:**
```python
def predict_sequence(self, sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)
```

**Parameters:**
- `sequence`: Input DNA sequence
- `window_size`: Sliding window size (default: 3)
- `short_threshold`: Short model probability threshold (default: 0.1)
- `ensemble_weight`: Short model weight in ensemble (default: 0.4)

**Usage:**
```python
predictor = PRFPredictor()
results = predictor.predict_sequence(
    sequence="ATGCGTACGT...",
    window_size=1,
    short_threshold=0.15,
    ensemble_weight=0.35
)
```

#### 2.3 `predict_regions()` - Region-based Prediction

**Method Signature:**
```python
def predict_regions(self, sequences, short_threshold=0.1, ensemble_weight=0.4)
```

**Parameters:**
- `sequences`: List or Series of 399bp sequences
- `short_threshold`: Short model probability threshold (default: 0.1)
- `ensemble_weight`: Short model weight in ensemble (default: 0.4)

**Usage:**
```python
predictor = PRFPredictor()
sequences = ["ATGCGT...", "GCTATAG..."]  # 399bp sequences
results = predictor.predict_regions(
    sequences=sequences,
    short_threshold=0.1,
    ensemble_weight=0.4
)
```

#### 2.4 `predict_single_position()` - Single Position Prediction

**Method Signature:**
```python
def predict_single_position(self, fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)
```

**Parameters:**
- `fs_period`: 33bp sequence around frameshift site
- `full_seq`: 399bp sequence for long model
- `short_threshold`: Short model probability threshold (default: 0.1)
- `ensemble_weight`: Short model weight in ensemble (default: 0.4)

**Usage:**
```python
predictor = PRFPredictor()
result = predictor.predict_single_position(
    fs_period="ATGCGTACGTATGCGTACGTATGCGTACGTA",  # 33bp
    full_seq="ATGCGT..." * 133,  # 399bp
    short_threshold=0.1,
    ensemble_weight=0.4
)
```

#### 2.5 `plot_sequence_prediction()` - Class Method for Plotting

**Method Signature:**
```python
def plot_sequence_prediction(self, sequence, window_size=3, short_threshold=0.65,
                           long_threshold=0.8, ensemble_weight=0.4, title=None,
                           save_path=None, figsize=(12, 8), dpi=300)
```

**Usage:**
```python
predictor = PRFPredictor()
results, fig = predictor.plot_sequence_prediction(
    sequence="ATGCGTACGT...",
    window_size=3,
    ensemble_weight=0.4
)
```

### 3. Utility Functions

#### 3.1 `fscanr()` - PRF Site Detection from BLASTX

**Function Signature:**
```python
def fscanr(
    blastx_output: pd.DataFrame,
    mismatch_cutoff: float = 10,
    evalue_cutoff: float = 1e-5,
    frameDist_cutoff: float = 10
) -> pd.DataFrame
```

**Parameters:**
- `blastx_output`: BLASTX output DataFrame with required columns:
  - `qseqid`, `sseqid`, `pident`, `length`, `mismatch`, `gapopen`
  - `qstart`, `qend`, `sstart`, `send`, `evalue`, `bitscore`, `qframe`, `sframe`
- `mismatch_cutoff`: Maximum allowed mismatches (default: 10)
- `evalue_cutoff`: E-value threshold (default: 1e-5)
- `frameDist_cutoff`: Frame distance threshold (default: 10)

**Returns:**
- `pd.DataFrame`: PRF sites with columns:
  - `DNA_seqid`: Sequence identifier
  - `FS_start`, `FS_end`: Frameshift start and end positions
  - `Pep_seqid`: Peptide sequence identifier
  - `Pep_FS_start`, `Pep_FS_end`: Peptide frameshift positions
  - `FS_type`: Type of frameshift (-2, -1, 1, 2)
  - `Strand`: Strand orientation (+, -)

**Usage:**
```python
from FScanpy.utils import fscanr
import pandas as pd

# Load BLASTX results
blastx_data = pd.read_excel('blastx_results.xlsx')

# Detect PRF sites
prf_sites = fscanr(
    blastx_output=blastx_data,
    mismatch_cutoff=5,       # Stricter mismatch filter
    evalue_cutoff=1e-6,      # Stricter E-value filter
    frameDist_cutoff=15      # Allow larger frame distances
)
```

#### 3.2 `extract_prf_regions()` - Extract Sequences Around PRF Sites

**Function Signature:**
```python
def extract_prf_regions(mrna_file: str, prf_data: pd.DataFrame) -> pd.DataFrame
```

**Parameters:**
- `mrna_file`: Path to mRNA sequences file (FASTA format)
- `prf_data`: DataFrame from `fscanr()` output

**Returns:**
- `pd.DataFrame`: Extracted sequences with columns:
  - `DNA_seqid`: Sequence identifier
  - `FS_start`, `FS_end`: Frameshift positions
  - `Strand`: Strand orientation
  - `399bp`: Extracted 399bp sequence
  - `FS_type`: Frameshift type

**Usage:**
```python
from FScanpy.utils import extract_prf_regions

# Extract sequences around PRF sites
prf_sequences = extract_prf_regions(
    mrna_file='sequences.fasta',
    prf_data=prf_sites
)

# Predict PRF probabilities
predictor = PRFPredictor()
results = predictor.predict_regions(prf_sequences['399bp'])
```

### 4. Data Access Functions

#### 4.1 Test Data Access

```python
from FScanpy.data import get_test_data_path, list_test_data

# List available test data
list_test_data()

# Get test data paths
blastx_file = get_test_data_path('blastx_example.xlsx')
mrna_file = get_test_data_path('mrna_example.fasta')
region_file = get_test_data_path('region_example.csv')
seq_file = get_test_data_path('full_seq.xlsx')
```

## Complete Workflow Examples

### Workflow 1: Full Sequence Analysis

```python
from FScanpy import predict_prf, plot_prf_prediction
import matplotlib.pyplot as plt

# Define sequence
full_seq = pd.read.excel(seq_file)

# Method 1: Simple prediction
results = predict_prf(sequence=full_seq[0]['full_seq'])
print(f"Found {len(results)} potential sites")

# Method 2: Prediction with visualization
results, fig = plot_prf_prediction(
    sequence=sequence=full_seq[0]['full_seq'],
    window_size=1,              # Scan every position
    short_threshold=0.3,        # Display sites above 0.3
    long_threshold=0.4,         # Display sites above 0.4
    ensemble_weight=0.4,        # 4:6 weight ratio
    title="PRF Analysis Results",
    save_path="prf_analysis.png"
)
plt.show()

# Analyze top predictions
top_sites = results.nlargest(5, 'Ensemble_Probability')
print("Top 5 predicted sites:")
for _, site in top_sites.iterrows():
    print(f"Position {site['Position']}: {site['Ensemble_Probability']:.3f}")
```

### Workflow 2: Region-based Prediction

```python
from FScanpy import predict_prf
import pandas as pd

# Prepare region data
region_data = pd.DataFrame({
    'sample_id': ['sample1', 'sample2', 'sample3'],
    'Long_Sequence': [
        'ATGCGT...',  # 399bp sequence 1
        'GCTATAG...',  # 399bp sequence 2
        'TTACGGA...'   # 399bp sequence 3
    ],
    'known_label': [1, 0, 1]  # Optional: known labels for validation
})

# Predict PRF probabilities
results = predict_prf(
    data=region_data,
    ensemble_weight=0.3  # Favor long model (3:7 ratio)
)

# Evaluate results
if 'known_label' in results.columns:
    threshold = 0.5
    predictions = (results['Ensemble_Probability'] > threshold).astype(int)
    accuracy = (predictions == results['known_label']).mean()
    print(f"Accuracy at threshold {threshold}: {accuracy:.3f}")
```

### Workflow 3: BLASTX-based Analysis Pipeline

```python
from FScanpy import PRFPredictor, predict_prf
from FScanpy.data import get_test_data_path
from FScanpy.utils import fscanr, extract_prf_regions
import pandas as pd

# Step 1: Load BLASTX data
blastx_data = pd.read_excel(get_test_data_path('blastx_example.xlsx'))
print(f"Loaded {len(blastx_data)} BLASTX hits")

# Step 2: Detect PRF sites using FScanR
prf_sites = fscanr(
    blastx_output=blastx_data,
    mismatch_cutoff=10,
    evalue_cutoff=1e-5,
    frameDist_cutoff=10
)
print(f"Detected {len(prf_sites)} potential PRF sites")

# Step 3: Extract sequences around PRF sites
mrna_file = get_test_data_path('mrna_example.fasta')
prf_sequences = extract_prf_regions(
    mrna_file=mrna_file,
    prf_data=prf_sites
)
print(f"Extracted {len(prf_sequences)} sequences")

# Step 4: Predict PRF probabilities
predictor = PRFPredictor()
results = predictor.predict_regions(
    sequences=prf_sequences['399bp'],
    ensemble_weight=0.4
)

# Step 5: Combine results with metadata
final_results = pd.concat([
    prf_sequences.reset_index(drop=True),
    results.reset_index(drop=True)
], axis=1)

# Step 6: Analyze results
high_prob_sites = final_results[
    final_results['Ensemble_Probability'] > 0.7
]
print(f"High probability PRF sites: {len(high_prob_sites)}")

# Display top results
print("\nTop PRF predictions:")
top_results = final_results.nlargest(3, 'Ensemble_Probability')
for _, row in top_results.iterrows():
    print(f"Sequence {row['DNA_seqid']}: {row['Ensemble_Probability']:.3f}")
```

### Workflow 4: Custom Analysis with Multiple Sequences

```python
from FScanpy import predict_prf, plot_prf_prediction
import matplotlib.pyplot as plt

# Multiple sequence analysis
sequences = [
    "ATGCGTACGTATGCGTACGTATGCGTACGTAAGCCCTTTGAACCCAAAGGG",
    "GCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCAT",
    "TTACGGATTACGGATTACGGATTACGGATTACGGATTACGGATTACGGAT"
]

# Batch prediction
results = predict_prf(
    sequence=sequences,
    window_size=2,
    ensemble_weight=0.5  # Equal weights
)

# Analyze per sequence
for seq_id in results['Sequence_ID'].unique():
    seq_results = results[results['Sequence_ID'] == seq_id]
    max_prob = seq_results['Ensemble_Probability'].max()
    print(f"{seq_id}: Max probability = {max_prob:.3f}")

# Visualize first sequence
first_seq_results, fig = plot_prf_prediction(
    sequence=sequences[0],
    ensemble_weight=0.5,
    title="First Sequence Analysis"
)
plt.show()
```

## Parameter Optimization Guidelines

### 1. Ensemble Weight Selection

- **Conservative (favor specificity)**: `ensemble_weight = 0.2-0.3` (favor long model)
- **Balanced**: `ensemble_weight = 0.4-0.6` (default recommended)
- **Sensitive (favor sensitivity)**: `ensemble_weight = 0.7-0.8` (favor short model)

### 2. Threshold Selection

- **Short threshold**: Usually 0.1-0.3, controls computational efficiency
- **Display thresholds**: 0.3-0.8, controls visualization display
- **Classification threshold**: 0.5 (standard), adjust based on validation data

### 3. Window Size Selection

- **Fine-grained analysis**: `window_size = 1` (every position)
- **Standard analysis**: `window_size = 3` (every 3rd position, default)
- **Coarse analysis**: `window_size = 6-9` (faster, less detailed)

## Troubleshooting

### Common Issues and Solutions

1. **Model Loading Errors**
   ```python
   # Check model directory
   import FScanpy
   predictor = PRFPredictor(model_dir='/custom/path')
   ```

2. **Memory Issues with Large Sequences**
   ```python
   # Use larger window size to reduce computational load
   results = predict_prf(sequence=large_seq, window_size=9)
   ```

3. **Visualization Issues**
   ```python
   # Adjust figure parameters
   results, fig = plot_prf_prediction(
       sequence=seq,
       figsize=(20, 10),  # Larger figure
       dpi=150            # Lower resolution
   )
   ```

4. **Input Format Issues**
   ```python
   # Ensure proper DataFrame format
   data = pd.DataFrame({
       'Long_Sequence': sequences,  # Use 'Long_Sequence' or '399bp'
       'sample_id': ids
   })
   ```

## Performance Optimization

### 1. Batch Processing
```python
# Process multiple sequences efficiently
sequences = ["seq1", "seq2", "seq3", ...]
results = predict_prf(sequence=sequences, window_size=3)
```

### 2. Threshold Optimization
```python
# Use appropriate short_threshold to skip unnecessary long model calls
results = predict_prf(
    sequence=sequence,
    short_threshold=0.2  # Higher threshold = faster processing
)
```

### 3. Memory Management
```python
# For very large datasets, process in chunks
chunk_size = 100
for i in range(0, len(large_dataset), chunk_size):
    chunk = large_dataset[i:i+chunk_size]
    chunk_results = predict_prf(data=chunk)
    # Process chunk_results
```

## Citation
If you use FScanpy, please cite our paper: [Paper Link]