20 KiB
FScanpy Tutorial - Complete Usage Guide
Abstract
FScanpy is a Python package designed to predict Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. This package integrates machine learning models, sequence feature analysis, and visualization capabilities to help researchers rapidly locate potential PRF sites.
Introduction
FScanpy is a Python package dedicated to predicting Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. It integrates machine learning models (Gradient Boosting and BiLSTM-CNN) along with the FScanR package to furnish precise PRF predictions. Users are capable of employing three types of data as input: the entire cDNA/mRNA sequence that requires prediction, the nucleotide sequence in the vicinity of the suspected frameshift site, and the peptide library blastx results of the species or related species. It anticipates the input sequence to be in the + strand and can be integrated with FScanR to augment the accuracy.
For the prediction of the entire sequence, FScanpy adopts a sliding window approach to scan the entire sequence and predict the PRF sites. For regional prediction, it is based on the 33-bp and 399-bp sequences in the 0 reading frame around the suspected frameshift site. Initially, the Short model (HistGradientBoosting) will predict the potential PRF sites within the scanning window. If the predicted probability exceeds the threshold, the Long model (BiLSTM-CNN) will predict the PRF sites in the 399bp sequence. Then, ensemble weighting combines the two models to make the final prediction.
For PRF detection from BLASTX output, FScanR identifies potential PRF sites from BLASTX alignment results, acquires the two hits of the same query sequence, and then utilizes frameDist_cutoff, mismatch_cutoff, and evalue_cutoff to filter the hits. Finally, FScanpy is utilized to predict the probability of PRF sites.
Background
Ribosomal frameshifting, also known as translational frameshifting or translational recoding, is a biological phenomenon that occurs during translation that results in the production of multiple, unique proteins from a single mRNA. The process can be programmed by the nucleotide sequence of the mRNA and is sometimes affected by the secondary, 3-dimensional mRNA structure. It has been described mainly in viruses (especially retroviruses), retrotransposons and bacterial insertion elements, and also in some cellular genes.
Key features of FScanpy include:
- Integration of two predictive models:
- Short Model (HistGradientBoosting): Analyzes local sequence features centered around potential frameshift sites (33bp).
- Long Model (BiLSTM-CNN): Analyzes broader sequence features (399bp).
- Supports PRF prediction across various species.
- Can be combined with FScanR for enhanced accuracy.
Installation (python>=3.7)
1. Use pip
pip install FScanpy
2. Clone from GitHub
git clone https://github.com/.../FScanpy.git
cd FScanpy
pip install -e .
Complete Function Reference
1. Core Prediction Functions
1.1 predict_prf()
- Main Prediction Interface
Function Signature:
def predict_prf(
sequence: Union[str, List[str], None] = None,
data: Union[pd.DataFrame, None] = None,
window_size: int = 3,
short_threshold: float = 0.1,
ensemble_weight: float = 0.4,
model_dir: str = None
) -> pd.DataFrame
Parameters:
sequence
: Single or multiple DNA sequences for sliding window predictiondata
: DataFrame data, must contain 'Long_Sequence' or '399bp' column for region predictionwindow_size
: Sliding window size (default: 3, recommended: 1-10)short_threshold
: Short model (HistGB) probability threshold (default: 0.1, range: 0.0-1.0)ensemble_weight
: Weight of short model in ensemble (default: 0.4, range: 0.0-1.0)model_dir
: Model directory path (optional, uses built-in models if None)
Returns:
pd.DataFrame
: Prediction results with columns:Short_Probability
: Short model prediction probabilityLong_Probability
: Long model prediction probabilityEnsemble_Probability
: Ensemble prediction probability (main result)Position
: Position in sequence (for sliding window mode)Codon
: Codon at position (for sliding window mode)Ensemble_Weights
: Weight configuration information
Usage Examples:
from FScanpy import predict_prf
# 1. Single sequence sliding window prediction
sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
results = predict_prf(sequence=sequence)
# 2. Multiple sequences prediction
sequences = ["ATGCGTACGT...", "GCTATAGCAT..."]
results = predict_prf(sequence=sequences)
# 3. Custom parameters
results = predict_prf(
sequence=sequence,
window_size=1, # Scan every position
short_threshold=0.2, # Higher threshold
ensemble_weight=0.3 # 3:7 ratio (short:long)
)
# 4. DataFrame region prediction
import pandas as pd
data = pd.DataFrame({
'Long_Sequence': ['ATGCGT...', 'GCTATAG...'], # or use '399bp'
'sample_id': ['sample1', 'sample2']
})
results = predict_prf(data=data)
1.2 plot_prf_prediction()
- Prediction with Visualization
Function Signature:
def plot_prf_prediction(
sequence: str,
window_size: int = 3,
short_threshold: float = 0.65,
long_threshold: float = 0.8,
ensemble_weight: float = 0.4,
title: str = None,
save_path: str = None,
figsize: tuple = (12, 8),
dpi: int = 300,
model_dir: str = None
) -> tuple
Parameters:
sequence
: Input DNA sequence (string)window_size
: Sliding window size (default: 3)short_threshold
: Short model filtering threshold for heatmap display (default: 0.65)long_threshold
: Long model filtering threshold for heatmap display (default: 0.8)ensemble_weight
: Weight of short model in ensemble (default: 0.4)title
: Plot title (optional, auto-generated if None)save_path
: Save path (optional, saves plot if provided)figsize
: Figure size tuple (default: (12, 8))dpi
: Figure resolution (default: 300)model_dir
: Model directory path (optional)
Returns:
tuple
: (prediction_results: pd.DataFrame, figure: matplotlib.figure.Figure)
Usage Examples:
from FScanpy import plot_prf_prediction
import matplotlib.pyplot as plt
# 1. Basic plotting
sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
results, fig = plot_prf_prediction(sequence)
plt.show()
# 2. Custom thresholds and weights
results, fig = plot_prf_prediction(
sequence,
short_threshold=0.7, # Higher display threshold
long_threshold=0.85, # Higher display threshold
ensemble_weight=0.3, # 3:7 weight ratio
title="Custom Analysis Results",
save_path="analysis.png",
figsize=(15, 10),
dpi=150
)
# 3. High-resolution analysis
results, fig = plot_prf_prediction(
sequence,
window_size=1, # Scan every position
ensemble_weight=0.5, # Equal weights
dpi=600 # High resolution
)
2. PRFPredictor Class Methods
2.1 Class Initialization
from FScanpy import PRFPredictor
# Initialize with default models
predictor = PRFPredictor()
# Initialize with custom model directory
predictor = PRFPredictor(model_dir='/path/to/models')
2.2 predict_sequence()
- Sliding Window Prediction
Method Signature:
def predict_sequence(self, sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)
Parameters:
sequence
: Input DNA sequencewindow_size
: Sliding window size (default: 3)short_threshold
: Short model probability threshold (default: 0.1)ensemble_weight
: Short model weight in ensemble (default: 0.4)
Usage:
predictor = PRFPredictor()
results = predictor.predict_sequence(
sequence="ATGCGTACGT...",
window_size=1,
short_threshold=0.15,
ensemble_weight=0.35
)
2.3 predict_regions()
- Region-based Prediction
Method Signature:
def predict_regions(self, sequences, short_threshold=0.1, ensemble_weight=0.4)
Parameters:
sequences
: List or Series of 399bp sequencesshort_threshold
: Short model probability threshold (default: 0.1)ensemble_weight
: Short model weight in ensemble (default: 0.4)
Usage:
predictor = PRFPredictor()
sequences = ["ATGCGT...", "GCTATAG..."] # 399bp sequences
results = predictor.predict_regions(
sequences=sequences,
short_threshold=0.1,
ensemble_weight=0.4
)
2.4 predict_single_position()
- Single Position Prediction
Method Signature:
def predict_single_position(self, fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)
Parameters:
fs_period
: 33bp sequence around frameshift sitefull_seq
: 399bp sequence for long modelshort_threshold
: Short model probability threshold (default: 0.1)ensemble_weight
: Short model weight in ensemble (default: 0.4)
Usage:
predictor = PRFPredictor()
result = predictor.predict_single_position(
fs_period="ATGCGTACGTATGCGTACGTATGCGTACGTA", # 33bp
full_seq="ATGCGT..." * 133, # 399bp
short_threshold=0.1,
ensemble_weight=0.4
)
2.5 plot_sequence_prediction()
- Class Method for Plotting
Method Signature:
def plot_sequence_prediction(self, sequence, window_size=3, short_threshold=0.65,
long_threshold=0.8, ensemble_weight=0.4, title=None,
save_path=None, figsize=(12, 8), dpi=300)
Usage:
predictor = PRFPredictor()
results, fig = predictor.plot_sequence_prediction(
sequence="ATGCGTACGT...",
window_size=3,
ensemble_weight=0.4
)
3. Utility Functions
3.1 fscanr()
- PRF Site Detection from BLASTX
Function Signature:
def fscanr(
blastx_output: pd.DataFrame,
mismatch_cutoff: float = 10,
evalue_cutoff: float = 1e-5,
frameDist_cutoff: float = 10
) -> pd.DataFrame
Parameters:
blastx_output
: BLASTX output DataFrame with required columns:qseqid
,sseqid
,pident
,length
,mismatch
,gapopen
qstart
,qend
,sstart
,send
,evalue
,bitscore
,qframe
,sframe
mismatch_cutoff
: Maximum allowed mismatches (default: 10)evalue_cutoff
: E-value threshold (default: 1e-5)frameDist_cutoff
: Frame distance threshold (default: 10)
Returns:
pd.DataFrame
: PRF sites with columns:DNA_seqid
: Sequence identifierFS_start
,FS_end
: Frameshift start and end positionsPep_seqid
: Peptide sequence identifierPep_FS_start
,Pep_FS_end
: Peptide frameshift positionsFS_type
: Type of frameshift (-2, -1, 1, 2)Strand
: Strand orientation (+, -)
Usage:
from FScanpy.utils import fscanr
import pandas as pd
# Load BLASTX results
blastx_data = pd.read_excel('blastx_results.xlsx')
# Detect PRF sites
prf_sites = fscanr(
blastx_output=blastx_data,
mismatch_cutoff=5, # Stricter mismatch filter
evalue_cutoff=1e-6, # Stricter E-value filter
frameDist_cutoff=15 # Allow larger frame distances
)
3.2 extract_prf_regions()
- Extract Sequences Around PRF Sites
Function Signature:
def extract_prf_regions(mrna_file: str, prf_data: pd.DataFrame) -> pd.DataFrame
Parameters:
mrna_file
: Path to mRNA sequences file (FASTA format)prf_data
: DataFrame fromfscanr()
output
Returns:
pd.DataFrame
: Extracted sequences with columns:DNA_seqid
: Sequence identifierFS_start
,FS_end
: Frameshift positionsStrand
: Strand orientation399bp
: Extracted 399bp sequenceFS_type
: Frameshift type
Usage:
from FScanpy.utils import extract_prf_regions
# Extract sequences around PRF sites
prf_sequences = extract_prf_regions(
mrna_file='sequences.fasta',
prf_data=prf_sites
)
# Predict PRF probabilities
predictor = PRFPredictor()
results = predictor.predict_regions(prf_sequences['399bp'])
4. Data Access Functions
4.1 Test Data Access
from FScanpy.data import get_test_data_path, list_test_data
# List available test data
list_test_data()
# Get test data paths
blastx_file = get_test_data_path('blastx_example.xlsx')
mrna_file = get_test_data_path('mrna_example.fasta')
region_file = get_test_data_path('region_example.csv')
seq_file = get_test_data_path('full_seq.xlsx')
Complete Workflow Examples
Workflow 1: Full Sequence Analysis
from FScanpy import predict_prf, plot_prf_prediction
import matplotlib.pyplot as plt
# Define sequence
full_seq = pd.read.excel(seq_file)
# Method 1: Simple prediction
results = predict_prf(sequence=full_seq[0]['full_seq'])
print(f"Found {len(results)} potential sites")
# Method 2: Prediction with visualization
results, fig = plot_prf_prediction(
sequence=sequence=full_seq[0]['full_seq'],
window_size=1, # Scan every position
short_threshold=0.3, # Display sites above 0.3
long_threshold=0.4, # Display sites above 0.4
ensemble_weight=0.4, # 4:6 weight ratio
title="PRF Analysis Results",
save_path="prf_analysis.png"
)
plt.show()
# Analyze top predictions
top_sites = results.nlargest(5, 'Ensemble_Probability')
print("Top 5 predicted sites:")
for _, site in top_sites.iterrows():
print(f"Position {site['Position']}: {site['Ensemble_Probability']:.3f}")
Workflow 2: Region-based Prediction
from FScanpy import predict_prf
import pandas as pd
# Prepare region data
region_data = pd.DataFrame({
'sample_id': ['sample1', 'sample2', 'sample3'],
'Long_Sequence': [
'ATGCGT...', # 399bp sequence 1
'GCTATAG...', # 399bp sequence 2
'TTACGGA...' # 399bp sequence 3
],
'known_label': [1, 0, 1] # Optional: known labels for validation
})
# Predict PRF probabilities
results = predict_prf(
data=region_data,
ensemble_weight=0.3 # Favor long model (3:7 ratio)
)
# Evaluate results
if 'known_label' in results.columns:
threshold = 0.5
predictions = (results['Ensemble_Probability'] > threshold).astype(int)
accuracy = (predictions == results['known_label']).mean()
print(f"Accuracy at threshold {threshold}: {accuracy:.3f}")
Workflow 3: BLASTX-based Analysis Pipeline
from FScanpy import PRFPredictor, predict_prf
from FScanpy.data import get_test_data_path
from FScanpy.utils import fscanr, extract_prf_regions
import pandas as pd
# Step 1: Load BLASTX data
blastx_data = pd.read_excel(get_test_data_path('blastx_example.xlsx'))
print(f"Loaded {len(blastx_data)} BLASTX hits")
# Step 2: Detect PRF sites using FScanR
prf_sites = fscanr(
blastx_output=blastx_data,
mismatch_cutoff=10,
evalue_cutoff=1e-5,
frameDist_cutoff=10
)
print(f"Detected {len(prf_sites)} potential PRF sites")
# Step 3: Extract sequences around PRF sites
mrna_file = get_test_data_path('mrna_example.fasta')
prf_sequences = extract_prf_regions(
mrna_file=mrna_file,
prf_data=prf_sites
)
print(f"Extracted {len(prf_sequences)} sequences")
# Step 4: Predict PRF probabilities
predictor = PRFPredictor()
results = predictor.predict_regions(
sequences=prf_sequences['399bp'],
ensemble_weight=0.4
)
# Step 5: Combine results with metadata
final_results = pd.concat([
prf_sequences.reset_index(drop=True),
results.reset_index(drop=True)
], axis=1)
# Step 6: Analyze results
high_prob_sites = final_results[
final_results['Ensemble_Probability'] > 0.7
]
print(f"High probability PRF sites: {len(high_prob_sites)}")
# Display top results
print("\nTop PRF predictions:")
top_results = final_results.nlargest(3, 'Ensemble_Probability')
for _, row in top_results.iterrows():
print(f"Sequence {row['DNA_seqid']}: {row['Ensemble_Probability']:.3f}")
Workflow 4: Custom Analysis with Multiple Sequences
from FScanpy import predict_prf, plot_prf_prediction
import matplotlib.pyplot as plt
# Multiple sequence analysis
sequences = [
"ATGCGTACGTATGCGTACGTATGCGTACGTAAGCCCTTTGAACCCAAAGGG",
"GCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCAT",
"TTACGGATTACGGATTACGGATTACGGATTACGGATTACGGATTACGGAT"
]
# Batch prediction
results = predict_prf(
sequence=sequences,
window_size=2,
ensemble_weight=0.5 # Equal weights
)
# Analyze per sequence
for seq_id in results['Sequence_ID'].unique():
seq_results = results[results['Sequence_ID'] == seq_id]
max_prob = seq_results['Ensemble_Probability'].max()
print(f"{seq_id}: Max probability = {max_prob:.3f}")
# Visualize first sequence
first_seq_results, fig = plot_prf_prediction(
sequence=sequences[0],
ensemble_weight=0.5,
title="First Sequence Analysis"
)
plt.show()
Parameter Optimization Guidelines
1. Ensemble Weight Selection
- Conservative (favor specificity):
ensemble_weight = 0.2-0.3
(favor long model) - Balanced:
ensemble_weight = 0.4-0.6
(default recommended) - Sensitive (favor sensitivity):
ensemble_weight = 0.7-0.8
(favor short model)
2. Threshold Selection
- Short threshold: Usually 0.1-0.3, controls computational efficiency
- Display thresholds: 0.3-0.8, controls visualization display
- Classification threshold: 0.5 (standard), adjust based on validation data
3. Window Size Selection
- Fine-grained analysis:
window_size = 1
(every position) - Standard analysis:
window_size = 3
(every 3rd position, default) - Coarse analysis:
window_size = 6-9
(faster, less detailed)
Troubleshooting
Common Issues and Solutions
-
Model Loading Errors
# Check model directory import FScanpy predictor = PRFPredictor(model_dir='/custom/path')
-
Memory Issues with Large Sequences
# Use larger window size to reduce computational load results = predict_prf(sequence=large_seq, window_size=9)
-
Visualization Issues
# Adjust figure parameters results, fig = plot_prf_prediction( sequence=seq, figsize=(20, 10), # Larger figure dpi=150 # Lower resolution )
-
Input Format Issues
# Ensure proper DataFrame format data = pd.DataFrame({ 'Long_Sequence': sequences, # Use 'Long_Sequence' or '399bp' 'sample_id': ids })
Performance Optimization
1. Batch Processing
# Process multiple sequences efficiently
sequences = ["seq1", "seq2", "seq3", ...]
results = predict_prf(sequence=sequences, window_size=3)
2. Threshold Optimization
# Use appropriate short_threshold to skip unnecessary long model calls
results = predict_prf(
sequence=sequence,
short_threshold=0.2 # Higher threshold = faster processing
)
3. Memory Management
# For very large datasets, process in chunks
chunk_size = 100
for i in range(0, len(large_dataset), chunk_size):
chunk = large_dataset[i:i+chunk_size]
chunk_results = predict_prf(data=chunk)
# Process chunk_results
Citation
If you use FScanpy, please cite our paper: [Paper Link]