完善包介绍

This commit is contained in:
Chenlab 2025-06-11 21:44:29 +08:00
parent 089df9c4a6
commit a2eefd1902
3 changed files with 1022 additions and 269 deletions

View File

@ -22,6 +22,84 @@
"- **region_example.csv**: Sample for individual site prediction"
]
},
{
"cell_type": "raw",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"## 📚 FScanpy Function Usage Guide\n",
"\n",
"### Core Functions Overview\n",
"\n",
"FScanpy provides several main functions for PRF prediction:\n",
"\n",
"#### 1. `predict_prf()` - Universal Prediction Function\n",
"```python\n",
"# Single sequence prediction\n",
"results = predict_prf(sequence=\"ATGCGT...\", window_size=3, ensemble_weight=0.4)\n",
"\n",
"# Multiple sequences prediction \n",
"results = predict_prf(sequence=[\"seq1\", \"seq2\"], window_size=3)\n",
"\n",
"# DataFrame region prediction\n",
"results = predict_prf(data=df_with_399bp_column, ensemble_weight=0.4)\n",
"```\n",
"\n",
"#### 2. `plot_prf_prediction()` - Prediction with Visualization\n",
"```python\n",
"# Basic plotting\n",
"results, fig = plot_prf_prediction(sequence=\"ATGCGT...\")\n",
"\n",
"# Custom parameters\n",
"results, fig = plot_prf_prediction(\n",
" sequence=\"ATGCGT...\",\n",
" window_size=1,\n",
" short_threshold=0.65,\n",
" long_threshold=0.8,\n",
" ensemble_weight=0.4,\n",
" save_path=\"plot.png\"\n",
")\n",
"```\n",
"\n",
"#### 3. `PRFPredictor` Class Methods\n",
"```python\n",
"predictor = PRFPredictor()\n",
"\n",
"# Sliding window prediction\n",
"results = predictor.predict_sequence(sequence, window_size=3, ensemble_weight=0.4)\n",
"\n",
"# Region prediction\n",
"results = predictor.predict_regions(sequences_399bp, ensemble_weight=0.4)\n",
"\n",
"# Single position prediction\n",
"result = predictor.predict_single_position(fs_period_33bp, full_seq_399bp)\n",
"\n",
"# Plot prediction\n",
"results, fig = predictor.plot_sequence_prediction(sequence)\n",
"```\n",
"\n",
"#### 4. Utility Functions\n",
"```python\n",
"from FScanpy.utils import fscanr, extract_prf_regions\n",
"\n",
"# Detect PRF sites from BLASTX\n",
"prf_sites = fscanr(blastx_df, mismatch_cutoff=10, evalue_cutoff=1e-5)\n",
"\n",
"# Extract sequences around PRF sites\n",
"prf_sequences = extract_prf_regions(mrna_file, prf_sites)\n",
"```\n",
"\n",
"### Parameter Guidelines\n",
"\n",
"- **ensemble_weight**: 0.4 (default, balanced), 0.2-0.3 (conservative), 0.7-0.8 (sensitive)\n",
"- **window_size**: 1 (detailed), 3 (standard), 6-9 (fast)\n",
"- **short_threshold**: 0.1 (default), 0.2-0.3 (stricter filtering)\n",
"- **Display thresholds**: 0.3-0.8 for visualization filtering\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -224,13 +302,13 @@
"source": [
"# Run FScanR analysis\n",
"print(\"🔍 Running FScanR analysis...\")\n",
"print(\"Parameter settings: mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=100\")\n",
"print(\"Parameter settings: mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10\")\n",
"\n",
"fscanr_results = fscanr(\n",
" blastx_data,\n",
" mismatch_cutoff=10,\n",
" evalue_cutoff=1e-5,\n",
" frameDist_cutoff=100\n",
" frameDist_cutoff=10\n",
")\n",
"\n",
"print(f\"\\n✅ FScanR analysis complete!\")\n",
@ -644,6 +722,170 @@
"print(\"The chart contains heatmaps and bar charts showing the PRF prediction probability distribution across the entire sequence.\")"
]
},
{
"cell_type": "raw",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"## 📖 Complete Function Reference\n",
"\n",
"### All Available Functions and Methods\n",
"\n",
"#### Core Prediction Functions\n",
"\n",
"**1. `predict_prf(sequence=None, data=None, window_size=3, short_threshold=0.1, ensemble_weight=0.4, model_dir=None)`**\n",
"- **Purpose**: Universal prediction function for both sliding window and region-based analysis\n",
"- **Input modes**: \n",
" - Single/multiple sequences → sliding window prediction\n",
" - DataFrame with 'Long_Sequence'/'399bp' column → region prediction\n",
"- **Key parameters**:\n",
" - `ensemble_weight`: Short model weight (0.0-1.0, default: 0.4)\n",
" - `window_size`: Scanning step size (default: 3)\n",
" - `short_threshold`: Filtering threshold (default: 0.1)\n",
"\n",
"**2. `plot_prf_prediction(sequence, window_size=3, short_threshold=0.65, long_threshold=0.8, ensemble_weight=0.4, title=None, save_path=None, figsize=(12,8), dpi=300)`**\n",
"- **Purpose**: Prediction with built-in visualization (3-subplot layout: FS site heatmap, prediction heatmap, bar chart)\n",
"- **Returns**: (prediction_results_df, matplotlib_figure)\n",
"- **Visualization features**: \n",
" - Black bars with alpha=0.6\n",
" - 'Reds' colormap for heatmaps\n",
" - Height ratios [0.1, 0.1, 1] for subplots\n",
"\n",
"#### PRFPredictor Class Methods\n",
"\n",
"**3. Class initialization: `PRFPredictor(model_dir=None)`**\n",
"- Loads HistGradientBoosting (short, 33bp) and BiLSTM-CNN (long, 399bp) models\n",
"- Uses ensemble weighting for final predictions\n",
"\n",
"**4. `predictor.predict_sequence(sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)`**\n",
"- **Purpose**: Sliding window analysis of complete sequences\n",
"- **Process**: Scans sequence with specified window size, applies both models\n",
"\n",
"**5. `predictor.predict_regions(sequences, short_threshold=0.1, ensemble_weight=0.4)`**\n",
"- **Purpose**: Batch prediction for pre-defined 399bp regions\n",
"- **Input**: List/Series of 399bp sequences\n",
"- **Efficient**: Direct region analysis without sliding window\n",
"\n",
"**6. `predictor.predict_single_position(fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)`**\n",
"- **Purpose**: Single position analysis\n",
"- **Inputs**: 33bp sequence (fs_period) + 399bp sequence (full_seq)\n",
"- **Returns**: Dictionary with individual and ensemble probabilities\n",
"\n",
"**7. `predictor.plot_sequence_prediction(...)`** \n",
"- **Purpose**: Class method version of plot_prf_prediction()\n",
"- **Same parameters** as standalone function\n",
"\n",
"#### Utility Functions\n",
"\n",
"**8. `fscanr(blastx_output, mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10)`**\n",
"- **Purpose**: Detect PRF sites from BLASTX alignment results\n",
"- **Input**: DataFrame with BLASTX columns (qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore, qframe, sframe)\n",
"- **Output**: PRF sites with FS_start, FS_end, FS_type, Strand information\n",
"\n",
"**9. `extract_prf_regions(mrna_file, prf_data)`**\n",
"- **Purpose**: Extract 399bp sequences around detected PRF sites\n",
"- **Inputs**: FASTA file path + FScanR results DataFrame\n",
"- **Handles**: Strand orientation (reverse complement for '-' strand)\n",
"\n",
"#### Data Access Functions\n",
"\n",
"**10. `get_test_data_path(filename)`**\n",
"- **Purpose**: Get path to built-in test data files\n",
"- **Available files**: 'blastx_example.xlsx', 'mrna_example.fasta', 'region_example.csv'\n",
"\n",
"**11. `list_test_data()`**\n",
"- **Purpose**: Display all available test data files\n",
"\n",
"### Usage Pattern Examples\n",
"\n",
"#### Pattern 1: Quick Single Sequence Analysis\n",
"```python\n",
"from FScanpy import predict_prf, plot_prf_prediction\n",
"\n",
"# Simple prediction\n",
"results = predict_prf(sequence=\"ATGCGT...\")\n",
"\n",
"# With visualization \n",
"results, fig = plot_prf_prediction(sequence=\"ATGCGT...\")\n",
"```\n",
"\n",
"#### Pattern 2: Batch Sequence Analysis\n",
"```python\n",
"sequences = [\"seq1\", \"seq2\", \"seq3\"]\n",
"results = predict_prf(sequence=sequences, ensemble_weight=0.5)\n",
"```\n",
"\n",
"#### Pattern 3: BLASTX Pipeline\n",
"```python\n",
"from FScanpy.utils import fscanr, extract_prf_regions\n",
"\n",
"# Step 1: Detect PRF sites\n",
"prf_sites = fscanr(blastx_df)\n",
"\n",
"# Step 2: Extract sequences\n",
"prf_sequences = extract_prf_regions(fasta_file, prf_sites)\n",
"\n",
"# Step 3: Predict probabilities\n",
"results = predict_prf(data=prf_sequences)\n",
"```\n",
"\n",
"#### Pattern 4: Custom Analysis with PRFPredictor\n",
"```python\n",
"from FScanpy import PRFPredictor\n",
"\n",
"predictor = PRFPredictor()\n",
"\n",
"# Method chaining for different analysis types\n",
"seq_results = predictor.predict_sequence(sequence)\n",
"region_results = predictor.predict_regions(sequences_399bp)\n",
"single_result = predictor.predict_single_position(seq_33bp, seq_399bp)\n",
"```\n",
"\n",
"### Parameter Optimization Guide\n",
"\n",
"**Ensemble Weight Selection:**\n",
"- `0.2-0.3`: Conservative (high specificity, favor long model)\n",
"- `0.4-0.6`: Balanced (recommended default)\n",
"- `0.7-0.8`: Sensitive (high sensitivity, favor short model)\n",
"\n",
"**Window Size Selection:**\n",
"- `1`: High resolution, every position (slow but detailed)\n",
"- `3`: Standard resolution (balanced speed/detail) \n",
"- `6-9`: Low resolution, faster analysis\n",
"\n",
"**Threshold Guidelines:**\n",
"- `short_threshold`: 0.1-0.3 (controls efficiency by filtering low-probability candidates)\n",
"- Display thresholds: 0.3-0.8 (controls visualization, higher = cleaner plots)\n",
"- Classification threshold: 0.5 (standard binary classification cutoff)\n",
"\n",
"### Output Interpretation\n",
"\n",
"**Main Result Columns:**\n",
"- `Short_Probability`: HistGradientBoosting model prediction (0-1)\n",
"- `Long_Probability`: BiLSTM-CNN model prediction (0-1)\n",
"- `Ensemble_Probability`: **Final prediction** (weighted combination)\n",
"- `Position`: Sequence position (sliding window mode)\n",
"- `Codon`: Codon at position (sliding window mode)\n",
"\n",
"**Ensemble Probability Interpretation:**\n",
"- `> 0.8`: High confidence PRF site\n",
"- `0.5-0.8`: Moderate confidence PRF site \n",
"- `0.3-0.5`: Low confidence, worth investigating\n",
"- `< 0.3`: Unlikely to be PRF site\n",
"\n",
"### Best Practices\n",
"\n",
"1. **For exploration**: Use `window_size=1, ensemble_weight=0.4`\n",
"2. **For screening**: Use `window_size=3, ensemble_weight=0.4, short_threshold=0.2`\n",
"3. **For validation**: Use region-based prediction with known sequences\n",
"4. **For visualization**: Adjust `short_threshold` and `long_threshold` in plotting functions to control display density\n",
"\n",
"This demo covers all major FScanpy functionalities. For detailed parameter descriptions and advanced usage, please refer to the complete tutorial documentation.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},

447
README.md
View File

@ -1,259 +1,286 @@
# FScanpy
## A Machine Learning-Based Framework for Programmed Ribosomal Frameshifting Prediction
FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR](https://github.com/seanchen607/FScanR.git) framework, FScanpy provides robust and accurate PRF site predictions. The package requires input sequences to be in the positive (5' to 3') orientation.
[![Python](https://img.shields.io/badge/Python-3.7%2B-blue.svg)](https://www.python.org/)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR](https://github.com/seanchen607/FScanR.git) framework, FScanpy provides robust and accurate PRF site predictions.
![FScanpy Architecture](/tutorial/image/structure.jpeg)
For detailed documentation and usage examples, please refer to our [tutorial](tutorial/tutorial.md).
## 🌟 Key Features
## 🚀 What's New in v0.3.0
### 🎯 **Dual-Model Architecture**
- **Short Model** (`HistGradientBoosting`): Fast screening with 33bp sequences
- **Long Model** (`BiLSTM-CNN`): Deep analysis with 399bp sequences
- **Ensemble Prediction**: Customizable model weights for optimal performance
### Model Naming Optimization
- **Short Model** (`short.pkl`): HistGradientBoosting model for rapid screening
- **Long Model** (`long.pkl`): BiLSTM-CNN model for detailed analysis
- **Unified Interface**: Consistent parameter naming and clearer output fields
### 🚀 **Versatile Input Support**
- **Single/Multiple Sequences**: Sliding window prediction across full sequences
- **Region-Based Analysis**: Direct prediction on pre-extracted 399bp regions
- **BLASTX Integration**: Seamless workflow with FScanR pipeline
- **Cross-Species Compatibility**: Built-in databases for viruses, marine phages, Euplotes, etc.
### Performance Improvements
- **Faster Prediction**: Optimized model type detection and reduced redundant operations
- **Better Error Handling**: More informative error messages and robust exception handling
- **Code Quality**: Reduced code duplication and improved maintainability
### 📊 **Advanced Visualization**
- **Interactive Heatmaps**: FS site probability visualization
- **Prediction Plots**: Combined probability and confidence displays
- **Customizable Thresholds**: Separate filtering for each model
- **Export Options**: PNG, PDF, and interactive formats
### 🎨 New Visualization Features
- **Sequence Plotting**: Built-in function for visualizing PRF prediction results
- **Dual Threshold Filtering**: Separate filtering for Short and Long models
- **Interactive Graphics**: Heatmap and bar chart visualization
- **Export Options**: Support for PNG and PDF output formats
### ⚡ **High Performance**
- **Optimized Algorithms**: Efficient sliding window scanning
- **Batch Processing**: Handle multiple sequences simultaneously
- **Flexible Thresholds**: Tunable sensitivity for different use cases
- **Memory Efficient**: Optimized for large-scale genomic data
### ⚖️ Ensemble Weighting System
- **Flexible Ensemble**: Control the contribution of Short and Long models
- **Weight Validation**: Automatic parameter validation and error handling
- **Clear Naming**: `ensemble_weight` parameter for intuitive usage
- **Visual Feedback**: Weight ratios displayed in plots and results
## 🔧 Installation
### 🔧 API Improvements
- **Method Renaming**: More intuitive method names
- `predict_sequence()`: Replaces `predict_full()` for sequence prediction
- `predict_regions()`: Replaces `predict_region()` for batch prediction
- **Field Standardization**: Consistent output field naming
- `Ensemble_Probability`: Main prediction result (replaces `Voting_Probability`)
- `Short_Sequence` / `Long_Sequence`: Clear sequence field names
- **Backward Compatibility**: Deprecated methods still work with warnings
## Core Features
- **Sequence Feature Extraction**: Support for extracting features from nucleic acid sequences, including base composition, k-mer features, and positional features.
- **Frameshift Hotspot Region Prediction**: Predict potential PRF sites in nucleotide sequences using machine learning models.
- **Feature Extraction**: Extract relevant features from sequences to assist in prediction.
- **Cross-Species Support**: Built-in databases for viruses, marine phages, Euplotes, etc., enabling PRF prediction across various species.
- **Visualization Tools**: Built-in plotting functions for result visualization and analysis.
- **Ensemble Modeling**: Customizable ensemble weights for different prediction strategies.
## Main Advantages
- **High Accuracy**: Integrates multiple machine learning models to provide accurate PRF site predictions.
- **Efficiency**: Utilizes a sliding window approach and feature extraction techniques to rapidly scan sequences.
- **Versatility**: Supports PRF prediction across various species and can be combined with the [FScanR](https://github.com/seanchen607/FScanR.git) framework for enhanced accuracy.
- **User-Friendly**: Comes with detailed documentation and usage examples, making it easy for researchers to use.
- **Flexible**: Provides different resolutions to suit different using situations.
## Quick Start
### Basic Prediction
```python
from FScanpy import predict_prf
# Single sequence prediction with default ensemble weights (0.4:0.6)
sequence = "ATGCGTACGT..."
results = predict_prf(sequence=sequence)
print(results[['Position', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']].head())
```
### Custom Ensemble Weighting
```python
# Adjust model weights for different prediction strategies
results_long_dominant = predict_prf(sequence=sequence, ensemble_weight=0.3) # 3:7 ratio (Long dominant)
results_equal_weight = predict_prf(sequence=sequence, ensemble_weight=0.5) # 5:5 ratio (Equal weight)
results_short_dominant = predict_prf(sequence=sequence, ensemble_weight=0.7) # 7:3 ratio (Short dominant)
# Compare ensemble probabilities
print("Long dominant:", results_long_dominant['Ensemble_Probability'].mean())
print("Equal weight:", results_equal_weight['Ensemble_Probability'].mean())
print("Short dominant:", results_short_dominant['Ensemble_Probability'].mean())
```
### Visualization with Custom Weights
```python
from FScanpy import plot_prf_prediction
import matplotlib.pyplot as plt
# Generate prediction plot with custom ensemble weighting
sequence = "ATGCGTACGT..."
results, fig = plot_prf_prediction(
sequence=sequence,
short_threshold=0.65, # HistGB threshold
long_threshold=0.8, # BiLSTM-CNN threshold
ensemble_weight=0.3, # Custom weight: 30% Short, 70% Long
title="Long-Dominant Ensemble PRF Prediction (3:7)",
save_path="prediction_result.png"
)
plt.show()
```
### Advanced Usage with New API
```python
from FScanpy import PRFPredictor
import matplotlib.pyplot as plt
# Create predictor instance
predictor = PRFPredictor()
# Use new sequence prediction method
results = predictor.predict_sequence(
sequence=sequence,
ensemble_weight=0.4
)
# Compare different ensemble configurations
weights = [0.2, 0.4, 0.6, 0.8]
weight_names = ["Long 80%", "Balanced", "Short 60%", "Short 80%"]
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()
for i, (weight, name) in enumerate(zip(weights, weight_names)):
results = predictor.predict_sequence(sequence=sequence, ensemble_weight=weight)
ax = axes[i]
ax.bar(results['Position'], results['Ensemble_Probability'], alpha=0.7)
ax.set_title(f'{name} (Weight: {weight:.1f}:{1-weight:.1f})')
ax.set_ylabel('Probability')
plt.tight_layout()
plt.show()
```
### Batch Region Prediction
```python
# Predict multiple 399bp sequences
import pandas as pd
data = pd.DataFrame({
'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57] # 399bp sequences
})
results = predict_prf(data=data, ensemble_weight=0.4)
print(results[['Ensemble_Probability', 'Ensemble_Weights']].head())
```
## Installation Requirements
### Prerequisites
- Python ≥ 3.7
- Dependencies are automatically handled during installation
- All dependencies are automatically installed
### Option 1: Install via pip
### Install via pip (Recommended)
```bash
pip install FScanpy
```
### Option 2: Install from source
### Install from Source
```bash
git clone git@60.204.158.188:yyh/FScanpy-package.git
git clone https://github.com/your-org/FScanpy-package.git
cd FScanpy-package
pip install -e .
```
## 🔄 Migration from Previous Versions
## 🚀 Quick Start
### API Changes Summary
### Basic Usage
```python
# OLD API (deprecated but still works)
results = predict_prf(sequence="ATGC...", short_weight=0.4)
results = predictor.predict_full(sequence, short_weight=0.4)
results = predictor.predict_region(sequences, short_weight=0.4)
from FScanpy import predict_prf
# NEW API (recommended)
results = predict_prf(sequence="ATGC...", ensemble_weight=0.4)
results = predictor.predict_sequence(sequence, ensemble_weight=0.4)
results = predictor.predict_regions(sequences, ensemble_weight=0.4)
# Simple sequence prediction
sequence = "ATGCGTACGTTAGC..." # Your DNA sequence
results = predict_prf(sequence=sequence)
# Output field changes
# OLD: 'Voting_Probability', 'Weight_Info', '33bp', '399bp'
# NEW: 'Ensemble_Probability', 'Ensemble_Weights', 'Short_Sequence', 'Long_Sequence'
# View top predictions
print(results[['Position', 'Ensemble_Probability', 'Short_Probability', 'Long_Probability']].head(10))
```
# Visualization with ensemble weights
### Visualization
```python
from FScanpy import plot_prf_prediction
# Generate prediction plot
results, fig = plot_prf_prediction(
sequence="ATGC...",
short_threshold=0.65,
long_threshold=0.8,
ensemble_weight=0.3 # 30% Short, 70% Long
sequence=sequence,
short_threshold=0.65, # HistGB threshold
long_threshold=0.8, # BiLSTM-CNN threshold
ensemble_weight=0.4, # 40% Short, 60% Long
title="PRF Prediction Results"
)
```
### Backward Compatibility
- All old methods still work but will show deprecation warnings
- Old field names are automatically added for compatibility
- Gradual migration is supported
### Advanced Usage
```python
from FScanpy import PRFPredictor
import pandas as pd
## Ensemble Weight Configuration Guide
# Create predictor instance
predictor = PRFPredictor()
### Recommended Weights for Different Scenarios:
# Batch prediction on pre-extracted regions
data = pd.DataFrame({
'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57] # 399bp sequences
})
results = predictor.predict_regions(data, ensemble_weight=0.4)
| Scenario | ensemble_weight | Description | Use Case |
|----------|----------------|-------------|----------|
| **High Sensitivity** | 0.2-0.3 | Long model dominant | Detecting subtle PRF sites |
| **Balanced Detection** | 0.4-0.5 | Balanced ensemble (recommended) | General purpose prediction |
| **Fast Screening** | 0.6-0.7 | Short model dominant | Rapid initial screening |
| **Equal Contribution** | 0.5 | Equal weight to both models | Comparative analysis |
# Sequence-level prediction with custom parameters
results = predictor.predict_sequence(
sequence=sequence,
window_size=1, # Step size for sliding window
ensemble_weight=0.3, # Model weighting
short_threshold=0.5 # Filtering threshold
)
```
### Weight Selection Guidelines:
- **Low ensemble_weight (0.2-0.3)**:
- Emphasizes Long model (BiLSTM-CNN)
- Better for detecting complex patterns
- Higher sensitivity, may have more false positives
- **High ensemble_weight (0.6-0.8)**:
- Emphasizes Short model (HistGB)
- Faster computation
- Good for initial screening
- Higher specificity, may miss subtle sites
- **Balanced (0.4-0.5)**:
- Recommended for most applications
- Good balance of sensitivity and specificity
- Suitable for comprehensive analysis
## 🎛️ Ensemble Weight Configuration
## Output Field Reference
The `ensemble_weight` parameter controls the contribution of each model:
### Main Prediction Fields
- **`Short_Probability`**: HistGradientBoosting model prediction (0-1)
- **`Long_Probability`**: BiLSTM-CNN model prediction (0-1)
- **`Ensemble_Probability`**: Final ensemble prediction (primary result)
- **`Ensemble_Weights`**: Weight configuration information
| ensemble_weight | Short Model | Long Model | Best For |
|----------------|-------------|------------|----------|
| **0.2-0.3** | 20-30% | 70-80% | **High sensitivity**, detecting subtle sites |
| **0.4-0.5** | 40-50% | 50-60% | **Balanced detection** (recommended) |
| **0.6-0.7** | 60-70% | 30-40% | **Fast screening**, high specificity |
### Sequence Fields
- **`Short_Sequence`**: 33bp sequence used by Short model
- **`Long_Sequence`**: 399bp sequence used by Long model
### Weight Selection Examples
```python
# High sensitivity (Long model dominant)
sensitive_results = predict_prf(sequence, ensemble_weight=0.2)
# Balanced approach (recommended)
balanced_results = predict_prf(sequence, ensemble_weight=0.4)
# Fast screening (Short model dominant)
screening_results = predict_prf(sequence, ensemble_weight=0.7)
```
## 📊 Core Functions
### Main Prediction Interface
```python
predict_prf(
sequence=None, # Single/multiple sequences or None
data=None, # DataFrame with 399bp sequences or None
window_size=3, # Sliding window step size
short_threshold=0.1, # Short model filtering threshold
ensemble_weight=0.4, # Short model weight (0.0-1.0)
model_dir=None # Custom model directory
)
```
### Visualization Function
```python
plot_prf_prediction(
sequence, # Input DNA sequence
window_size=3, # Scanning step size
short_threshold=0.65, # Short model threshold for plotting
long_threshold=0.8, # Long model threshold for plotting
ensemble_weight=0.4, # Model weighting
title=None, # Plot title
save_path=None, # Save file path
figsize=(12,8), # Figure size
dpi=300 # Resolution for saved plots
)
```
### PRFPredictor Class Methods
```python
predictor = PRFPredictor()
# Sequence prediction (sliding window)
predictor.predict_sequence(sequence, ensemble_weight=0.4)
# Region prediction (batch processing)
predictor.predict_regions(dataframe, ensemble_weight=0.4)
# Feature extraction
predictor.extract_features(sequences)
# Model information
predictor.get_model_info()
```
## 📈 Output Fields
### Prediction Results
- **`Position`**: Position in the original sequence
- **`Codon`**: 3bp codon at the position
- **`Ensemble_Probability`**: Final ensemble prediction (main result)
- **`Short_Probability`**: HistGradientBoosting prediction (0-1)
- **`Long_Probability`**: BiLSTM-CNN prediction (0-1)
- **`Ensemble_Weights`**: Model weight configuration used
### Metadata Fields
- **`Sequence_ID`**: Identifier for multi-sequence predictions
- Additional fields from input DataFrame (for region predictions)
### Sequence Information
- **`Short_Sequence`**: 33bp sequence for Short model
- **`Long_Sequence`**: 399bp sequence for Long model
- **`Codon`**: 3bp codon at the prediction position
- **`Sequence_ID`**: Identifier for multi-sequence inputs
## Examples
## 🔬 Integration with FScanR
See `example_plot_prediction.py` for comprehensive examples of:
- Basic prediction plotting
- Custom threshold configuration
- Ensemble weight parameter usage and comparison
- New API method demonstrations
- Saving plots to files
- Advanced visualization options
FScanpy works seamlessly with the FScanR pipeline for comprehensive PRF analysis:
## Authors
```python
from FScanpy import fscanr, extract_prf_regions, predict_prf
# Step 1: BLASTX analysis with FScanR
blastx_results = fscanr(
blastx_data,
mismatch_cutoff=10,
evalue_cutoff=1e-5,
frameDist_cutoff=10
)
## Citation
If you utilize FScanpy in your research, please cite our work:
# Step 2: Extract PRF candidate regions
prf_regions = extract_prf_regions(original_sequence, blastx_results)
# Step 3: Predict with FScanpy
final_predictions = predict_prf(data=prf_regions, ensemble_weight=0.4)
```
## 📚 Documentation
- **[Complete Tutorial](tutorial/tutorial.md)**: Comprehensive usage guide with examples
- **[Demo Notebook](FScanpy_Demo.ipynb)**: Interactive examples and workflows
- **[Example Scripts](example_plot_prediction.py)**: Ready-to-run code examples
## 🎯 Use Cases
### 1. **Viral Genome Analysis**
```python
# Scan viral genome for PRF sites
viral_sequence = load_viral_genome()
prf_sites = predict_prf(viral_sequence, ensemble_weight=0.3)
high_confidence = prf_sites[prf_sites['Ensemble_Probability'] > 0.8]
```
### 2. **Comparative Genomics**
```python
# Compare PRF patterns across species
species_data = pd.DataFrame({
'Species': ['Virus_A', 'Virus_B'],
'Long_Sequence': [seq_a_399bp, seq_b_399bp]
})
comparative_results = predict_prf(data=species_data)
```
### 3. **High-Throughput Screening**
```python
# Fast screening of large sequence datasets
sequences = load_large_dataset()
screening_results = predict_prf(
sequence=sequences,
ensemble_weight=0.7, # Fast screening mode
short_threshold=0.3
)
```
## 🤝 Contributing
We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
## 📝 Citation
If you use FScanpy in your research, please cite:
```bibtex
[Citation details will be added upon publication]
@software{fscanpy2024,
title={FScanpy: A Machine Learning Framework for Programmed Ribosomal Frameshifting Prediction},
author={[Author names]},
year={2024},
url={https://github.com/your-org/FScanpy}
}
```
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🆘 Support
- **Issues**: [GitHub Issues](https://github.com/your-org/FScanpy/issues)
- **Documentation**: [Tutorial](tutorial/tutorial.md)
- **Examples**: [Demo Notebook](FScanpy_Demo.ipynb)
## 🏗️ Dependencies
FScanpy automatically installs all required dependencies:
- `numpy>=1.19.0`
- `pandas>=1.2.0`
- `scikit-learn>=1.0.0`
- `tensorflow>=2.6.0`
- `matplotlib>=3.3.0`
- `seaborn>=0.11.0`
---
**FScanpy** - Advancing programmed ribosomal frameshifting research through machine learning 🧬

View File

@ -1,3 +1,5 @@
# FScanpy Tutorial - Complete Usage Guide
## Abstract
FScanpy is a Python package designed to predict Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. This package integrates machine learning models, sequence feature analysis, and visualization capabilities to help researchers rapidly locate potential PRF sites.
@ -7,6 +9,7 @@ FScanpy is a Python package designed to predict Programmed Ribosomal Frameshifti
FScanpy is a Python package dedicated to predicting Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. It integrates machine learning models (Gradient Boosting and BiLSTM-CNN) along with the FScanR package to furnish precise PRF predictions. Users are capable of employing three types of data as input: the entire cDNA/mRNA sequence that requires prediction, the nucleotide sequence in the vicinity of the suspected frameshift site, and the peptide library blastx results of the species or related species. It anticipates the input sequence to be in the + strand and can be integrated with FScanR to augment the accuracy.
![Machine learning models](/image/ML.png)
For the prediction of the entire sequence, FScanpy adopts a sliding window approach to scan the entire sequence and predict the PRF sites. For regional prediction, it is based on the 33-bp and 399-bp sequences in the 0 reading frame around the suspected frameshift site. Initially, the Short model (HistGradientBoosting) will predict the potential PRF sites within the scanning window. If the predicted probability exceeds the threshold, the Long model (BiLSTM-CNN) will predict the PRF sites in the 399bp sequence. Then, ensemble weighting combines the two models to make the final prediction.
For PRF detection from BLASTX output, [FScanR](https://github.com/seanchen607/FScanR.git) identifies potential PRF sites from BLASTX alignment results, acquires the two hits of the same query sequence, and then utilizes frameDist_cutoff, mismatch_cutoff, and evalue_cutoff to filter the hits. Finally, FScanpy is utilized to predict the probability of PRF sites.
@ -32,102 +35,583 @@ pip install FScanpy
### 2. Clone from GitHub
```bash
git clone https://github.com/.../FScanpy.git
cd your_project_directory
cd FScanpy
pip install -e .
```
## Methods and Usage
## Complete Function Reference
### 1. Core Prediction Functions
#### 1.1 `predict_prf()` - Main Prediction Interface
**Function Signature:**
```python
def predict_prf(
sequence: Union[str, List[str], None] = None,
data: Union[pd.DataFrame, None] = None,
window_size: int = 3,
short_threshold: float = 0.1,
ensemble_weight: float = 0.4,
model_dir: str = None
) -> pd.DataFrame
```
**Parameters:**
- `sequence`: Single or multiple DNA sequences for sliding window prediction
- `data`: DataFrame data, must contain 'Long_Sequence' or '399bp' column for region prediction
- `window_size`: Sliding window size (default: 3, recommended: 1-10)
- `short_threshold`: Short model (HistGB) probability threshold (default: 0.1, range: 0.0-1.0)
- `ensemble_weight`: Weight of short model in ensemble (default: 0.4, range: 0.0-1.0)
- `model_dir`: Model directory path (optional, uses built-in models if None)
**Returns:**
- `pd.DataFrame`: Prediction results with columns:
- `Short_Probability`: Short model prediction probability
- `Long_Probability`: Long model prediction probability
- `Ensemble_Probability`: Ensemble prediction probability (main result)
- `Position`: Position in sequence (for sliding window mode)
- `Codon`: Codon at position (for sliding window mode)
- `Ensemble_Weights`: Weight configuration information
**Usage Examples:**
```python
from FScanpy import predict_prf
# 1. Single sequence sliding window prediction
sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
results = predict_prf(sequence=sequence)
# 2. Multiple sequences prediction
sequences = ["ATGCGTACGT...", "GCTATAGCAT..."]
results = predict_prf(sequence=sequences)
# 3. Custom parameters
results = predict_prf(
sequence=sequence,
window_size=1, # Scan every position
short_threshold=0.2, # Higher threshold
ensemble_weight=0.3 # 3:7 ratio (short:long)
)
# 4. DataFrame region prediction
import pandas as pd
data = pd.DataFrame({
'Long_Sequence': ['ATGCGT...', 'GCTATAG...'], # or use '399bp'
'sample_id': ['sample1', 'sample2']
})
results = predict_prf(data=data)
```
#### 1.2 `plot_prf_prediction()` - Prediction with Visualization
**Function Signature:**
```python
def plot_prf_prediction(
sequence: str,
window_size: int = 3,
short_threshold: float = 0.65,
long_threshold: float = 0.8,
ensemble_weight: float = 0.4,
title: str = None,
save_path: str = None,
figsize: tuple = (12, 8),
dpi: int = 300,
model_dir: str = None
) -> tuple
```
**Parameters:**
- `sequence`: Input DNA sequence (string)
- `window_size`: Sliding window size (default: 3)
- `short_threshold`: Short model filtering threshold for heatmap display (default: 0.65)
- `long_threshold`: Long model filtering threshold for heatmap display (default: 0.8)
- `ensemble_weight`: Weight of short model in ensemble (default: 0.4)
- `title`: Plot title (optional, auto-generated if None)
- `save_path`: Save path (optional, saves plot if provided)
- `figsize`: Figure size tuple (default: (12, 8))
- `dpi`: Figure resolution (default: 300)
- `model_dir`: Model directory path (optional)
**Returns:**
- `tuple`: (prediction_results: pd.DataFrame, figure: matplotlib.figure.Figure)
**Usage Examples:**
```python
from FScanpy import plot_prf_prediction
import matplotlib.pyplot as plt
# 1. Basic plotting
sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
results, fig = plot_prf_prediction(sequence)
plt.show()
# 2. Custom thresholds and weights
results, fig = plot_prf_prediction(
sequence,
short_threshold=0.7, # Higher display threshold
long_threshold=0.85, # Higher display threshold
ensemble_weight=0.3, # 3:7 weight ratio
title="Custom Analysis Results",
save_path="analysis.png",
figsize=(15, 10),
dpi=150
)
# 3. High-resolution analysis
results, fig = plot_prf_prediction(
sequence,
window_size=1, # Scan every position
ensemble_weight=0.5, # Equal weights
dpi=600 # High resolution
)
```
### 2. PRFPredictor Class Methods
#### 2.1 Class Initialization
### 1. Load model and test data
```python
from FScanpy import PRFPredictor
from FScanpy.data import get_test_data_path, list_test_data
predictor = PRFPredictor() # load model
list_test_data() # list all the test data
blastx_file = get_test_data_path('blastx_example.xlsx')
mrna_file = get_test_data_path('mrna_example.fasta')
region_example = get_test_data_path('region_example.xlsx')
# Initialize with default models
predictor = PRFPredictor()
# Initialize with custom model directory
predictor = PRFPredictor(model_dir='/path/to/models')
```
### 2. Predict PRF Sites in a Full Sequence
Use the `predict_sequence()` method to scan the entire sequence:
#### 2.2 `predict_sequence()` - Sliding Window Prediction
**Method Signature:**
```python
def predict_sequence(self, sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)
```
**Parameters:**
- `sequence`: Input DNA sequence
- `window_size`: Sliding window size (default: 3)
- `short_threshold`: Short model probability threshold (default: 0.1)
- `ensemble_weight`: Short model weight in ensemble (default: 0.4)
**Usage:**
```python
predictor = PRFPredictor()
results = predictor.predict_sequence(
sequence='ATGCGTACGTATGCGTACGTATGCGTACGT',
window_size=3, # Scanning window size
short_threshold=0.1, # Short model threshold
ensemble_weight=0.4 # Ensemble weight (Short:Long = 0.4:0.6)
)
# With visualization
results, fig = predictor.plot_sequence_prediction(
sequence='ATGCGTACGTATGCGTACGTATGCGTACGT',
ensemble_weight=0.4
sequence="ATGCGTACGT...",
window_size=1,
short_threshold=0.15,
ensemble_weight=0.35
)
```
### 3. Predict PRF in Specific Regions
Use the `predict_regions()` method to predict PRF in known regions of interest:
#### 2.3 `predict_regions()` - Region-based Prediction
**Method Signature:**
```python
import pandas as pd
region_example = pd.read_excel(get_test_data_path('region_example.xlsx'))
def predict_regions(self, sequences, short_threshold=0.1, ensemble_weight=0.4)
```
**Parameters:**
- `sequences`: List or Series of 399bp sequences
- `short_threshold`: Short model probability threshold (default: 0.1)
- `ensemble_weight`: Short model weight in ensemble (default: 0.4)
**Usage:**
```python
predictor = PRFPredictor()
sequences = ["ATGCGT...", "GCTATAG..."] # 399bp sequences
results = predictor.predict_regions(
sequences=region_example['399bp'],
sequences=sequences,
short_threshold=0.1,
ensemble_weight=0.4
)
```
### 4. Identify PRF Sites from BLASTX Output
BLASTX Output should contain the following columns: `qseqid`, `sseqid`, `pident`, `length`, `mismatch`, `gapopen`, `qstart`, `qend`, `sstart`, `send`, `evalue`, `bitscore`, `qframe`, `sframe`.
#### 2.4 `predict_single_position()` - Single Position Prediction
Use the FScanR function to identify potential PRF sites from BLASTX alignment results:
**Method Signature:**
```python
def predict_single_position(self, fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)
```
**Parameters:**
- `fs_period`: 33bp sequence around frameshift site
- `full_seq`: 399bp sequence for long model
- `short_threshold`: Short model probability threshold (default: 0.1)
- `ensemble_weight`: Short model weight in ensemble (default: 0.4)
**Usage:**
```python
predictor = PRFPredictor()
result = predictor.predict_single_position(
fs_period="ATGCGTACGTATGCGTACGTATGCGTACGTA", # 33bp
full_seq="ATGCGT..." * 133, # 399bp
short_threshold=0.1,
ensemble_weight=0.4
)
```
#### 2.5 `plot_sequence_prediction()` - Class Method for Plotting
**Method Signature:**
```python
def plot_sequence_prediction(self, sequence, window_size=3, short_threshold=0.65,
long_threshold=0.8, ensemble_weight=0.4, title=None,
save_path=None, figsize=(12, 8), dpi=300)
```
**Usage:**
```python
predictor = PRFPredictor()
results, fig = predictor.plot_sequence_prediction(
sequence="ATGCGTACGT...",
window_size=3,
ensemble_weight=0.4
)
```
### 3. Utility Functions
#### 3.1 `fscanr()` - PRF Site Detection from BLASTX
**Function Signature:**
```python
def fscanr(
blastx_output: pd.DataFrame,
mismatch_cutoff: float = 10,
evalue_cutoff: float = 1e-5,
frameDist_cutoff: float = 10
) -> pd.DataFrame
```
**Parameters:**
- `blastx_output`: BLASTX output DataFrame with required columns:
- `qseqid`, `sseqid`, `pident`, `length`, `mismatch`, `gapopen`
- `qstart`, `qend`, `sstart`, `send`, `evalue`, `bitscore`, `qframe`, `sframe`
- `mismatch_cutoff`: Maximum allowed mismatches (default: 10)
- `evalue_cutoff`: E-value threshold (default: 1e-5)
- `frameDist_cutoff`: Frame distance threshold (default: 10)
**Returns:**
- `pd.DataFrame`: PRF sites with columns:
- `DNA_seqid`: Sequence identifier
- `FS_start`, `FS_end`: Frameshift start and end positions
- `Pep_seqid`: Peptide sequence identifier
- `Pep_FS_start`, `Pep_FS_end`: Peptide frameshift positions
- `FS_type`: Type of frameshift (-2, -1, 1, 2)
- `Strand`: Strand orientation (+, -)
**Usage:**
```python
from FScanpy.utils import fscanr
blastx_output = pd.read_excel(get_test_data_path('blastx_example.xlsx'))
fscanr_result = fscanr(blastx_output,
mismatch_cutoff=10, # Allowed mismatches
evalue_cutoff=1e-5, # E-value threshold
frameDist_cutoff=10) # Frame distance threshold
import pandas as pd
# Load BLASTX results
blastx_data = pd.read_excel('blastx_results.xlsx')
# Detect PRF sites
prf_sites = fscanr(
blastx_output=blastx_data,
mismatch_cutoff=5, # Stricter mismatch filter
evalue_cutoff=1e-6, # Stricter E-value filter
frameDist_cutoff=15 # Allow larger frame distances
)
```
### 5. Extract PRF Sites and Evaluate
Use the `extract_prf_regions()` method to extract PRF site sequences from mRNA sequences:
#### 3.2 `extract_prf_regions()` - Extract Sequences Around PRF Sites
**Function Signature:**
```python
def extract_prf_regions(mrna_file: str, prf_data: pd.DataFrame) -> pd.DataFrame
```
**Parameters:**
- `mrna_file`: Path to mRNA sequences file (FASTA format)
- `prf_data`: DataFrame from `fscanr()` output
**Returns:**
- `pd.DataFrame`: Extracted sequences with columns:
- `DNA_seqid`: Sequence identifier
- `FS_start`, `FS_end`: Frameshift positions
- `Strand`: Strand orientation
- `399bp`: Extracted 399bp sequence
- `FS_type`: Frameshift type
**Usage:**
```python
from FScanpy.utils import extract_prf_regions
prf_regions = extract_prf_regions(
mrna_file=get_test_data_path('mrna_example.fasta'),
prf_data=fscanr_result
# Extract sequences around PRF sites
prf_sequences = extract_prf_regions(
mrna_file='sequences.fasta',
prf_data=prf_sites
)
prf_results = predictor.predict_regions(prf_regions['399bp'])
# Predict PRF probabilities
predictor = PRFPredictor()
results = predictor.predict_regions(prf_sequences['399bp'])
```
## Complete Workflow Example
### 4. Data Access Functions
#### 4.1 Test Data Access
```python
from FScanpy import PRFPredictor, predict_prf, plot_prf_prediction
from FScanpy.data import get_test_data_path, list_test_data
# List available test data
list_test_data()
# Get test data paths
blastx_file = get_test_data_path('blastx_example.xlsx')
mrna_file = get_test_data_path('mrna_example.fasta')
region_file = get_test_data_path('region_example.csv')
```
## Complete Workflow Examples
### Workflow 1: Full Sequence Analysis
```python
from FScanpy import predict_prf, plot_prf_prediction
import matplotlib.pyplot as plt
# Define sequence
sequence = "ATGCGTACGTATGCGTACGTATGCGTACGTAAGCCCTTTGAACCCAAAGGG"
# Method 1: Simple prediction
results = predict_prf(sequence=sequence)
print(f"Found {len(results)} potential sites")
# Method 2: Prediction with visualization
results, fig = plot_prf_prediction(
sequence=sequence,
window_size=1, # Scan every position
short_threshold=0.3, # Display sites above 0.3
long_threshold=0.4, # Display sites above 0.4
ensemble_weight=0.4, # 4:6 weight ratio
title="PRF Analysis Results",
save_path="prf_analysis.png"
)
plt.show()
# Analyze top predictions
top_sites = results.nlargest(5, 'Ensemble_Probability')
print("Top 5 predicted sites:")
for _, site in top_sites.iterrows():
print(f"Position {site['Position']}: {site['Ensemble_Probability']:.3f}")
```
### Workflow 2: Region-based Prediction
```python
from FScanpy import predict_prf
import pandas as pd
# Prepare region data
region_data = pd.DataFrame({
'sample_id': ['sample1', 'sample2', 'sample3'],
'Long_Sequence': [
'ATGCGT...', # 399bp sequence 1
'GCTATAG...', # 399bp sequence 2
'TTACGGA...' # 399bp sequence 3
],
'known_label': [1, 0, 1] # Optional: known labels for validation
})
# Predict PRF probabilities
results = predict_prf(
data=region_data,
ensemble_weight=0.3 # Favor long model (3:7 ratio)
)
# Evaluate results
if 'known_label' in results.columns:
threshold = 0.5
predictions = (results['Ensemble_Probability'] > threshold).astype(int)
accuracy = (predictions == results['known_label']).mean()
print(f"Accuracy at threshold {threshold}: {accuracy:.3f}")
```
### Workflow 3: BLASTX-based Analysis Pipeline
```python
from FScanpy import PRFPredictor, predict_prf
from FScanpy.data import get_test_data_path
from FScanpy.utils import fscanr, extract_prf_regions
import pandas as pd
# Initialize predictor
# Step 1: Load BLASTX data
blastx_data = pd.read_excel(get_test_data_path('blastx_example.xlsx'))
print(f"Loaded {len(blastx_data)} BLASTX hits")
# Step 2: Detect PRF sites using FScanR
prf_sites = fscanr(
blastx_output=blastx_data,
mismatch_cutoff=10,
evalue_cutoff=1e-5,
frameDist_cutoff=10
)
print(f"Detected {len(prf_sites)} potential PRF sites")
# Step 3: Extract sequences around PRF sites
mrna_file = get_test_data_path('mrna_example.fasta')
prf_sequences = extract_prf_regions(
mrna_file=mrna_file,
prf_data=prf_sites
)
print(f"Extracted {len(prf_sequences)} sequences")
# Step 4: Predict PRF probabilities
predictor = PRFPredictor()
results = predictor.predict_regions(
sequences=prf_sequences['399bp'],
ensemble_weight=0.4
)
# Method 1: Sequence prediction
sequence = 'ATGCGTACGTATGCGTACGTATGCGTACGT'
results = predict_prf(sequence=sequence, ensemble_weight=0.4)
# Step 5: Combine results with metadata
final_results = pd.concat([
prf_sequences.reset_index(drop=True),
results.reset_index(drop=True)
], axis=1)
# Method 2: Region prediction
region_data = pd.read_excel(get_test_data_path('region_example.xlsx'))
results = predict_prf(data=region_data, ensemble_weight=0.4)
# Step 6: Analyze results
high_prob_sites = final_results[
final_results['Ensemble_Probability'] > 0.7
]
print(f"High probability PRF sites: {len(high_prob_sites)}")
# Method 3: BLASTX pipeline
blastx_output = pd.read_excel(get_test_data_path('blastx_example.xlsx'))
fscanr_result = fscanr(blastx_output, mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10)
prf_regions = extract_prf_regions(get_test_data_path('mrna_example.fasta'), fscanr_result)
prf_results = predictor.predict_regions(prf_regions['399bp'])
# Display top results
print("\nTop PRF predictions:")
top_results = final_results.nlargest(3, 'Ensemble_Probability')
for _, row in top_results.iterrows():
print(f"Sequence {row['DNA_seqid']}: {row['Ensemble_Probability']:.3f}")
```
# Visualization
results, fig = plot_prf_prediction(sequence, ensemble_weight=0.4, save_path='prediction.png')
### Workflow 4: Custom Analysis with Multiple Sequences
```python
from FScanpy import predict_prf, plot_prf_prediction
import matplotlib.pyplot as plt
# Multiple sequence analysis
sequences = [
"ATGCGTACGTATGCGTACGTATGCGTACGTAAGCCCTTTGAACCCAAAGGG",
"GCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCAT",
"TTACGGATTACGGATTACGGATTACGGATTACGGATTACGGATTACGGAT"
]
# Batch prediction
results = predict_prf(
sequence=sequences,
window_size=2,
ensemble_weight=0.5 # Equal weights
)
# Analyze per sequence
for seq_id in results['Sequence_ID'].unique():
seq_results = results[results['Sequence_ID'] == seq_id]
max_prob = seq_results['Ensemble_Probability'].max()
print(f"{seq_id}: Max probability = {max_prob:.3f}")
# Visualize first sequence
first_seq_results, fig = plot_prf_prediction(
sequence=sequences[0],
ensemble_weight=0.5,
title="First Sequence Analysis"
)
plt.show()
```
## Parameter Optimization Guidelines
### 1. Ensemble Weight Selection
- **Conservative (favor specificity)**: `ensemble_weight = 0.2-0.3` (favor long model)
- **Balanced**: `ensemble_weight = 0.4-0.6` (default recommended)
- **Sensitive (favor sensitivity)**: `ensemble_weight = 0.7-0.8` (favor short model)
### 2. Threshold Selection
- **Short threshold**: Usually 0.1-0.3, controls computational efficiency
- **Display thresholds**: 0.3-0.8, controls visualization display
- **Classification threshold**: 0.5 (standard), adjust based on validation data
### 3. Window Size Selection
- **Fine-grained analysis**: `window_size = 1` (every position)
- **Standard analysis**: `window_size = 3` (every 3rd position, default)
- **Coarse analysis**: `window_size = 6-9` (faster, less detailed)
## Troubleshooting
### Common Issues and Solutions
1. **Model Loading Errors**
```python
# Check model directory
import FScanpy
predictor = PRFPredictor(model_dir='/custom/path')
```
2. **Memory Issues with Large Sequences**
```python
# Use larger window size to reduce computational load
results = predict_prf(sequence=large_seq, window_size=9)
```
3. **Visualization Issues**
```python
# Adjust figure parameters
results, fig = plot_prf_prediction(
sequence=seq,
figsize=(20, 10), # Larger figure
dpi=150 # Lower resolution
)
```
4. **Input Format Issues**
```python
# Ensure proper DataFrame format
data = pd.DataFrame({
'Long_Sequence': sequences, # Use 'Long_Sequence' or '399bp'
'sample_id': ids
})
```
## Performance Optimization
### 1. Batch Processing
```python
# Process multiple sequences efficiently
sequences = ["seq1", "seq2", "seq3", ...]
results = predict_prf(sequence=sequences, window_size=3)
```
### 2. Threshold Optimization
```python
# Use appropriate short_threshold to skip unnecessary long model calls
results = predict_prf(
sequence=sequence,
short_threshold=0.2 # Higher threshold = faster processing
)
```
### 3. Memory Management
```python
# For very large datasets, process in chunks
chunk_size = 100
for i in range(0, len(large_dataset), chunk_size):
chunk = large_dataset[i:i+chunk_size]
chunk_results = predict_prf(data=chunk)
# Process chunk_results
```
## Citation
If you use FScanpy, please cite our paper: [Paper Link]
If you use FScanpy, please cite our paper: [Paper Link]
## Support
For questions and issues, please visit our GitHub repository or contact the development team.