617 lines
20 KiB
Markdown
617 lines
20 KiB
Markdown
# FScanpy Tutorial - Complete Usage Guide
|
|
|
|
[](tutorial_zh.md)
|
|
|
|
## Abstract
|
|
FScanpy is a Python package designed to predict Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. This package integrates machine learning models, sequence feature analysis, and visualization capabilities to help researchers rapidly locate potential PRF sites.
|
|
|
|
## Introduction
|
|

|
|
|
|
FScanpy is a Python package dedicated to predicting Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. It integrates machine learning models (Gradient Boosting and BiLSTM-CNN) along with the FScanR package to furnish precise PRF predictions. Users are capable of employing three types of data as input: the entire cDNA/mRNA sequence that requires prediction, the nucleotide sequence in the vicinity of the suspected frameshift site, and the peptide library blastx results of the species or related species. It anticipates the input sequence to be in the + strand and can be integrated with FScanR to augment the accuracy.
|
|
|
|

|
|
|
|
For the prediction of the entire sequence, FScanpy adopts a sliding window approach to scan the entire sequence and predict the PRF sites. For regional prediction, it is based on the 33-bp and 399-bp sequences in the 0 reading frame around the suspected frameshift site. Initially, the Short model (HistGradientBoosting) will predict the potential PRF sites within the scanning window. If the predicted probability exceeds the threshold, the Long model (BiLSTM-CNN) will predict the PRF sites in the 399bp sequence. Then, ensemble weighting combines the two models to make the final prediction.
|
|
|
|
For PRF detection from BLASTX output, [FScanR](https://github.com/seanchen607/FScanR.git) identifies potential PRF sites from BLASTX alignment results, acquires the two hits of the same query sequence, and then utilizes frameDist_cutoff, mismatch_cutoff, and evalue_cutoff to filter the hits. Finally, FScanpy is utilized to predict the probability of PRF sites.
|
|
|
|
### Background
|
|
[Ribosomal frameshifting](https://en.wikipedia.org/wiki/Ribosomal_frameshift), also known as translational frameshifting or translational recoding, is a biological phenomenon that occurs during translation that results in the production of multiple, unique proteins from a single mRNA. The process can be programmed by the nucleotide sequence of the mRNA and is sometimes affected by the secondary, 3-dimensional mRNA structure. It has been described mainly in viruses (especially retroviruses), retrotransposons and bacterial insertion elements, and also in some cellular genes.
|
|
|
|
### Key features of FScanpy include:
|
|
|
|
- Integration of two predictive models:
|
|
- Short Model (HistGradientBoosting): Analyzes local sequence features centered around potential frameshift sites (33bp).
|
|
- Long Model (BiLSTM-CNN): Analyzes broader sequence features (399bp).
|
|
- Supports PRF prediction across various species.
|
|
- Can be combined with [FScanR](https://github.com/seanchen607/FScanR.git) for enhanced accuracy.
|
|
|
|
## Installation (python>=3.7)
|
|
|
|
### 1. Use pip
|
|
```bash
|
|
pip install FScanpy
|
|
```
|
|
|
|
### 2. Clone from GitHub
|
|
```bash
|
|
git clone https://github.com/.../FScanpy.git
|
|
cd FScanpy
|
|
pip install -e .
|
|
```
|
|
|
|
## Complete Function Reference
|
|
|
|
### 1. Core Prediction Functions
|
|
|
|
#### 1.1 `predict_prf()` - Main Prediction Interface
|
|
|
|
**Function Signature:**
|
|
```python
|
|
def predict_prf(
|
|
sequence: Union[str, List[str], None] = None,
|
|
data: Union[pd.DataFrame, None] = None,
|
|
window_size: int = 3,
|
|
short_threshold: float = 0.1,
|
|
ensemble_weight: float = 0.4,
|
|
model_dir: str = None
|
|
) -> pd.DataFrame
|
|
```
|
|
|
|
**Parameters:**
|
|
- `sequence`: Single or multiple DNA sequences for sliding window prediction
|
|
- `data`: DataFrame data, must contain 'Long_Sequence' or '399bp' column for region prediction
|
|
- `window_size`: Sliding window size (default: 3, recommended: 1-10)
|
|
- `short_threshold`: Short model (HistGB) probability threshold (default: 0.1, range: 0.0-1.0)
|
|
- `ensemble_weight`: Weight of short model in ensemble (default: 0.4, range: 0.0-1.0)
|
|
- `model_dir`: Model directory path (optional, uses built-in models if None)
|
|
|
|
**Returns:**
|
|
- `pd.DataFrame`: Prediction results with columns:
|
|
- `Short_Probability`: Short model prediction probability
|
|
- `Long_Probability`: Long model prediction probability
|
|
- `Ensemble_Probability`: Ensemble prediction probability (main result)
|
|
- `Position`: Position in sequence (for sliding window mode)
|
|
- `Codon`: Codon at position (for sliding window mode)
|
|
- `Ensemble_Weights`: Weight configuration information
|
|
|
|
**Usage Examples:**
|
|
|
|
```python
|
|
from FScanpy import predict_prf
|
|
|
|
# 1. Single sequence sliding window prediction
|
|
sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
|
|
results = predict_prf(sequence=sequence)
|
|
|
|
# 2. Multiple sequences prediction
|
|
sequences = ["ATGCGTACGT...", "GCTATAGCAT..."]
|
|
results = predict_prf(sequence=sequences)
|
|
|
|
# 3. Custom parameters
|
|
results = predict_prf(
|
|
sequence=sequence,
|
|
window_size=1, # Scan every position
|
|
short_threshold=0.2, # Higher threshold
|
|
ensemble_weight=0.3 # 3:7 ratio (short:long)
|
|
)
|
|
|
|
# 4. DataFrame region prediction
|
|
import pandas as pd
|
|
data = pd.DataFrame({
|
|
'Long_Sequence': ['ATGCGT...', 'GCTATAG...'], # or use '399bp'
|
|
'sample_id': ['sample1', 'sample2']
|
|
})
|
|
results = predict_prf(data=data)
|
|
```
|
|
|
|
#### 1.2 `plot_prf_prediction()` - Prediction with Visualization
|
|
|
|
**Function Signature:**
|
|
```python
|
|
def plot_prf_prediction(
|
|
sequence: str,
|
|
window_size: int = 3,
|
|
short_threshold: float = 0.65,
|
|
long_threshold: float = 0.8,
|
|
ensemble_weight: float = 0.4,
|
|
title: str = None,
|
|
save_path: str = None,
|
|
figsize: tuple = (12, 8),
|
|
dpi: int = 300,
|
|
model_dir: str = None
|
|
) -> tuple
|
|
```
|
|
|
|
**Parameters:**
|
|
- `sequence`: Input DNA sequence (string)
|
|
- `window_size`: Sliding window size (default: 3)
|
|
- `short_threshold`: Short model filtering threshold for heatmap display (default: 0.65)
|
|
- `long_threshold`: Long model filtering threshold for heatmap display (default: 0.8)
|
|
- `ensemble_weight`: Weight of short model in ensemble (default: 0.4)
|
|
- `title`: Plot title (optional, auto-generated if None)
|
|
- `save_path`: Save path (optional, saves plot if provided)
|
|
- `figsize`: Figure size tuple (default: (12, 8))
|
|
- `dpi`: Figure resolution (default: 300)
|
|
- `model_dir`: Model directory path (optional)
|
|
|
|
**Returns:**
|
|
- `tuple`: (prediction_results: pd.DataFrame, figure: matplotlib.figure.Figure)
|
|
|
|
**Usage Examples:**
|
|
|
|
```python
|
|
from FScanpy import plot_prf_prediction
|
|
import matplotlib.pyplot as plt
|
|
|
|
# 1. Basic plotting
|
|
sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
|
|
results, fig = plot_prf_prediction(sequence)
|
|
plt.show()
|
|
|
|
# 2. Custom thresholds and weights
|
|
results, fig = plot_prf_prediction(
|
|
sequence,
|
|
short_threshold=0.7, # Higher display threshold
|
|
long_threshold=0.85, # Higher display threshold
|
|
ensemble_weight=0.3, # 3:7 weight ratio
|
|
title="Custom Analysis Results",
|
|
save_path="analysis.png",
|
|
figsize=(15, 10),
|
|
dpi=150
|
|
)
|
|
|
|
# 3. High-resolution analysis
|
|
results, fig = plot_prf_prediction(
|
|
sequence,
|
|
window_size=1, # Scan every position
|
|
ensemble_weight=0.5, # Equal weights
|
|
dpi=600 # High resolution
|
|
)
|
|
```
|
|
|
|
### 2. PRFPredictor Class Methods
|
|
|
|
#### 2.1 Class Initialization
|
|
|
|
```python
|
|
from FScanpy import PRFPredictor
|
|
|
|
# Initialize with default models
|
|
predictor = PRFPredictor()
|
|
|
|
# Initialize with custom model directory
|
|
predictor = PRFPredictor(model_dir='/path/to/models')
|
|
```
|
|
|
|
#### 2.2 `predict_sequence()` - Sliding Window Prediction
|
|
|
|
**Method Signature:**
|
|
```python
|
|
def predict_sequence(self, sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)
|
|
```
|
|
|
|
**Parameters:**
|
|
- `sequence`: Input DNA sequence
|
|
- `window_size`: Sliding window size (default: 3)
|
|
- `short_threshold`: Short model probability threshold (default: 0.1)
|
|
- `ensemble_weight`: Short model weight in ensemble (default: 0.4)
|
|
|
|
**Usage:**
|
|
```python
|
|
predictor = PRFPredictor()
|
|
results = predictor.predict_sequence(
|
|
sequence="ATGCGTACGT...",
|
|
window_size=1,
|
|
short_threshold=0.15,
|
|
ensemble_weight=0.35
|
|
)
|
|
```
|
|
|
|
#### 2.3 `predict_regions()` - Region-based Prediction
|
|
|
|
**Method Signature:**
|
|
```python
|
|
def predict_regions(self, sequences, short_threshold=0.1, ensemble_weight=0.4)
|
|
```
|
|
|
|
**Parameters:**
|
|
- `sequences`: List or Series of 399bp sequences
|
|
- `short_threshold`: Short model probability threshold (default: 0.1)
|
|
- `ensemble_weight`: Short model weight in ensemble (default: 0.4)
|
|
|
|
**Usage:**
|
|
```python
|
|
predictor = PRFPredictor()
|
|
sequences = ["ATGCGT...", "GCTATAG..."] # 399bp sequences
|
|
results = predictor.predict_regions(
|
|
sequences=sequences,
|
|
short_threshold=0.1,
|
|
ensemble_weight=0.4
|
|
)
|
|
```
|
|
|
|
#### 2.4 `predict_single_position()` - Single Position Prediction
|
|
|
|
**Method Signature:**
|
|
```python
|
|
def predict_single_position(self, fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)
|
|
```
|
|
|
|
**Parameters:**
|
|
- `fs_period`: 33bp sequence around frameshift site
|
|
- `full_seq`: 399bp sequence for long model
|
|
- `short_threshold`: Short model probability threshold (default: 0.1)
|
|
- `ensemble_weight`: Short model weight in ensemble (default: 0.4)
|
|
|
|
**Usage:**
|
|
```python
|
|
predictor = PRFPredictor()
|
|
result = predictor.predict_single_position(
|
|
fs_period="ATGCGTACGTATGCGTACGTATGCGTACGTA", # 33bp
|
|
full_seq="ATGCGT..." * 133, # 399bp
|
|
short_threshold=0.1,
|
|
ensemble_weight=0.4
|
|
)
|
|
```
|
|
|
|
#### 2.5 `plot_sequence_prediction()` - Class Method for Plotting
|
|
|
|
**Method Signature:**
|
|
```python
|
|
def plot_sequence_prediction(self, sequence, window_size=3, short_threshold=0.65,
|
|
long_threshold=0.8, ensemble_weight=0.4, title=None,
|
|
save_path=None, figsize=(12, 8), dpi=300)
|
|
```
|
|
|
|
**Usage:**
|
|
```python
|
|
predictor = PRFPredictor()
|
|
results, fig = predictor.plot_sequence_prediction(
|
|
sequence="ATGCGTACGT...",
|
|
window_size=3,
|
|
ensemble_weight=0.4
|
|
)
|
|
```
|
|
|
|
### 3. Utility Functions
|
|
|
|
#### 3.1 `fscanr()` - PRF Site Detection from BLASTX
|
|
|
|
**Function Signature:**
|
|
```python
|
|
def fscanr(
|
|
blastx_output: pd.DataFrame,
|
|
mismatch_cutoff: float = 10,
|
|
evalue_cutoff: float = 1e-5,
|
|
frameDist_cutoff: float = 10
|
|
) -> pd.DataFrame
|
|
```
|
|
|
|
**Parameters:**
|
|
- `blastx_output`: BLASTX output DataFrame with required columns:
|
|
- `qseqid`, `sseqid`, `pident`, `length`, `mismatch`, `gapopen`
|
|
- `qstart`, `qend`, `sstart`, `send`, `evalue`, `bitscore`, `qframe`, `sframe`
|
|
- `mismatch_cutoff`: Maximum allowed mismatches (default: 10)
|
|
- `evalue_cutoff`: E-value threshold (default: 1e-5)
|
|
- `frameDist_cutoff`: Frame distance threshold (default: 10)
|
|
|
|
**Returns:**
|
|
- `pd.DataFrame`: PRF sites with columns:
|
|
- `DNA_seqid`: Sequence identifier
|
|
- `FS_start`, `FS_end`: Frameshift start and end positions
|
|
- `Pep_seqid`: Peptide sequence identifier
|
|
- `Pep_FS_start`, `Pep_FS_end`: Peptide frameshift positions
|
|
- `FS_type`: Type of frameshift (-2, -1, 1, 2)
|
|
- `Strand`: Strand orientation (+, -)
|
|
|
|
**Usage:**
|
|
```python
|
|
from FScanpy.utils import fscanr
|
|
import pandas as pd
|
|
|
|
# Load BLASTX results
|
|
blastx_data = pd.read_excel('blastx_results.xlsx')
|
|
|
|
# Detect PRF sites
|
|
prf_sites = fscanr(
|
|
blastx_output=blastx_data,
|
|
mismatch_cutoff=5, # Stricter mismatch filter
|
|
evalue_cutoff=1e-6, # Stricter E-value filter
|
|
frameDist_cutoff=15 # Allow larger frame distances
|
|
)
|
|
```
|
|
|
|
#### 3.2 `extract_prf_regions()` - Extract Sequences Around PRF Sites
|
|
|
|
**Function Signature:**
|
|
```python
|
|
def extract_prf_regions(mrna_file: str, prf_data: pd.DataFrame) -> pd.DataFrame
|
|
```
|
|
|
|
**Parameters:**
|
|
- `mrna_file`: Path to mRNA sequences file (FASTA format)
|
|
- `prf_data`: DataFrame from `fscanr()` output
|
|
|
|
**Returns:**
|
|
- `pd.DataFrame`: Extracted sequences with columns:
|
|
- `DNA_seqid`: Sequence identifier
|
|
- `FS_start`, `FS_end`: Frameshift positions
|
|
- `Strand`: Strand orientation
|
|
- `399bp`: Extracted 399bp sequence
|
|
- `FS_type`: Frameshift type
|
|
|
|
**Usage:**
|
|
```python
|
|
from FScanpy.utils import extract_prf_regions
|
|
|
|
# Extract sequences around PRF sites
|
|
prf_sequences = extract_prf_regions(
|
|
mrna_file='sequences.fasta',
|
|
prf_data=prf_sites
|
|
)
|
|
|
|
# Predict PRF probabilities
|
|
predictor = PRFPredictor()
|
|
results = predictor.predict_regions(prf_sequences['399bp'])
|
|
```
|
|
|
|
### 4. Data Access Functions
|
|
|
|
#### 4.1 Test Data Access
|
|
|
|
```python
|
|
from FScanpy.data import get_test_data_path, list_test_data
|
|
|
|
# List available test data
|
|
list_test_data()
|
|
|
|
# Get test data paths
|
|
blastx_file = get_test_data_path('blastx_example.xlsx')
|
|
mrna_file = get_test_data_path('mrna_example.fasta')
|
|
region_file = get_test_data_path('region_example.csv')
|
|
seq_file = get_test_data_path('full_seq.xlsx')
|
|
```
|
|
|
|
## Complete Workflow Examples
|
|
|
|
### Workflow 1: Full Sequence Analysis
|
|
|
|
```python
|
|
from FScanpy import predict_prf, plot_prf_prediction
|
|
import matplotlib.pyplot as plt
|
|
|
|
# Define sequence
|
|
full_seq = pd.read.excel(seq_file)
|
|
|
|
# Method 1: Simple prediction
|
|
results = predict_prf(sequence=full_seq[0]['full_seq'])
|
|
print(f"Found {len(results)} potential sites")
|
|
|
|
# Method 2: Prediction with visualization
|
|
results, fig = plot_prf_prediction(
|
|
sequence=sequence=full_seq[0]['full_seq'],
|
|
window_size=1, # Scan every position
|
|
short_threshold=0.3, # Display sites above 0.3
|
|
long_threshold=0.4, # Display sites above 0.4
|
|
ensemble_weight=0.4, # 4:6 weight ratio
|
|
title="PRF Analysis Results",
|
|
save_path="prf_analysis.png"
|
|
)
|
|
plt.show()
|
|
|
|
# Analyze top predictions
|
|
top_sites = results.nlargest(5, 'Ensemble_Probability')
|
|
print("Top 5 predicted sites:")
|
|
for _, site in top_sites.iterrows():
|
|
print(f"Position {site['Position']}: {site['Ensemble_Probability']:.3f}")
|
|
```
|
|
|
|
### Workflow 2: Region-based Prediction
|
|
|
|
```python
|
|
from FScanpy import predict_prf
|
|
import pandas as pd
|
|
|
|
# Prepare region data
|
|
region_data = pd.DataFrame({
|
|
'sample_id': ['sample1', 'sample2', 'sample3'],
|
|
'Long_Sequence': [
|
|
'ATGCGT...', # 399bp sequence 1
|
|
'GCTATAG...', # 399bp sequence 2
|
|
'TTACGGA...' # 399bp sequence 3
|
|
],
|
|
'known_label': [1, 0, 1] # Optional: known labels for validation
|
|
})
|
|
|
|
# Predict PRF probabilities
|
|
results = predict_prf(
|
|
data=region_data,
|
|
ensemble_weight=0.3 # Favor long model (3:7 ratio)
|
|
)
|
|
|
|
# Evaluate results
|
|
if 'known_label' in results.columns:
|
|
threshold = 0.5
|
|
predictions = (results['Ensemble_Probability'] > threshold).astype(int)
|
|
accuracy = (predictions == results['known_label']).mean()
|
|
print(f"Accuracy at threshold {threshold}: {accuracy:.3f}")
|
|
```
|
|
|
|
### Workflow 3: BLASTX-based Analysis Pipeline
|
|
|
|
```python
|
|
from FScanpy import PRFPredictor, predict_prf
|
|
from FScanpy.data import get_test_data_path
|
|
from FScanpy.utils import fscanr, extract_prf_regions
|
|
import pandas as pd
|
|
|
|
# Step 1: Load BLASTX data
|
|
blastx_data = pd.read_excel(get_test_data_path('blastx_example.xlsx'))
|
|
print(f"Loaded {len(blastx_data)} BLASTX hits")
|
|
|
|
# Step 2: Detect PRF sites using FScanR
|
|
prf_sites = fscanr(
|
|
blastx_output=blastx_data,
|
|
mismatch_cutoff=10,
|
|
evalue_cutoff=1e-5,
|
|
frameDist_cutoff=10
|
|
)
|
|
print(f"Detected {len(prf_sites)} potential PRF sites")
|
|
|
|
# Step 3: Extract sequences around PRF sites
|
|
mrna_file = get_test_data_path('mrna_example.fasta')
|
|
prf_sequences = extract_prf_regions(
|
|
mrna_file=mrna_file,
|
|
prf_data=prf_sites
|
|
)
|
|
print(f"Extracted {len(prf_sequences)} sequences")
|
|
|
|
# Step 4: Predict PRF probabilities
|
|
predictor = PRFPredictor()
|
|
results = predictor.predict_regions(
|
|
sequences=prf_sequences['399bp'],
|
|
ensemble_weight=0.4
|
|
)
|
|
|
|
# Step 5: Combine results with metadata
|
|
final_results = pd.concat([
|
|
prf_sequences.reset_index(drop=True),
|
|
results.reset_index(drop=True)
|
|
], axis=1)
|
|
|
|
# Step 6: Analyze results
|
|
high_prob_sites = final_results[
|
|
final_results['Ensemble_Probability'] > 0.7
|
|
]
|
|
print(f"High probability PRF sites: {len(high_prob_sites)}")
|
|
|
|
# Display top results
|
|
print("\nTop PRF predictions:")
|
|
top_results = final_results.nlargest(3, 'Ensemble_Probability')
|
|
for _, row in top_results.iterrows():
|
|
print(f"Sequence {row['DNA_seqid']}: {row['Ensemble_Probability']:.3f}")
|
|
```
|
|
|
|
### Workflow 4: Custom Analysis with Multiple Sequences
|
|
|
|
```python
|
|
from FScanpy import predict_prf, plot_prf_prediction
|
|
import matplotlib.pyplot as plt
|
|
|
|
# Multiple sequence analysis
|
|
sequences = [
|
|
"ATGCGTACGTATGCGTACGTATGCGTACGTAAGCCCTTTGAACCCAAAGGG",
|
|
"GCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCAT",
|
|
"TTACGGATTACGGATTACGGATTACGGATTACGGATTACGGATTACGGAT"
|
|
]
|
|
|
|
# Batch prediction
|
|
results = predict_prf(
|
|
sequence=sequences,
|
|
window_size=2,
|
|
ensemble_weight=0.5 # Equal weights
|
|
)
|
|
|
|
# Analyze per sequence
|
|
for seq_id in results['Sequence_ID'].unique():
|
|
seq_results = results[results['Sequence_ID'] == seq_id]
|
|
max_prob = seq_results['Ensemble_Probability'].max()
|
|
print(f"{seq_id}: Max probability = {max_prob:.3f}")
|
|
|
|
# Visualize first sequence
|
|
first_seq_results, fig = plot_prf_prediction(
|
|
sequence=sequences[0],
|
|
ensemble_weight=0.5,
|
|
title="First Sequence Analysis"
|
|
)
|
|
plt.show()
|
|
```
|
|
|
|
## Parameter Optimization Guidelines
|
|
|
|
### 1. Ensemble Weight Selection
|
|
|
|
- **Conservative (favor specificity)**: `ensemble_weight = 0.2-0.3` (favor long model)
|
|
- **Balanced**: `ensemble_weight = 0.4-0.6` (default recommended)
|
|
- **Sensitive (favor sensitivity)**: `ensemble_weight = 0.7-0.8` (favor short model)
|
|
|
|
### 2. Threshold Selection
|
|
|
|
- **Short threshold**: Usually 0.1-0.3, controls computational efficiency
|
|
- **Display thresholds**: 0.3-0.8, controls visualization display
|
|
- **Classification threshold**: 0.5 (standard), adjust based on validation data
|
|
|
|
### 3. Window Size Selection
|
|
|
|
- **Fine-grained analysis**: `window_size = 1` (every position)
|
|
- **Standard analysis**: `window_size = 3` (every 3rd position, default)
|
|
- **Coarse analysis**: `window_size = 6-9` (faster, less detailed)
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues and Solutions
|
|
|
|
1. **Model Loading Errors**
|
|
```python
|
|
# Check model directory
|
|
import FScanpy
|
|
predictor = PRFPredictor(model_dir='/custom/path')
|
|
```
|
|
|
|
2. **Memory Issues with Large Sequences**
|
|
```python
|
|
# Use larger window size to reduce computational load
|
|
results = predict_prf(sequence=large_seq, window_size=9)
|
|
```
|
|
|
|
3. **Visualization Issues**
|
|
```python
|
|
# Adjust figure parameters
|
|
results, fig = plot_prf_prediction(
|
|
sequence=seq,
|
|
figsize=(20, 10), # Larger figure
|
|
dpi=150 # Lower resolution
|
|
)
|
|
```
|
|
|
|
4. **Input Format Issues**
|
|
```python
|
|
# Ensure proper DataFrame format
|
|
data = pd.DataFrame({
|
|
'Long_Sequence': sequences, # Use 'Long_Sequence' or '399bp'
|
|
'sample_id': ids
|
|
})
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### 1. Batch Processing
|
|
```python
|
|
# Process multiple sequences efficiently
|
|
sequences = ["seq1", "seq2", "seq3", ...]
|
|
results = predict_prf(sequence=sequences, window_size=3)
|
|
```
|
|
|
|
### 2. Threshold Optimization
|
|
```python
|
|
# Use appropriate short_threshold to skip unnecessary long model calls
|
|
results = predict_prf(
|
|
sequence=sequence,
|
|
short_threshold=0.2 # Higher threshold = faster processing
|
|
)
|
|
```
|
|
|
|
### 3. Memory Management
|
|
```python
|
|
# For very large datasets, process in chunks
|
|
chunk_size = 100
|
|
for i in range(0, len(large_dataset), chunk_size):
|
|
chunk = large_dataset[i:i+chunk_size]
|
|
chunk_results = predict_prf(data=chunk)
|
|
# Process chunk_results
|
|
```
|
|
|
|
## Citation
|
|
If you use FScanpy, please cite our paper: [Paper Link] |