2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# FScanpy Tutorial - Complete Usage Guide
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								## Abstract
  
						 
					
						
							
								
									
										
										
										
											2025-03-20 15:18:06 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								FScanpy is a Python package designed to predict Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. This package integrates machine learning models, sequence feature analysis, and visualization capabilities to help researchers rapidly locate potential PRF sites.
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Introduction
  
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:32:06 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								FScanpy is a Python package dedicated to predicting Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. It integrates machine learning models (Gradient Boosting and BiLSTM-CNN) along with the FScanR package to furnish precise PRF predictions. Users are capable of employing three types of data as input: the entire cDNA/mRNA sequence that requires prediction, the nucleotide sequence in the vicinity of the suspected frameshift site, and the peptide library blastx results of the species or related species. It anticipates the input sequence to be in the + strand and can be integrated with FScanR to augment the accuracy.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:32:06 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-05-29 17:58:48 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								For the prediction of the entire sequence, FScanpy adopts a sliding window approach to scan the entire sequence and predict the PRF sites. For regional prediction, it is based on the 33-bp and 399-bp sequences in the 0 reading frame around the suspected frameshift site. Initially, the Short model (HistGradientBoosting) will predict the potential PRF sites within the scanning window. If the predicted probability exceeds the threshold, the Long model (BiLSTM-CNN) will predict the PRF sites in the 399bp sequence. Then, ensemble weighting combines the two models to make the final prediction.
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-03-20 15:31:14 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								For PRF detection from BLASTX output, [FScanR ](https://github.com/seanchen607/FScanR.git ) identifies potential PRF sites from BLASTX alignment results, acquires the two hits of the same query sequence, and then utilizes frameDist_cutoff, mismatch_cutoff, and evalue_cutoff to filter the hits. Finally, FScanpy is utilized to predict the probability of PRF sites.
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### Background
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								[Ribosomal frameshifting ](https://en.wikipedia.org/wiki/Ribosomal_frameshift ), also known as translational frameshifting or translational recoding, is a biological phenomenon that occurs during translation that results in the production of multiple, unique proteins from a single mRNA. The process can be programmed by the nucleotide sequence of the mRNA and is sometimes affected by the secondary, 3-dimensional mRNA structure. It has been described mainly in viruses (especially retroviruses), retrotransposons and bacterial insertion elements, and also in some cellular genes.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### Key features of FScanpy include:
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  Integration of two predictive models: 
						 
					
						
							
								
									
										
										
										
											2025-05-29 17:58:48 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								  -  Short Model (HistGradientBoosting): Analyzes local sequence features centered around potential frameshift sites (33bp).
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  Long Model (BiLSTM-CNN): Analyzes broader sequence features (399bp).
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								-  Supports PRF prediction across various species. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  Can be combined with [FScanR ](https://github.com/seanchen607/FScanR.git ) for enhanced accuracy. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Installation (python>=3.7)
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### 1. Use pip
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```bash
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								pip install FScanpy
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-05-29 17:58:48 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								### 2. Clone from GitHub
  
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```bash
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								git clone https://github.com/.../FScanpy.git
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								cd FScanpy
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								pip install -e .
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								## Complete Function Reference
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### 1. Core Prediction Functions
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								#### 1.1 `predict_prf()` - Main Prediction Interface
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Function Signature:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								def predict_prf(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    sequence: Union[str, List[str], None] = None,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    data: Union[pd.DataFrame, None] = None,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    window_size: int = 3,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    short_threshold: float = 0.1,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ensemble_weight: float = 0.4,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    model_dir: str = None
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								) -> pd.DataFrame
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Parameters:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `sequence` : Single or multiple DNA sequences for sliding window prediction 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `data` : DataFrame data, must contain 'Long_Sequence' or '399bp' column for region prediction   
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `window_size` : Sliding window size (default: 3, recommended: 1-10) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `short_threshold` : Short model (HistGB) probability threshold (default: 0.1, range: 0.0-1.0) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `ensemble_weight` : Weight of short model in ensemble (default: 0.4, range: 0.0-1.0) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `model_dir` : Model directory path (optional, uses built-in models if None) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Returns:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `pd.DataFrame` : Prediction results with columns: 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `Short_Probability` : Short model prediction probability
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `Long_Probability` : Long model prediction probability  
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `Ensemble_Probability` : Ensemble prediction probability (main result)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `Position` : Position in sequence (for sliding window mode)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `Codon` : Codon at position (for sliding window mode)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `Ensemble_Weights` : Weight configuration information
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Usage Examples:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								from FScanpy import predict_prf
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# 1. Single sequence sliding window prediction
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results = predict_prf(sequence=sequence)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# 2. Multiple sequences prediction
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								sequences = ["ATGCGTACGT...", "GCTATAGCAT..."]
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results = predict_prf(sequence=sequences)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# 3. Custom parameters
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results = predict_prf(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    sequence=sequence, 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    window_size=1,           # Scan every position
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    short_threshold=0.2,     # Higher threshold
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ensemble_weight=0.3      # 3:7 ratio (short:long)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# 4. DataFrame region prediction
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								import pandas as pd
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								data = pd.DataFrame({
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    'Long_Sequence': ['ATGCGT...', 'GCTATAG...'],  # or use '399bp'
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    'sample_id': ['sample1', 'sample2']
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								})
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results = predict_prf(data=data)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								#### 1.2 `plot_prf_prediction()` - Prediction with Visualization
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Function Signature:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								def plot_prf_prediction(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    sequence: str,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    window_size: int = 3,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    short_threshold: float = 0.65,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    long_threshold: float = 0.8,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ensemble_weight: float = 0.4,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    title: str = None,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    save_path: str = None,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    figsize: tuple = (12, 8),
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    dpi: int = 300,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    model_dir: str = None
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								) -> tuple
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Parameters:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `sequence` : Input DNA sequence (string) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `window_size` : Sliding window size (default: 3) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `short_threshold` : Short model filtering threshold for heatmap display (default: 0.65) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `long_threshold` : Long model filtering threshold for heatmap display (default: 0.8) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `ensemble_weight` : Weight of short model in ensemble (default: 0.4) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `title` : Plot title (optional, auto-generated if None) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `save_path` : Save path (optional, saves plot if provided) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `figsize` : Figure size tuple (default: (12, 8)) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `dpi` : Figure resolution (default: 300) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `model_dir` : Model directory path (optional) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Returns:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `tuple` : (prediction_results: pd.DataFrame, figure: matplotlib.figure.Figure) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Usage Examples:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								from FScanpy import plot_prf_prediction
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								import matplotlib.pyplot as plt
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# 1. Basic plotting
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results, fig = plot_prf_prediction(sequence)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								plt.show()
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# 2. Custom thresholds and weights
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results, fig = plot_prf_prediction(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    sequence, 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    short_threshold=0.7,     # Higher display threshold
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    long_threshold=0.85,     # Higher display threshold
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ensemble_weight=0.3,     # 3:7 weight ratio
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    title="Custom Analysis Results",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    save_path="analysis.png",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    figsize=(15, 10),
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    dpi=150
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# 3. High-resolution analysis
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results, fig = plot_prf_prediction(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    sequence,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    window_size=1,           # Scan every position
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ensemble_weight=0.5,     # Equal weights
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    dpi=600                  # High resolution
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### 2. PRFPredictor Class Methods
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								#### 2.1 Class Initialization
  
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								from FScanpy import PRFPredictor
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Initialize with default models
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								predictor = PRFPredictor()
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Initialize with custom model directory
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								predictor = PRFPredictor(model_dir='/path/to/models')
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								#### 2.2 `predict_sequence()` - Sliding Window Prediction
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Method Signature:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								def predict_sequence(self, sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Parameters:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `sequence` : Input DNA sequence 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `window_size` : Sliding window size (default: 3) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `short_threshold` : Short model probability threshold (default: 0.1) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `ensemble_weight` : Short model weight in ensemble (default: 0.4) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Usage:**
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								predictor = PRFPredictor()
							 
						 
					
						
							
								
									
										
										
										
											2025-05-29 17:58:48 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								results = predictor.predict_sequence(
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								    sequence="ATGCGTACGT...",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    window_size=1,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    short_threshold=0.15,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ensemble_weight=0.35
							 
						 
					
						
							
								
									
										
										
										
											2025-05-29 17:58:48 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
									
										
										
										
											2025-05-29 17:58:48 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								#### 2.3 `predict_regions()` - Region-based Prediction
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Method Signature:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								def predict_regions(self, sequences, short_threshold=0.1, ensemble_weight=0.4)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Parameters:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `sequences` : List or Series of 399bp sequences 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `short_threshold` : Short model probability threshold (default: 0.1) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `ensemble_weight` : Short model weight in ensemble (default: 0.4) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Usage:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								predictor = PRFPredictor()
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								sequences = ["ATGCGT...", "GCTATAG..."]  # 399bp sequences
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results = predictor.predict_regions(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    sequences=sequences,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    short_threshold=0.1,
							 
						 
					
						
							
								
									
										
										
										
											2025-05-29 17:58:48 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								    ensemble_weight=0.4
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								#### 2.4 `predict_single_position()` - Single Position Prediction
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Method Signature:**
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								def predict_single_position(self, fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Parameters:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `fs_period` : 33bp sequence around frameshift site 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `full_seq` : 399bp sequence for long model 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `short_threshold` : Short model probability threshold (default: 0.1) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `ensemble_weight` : Short model weight in ensemble (default: 0.4) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Usage:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								predictor = PRFPredictor()
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								result = predictor.predict_single_position(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    fs_period="ATGCGTACGTATGCGTACGTATGCGTACGTA",  # 33bp
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    full_seq="ATGCGT..." * 133,  # 399bp
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    short_threshold=0.1,
							 
						 
					
						
							
								
									
										
										
										
											2025-05-29 17:58:48 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								    ensemble_weight=0.4
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								#### 2.5 `plot_sequence_prediction()` - Class Method for Plotting
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Method Signature:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								def plot_sequence_prediction(self, sequence, window_size=3, short_threshold=0.65, 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								                           long_threshold=0.8, ensemble_weight=0.4, title=None, 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								                           save_path=None, figsize=(12, 8), dpi=300)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								**Usage:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								predictor = PRFPredictor()
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results, fig = predictor.plot_sequence_prediction(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    sequence="ATGCGTACGT...",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    window_size=3,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ensemble_weight=0.4
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### 3. Utility Functions
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								#### 3.1 `fscanr()` - PRF Site Detection from BLASTX
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Function Signature:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								def fscanr(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    blastx_output: pd.DataFrame,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    mismatch_cutoff: float = 10,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    evalue_cutoff: float = 1e-5,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    frameDist_cutoff: float = 10
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								) -> pd.DataFrame
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Parameters:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `blastx_output` : BLASTX output DataFrame with required columns: 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `qseqid` , `sseqid` , `pident` , `length` , `mismatch` , `gapopen` 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `qstart` , `qend` , `sstart` , `send` , `evalue` , `bitscore` , `qframe` , `sframe` 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `mismatch_cutoff` : Maximum allowed mismatches (default: 10) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `evalue_cutoff` : E-value threshold (default: 1e-5) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `frameDist_cutoff` : Frame distance threshold (default: 10) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Returns:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `pd.DataFrame` : PRF sites with columns: 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `DNA_seqid` : Sequence identifier
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `FS_start` , `FS_end` : Frameshift start and end positions
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `Pep_seqid` : Peptide sequence identifier
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `Pep_FS_start` , `Pep_FS_end` : Peptide frameshift positions
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `FS_type` : Type of frameshift (-2, -1, 1, 2)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `Strand` : Strand orientation (+, -)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Usage:**
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								from FScanpy.utils import fscanr
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								import pandas as pd
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Load BLASTX results
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								blastx_data = pd.read_excel('blastx_results.xlsx')
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Detect PRF sites
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								prf_sites = fscanr(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    blastx_output=blastx_data,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    mismatch_cutoff=5,       # Stricter mismatch filter
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    evalue_cutoff=1e-6,      # Stricter E-value filter
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    frameDist_cutoff=15      # Allow larger frame distances
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								#### 3.2 `extract_prf_regions()` - Extract Sequences Around PRF Sites
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Function Signature:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								def extract_prf_regions(mrna_file: str, prf_data: pd.DataFrame) -> pd.DataFrame
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Parameters:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `mrna_file` : Path to mRNA sequences file (FASTA format) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `prf_data` : DataFrame from `fscanr()`  output 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Returns:**
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  `pd.DataFrame` : Extracted sequences with columns: 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `DNA_seqid` : Sequence identifier
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `FS_start` , `FS_end` : Frameshift positions
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `Strand` : Strand orientation
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `399bp` : Extracted 399bp sequence
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  `FS_type` : Frameshift type
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								**Usage:**
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								from FScanpy.utils import extract_prf_regions
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Extract sequences around PRF sites
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								prf_sequences = extract_prf_regions(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    mrna_file='sequences.fasta',
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    prf_data=prf_sites
							 
						 
					
						
							
								
									
										
										
										
											2025-05-29 17:58:48 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Predict PRF probabilities
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								predictor = PRFPredictor()
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results = predictor.predict_regions(prf_sequences['399bp'])
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								### 4. Data Access Functions
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								#### 4.1 Test Data Access
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								from FScanpy.data import get_test_data_path, list_test_data
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# List available test data
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								list_test_data()
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Get test data paths
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								blastx_file = get_test_data_path('blastx_example.xlsx')
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								mrna_file = get_test_data_path('mrna_example.fasta')
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								region_file = get_test_data_path('region_example.csv')
							 
						 
					
						
							
								
									
										
										
										
											2025-06-12 02:52:35 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								seq_file = get_test_data_path('full_seq.xlsx')
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Complete Workflow Examples
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### Workflow 1: Full Sequence Analysis
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								from FScanpy import predict_prf, plot_prf_prediction
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								import matplotlib.pyplot as plt
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Define sequence
  
						 
					
						
							
								
									
										
										
										
											2025-06-12 02:52:35 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								full_seq = pd.read.excel(seq_file)
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Method 1: Simple prediction
  
						 
					
						
							
								
									
										
										
										
											2025-06-12 02:52:35 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								results = predict_prf(sequence=full_seq[0]['full_seq'])
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								print(f"Found {len(results)} potential sites")
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Method 2: Prediction with visualization
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results, fig = plot_prf_prediction(
							 
						 
					
						
							
								
									
										
										
										
											2025-06-12 02:52:35 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								    sequence=sequence=full_seq[0]['full_seq'],
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								    window_size=1,              # Scan every position
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    short_threshold=0.3,        # Display sites above 0.3
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    long_threshold=0.4,         # Display sites above 0.4
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ensemble_weight=0.4,        # 4:6 weight ratio
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    title="PRF Analysis Results",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    save_path="prf_analysis.png"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								plt.show()
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Analyze top predictions
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								top_sites = results.nlargest(5, 'Ensemble_Probability')
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								print("Top 5 predicted sites:")
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								for _, site in top_sites.iterrows():
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    print(f"Position {site['Position']}: {site['Ensemble_Probability']:.3f}")
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### Workflow 2: Region-based Prediction
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								from FScanpy import predict_prf
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								import pandas as pd
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Prepare region data
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								region_data = pd.DataFrame({
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    'sample_id': ['sample1', 'sample2', 'sample3'],
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    'Long_Sequence': [
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        'ATGCGT...',  # 399bp sequence 1
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        'GCTATAG...',  # 399bp sequence 2  
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        'TTACGGA...'   # 399bp sequence 3
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ],
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    'known_label': [1, 0, 1]  # Optional: known labels for validation
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								})
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Predict PRF probabilities
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results = predict_prf(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    data=region_data,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ensemble_weight=0.3  # Favor long model (3:7 ratio)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Evaluate results
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								if 'known_label' in results.columns:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    threshold = 0.5
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    predictions = (results['Ensemble_Probability'] > threshold).astype(int)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    accuracy = (predictions == results['known_label']).mean()
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    print(f"Accuracy at threshold {threshold}: {accuracy:.3f}")
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### Workflow 3: BLASTX-based Analysis Pipeline
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								from FScanpy import PRFPredictor, predict_prf
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								from FScanpy.data import get_test_data_path
							 
						 
					
						
							
								
									
										
										
										
											2025-05-29 17:58:48 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								from FScanpy.utils import fscanr, extract_prf_regions
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								import pandas as pd
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# Step 1: Load BLASTX data
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								blastx_data = pd.read_excel(get_test_data_path('blastx_example.xlsx'))
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								print(f"Loaded {len(blastx_data)} BLASTX hits")
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Step 2: Detect PRF sites using FScanR
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								prf_sites = fscanr(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    blastx_output=blastx_data,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    mismatch_cutoff=10,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    evalue_cutoff=1e-5,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    frameDist_cutoff=10
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								print(f"Detected {len(prf_sites)} potential PRF sites")
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Step 3: Extract sequences around PRF sites
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								mrna_file = get_test_data_path('mrna_example.fasta')
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								prf_sequences = extract_prf_regions(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    mrna_file=mrna_file,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    prf_data=prf_sites
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								print(f"Extracted {len(prf_sequences)} sequences")
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Step 4: Predict PRF probabilities
  
						 
					
						
							
								
									
										
										
										
											2025-05-29 17:58:48 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								predictor = PRFPredictor()
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								results = predictor.predict_regions(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    sequences=prf_sequences['399bp'],
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ensemble_weight=0.4
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Step 5: Combine results with metadata
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								final_results = pd.concat([
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    prf_sequences.reset_index(drop=True),
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    results.reset_index(drop=True)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								], axis=1)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Step 6: Analyze results
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								high_prob_sites = final_results[
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    final_results['Ensemble_Probability'] > 0.7
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								]
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								print(f"High probability PRF sites: {len(high_prob_sites)}")
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Display top results
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								print("\nTop PRF predictions:")
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								top_results = final_results.nlargest(3, 'Ensemble_Probability')
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								for _, row in top_results.iterrows():
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    print(f"Sequence {row['DNA_seqid']}: {row['Ensemble_Probability']:.3f}")
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### Workflow 4: Custom Analysis with Multiple Sequences
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								from FScanpy import predict_prf, plot_prf_prediction
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								import matplotlib.pyplot as plt
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Multiple sequence analysis
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								sequences = [
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    "ATGCGTACGTATGCGTACGTATGCGTACGTAAGCCCTTTGAACCCAAAGGG",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    "GCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCAT",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    "TTACGGATTACGGATTACGGATTACGGATTACGGATTACGGATTACGGAT"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								]
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Batch prediction
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results = predict_prf(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    sequence=sequences,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    window_size=2,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ensemble_weight=0.5  # Equal weights
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Analyze per sequence
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								for seq_id in results['Sequence_ID'].unique():
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    seq_results = results[results['Sequence_ID'] == seq_id]
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    max_prob = seq_results['Ensemble_Probability'].max()
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    print(f"{seq_id}: Max probability = {max_prob:.3f}")
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Visualize first sequence
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								first_seq_results, fig = plot_prf_prediction(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    sequence=sequences[0],
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    ensemble_weight=0.5,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    title="First Sequence Analysis"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								plt.show()
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Parameter Optimization Guidelines
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### 1. Ensemble Weight Selection
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  **Conservative (favor specificity)**: `ensemble_weight = 0.2-0.3`  (favor long model) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  **Balanced**: `ensemble_weight = 0.4-0.6`  (default recommended) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  **Sensitive (favor sensitivity)**: `ensemble_weight = 0.7-0.8`  (favor short model) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### 2. Threshold Selection
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  **Short threshold**: Usually 0.1-0.3, controls computational efficiency 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  **Display thresholds**: 0.3-0.8, controls visualization display 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  **Classification threshold**: 0.5 (standard), adjust based on validation data 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### 3. Window Size Selection
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  **Fine-grained analysis**: `window_size = 1`  (every position) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  **Standard analysis**: `window_size = 3`  (every 3rd position, default) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  **Coarse analysis**: `window_size = 6-9`  (faster, less detailed) 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Troubleshooting
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### Common Issues and Solutions
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								1.  **Model Loading Errors**  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   ```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   # Check model directory
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   import FScanpy
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   predictor = PRFPredictor(model_dir='/custom/path')
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   ```
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								2.  **Memory Issues with Large Sequences**  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   ```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   # Use larger window size to reduce computational load
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   results = predict_prf(sequence=large_seq, window_size=9)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   ```
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								3.  **Visualization Issues**  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   ```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   # Adjust figure parameters
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   results, fig = plot_prf_prediction(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								       sequence=seq,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								       figsize=(20, 10),  # Larger figure
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								       dpi=150            # Lower resolution
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   ```
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								4.  **Input Format Issues**  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   ```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   # Ensure proper DataFrame format
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   data = pd.DataFrame({
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								       'Long_Sequence': sequences,  # Use 'Long_Sequence' or '399bp'
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								       'sample_id': ids
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   })
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   ```
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								## Performance Optimization
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### 1. Batch Processing
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Process multiple sequences efficiently
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								sequences = ["seq1", "seq2", "seq3", ...]
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results = predict_prf(sequence=sequences, window_size=3)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### 2. Threshold Optimization
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Use appropriate short_threshold to skip unnecessary long model calls
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								results = predict_prf(
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    sequence=sequence,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    short_threshold=0.2  # Higher threshold = faster processing
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### 3. Memory Management
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# For very large datasets, process in chunks
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								chunk_size = 100
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								for i in range(0, len(large_dataset), chunk_size):
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    chunk = large_dataset[i:i+chunk_size]
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    chunk_results = predict_prf(data=chunk)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    # Process chunk_results
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:28:40 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
									
										
										
										
											2025-03-18 11:21:54 +08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Citation
  
						 
					
						
							
								
									
										
										
										
											2025-06-11 21:44:29 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								If you use FScanpy, please cite our paper: [Paper Link]
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Support
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								For questions and issues, please visit our GitHub repository or contact the development team.