完善包介绍

2025-06-11 21:44:29 +08:00 · 2025-06-11 21:44:29 +08:00 · a2eefd1902
parent 089df9c4a6
commit a2eefd1902
3 changed files with 1022 additions and 269 deletions
--- a/FScanpy_Demo.ipynb
+++ b/FScanpy_Demo.ipynb
@ -22,6 +22,84 @@
    "- **region_example.csv**: Sample for individual site prediction"
   ]
  },
+  {
+   "cell_type": "raw",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "## 📚 FScanpy Function Usage Guide\n",
+    "\n",
+    "### Core Functions Overview\n",
+    "\n",
+    "FScanpy provides several main functions for PRF prediction:\n",
+    "\n",
+    "#### 1. `predict_prf()` - Universal Prediction Function\n",
+    "```python\n",
+    "# Single sequence prediction\n",
+    "results = predict_prf(sequence=\"ATGCGT...\", window_size=3, ensemble_weight=0.4)\n",
+    "\n",
+    "# Multiple sequences prediction  \n",
+    "results = predict_prf(sequence=[\"seq1\", \"seq2\"], window_size=3)\n",
+    "\n",
+    "# DataFrame region prediction\n",
+    "results = predict_prf(data=df_with_399bp_column, ensemble_weight=0.4)\n",
+    "```\n",
+    "\n",
+    "#### 2. `plot_prf_prediction()` - Prediction with Visualization\n",
+    "```python\n",
+    "# Basic plotting\n",
+    "results, fig = plot_prf_prediction(sequence=\"ATGCGT...\")\n",
+    "\n",
+    "# Custom parameters\n",
+    "results, fig = plot_prf_prediction(\n",
+    "    sequence=\"ATGCGT...\",\n",
+    "    window_size=1,\n",
+    "    short_threshold=0.65,\n",
+    "    long_threshold=0.8,\n",
+    "    ensemble_weight=0.4,\n",
+    "    save_path=\"plot.png\"\n",
+    ")\n",
+    "```\n",
+    "\n",
+    "#### 3. `PRFPredictor` Class Methods\n",
+    "```python\n",
+    "predictor = PRFPredictor()\n",
+    "\n",
+    "# Sliding window prediction\n",
+    "results = predictor.predict_sequence(sequence, window_size=3, ensemble_weight=0.4)\n",
+    "\n",
+    "# Region prediction\n",
+    "results = predictor.predict_regions(sequences_399bp, ensemble_weight=0.4)\n",
+    "\n",
+    "# Single position prediction\n",
+    "result = predictor.predict_single_position(fs_period_33bp, full_seq_399bp)\n",
+    "\n",
+    "# Plot prediction\n",
+    "results, fig = predictor.plot_sequence_prediction(sequence)\n",
+    "```\n",
+    "\n",
+    "#### 4. Utility Functions\n",
+    "```python\n",
+    "from FScanpy.utils import fscanr, extract_prf_regions\n",
+    "\n",
+    "# Detect PRF sites from BLASTX\n",
+    "prf_sites = fscanr(blastx_df, mismatch_cutoff=10, evalue_cutoff=1e-5)\n",
+    "\n",
+    "# Extract sequences around PRF sites\n",
+    "prf_sequences = extract_prf_regions(mrna_file, prf_sites)\n",
+    "```\n",
+    "\n",
+    "### Parameter Guidelines\n",
+    "\n",
+    "- **ensemble_weight**: 0.4 (default, balanced), 0.2-0.3 (conservative), 0.7-0.8 (sensitive)\n",
+    "- **window_size**: 1 (detailed), 3 (standard), 6-9 (fast)\n",
+    "- **short_threshold**: 0.1 (default), 0.2-0.3 (stricter filtering)\n",
+    "- **Display thresholds**: 0.3-0.8 for visualization filtering\n"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -224,13 +302,13 @@
   "source": [
    "# Run FScanR analysis\n",
    "print(\"🔍 Running FScanR analysis...\")\n",
-    "print(\"Parameter settings: mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=100\")\n",
+    "print(\"Parameter settings: mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10\")\n",
    "\n",
    "fscanr_results = fscanr(\n",
    "    blastx_data,\n",
    "    mismatch_cutoff=10,\n",
    "    evalue_cutoff=1e-5,\n",
-    "    frameDist_cutoff=100\n",
+    "    frameDist_cutoff=10\n",
    ")\n",
    "\n",
    "print(f\"\\n✅ FScanR analysis complete!\")\n",
@ -644,6 +722,170 @@
    "print(\"The chart contains heatmaps and bar charts showing the PRF prediction probability distribution across the entire sequence.\")"
   ]
  },
+  {
+   "cell_type": "raw",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "## 📖 Complete Function Reference\n",
+    "\n",
+    "### All Available Functions and Methods\n",
+    "\n",
+    "#### Core Prediction Functions\n",
+    "\n",
+    "**1. `predict_prf(sequence=None, data=None, window_size=3, short_threshold=0.1, ensemble_weight=0.4, model_dir=None)`**\n",
+    "- **Purpose**: Universal prediction function for both sliding window and region-based analysis\n",
+    "- **Input modes**: \n",
+    "  - Single/multiple sequences → sliding window prediction\n",
+    "  - DataFrame with 'Long_Sequence'/'399bp' column → region prediction\n",
+    "- **Key parameters**:\n",
+    "  - `ensemble_weight`: Short model weight (0.0-1.0, default: 0.4)\n",
+    "  - `window_size`: Scanning step size (default: 3)\n",
+    "  - `short_threshold`: Filtering threshold (default: 0.1)\n",
+    "\n",
+    "**2. `plot_prf_prediction(sequence, window_size=3, short_threshold=0.65, long_threshold=0.8, ensemble_weight=0.4, title=None, save_path=None, figsize=(12,8), dpi=300)`**\n",
+    "- **Purpose**: Prediction with built-in visualization (3-subplot layout: FS site heatmap, prediction heatmap, bar chart)\n",
+    "- **Returns**: (prediction_results_df, matplotlib_figure)\n",
+    "- **Visualization features**: \n",
+    "  - Black bars with alpha=0.6\n",
+    "  - 'Reds' colormap for heatmaps\n",
+    "  - Height ratios [0.1, 0.1, 1] for subplots\n",
+    "\n",
+    "#### PRFPredictor Class Methods\n",
+    "\n",
+    "**3. Class initialization: `PRFPredictor(model_dir=None)`**\n",
+    "- Loads HistGradientBoosting (short, 33bp) and BiLSTM-CNN (long, 399bp) models\n",
+    "- Uses ensemble weighting for final predictions\n",
+    "\n",
+    "**4. `predictor.predict_sequence(sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)`**\n",
+    "- **Purpose**: Sliding window analysis of complete sequences\n",
+    "- **Process**: Scans sequence with specified window size, applies both models\n",
+    "\n",
+    "**5. `predictor.predict_regions(sequences, short_threshold=0.1, ensemble_weight=0.4)`**\n",
+    "- **Purpose**: Batch prediction for pre-defined 399bp regions\n",
+    "- **Input**: List/Series of 399bp sequences\n",
+    "- **Efficient**: Direct region analysis without sliding window\n",
+    "\n",
+    "**6. `predictor.predict_single_position(fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)`**\n",
+    "- **Purpose**: Single position analysis\n",
+    "- **Inputs**: 33bp sequence (fs_period) + 399bp sequence (full_seq)\n",
+    "- **Returns**: Dictionary with individual and ensemble probabilities\n",
+    "\n",
+    "**7. `predictor.plot_sequence_prediction(...)`** \n",
+    "- **Purpose**: Class method version of plot_prf_prediction()\n",
+    "- **Same parameters** as standalone function\n",
+    "\n",
+    "#### Utility Functions\n",
+    "\n",
+    "**8. `fscanr(blastx_output, mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10)`**\n",
+    "- **Purpose**: Detect PRF sites from BLASTX alignment results\n",
+    "- **Input**: DataFrame with BLASTX columns (qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore, qframe, sframe)\n",
+    "- **Output**: PRF sites with FS_start, FS_end, FS_type, Strand information\n",
+    "\n",
+    "**9. `extract_prf_regions(mrna_file, prf_data)`**\n",
+    "- **Purpose**: Extract 399bp sequences around detected PRF sites\n",
+    "- **Inputs**: FASTA file path + FScanR results DataFrame\n",
+    "- **Handles**: Strand orientation (reverse complement for '-' strand)\n",
+    "\n",
+    "#### Data Access Functions\n",
+    "\n",
+    "**10. `get_test_data_path(filename)`**\n",
+    "- **Purpose**: Get path to built-in test data files\n",
+    "- **Available files**: 'blastx_example.xlsx', 'mrna_example.fasta', 'region_example.csv'\n",
+    "\n",
+    "**11. `list_test_data()`**\n",
+    "- **Purpose**: Display all available test data files\n",
+    "\n",
+    "### Usage Pattern Examples\n",
+    "\n",
+    "#### Pattern 1: Quick Single Sequence Analysis\n",
+    "```python\n",
+    "from FScanpy import predict_prf, plot_prf_prediction\n",
+    "\n",
+    "# Simple prediction\n",
+    "results = predict_prf(sequence=\"ATGCGT...\")\n",
+    "\n",
+    "# With visualization  \n",
+    "results, fig = plot_prf_prediction(sequence=\"ATGCGT...\")\n",
+    "```\n",
+    "\n",
+    "#### Pattern 2: Batch Sequence Analysis\n",
+    "```python\n",
+    "sequences = [\"seq1\", \"seq2\", \"seq3\"]\n",
+    "results = predict_prf(sequence=sequences, ensemble_weight=0.5)\n",
+    "```\n",
+    "\n",
+    "#### Pattern 3: BLASTX Pipeline\n",
+    "```python\n",
+    "from FScanpy.utils import fscanr, extract_prf_regions\n",
+    "\n",
+    "# Step 1: Detect PRF sites\n",
+    "prf_sites = fscanr(blastx_df)\n",
+    "\n",
+    "# Step 2: Extract sequences\n",
+    "prf_sequences = extract_prf_regions(fasta_file, prf_sites)\n",
+    "\n",
+    "# Step 3: Predict probabilities\n",
+    "results = predict_prf(data=prf_sequences)\n",
+    "```\n",
+    "\n",
+    "#### Pattern 4: Custom Analysis with PRFPredictor\n",
+    "```python\n",
+    "from FScanpy import PRFPredictor\n",
+    "\n",
+    "predictor = PRFPredictor()\n",
+    "\n",
+    "# Method chaining for different analysis types\n",
+    "seq_results = predictor.predict_sequence(sequence)\n",
+    "region_results = predictor.predict_regions(sequences_399bp)\n",
+    "single_result = predictor.predict_single_position(seq_33bp, seq_399bp)\n",
+    "```\n",
+    "\n",
+    "### Parameter Optimization Guide\n",
+    "\n",
+    "**Ensemble Weight Selection:**\n",
+    "- `0.2-0.3`: Conservative (high specificity, favor long model)\n",
+    "- `0.4-0.6`: Balanced (recommended default)\n",
+    "- `0.7-0.8`: Sensitive (high sensitivity, favor short model)\n",
+    "\n",
+    "**Window Size Selection:**\n",
+    "- `1`: High resolution, every position (slow but detailed)\n",
+    "- `3`: Standard resolution (balanced speed/detail)  \n",
+    "- `6-9`: Low resolution, faster analysis\n",
+    "\n",
+    "**Threshold Guidelines:**\n",
+    "- `short_threshold`: 0.1-0.3 (controls efficiency by filtering low-probability candidates)\n",
+    "- Display thresholds: 0.3-0.8 (controls visualization, higher = cleaner plots)\n",
+    "- Classification threshold: 0.5 (standard binary classification cutoff)\n",
+    "\n",
+    "### Output Interpretation\n",
+    "\n",
+    "**Main Result Columns:**\n",
+    "- `Short_Probability`: HistGradientBoosting model prediction (0-1)\n",
+    "- `Long_Probability`: BiLSTM-CNN model prediction (0-1)\n",
+    "- `Ensemble_Probability`: **Final prediction** (weighted combination)\n",
+    "- `Position`: Sequence position (sliding window mode)\n",
+    "- `Codon`: Codon at position (sliding window mode)\n",
+    "\n",
+    "**Ensemble Probability Interpretation:**\n",
+    "- `> 0.8`: High confidence PRF site\n",
+    "- `0.5-0.8`: Moderate confidence PRF site  \n",
+    "- `0.3-0.5`: Low confidence, worth investigating\n",
+    "- `< 0.3`: Unlikely to be PRF site\n",
+    "\n",
+    "### Best Practices\n",
+    "\n",
+    "1. **For exploration**: Use `window_size=1, ensemble_weight=0.4`\n",
+    "2. **For screening**: Use `window_size=3, ensemble_weight=0.4, short_threshold=0.2`\n",
+    "3. **For validation**: Use region-based prediction with known sequences\n",
+    "4. **For visualization**: Adjust `short_threshold` and `long_threshold` in plotting functions to control display density\n",
+    "\n",
+    "This demo covers all major FScanpy functionalities. For detailed parameter descriptions and advanced usage, please refer to the complete tutorial documentation.\n"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
--- a/README.md
+++ b/README.md
@ -1,259 +1,286 @@
 # FScanpy
 ## A Machine Learning-Based Framework for Programmed Ribosomal Frameshifting Prediction

-FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR](https://github.com/seanchen607/FScanR.git) framework, FScanpy provides robust and accurate PRF site predictions. The package requires input sequences to be in the positive (5' to 3') orientation.
+[![Python](https://img.shields.io/badge/Python-3.7%2B-blue.svg)](https://www.python.org/)
+[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
+
+FScanpy is a comprehensive Python package designed for the prediction of [Programmed Ribosomal Frameshifting (PRF)](https://en.wikipedia.org/wiki/Ribosomal_frameshift) sites in nucleotide sequences. By integrating advanced machine learning approaches (HistGradientBoosting and BiLSTM-CNN) with the established [FScanR](https://github.com/seanchen607/FScanR.git) framework, FScanpy provides robust and accurate PRF site predictions.

 ![FScanpy Architecture](/tutorial/image/structure.jpeg)

-For detailed documentation and usage examples, please refer to our [tutorial](tutorial/tutorial.md).
+## 🌟 Key Features

-## 🚀 What's New in v0.3.0
+### 🎯 **Dual-Model Architecture**
+- **Short Model** (`HistGradientBoosting`): Fast screening with 33bp sequences
+- **Long Model** (`BiLSTM-CNN`): Deep analysis with 399bp sequences  
+- **Ensemble Prediction**: Customizable model weights for optimal performance

-### Model Naming Optimization
- **Short Model** (`short.pkl`): HistGradientBoosting model for rapid screening
- **Long Model** (`long.pkl`): BiLSTM-CNN model for detailed analysis
- **Unified Interface**: Consistent parameter naming and clearer output fields
+### 🚀 **Versatile Input Support**
+- **Single/Multiple Sequences**: Sliding window prediction across full sequences
+- **Region-Based Analysis**: Direct prediction on pre-extracted 399bp regions
+- **BLASTX Integration**: Seamless workflow with FScanR pipeline
+- **Cross-Species Compatibility**: Built-in databases for viruses, marine phages, Euplotes, etc.

-### Performance Improvements
- **Faster Prediction**: Optimized model type detection and reduced redundant operations
- **Better Error Handling**: More informative error messages and robust exception handling
- **Code Quality**: Reduced code duplication and improved maintainability
+### 📊 **Advanced Visualization**
+- **Interactive Heatmaps**: FS site probability visualization
+- **Prediction Plots**: Combined probability and confidence displays
+- **Customizable Thresholds**: Separate filtering for each model
+- **Export Options**: PNG, PDF, and interactive formats

-### 🎨 New Visualization Features
- **Sequence Plotting**: Built-in function for visualizing PRF prediction results
- **Dual Threshold Filtering**: Separate filtering for Short and Long models
- **Interactive Graphics**: Heatmap and bar chart visualization
- **Export Options**: Support for PNG and PDF output formats
+### ⚡ **High Performance**
+- **Optimized Algorithms**: Efficient sliding window scanning
+- **Batch Processing**: Handle multiple sequences simultaneously
+- **Flexible Thresholds**: Tunable sensitivity for different use cases
+- **Memory Efficient**: Optimized for large-scale genomic data

-### ⚖️ Ensemble Weighting System
- **Flexible Ensemble**: Control the contribution of Short and Long models
- **Weight Validation**: Automatic parameter validation and error handling
- **Clear Naming**: `ensemble_weight` parameter for intuitive usage
- **Visual Feedback**: Weight ratios displayed in plots and results
+## 🔧 Installation

-### 🔧 API Improvements
- **Method Renaming**: More intuitive method names
-  - `predict_sequence()`: Replaces `predict_full()` for sequence prediction
-  - `predict_regions()`: Replaces `predict_region()` for batch prediction
- **Field Standardization**: Consistent output field naming
-  - `Ensemble_Probability`: Main prediction result (replaces `Voting_Probability`)
-  - `Short_Sequence` / `Long_Sequence`: Clear sequence field names
- **Backward Compatibility**: Deprecated methods still work with warnings
-
-## Core Features
- **Sequence Feature Extraction**: Support for extracting features from nucleic acid sequences, including base composition, k-mer features, and positional features.
- **Frameshift Hotspot Region Prediction**: Predict potential PRF sites in nucleotide sequences using machine learning models.
- **Feature Extraction**: Extract relevant features from sequences to assist in prediction.
- **Cross-Species Support**: Built-in databases for viruses, marine phages, Euplotes, etc., enabling PRF prediction across various species.
- **Visualization Tools**: Built-in plotting functions for result visualization and analysis.
- **Ensemble Modeling**: Customizable ensemble weights for different prediction strategies.
-
-## Main Advantages
- **High Accuracy**: Integrates multiple machine learning models to provide accurate PRF site predictions.
- **Efficiency**: Utilizes a sliding window approach and feature extraction techniques to rapidly scan sequences.
- **Versatility**: Supports PRF prediction across various species and can be combined with the [FScanR](https://github.com/seanchen607/FScanR.git) framework for enhanced accuracy.
- **User-Friendly**: Comes with detailed documentation and usage examples, making it easy for researchers to use.
- **Flexible**: Provides different resolutions to suit different using situations.
-
-## Quick Start
-
-### Basic Prediction
-```python
-from FScanpy import predict_prf
-
-# Single sequence prediction with default ensemble weights (0.4:0.6)
-sequence = "ATGCGTACGT..."
-results = predict_prf(sequence=sequence)
-print(results[['Position', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']].head())
-```
-
-### Custom Ensemble Weighting
-```python
-# Adjust model weights for different prediction strategies
-results_long_dominant = predict_prf(sequence=sequence, ensemble_weight=0.3)   # 3:7 ratio (Long dominant)
-results_equal_weight = predict_prf(sequence=sequence, ensemble_weight=0.5)    # 5:5 ratio (Equal weight)
-results_short_dominant = predict_prf(sequence=sequence, ensemble_weight=0.7)  # 7:3 ratio (Short dominant)
-
-# Compare ensemble probabilities
-print("Long dominant:", results_long_dominant['Ensemble_Probability'].mean())
-print("Equal weight:", results_equal_weight['Ensemble_Probability'].mean())
-print("Short dominant:", results_short_dominant['Ensemble_Probability'].mean())
-```
-
-### Visualization with Custom Weights
-```python
-from FScanpy import plot_prf_prediction
-import matplotlib.pyplot as plt
-
-# Generate prediction plot with custom ensemble weighting
-sequence = "ATGCGTACGT..."
-results, fig = plot_prf_prediction(
-    sequence=sequence,
-    short_threshold=0.65,     # HistGB threshold
-    long_threshold=0.8,       # BiLSTM-CNN threshold
-    ensemble_weight=0.3,      # Custom weight: 30% Short, 70% Long
-    title="Long-Dominant Ensemble PRF Prediction (3:7)",
-    save_path="prediction_result.png"
-)
-
-plt.show()
-```
-
-### Advanced Usage with New API
-```python
-from FScanpy import PRFPredictor
-import matplotlib.pyplot as plt
-
-# Create predictor instance
-predictor = PRFPredictor()
-
-# Use new sequence prediction method
-results = predictor.predict_sequence(
-    sequence=sequence,
-    ensemble_weight=0.4
-)
-
-# Compare different ensemble configurations
-weights = [0.2, 0.4, 0.6, 0.8]
-weight_names = ["Long 80%", "Balanced", "Short 60%", "Short 80%"]
-
-fig, axes = plt.subplots(2, 2, figsize=(15, 10))
-axes = axes.flatten()
-
-for i, (weight, name) in enumerate(zip(weights, weight_names)):
-    results = predictor.predict_sequence(sequence=sequence, ensemble_weight=weight)
-    ax = axes[i]
-    ax.bar(results['Position'], results['Ensemble_Probability'], alpha=0.7)
-    ax.set_title(f'{name} (Weight: {weight:.1f}:{1-weight:.1f})')
-    ax.set_ylabel('Probability')
-
-plt.tight_layout()
-plt.show()
-```
-
-### Batch Region Prediction
-```python
-# Predict multiple 399bp sequences
-import pandas as pd
-
-data = pd.DataFrame({
-    'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57]  # 399bp sequences
-})
-
-results = predict_prf(data=data, ensemble_weight=0.4)
-print(results[['Ensemble_Probability', 'Ensemble_Weights']].head())
-```
-
-## Installation Requirements
+### Prerequisites
 - Python ≥ 3.7
- Dependencies are automatically handled during installation
+- All dependencies are automatically installed

-### Option 1: Install via pip
+### Install via pip (Recommended)
 ```bash
 pip install FScanpy
 ```

-### Option 2: Install from source
+### Install from Source
 ```bash
-git clone git@60.204.158.188:yyh/FScanpy-package.git
+git clone https://github.com/your-org/FScanpy-package.git
 cd FScanpy-package
 pip install -e .
 ```

-## 🔄 Migration from Previous Versions
+## 🚀 Quick Start

-### API Changes Summary
+### Basic Usage
 ```python
-# OLD API (deprecated but still works)
-results = predict_prf(sequence="ATGC...", short_weight=0.4)
-results = predictor.predict_full(sequence, short_weight=0.4)
-results = predictor.predict_region(sequences, short_weight=0.4)
+from FScanpy import predict_prf

-# NEW API (recommended)
-results = predict_prf(sequence="ATGC...", ensemble_weight=0.4)
-results = predictor.predict_sequence(sequence, ensemble_weight=0.4)
-results = predictor.predict_regions(sequences, ensemble_weight=0.4)
+# Simple sequence prediction
+sequence = "ATGCGTACGTTAGC..." # Your DNA sequence
+results = predict_prf(sequence=sequence)

-# Output field changes
-# OLD: 'Voting_Probability', 'Weight_Info', '33bp', '399bp'
-# NEW: 'Ensemble_Probability', 'Ensemble_Weights', 'Short_Sequence', 'Long_Sequence'
+# View top predictions
+print(results[['Position', 'Ensemble_Probability', 'Short_Probability', 'Long_Probability']].head(10))
+```

-# Visualization with ensemble weights
+### Visualization
+```python
+from FScanpy import plot_prf_prediction
+
+# Generate prediction plot
 results, fig = plot_prf_prediction(
-    sequence="ATGC...", 
-    short_threshold=0.65, 
-    long_threshold=0.8,
-    ensemble_weight=0.3  # 30% Short, 70% Long
+    sequence=sequence,
+    short_threshold=0.65,    # HistGB threshold
+    long_threshold=0.8,      # BiLSTM-CNN threshold
+    ensemble_weight=0.4,     # 40% Short, 60% Long
+    title="PRF Prediction Results"
 )
 ```

-### Backward Compatibility
- All old methods still work but will show deprecation warnings
- Old field names are automatically added for compatibility
- Gradual migration is supported
+### Advanced Usage
+```python
+from FScanpy import PRFPredictor
+import pandas as pd

-## Ensemble Weight Configuration Guide
+# Create predictor instance
+predictor = PRFPredictor()

-### Recommended Weights for Different Scenarios:
+# Batch prediction on pre-extracted regions
+data = pd.DataFrame({
+    'Long_Sequence': ['ATGCGT...' * 60, 'GCTATAG...' * 57]  # 399bp sequences
+})
+results = predictor.predict_regions(data, ensemble_weight=0.4)

-| Scenario | ensemble_weight | Description | Use Case |
-|----------|----------------|-------------|----------|
-| **High Sensitivity** | 0.2-0.3 | Long model dominant | Detecting subtle PRF sites |
-| **Balanced Detection** | 0.4-0.5 | Balanced ensemble (recommended) | General purpose prediction |
-| **Fast Screening** | 0.6-0.7 | Short model dominant | Rapid initial screening |
-| **Equal Contribution** | 0.5 | Equal weight to both models | Comparative analysis |
+# Sequence-level prediction with custom parameters
+results = predictor.predict_sequence(
+    sequence=sequence,
+    window_size=1,           # Step size for sliding window
+    ensemble_weight=0.3,     # Model weighting
+    short_threshold=0.5      # Filtering threshold
+)
+```

-### Weight Selection Guidelines:
- **Low ensemble_weight (0.2-0.3)**: 
-  - Emphasizes Long model (BiLSTM-CNN)
-  - Better for detecting complex patterns
-  - Higher sensitivity, may have more false positives
-  
- **High ensemble_weight (0.6-0.8)**: 
-  - Emphasizes Short model (HistGB)
-  - Faster computation
-  - Good for initial screening
-  - Higher specificity, may miss subtle sites
-  
- **Balanced (0.4-0.5)**: 
-  - Recommended for most applications
-  - Good balance of sensitivity and specificity
-  - Suitable for comprehensive analysis
+## 🎛️ Ensemble Weight Configuration

-## Output Field Reference
+The `ensemble_weight` parameter controls the contribution of each model:

-### Main Prediction Fields
- **`Short_Probability`**: HistGradientBoosting model prediction (0-1)
- **`Long_Probability`**: BiLSTM-CNN model prediction (0-1)
- **`Ensemble_Probability`**: Final ensemble prediction (primary result)
- **`Ensemble_Weights`**: Weight configuration information
+| ensemble_weight | Short Model | Long Model | Best For |
+|----------------|-------------|------------|----------|
+| **0.2-0.3** | 20-30% | 70-80% | **High sensitivity**, detecting subtle sites |
+| **0.4-0.5** | 40-50% | 50-60% | **Balanced detection** (recommended) |
+| **0.6-0.7** | 60-70% | 30-40% | **Fast screening**, high specificity |

-### Sequence Fields
- **`Short_Sequence`**: 33bp sequence used by Short model
- **`Long_Sequence`**: 399bp sequence used by Long model
+### Weight Selection Examples
+```python
+# High sensitivity (Long model dominant)
+sensitive_results = predict_prf(sequence, ensemble_weight=0.2)
+
+# Balanced approach (recommended)
+balanced_results = predict_prf(sequence, ensemble_weight=0.4)
+
+# Fast screening (Short model dominant)  
+screening_results = predict_prf(sequence, ensemble_weight=0.7)
+```
+
+## 📊 Core Functions
+
+### Main Prediction Interface
+```python
+predict_prf(
+    sequence=None,           # Single/multiple sequences or None
+    data=None,              # DataFrame with 399bp sequences or None
+    window_size=3,          # Sliding window step size
+    short_threshold=0.1,    # Short model filtering threshold
+    ensemble_weight=0.4,    # Short model weight (0.0-1.0)
+    model_dir=None         # Custom model directory
+)
+```
+
+### Visualization Function
+```python
+plot_prf_prediction(
+    sequence,               # Input DNA sequence
+    window_size=3,          # Scanning step size
+    short_threshold=0.65,   # Short model threshold for plotting
+    long_threshold=0.8,     # Long model threshold for plotting
+    ensemble_weight=0.4,    # Model weighting
+    title=None,            # Plot title
+    save_path=None,        # Save file path
+    figsize=(12,8),        # Figure size
+    dpi=300               # Resolution for saved plots
+)
+```
+
+### PRFPredictor Class Methods
+```python
+predictor = PRFPredictor()
+
+# Sequence prediction (sliding window)
+predictor.predict_sequence(sequence, ensemble_weight=0.4)
+
+# Region prediction (batch processing)
+predictor.predict_regions(dataframe, ensemble_weight=0.4)
+
+# Feature extraction
+predictor.extract_features(sequences)
+
+# Model information
+predictor.get_model_info()
+```
+
+## 📈 Output Fields
+
+### Prediction Results
 - **`Position`**: Position in the original sequence
- **`Codon`**: 3bp codon at the position
+- **`Ensemble_Probability`**: Final ensemble prediction (main result)
+- **`Short_Probability`**: HistGradientBoosting prediction (0-1)
+- **`Long_Probability`**: BiLSTM-CNN prediction (0-1)
+- **`Ensemble_Weights`**: Model weight configuration used

-### Metadata Fields
- **`Sequence_ID`**: Identifier for multi-sequence predictions
- Additional fields from input DataFrame (for region predictions)
+### Sequence Information
+- **`Short_Sequence`**: 33bp sequence for Short model
+- **`Long_Sequence`**: 399bp sequence for Long model  
+- **`Codon`**: 3bp codon at the prediction position
+- **`Sequence_ID`**: Identifier for multi-sequence inputs

-## Examples
+## 🔬 Integration with FScanR

-See `example_plot_prediction.py` for comprehensive examples of:
- Basic prediction plotting
- Custom threshold configuration
- Ensemble weight parameter usage and comparison
- New API method demonstrations
- Saving plots to files
- Advanced visualization options
+FScanpy works seamlessly with the FScanR pipeline for comprehensive PRF analysis:

-## Authors
+```python
+from FScanpy import fscanr, extract_prf_regions, predict_prf

+# Step 1: BLASTX analysis with FScanR
+blastx_results = fscanr(
+    blastx_data,
+    mismatch_cutoff=10,
+    evalue_cutoff=1e-5,
+    frameDist_cutoff=10
+)

-## Citation
-If you utilize FScanpy in your research, please cite our work:
+# Step 2: Extract PRF candidate regions
+prf_regions = extract_prf_regions(original_sequence, blastx_results)
+
+# Step 3: Predict with FScanpy
+final_predictions = predict_prf(data=prf_regions, ensemble_weight=0.4)
+```
+
+## 📚 Documentation
+
+- **[Complete Tutorial](tutorial/tutorial.md)**: Comprehensive usage guide with examples
+- **[Demo Notebook](FScanpy_Demo.ipynb)**: Interactive examples and workflows
+- **[Example Scripts](example_plot_prediction.py)**: Ready-to-run code examples
+
+## 🎯 Use Cases
+
+### 1. **Viral Genome Analysis**
+```python
+# Scan viral genome for PRF sites
+viral_sequence = load_viral_genome()
+prf_sites = predict_prf(viral_sequence, ensemble_weight=0.3)
+high_confidence = prf_sites[prf_sites['Ensemble_Probability'] > 0.8]
+```
+
+### 2. **Comparative Genomics**
+```python
+# Compare PRF patterns across species
+species_data = pd.DataFrame({
+    'Species': ['Virus_A', 'Virus_B'],
+    'Long_Sequence': [seq_a_399bp, seq_b_399bp]
+})
+comparative_results = predict_prf(data=species_data)
+```
+
+### 3. **High-Throughput Screening**
+```python
+# Fast screening of large sequence datasets
+sequences = load_large_dataset()
+screening_results = predict_prf(
+    sequence=sequences,
+    ensemble_weight=0.7,  # Fast screening mode
+    short_threshold=0.3
+)
+```
+
+## 🤝 Contributing
+
+We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
+
+## 📝 Citation
+
+If you use FScanpy in your research, please cite:

 ```bibtex
-[Citation details will be added upon publication]
+@software{fscanpy2024,
+  title={FScanpy: A Machine Learning Framework for Programmed Ribosomal Frameshifting Prediction},
+  author={[Author names]},
+  year={2024},
+  url={https://github.com/your-org/FScanpy}
+}
 ```
+
+## 📄 License
+
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+
+## 🆘 Support
+
+- **Issues**: [GitHub Issues](https://github.com/your-org/FScanpy/issues)
+- **Documentation**: [Tutorial](tutorial/tutorial.md)
+- **Examples**: [Demo Notebook](FScanpy_Demo.ipynb)
+
+## 🏗️ Dependencies
+
+FScanpy automatically installs all required dependencies:
+- `numpy>=1.19.0`
+- `pandas>=1.2.0`
+- `scikit-learn>=1.0.0`
+- `tensorflow>=2.6.0`
+- `matplotlib>=3.3.0`
+- `seaborn>=0.11.0`
+
+---
+
+**FScanpy** - Advancing programmed ribosomal frameshifting research through machine learning 🧬
--- a/tutorial/tutorial.md
+++ b/tutorial/tutorial.md
@ -1,3 +1,5 @@
+# FScanpy Tutorial - Complete Usage Guide
+
 ## Abstract
 FScanpy is a Python package designed to predict Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. This package integrates machine learning models, sequence feature analysis, and visualization capabilities to help researchers rapidly locate potential PRF sites.

@ -7,6 +9,7 @@ FScanpy is a Python package designed to predict Programmed Ribosomal Frameshifti
 FScanpy is a Python package dedicated to predicting Programmed Ribosomal Frameshifting (PRF) sites in DNA sequences. It integrates machine learning models (Gradient Boosting and BiLSTM-CNN) along with the FScanR package to furnish precise PRF predictions. Users are capable of employing three types of data as input: the entire cDNA/mRNA sequence that requires prediction, the nucleotide sequence in the vicinity of the suspected frameshift site, and the peptide library blastx results of the species or related species. It anticipates the input sequence to be in the + strand and can be integrated with FScanR to augment the accuracy.

 ![Machine learning models](/image/ML.png)
+
 For the prediction of the entire sequence, FScanpy adopts a sliding window approach to scan the entire sequence and predict the PRF sites. For regional prediction, it is based on the 33-bp and 399-bp sequences in the 0 reading frame around the suspected frameshift site. Initially, the Short model (HistGradientBoosting) will predict the potential PRF sites within the scanning window. If the predicted probability exceeds the threshold, the Long model (BiLSTM-CNN) will predict the PRF sites in the 399bp sequence. Then, ensemble weighting combines the two models to make the final prediction.

 For PRF detection from BLASTX output, [FScanR](https://github.com/seanchen607/FScanR.git) identifies potential PRF sites from BLASTX alignment results, acquires the two hits of the same query sequence, and then utilizes frameDist_cutoff, mismatch_cutoff, and evalue_cutoff to filter the hits. Finally, FScanpy is utilized to predict the probability of PRF sites.
@ -32,102 +35,583 @@ pip install FScanpy
 ### 2. Clone from GitHub
 ```bash
 git clone https://github.com/.../FScanpy.git
-cd your_project_directory
+cd FScanpy
 pip install -e .
 ```

-## Methods and Usage
+## Complete Function Reference
+
+### 1. Core Prediction Functions
+
+#### 1.1 `predict_prf()` - Main Prediction Interface
+
+**Function Signature:**
+```python
+def predict_prf(
+    sequence: Union[str, List[str], None] = None,
+    data: Union[pd.DataFrame, None] = None,
+    window_size: int = 3,
+    short_threshold: float = 0.1,
+    ensemble_weight: float = 0.4,
+    model_dir: str = None
+) -> pd.DataFrame
+```
+
+**Parameters:**
+- `sequence`: Single or multiple DNA sequences for sliding window prediction
+- `data`: DataFrame data, must contain 'Long_Sequence' or '399bp' column for region prediction  
+- `window_size`: Sliding window size (default: 3, recommended: 1-10)
+- `short_threshold`: Short model (HistGB) probability threshold (default: 0.1, range: 0.0-1.0)
+- `ensemble_weight`: Weight of short model in ensemble (default: 0.4, range: 0.0-1.0)
+- `model_dir`: Model directory path (optional, uses built-in models if None)
+
+**Returns:**
+- `pd.DataFrame`: Prediction results with columns:
+  - `Short_Probability`: Short model prediction probability
+  - `Long_Probability`: Long model prediction probability  
+  - `Ensemble_Probability`: Ensemble prediction probability (main result)
+  - `Position`: Position in sequence (for sliding window mode)
+  - `Codon`: Codon at position (for sliding window mode)
+  - `Ensemble_Weights`: Weight configuration information
+
+**Usage Examples:**
+
+```python
+from FScanpy import predict_prf
+
+# 1. Single sequence sliding window prediction
+sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
+results = predict_prf(sequence=sequence)
+
+# 2. Multiple sequences prediction
+sequences = ["ATGCGTACGT...", "GCTATAGCAT..."]
+results = predict_prf(sequence=sequences)
+
+# 3. Custom parameters
+results = predict_prf(
+    sequence=sequence, 
+    window_size=1,           # Scan every position
+    short_threshold=0.2,     # Higher threshold
+    ensemble_weight=0.3      # 3:7 ratio (short:long)
+)
+
+# 4. DataFrame region prediction
+import pandas as pd
+data = pd.DataFrame({
+    'Long_Sequence': ['ATGCGT...', 'GCTATAG...'],  # or use '399bp'
+    'sample_id': ['sample1', 'sample2']
+})
+results = predict_prf(data=data)
+```
+
+#### 1.2 `plot_prf_prediction()` - Prediction with Visualization
+
+**Function Signature:**
+```python
+def plot_prf_prediction(
+    sequence: str,
+    window_size: int = 3,
+    short_threshold: float = 0.65,
+    long_threshold: float = 0.8,
+    ensemble_weight: float = 0.4,
+    title: str = None,
+    save_path: str = None,
+    figsize: tuple = (12, 8),
+    dpi: int = 300,
+    model_dir: str = None
+) -> tuple
+```
+
+**Parameters:**
+- `sequence`: Input DNA sequence (string)
+- `window_size`: Sliding window size (default: 3)
+- `short_threshold`: Short model filtering threshold for heatmap display (default: 0.65)
+- `long_threshold`: Long model filtering threshold for heatmap display (default: 0.8)
+- `ensemble_weight`: Weight of short model in ensemble (default: 0.4)
+- `title`: Plot title (optional, auto-generated if None)
+- `save_path`: Save path (optional, saves plot if provided)
+- `figsize`: Figure size tuple (default: (12, 8))
+- `dpi`: Figure resolution (default: 300)
+- `model_dir`: Model directory path (optional)
+
+**Returns:**
+- `tuple`: (prediction_results: pd.DataFrame, figure: matplotlib.figure.Figure)
+
+**Usage Examples:**
+
+```python
+from FScanpy import plot_prf_prediction
+import matplotlib.pyplot as plt
+
+# 1. Basic plotting
+sequence = "ATGCGTACGTATGCGTACGTATGCGTACGT"
+results, fig = plot_prf_prediction(sequence)
+plt.show()
+
+# 2. Custom thresholds and weights
+results, fig = plot_prf_prediction(
+    sequence, 
+    short_threshold=0.7,     # Higher display threshold
+    long_threshold=0.85,     # Higher display threshold
+    ensemble_weight=0.3,     # 3:7 weight ratio
+    title="Custom Analysis Results",
+    save_path="analysis.png",
+    figsize=(15, 10),
+    dpi=150
+)
+
+# 3. High-resolution analysis
+results, fig = plot_prf_prediction(
+    sequence,
+    window_size=1,           # Scan every position
+    ensemble_weight=0.5,     # Equal weights
+    dpi=600                  # High resolution
+)
+```
+
+### 2. PRFPredictor Class Methods
+
+#### 2.1 Class Initialization

-### 1. Load model and test data
 ```python
 from FScanpy import PRFPredictor
-from FScanpy.data import get_test_data_path, list_test_data
-predictor = PRFPredictor() # load model
-list_test_data() # list all the test data
-blastx_file = get_test_data_path('blastx_example.xlsx')
-mrna_file = get_test_data_path('mrna_example.fasta')
-region_example = get_test_data_path('region_example.xlsx')
+
+# Initialize with default models
+predictor = PRFPredictor()
+
+# Initialize with custom model directory
+predictor = PRFPredictor(model_dir='/path/to/models')
 ```

-### 2. Predict PRF Sites in a Full Sequence
-Use the `predict_sequence()` method to scan the entire sequence:
+#### 2.2 `predict_sequence()` - Sliding Window Prediction
+
+**Method Signature:**
 ```python
+def predict_sequence(self, sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)
+```
+
+**Parameters:**
+- `sequence`: Input DNA sequence
+- `window_size`: Sliding window size (default: 3)
+- `short_threshold`: Short model probability threshold (default: 0.1)
+- `ensemble_weight`: Short model weight in ensemble (default: 0.4)
+
+**Usage:**
+```python
+predictor = PRFPredictor()
 results = predictor.predict_sequence(
-    sequence='ATGCGTACGTATGCGTACGTATGCGTACGT',
-    window_size=3,           # Scanning window size
-    short_threshold=0.1,     # Short model threshold
-    ensemble_weight=0.4      # Ensemble weight (Short:Long = 0.4:0.6)
-)
-
-# With visualization
-results, fig = predictor.plot_sequence_prediction(
-    sequence='ATGCGTACGTATGCGTACGTATGCGTACGT',
-    ensemble_weight=0.4
+    sequence="ATGCGTACGT...",
+    window_size=1,
+    short_threshold=0.15,
+    ensemble_weight=0.35
 )
 ```

-### 3. Predict PRF in Specific Regions
-Use the `predict_regions()` method to predict PRF in known regions of interest:
+#### 2.3 `predict_regions()` - Region-based Prediction
+
+**Method Signature:**
 ```python
-import pandas as pd
-region_example = pd.read_excel(get_test_data_path('region_example.xlsx'))
+def predict_regions(self, sequences, short_threshold=0.1, ensemble_weight=0.4)
+```
+
+**Parameters:**
+- `sequences`: List or Series of 399bp sequences
+- `short_threshold`: Short model probability threshold (default: 0.1)
+- `ensemble_weight`: Short model weight in ensemble (default: 0.4)
+
+**Usage:**
+```python
+predictor = PRFPredictor()
+sequences = ["ATGCGT...", "GCTATAG..."]  # 399bp sequences
 results = predictor.predict_regions(
-    sequences=region_example['399bp'],
+    sequences=sequences,
+    short_threshold=0.1,
    ensemble_weight=0.4
 )
 ```

-### 4. Identify PRF Sites from BLASTX Output
-BLASTX Output should contain the following columns: `qseqid`, `sseqid`, `pident`, `length`, `mismatch`, `gapopen`, `qstart`, `qend`, `sstart`, `send`, `evalue`, `bitscore`, `qframe`, `sframe`.
+#### 2.4 `predict_single_position()` - Single Position Prediction

-Use the FScanR function to identify potential PRF sites from BLASTX alignment results:
+**Method Signature:**
+```python
+def predict_single_position(self, fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)
+```
+
+**Parameters:**
+- `fs_period`: 33bp sequence around frameshift site
+- `full_seq`: 399bp sequence for long model
+- `short_threshold`: Short model probability threshold (default: 0.1)
+- `ensemble_weight`: Short model weight in ensemble (default: 0.4)
+
+**Usage:**
+```python
+predictor = PRFPredictor()
+result = predictor.predict_single_position(
+    fs_period="ATGCGTACGTATGCGTACGTATGCGTACGTA",  # 33bp
+    full_seq="ATGCGT..." * 133,  # 399bp
+    short_threshold=0.1,
+    ensemble_weight=0.4
+)
+```
+
+#### 2.5 `plot_sequence_prediction()` - Class Method for Plotting
+
+**Method Signature:**
+```python
+def plot_sequence_prediction(self, sequence, window_size=3, short_threshold=0.65, 
+                           long_threshold=0.8, ensemble_weight=0.4, title=None, 
+                           save_path=None, figsize=(12, 8), dpi=300)
+```
+
+**Usage:**
+```python
+predictor = PRFPredictor()
+results, fig = predictor.plot_sequence_prediction(
+    sequence="ATGCGTACGT...",
+    window_size=3,
+    ensemble_weight=0.4
+)
+```
+
+### 3. Utility Functions
+
+#### 3.1 `fscanr()` - PRF Site Detection from BLASTX
+
+**Function Signature:**
+```python
+def fscanr(
+    blastx_output: pd.DataFrame,
+    mismatch_cutoff: float = 10,
+    evalue_cutoff: float = 1e-5,
+    frameDist_cutoff: float = 10
+) -> pd.DataFrame
+```
+
+**Parameters:**
+- `blastx_output`: BLASTX output DataFrame with required columns:
+  - `qseqid`, `sseqid`, `pident`, `length`, `mismatch`, `gapopen`
+  - `qstart`, `qend`, `sstart`, `send`, `evalue`, `bitscore`, `qframe`, `sframe`
+- `mismatch_cutoff`: Maximum allowed mismatches (default: 10)
+- `evalue_cutoff`: E-value threshold (default: 1e-5)
+- `frameDist_cutoff`: Frame distance threshold (default: 10)
+
+**Returns:**
+- `pd.DataFrame`: PRF sites with columns:
+  - `DNA_seqid`: Sequence identifier
+  - `FS_start`, `FS_end`: Frameshift start and end positions
+  - `Pep_seqid`: Peptide sequence identifier
+  - `Pep_FS_start`, `Pep_FS_end`: Peptide frameshift positions
+  - `FS_type`: Type of frameshift (-2, -1, 1, 2)
+  - `Strand`: Strand orientation (+, -)
+
+**Usage:**
 ```python
 from FScanpy.utils import fscanr
-blastx_output = pd.read_excel(get_test_data_path('blastx_example.xlsx'))
-fscanr_result = fscanr(blastx_output, 
-                      mismatch_cutoff=10,    # Allowed mismatches
-                      evalue_cutoff=1e-5,    # E-value threshold
-                      frameDist_cutoff=10)   # Frame distance threshold
+import pandas as pd
+
+# Load BLASTX results
+blastx_data = pd.read_excel('blastx_results.xlsx')
+
+# Detect PRF sites
+prf_sites = fscanr(
+    blastx_output=blastx_data,
+    mismatch_cutoff=5,       # Stricter mismatch filter
+    evalue_cutoff=1e-6,      # Stricter E-value filter
+    frameDist_cutoff=15      # Allow larger frame distances
+)
 ```

-### 5. Extract PRF Sites and Evaluate
-Use the `extract_prf_regions()` method to extract PRF site sequences from mRNA sequences:
+#### 3.2 `extract_prf_regions()` - Extract Sequences Around PRF Sites
+
+**Function Signature:**
+```python
+def extract_prf_regions(mrna_file: str, prf_data: pd.DataFrame) -> pd.DataFrame
+```
+
+**Parameters:**
+- `mrna_file`: Path to mRNA sequences file (FASTA format)
+- `prf_data`: DataFrame from `fscanr()` output
+
+**Returns:**
+- `pd.DataFrame`: Extracted sequences with columns:
+  - `DNA_seqid`: Sequence identifier
+  - `FS_start`, `FS_end`: Frameshift positions
+  - `Strand`: Strand orientation
+  - `399bp`: Extracted 399bp sequence
+  - `FS_type`: Frameshift type
+
+**Usage:**
 ```python
 from FScanpy.utils import extract_prf_regions
-prf_regions = extract_prf_regions(
-    mrna_file=get_test_data_path('mrna_example.fasta'),
-    prf_data=fscanr_result
+
+# Extract sequences around PRF sites
+prf_sequences = extract_prf_regions(
+    mrna_file='sequences.fasta',
+    prf_data=prf_sites
 )
-prf_results = predictor.predict_regions(prf_regions['399bp'])
+
+# Predict PRF probabilities
+predictor = PRFPredictor()
+results = predictor.predict_regions(prf_sequences['399bp'])
 ```

-## Complete Workflow Example
+### 4. Data Access Functions
+
+#### 4.1 Test Data Access
+
 ```python
-from FScanpy import PRFPredictor, predict_prf, plot_prf_prediction
 from FScanpy.data import get_test_data_path, list_test_data
+
+# List available test data
+list_test_data()
+
+# Get test data paths
+blastx_file = get_test_data_path('blastx_example.xlsx')
+mrna_file = get_test_data_path('mrna_example.fasta')
+region_file = get_test_data_path('region_example.csv')
+```
+
+## Complete Workflow Examples
+
+### Workflow 1: Full Sequence Analysis
+
+```python
+from FScanpy import predict_prf, plot_prf_prediction
+import matplotlib.pyplot as plt
+
+# Define sequence
+sequence = "ATGCGTACGTATGCGTACGTATGCGTACGTAAGCCCTTTGAACCCAAAGGG"
+
+# Method 1: Simple prediction
+results = predict_prf(sequence=sequence)
+print(f"Found {len(results)} potential sites")
+
+# Method 2: Prediction with visualization
+results, fig = plot_prf_prediction(
+    sequence=sequence,
+    window_size=1,              # Scan every position
+    short_threshold=0.3,        # Display sites above 0.3
+    long_threshold=0.4,         # Display sites above 0.4
+    ensemble_weight=0.4,        # 4:6 weight ratio
+    title="PRF Analysis Results",
+    save_path="prf_analysis.png"
+)
+plt.show()
+
+# Analyze top predictions
+top_sites = results.nlargest(5, 'Ensemble_Probability')
+print("Top 5 predicted sites:")
+for _, site in top_sites.iterrows():
+    print(f"Position {site['Position']}: {site['Ensemble_Probability']:.3f}")
+```
+
+### Workflow 2: Region-based Prediction
+
+```python
+from FScanpy import predict_prf
+import pandas as pd
+
+# Prepare region data
+region_data = pd.DataFrame({
+    'sample_id': ['sample1', 'sample2', 'sample3'],
+    'Long_Sequence': [
+        'ATGCGT...',  # 399bp sequence 1
+        'GCTATAG...',  # 399bp sequence 2  
+        'TTACGGA...'   # 399bp sequence 3
+    ],
+    'known_label': [1, 0, 1]  # Optional: known labels for validation
+})
+
+# Predict PRF probabilities
+results = predict_prf(
+    data=region_data,
+    ensemble_weight=0.3  # Favor long model (3:7 ratio)
+)
+
+# Evaluate results
+if 'known_label' in results.columns:
+    threshold = 0.5
+    predictions = (results['Ensemble_Probability'] > threshold).astype(int)
+    accuracy = (predictions == results['known_label']).mean()
+    print(f"Accuracy at threshold {threshold}: {accuracy:.3f}")
+```
+
+### Workflow 3: BLASTX-based Analysis Pipeline
+
+```python
+from FScanpy import PRFPredictor, predict_prf
+from FScanpy.data import get_test_data_path
 from FScanpy.utils import fscanr, extract_prf_regions
 import pandas as pd

-# Initialize predictor
+# Step 1: Load BLASTX data
+blastx_data = pd.read_excel(get_test_data_path('blastx_example.xlsx'))
+print(f"Loaded {len(blastx_data)} BLASTX hits")
+
+# Step 2: Detect PRF sites using FScanR
+prf_sites = fscanr(
+    blastx_output=blastx_data,
+    mismatch_cutoff=10,
+    evalue_cutoff=1e-5,
+    frameDist_cutoff=10
+)
+print(f"Detected {len(prf_sites)} potential PRF sites")
+
+# Step 3: Extract sequences around PRF sites
+mrna_file = get_test_data_path('mrna_example.fasta')
+prf_sequences = extract_prf_regions(
+    mrna_file=mrna_file,
+    prf_data=prf_sites
+)
+print(f"Extracted {len(prf_sequences)} sequences")
+
+# Step 4: Predict PRF probabilities
 predictor = PRFPredictor()
+results = predictor.predict_regions(
+    sequences=prf_sequences['399bp'],
+    ensemble_weight=0.4
+)

-# Method 1: Sequence prediction
-sequence = 'ATGCGTACGTATGCGTACGTATGCGTACGT'
-results = predict_prf(sequence=sequence, ensemble_weight=0.4)
+# Step 5: Combine results with metadata
+final_results = pd.concat([
+    prf_sequences.reset_index(drop=True),
+    results.reset_index(drop=True)
+], axis=1)

-# Method 2: Region prediction
-region_data = pd.read_excel(get_test_data_path('region_example.xlsx'))
-results = predict_prf(data=region_data, ensemble_weight=0.4)
+# Step 6: Analyze results
+high_prob_sites = final_results[
+    final_results['Ensemble_Probability'] > 0.7
+]
+print(f"High probability PRF sites: {len(high_prob_sites)}")

-# Method 3: BLASTX pipeline
-blastx_output = pd.read_excel(get_test_data_path('blastx_example.xlsx'))
-fscanr_result = fscanr(blastx_output, mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10)
-prf_regions = extract_prf_regions(get_test_data_path('mrna_example.fasta'), fscanr_result)
-prf_results = predictor.predict_regions(prf_regions['399bp'])
+# Display top results
+print("\nTop PRF predictions:")
+top_results = final_results.nlargest(3, 'Ensemble_Probability')
+for _, row in top_results.iterrows():
+    print(f"Sequence {row['DNA_seqid']}: {row['Ensemble_Probability']:.3f}")
+```

-# Visualization
-results, fig = plot_prf_prediction(sequence, ensemble_weight=0.4, save_path='prediction.png')
+### Workflow 4: Custom Analysis with Multiple Sequences
+
+```python
+from FScanpy import predict_prf, plot_prf_prediction
+import matplotlib.pyplot as plt
+
+# Multiple sequence analysis
+sequences = [
+    "ATGCGTACGTATGCGTACGTATGCGTACGTAAGCCCTTTGAACCCAAAGGG",
+    "GCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCATGCTATAGCAT",
+    "TTACGGATTACGGATTACGGATTACGGATTACGGATTACGGATTACGGAT"
+]
+
+# Batch prediction
+results = predict_prf(
+    sequence=sequences,
+    window_size=2,
+    ensemble_weight=0.5  # Equal weights
+)
+
+# Analyze per sequence
+for seq_id in results['Sequence_ID'].unique():
+    seq_results = results[results['Sequence_ID'] == seq_id]
+    max_prob = seq_results['Ensemble_Probability'].max()
+    print(f"{seq_id}: Max probability = {max_prob:.3f}")
+
+# Visualize first sequence
+first_seq_results, fig = plot_prf_prediction(
+    sequence=sequences[0],
+    ensemble_weight=0.5,
+    title="First Sequence Analysis"
+)
+plt.show()
+```
+
+## Parameter Optimization Guidelines
+
+### 1. Ensemble Weight Selection
+
+- **Conservative (favor specificity)**: `ensemble_weight = 0.2-0.3` (favor long model)
+- **Balanced**: `ensemble_weight = 0.4-0.6` (default recommended)
+- **Sensitive (favor sensitivity)**: `ensemble_weight = 0.7-0.8` (favor short model)
+
+### 2. Threshold Selection
+
+- **Short threshold**: Usually 0.1-0.3, controls computational efficiency
+- **Display thresholds**: 0.3-0.8, controls visualization display
+- **Classification threshold**: 0.5 (standard), adjust based on validation data
+
+### 3. Window Size Selection
+
+- **Fine-grained analysis**: `window_size = 1` (every position)
+- **Standard analysis**: `window_size = 3` (every 3rd position, default)
+- **Coarse analysis**: `window_size = 6-9` (faster, less detailed)
+
+## Troubleshooting
+
+### Common Issues and Solutions
+
+1. **Model Loading Errors**
+   ```python
+   # Check model directory
+   import FScanpy
+   predictor = PRFPredictor(model_dir='/custom/path')
+   ```
+
+2. **Memory Issues with Large Sequences**
+   ```python
+   # Use larger window size to reduce computational load
+   results = predict_prf(sequence=large_seq, window_size=9)
+   ```
+
+3. **Visualization Issues**
+   ```python
+   # Adjust figure parameters
+   results, fig = plot_prf_prediction(
+       sequence=seq,
+       figsize=(20, 10),  # Larger figure
+       dpi=150            # Lower resolution
+   )
+   ```
+
+4. **Input Format Issues**
+   ```python
+   # Ensure proper DataFrame format
+   data = pd.DataFrame({
+       'Long_Sequence': sequences,  # Use 'Long_Sequence' or '399bp'
+       'sample_id': ids
+   })
+   ```
+
+## Performance Optimization
+
+### 1. Batch Processing
+```python
+# Process multiple sequences efficiently
+sequences = ["seq1", "seq2", "seq3", ...]
+results = predict_prf(sequence=sequences, window_size=3)
+```
+
+### 2. Threshold Optimization
+```python
+# Use appropriate short_threshold to skip unnecessary long model calls
+results = predict_prf(
+    sequence=sequence,
+    short_threshold=0.2  # Higher threshold = faster processing
+)
+```
+
+### 3. Memory Management
+```python
+# For very large datasets, process in chunks
+chunk_size = 100
+for i in range(0, len(large_dataset), chunk_size):
+    chunk = large_dataset[i:i+chunk_size]
+    chunk_results = predict_prf(data=chunk)
+    # Process chunk_results
 ```

 ## Citation
-If you use FScanpy, please cite our paper: [Paper Link] 
+If you use FScanpy, please cite our paper: [Paper Link]
+
+## Support
+For questions and issues, please visit our GitHub repository or contact the development team.