FScanpy-package/FScanpy_Demo.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# FScanpy \n",
    "\n",
    "This notebook demonstrates how to use FScanpy with real test data for complete PRF site prediction analysis, including:\n",
    "\n",
    "## 🎯 Complete Workflow\n",
    "1. **Load Test Data** - Use built-in real test data\n",
    "2. **FScanR Analysis** - Identify potential PRF sites from BLASTX results\n",
    "3. **Sequence Extraction** - Extract sequences around PRF sites\n",
    "4. **FScanpy Prediction** - Use machine learning models to predict probabilities\n",
    "5. **Results Visualization** - Generate prediction result plots using built-in plotting functions\n",
    "6. **Sequence-level Prediction Demo** - Sliding window analysis of complete sequences\n",
    "\n",
    "## 📊 Data Description\n",
    "- **blastx_example.xlsx**: Real BLASTX alignment results\n",
    "- **mrna_example.fasta**: Real mRNA sequence data\n",
    "- **region_example.csv**: Sample for individual site prediction"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## 📚 FScanpy Function Usage Guide\n",
    "\n",
    "### Core Functions Overview\n",
    "\n",
    "FScanpy provides several main functions for PRF prediction:\n",
    "\n",
    "#### 1. `predict_prf()` - Universal Prediction Function\n",
    "```python\n",
    "# Single sequence prediction\n",
    "results = predict_prf(sequence=\"ATGCGT...\", window_size=3, ensemble_weight=0.4)\n",
    "\n",
    "# Multiple sequences prediction  \n",
    "results = predict_prf(sequence=[\"seq1\", \"seq2\"], window_size=3)\n",
    "\n",
    "# DataFrame region prediction\n",
    "results = predict_prf(data=df_with_399bp_column, ensemble_weight=0.4)\n",
    "```\n",
    "\n",
    "#### 2. `plot_prf_prediction()` - Prediction with Visualization\n",
    "```python\n",
    "# Basic plotting\n",
    "results, fig = plot_prf_prediction(sequence=\"ATGCGT...\")\n",
    "\n",
    "# Custom parameters\n",
    "results, fig = plot_prf_prediction(\n",
    "    sequence=\"ATGCGT...\",\n",
    "    window_size=1,\n",
    "    short_threshold=0.65,\n",
    "    long_threshold=0.8,\n",
    "    ensemble_weight=0.4,\n",
    "    save_path=\"plot.png\"\n",
    ")\n",
    "```\n",
    "\n",
    "#### 3. `PRFPredictor` Class Methods\n",
    "```python\n",
    "predictor = PRFPredictor()\n",
    "\n",
    "# Sliding window prediction\n",
    "results = predictor.predict_sequence(sequence, window_size=3, ensemble_weight=0.4)\n",
    "\n",
    "# Region prediction\n",
    "results = predictor.predict_regions(sequences_399bp, ensemble_weight=0.4)\n",
    "\n",
    "# Single position prediction\n",
    "result = predictor.predict_single_position(fs_period_33bp, full_seq_399bp)\n",
    "\n",
    "# Plot prediction\n",
    "results, fig = predictor.plot_sequence_prediction(sequence)\n",
    "```\n",
    "\n",
    "#### 4. Utility Functions\n",
    "```python\n",
    "from FScanpy.utils import fscanr, extract_prf_regions\n",
    "\n",
    "# Detect PRF sites from BLASTX\n",
    "prf_sites = fscanr(blastx_df, mismatch_cutoff=10, evalue_cutoff=1e-5)\n",
    "\n",
    "# Extract sequences around PRF sites\n",
    "prf_sequences = extract_prf_regions(mrna_file, prf_sites)\n",
    "```\n",
    "\n",
    "### Parameter Guidelines\n",
    "\n",
    "- **ensemble_weight**: 0.4 (default, balanced), 0.2-0.3 (conservative), 0.7-0.8 (sensitive)\n",
    "- **window_size**: 1 (detailed), 3 (standard), 6-9 (fast)\n",
    "- **short_threshold**: 0.1 (default), 0.2-0.3 (stricter filtering)\n",
    "- **Display thresholds**: 0.3-0.8 for visualization filtering\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 📦 Environment Setup and Data Loading"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "ename": "ImportError",
     "evalue": "cannot import name 'PRFPredictor' from 'FScanpy' (unknown location)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mImportError\u001b[0m                               Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[3], line 6\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mmatplotlib\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpyplot\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mplt\u001b[39;00m\n\u001b[1;32m      5\u001b[0m \u001b[38;5;66;03m# Import FScanpy related modules\u001b[39;00m\n\u001b[0;32m----> 6\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mFScanpy\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m PRFPredictor, predict_prf, plot_prf_prediction\n\u001b[1;32m      7\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mFScanpy\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mdata\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m get_test_data_path, list_test_data\n\u001b[1;32m      8\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mFScanpy\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mutils\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m fscanr, extract_prf_regions\n",
      "\u001b[0;31mImportError\u001b[0m: cannot import name 'PRFPredictor' from 'FScanpy' (unknown location)"
     ]
    }
   ],
   "source": [
    "# Import necessary libraries\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Import FScanpy related modules\n",
    "from FScanpy import PRFPredictor, predict_prf, plot_prf_prediction\n",
    "from FScanpy.data import get_test_data_path, list_test_data\n",
    "from FScanpy.utils import fscanr, extract_prf_regions\n",
    "\n",
    "print(\"✅ Environment setup complete!\")\n",
    "print(\"📋 Available test data:\")\n",
    "list_test_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Load and Explore Test Data\n",
    "\n",
    "First, load the real test data provided by FScanpy to understand the data structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📁 数据文件路径:\n",
      "  BLASTX数据: /mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/data/test_data/blastx_example.xlsx\n",
      "  mRNA序列: /mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/data/test_data/mrna_example.fasta\n",
      "  验证区域: /mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/data/test_data/region_example.csv\n",
      "\n",
      "🧬 BLASTX数据概览:\n",
      "  数据形状: (1000, 14)\n",
      "  列名: ['DNA_seqid', 'Pep_seqid', 'pident', 'length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore', 'qframe', 'sframe']\n",
      "  唯一序列数: 704\n",
      "\n",
      "📊 BLASTX数据示例:\n",
      "      DNA_seqid          Pep_seqid  pident  length        evalue  qframe\n",
      "0  MSTRG.9998.1  CAMPEP_0196994412   68.27     104  1.000000e-33       2\n",
      "1  MSTRG.9996.1  CAMPEP_0197017426   49.16     297  3.000000e-79       2\n",
      "2  MSTRG.9994.1  CAMPEP_0197009206   98.31     354  0.000000e+00       2\n",
      "3  MSTRG.9993.1  CAMPEP_0168331218   51.67      60  2.000000e-37       2\n",
      "4  MSTRG.9993.1  CAMPEP_0168331218   45.45      88  2.000000e-37       3\n"
     ]
    }
   ],
   "source": [
    "# Get test data paths\n",
    "blastx_file = get_test_data_path('blastx_example.xlsx')\n",
    "mrna_file = get_test_data_path('mrna_example.fasta')\n",
    "region_file = get_test_data_path('region_example.csv')\n",
    "\n",
    "print(f\"📁 Data file paths:\")\n",
    "print(f\"  BLASTX data: {blastx_file}\")\n",
    "print(f\"  mRNA sequences: {mrna_file}\")\n",
    "print(f\"  Validation regions: {region_file}\")\n",
    "\n",
    "# Load BLASTX data\n",
    "blastx_data = pd.read_excel(blastx_file)\n",
    "print(f\"\\n🧬 BLASTX data overview:\")\n",
    "print(f\"  Data shape: {blastx_data.shape}\")\n",
    "print(f\"  Column names: {list(blastx_data.columns)}\")\n",
    "print(f\"  Unique sequences: {blastx_data['DNA_seqid'].nunique()}\")\n",
    "\n",
    "# Display first few rows\n",
    "print(\"\\n📊 BLASTX data examples:\")\n",
    "display_cols = ['DNA_seqid', 'Pep_seqid', 'pident', 'length', 'evalue', 'qframe']\n",
    "print(blastx_data[display_cols].head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🎯 验证区域数据概览:\n",
      "  数据形状: (3, 8)\n",
      "  列名: ['FS_period', '399bp', 'fs_position', 'DNA_seqid', 'label', 'source', 'FS_type', 'dataset']\n",
      "  数据来源: {'EUPLOTES': 3}\n",
      "\n",
      "📋 验证区域数据示例:\n",
      "   fs_position      DNA_seqid  label    source   FS_type\n",
      "0         16.0  MSTRG.18491.1      0  EUPLOTES  negative\n",
      "1         16.0   MSTRG.4662.1      0  EUPLOTES  negative\n",
      "2         16.0  MSTRG.14742.1      0  EUPLOTES  negative\n",
      "\n",
      "📈 标签分布:\n",
      "label\n",
      "0    3\n",
      "Name: count, dtype: int64\n",
      "\n",
      "🔬 FS类型分布:\n",
      "FS_type\n",
      "negative    3\n",
      "Name: count, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# Load validation region data\n",
    "region_data = pd.read_csv(region_file)\n",
    "print(f\"🎯 Validation region data overview:\")\n",
    "print(f\"  Data shape: {region_data.shape}\")\n",
    "print(f\"  Column names: {list(region_data.columns)}\")\n",
    "print(f\"  Data sources: {region_data['source'].value_counts().to_dict()}\")\n",
    "\n",
    "print(\"\\n📋 Validation region data examples:\")\n",
    "display_cols = ['fs_position', 'DNA_seqid', 'label', 'source', 'FS_type']\n",
    "print(region_data[display_cols].head())\n",
    "\n",
    "# Statistical analysis\n",
    "print(f\"\\n📈 Label distribution:\")\n",
    "print(region_data['label'].value_counts())\n",
    "print(f\"\\n🔬 FS type distribution:\")\n",
    "print(region_data['FS_type'].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. FScanR Analysis - Identify Potential PRF Sites from BLASTX\n",
    "\n",
    "Use the FScanR algorithm to analyze BLASTX results and identify potential programmed ribosomal frameshift sites."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🔍 运行FScanR分析...\n",
      "参数设置: mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10\n",
      "\n",
      "✅ FScanR分析完成！\n",
      "检测到的潜在PRF位点数量: 24\n",
      "\n",
      "📊 FScanR结果概览:\n",
      "  列名: ['DNA_seqid', 'FS_start', 'FS_end', 'Pep_seqid', 'Pep_FS_start', 'Pep_FS_end', 'FS_type', 'Strand']\n",
      "  涉及的序列数: 16\n",
      "  链方向分布: {'+': 16, '-': 8}\n",
      "  FS类型分布: {1: 16, -1: 7, -2: 1}\n",
      "\n",
      "🎯 FScanR结果示例:\n",
      "      DNA_seqid  FS_start  FS_end          Pep_seqid  Pep_FS_start  \\\n",
      "0  MSTRG.9380.1      3797    3802  CAMPEP_0197017206          1137   \n",
      "1  MSTRG.9431.1      4136    4192  CAMPEP_0197016790           657   \n",
      "3  MSTRG.9432.1       848     904  CAMPEP_0197016790           753   \n",
      "4  MSTRG.9582.1       302     304  CAMPEP_0197003180           214   \n",
      "5   MSTRG.961.1      1536    1533  CAMPEP_0197017908           590   \n",
      "\n",
      "   Pep_FS_end  FS_type Strand  \n",
      "0        1138        1      +  \n",
      "1         675        1      +  \n",
      "3           2        1      -  \n",
      "4         214        1      +  \n",
      "5          19       -1      -  \n"
     ]
    }
   ],
   "source": [
    "# Run FScanR analysis\n",
    "print(\"🔍 Running FScanR analysis...\")\n",
    "print(\"Parameter settings: mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10\")\n",
    "\n",
    "fscanr_results = fscanr(\n",
    "    blastx_data,\n",
    "    mismatch_cutoff=10,\n",
    "    evalue_cutoff=1e-5,\n",
    "    frameDist_cutoff=10\n",
    ")\n",
    "\n",
    "print(f\"\\n✅ FScanR analysis complete!\")\n",
    "print(f\"Number of potential PRF sites detected: {len(fscanr_results)}\")\n",
    "\n",
    "if len(fscanr_results) > 0:\n",
    "    print(f\"\\n📊 FScanR results overview:\")\n",
    "    print(f\"  Column names: {list(fscanr_results.columns)}\")\n",
    "    print(f\"  Number of sequences involved: {fscanr_results['DNA_seqid'].nunique()}\")\n",
    "    print(f\"  Strand orientation distribution: {fscanr_results['Strand'].value_counts().to_dict()}\")\n",
    "    print(f\"  FS type distribution: {fscanr_results['FS_type'].value_counts().to_dict()}\")\n",
    "    \n",
    "    print(\"\\n🎯 FScanR results examples:\")\n",
    "    print(fscanr_results.head())\n",
    "else:\n",
    "    print(\"⚠️ No PRF sites detected, may need to adjust parameters\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Sequence Extraction - Extract Sequences Around PRF Sites\n",
    "\n",
    "Extract sequence fragments around PRF sites identified by FScanR from mRNA sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📝 从mRNA序列中提取PRF位点周围序列...\n",
      "\n",
      "✅ 序列提取完成！\n",
      "成功提取的序列数量: 24\n",
      "\n",
      "📏 序列长度验证:\n",
      "  399bp序列长度分布: {399: 24}\n",
      "  平均长度: 399.0\n",
      "\n",
      "🧬 提取序列示例:\n",
      "序列 1: MSTRG.9380.1\n",
      "  FS位置: 3797-3802\n",
      "  链方向: +\n",
      "  FS类型: 1\n",
      "  序列片段: AAGGAGTTTGAAGAAGAACAGGAAAAACAAGAGAAAGAGAGAAAGGAGAA...NNNNNNNNNNNNNNNNNNNN\n",
      "\n",
      "序列 2: MSTRG.9431.1\n",
      "  FS位置: 4136-4192\n",
      "  链方向: +\n",
      "  FS类型: 1\n",
      "  序列片段: CAAGTATCTGAGTGGGAGGGAGACACAGGTGTTGATCAAACCCCATTCCC...ATAATGACGGAGGCTTCAGA\n",
      "\n",
      "序列 3: MSTRG.9432.1\n",
      "  FS位置: 848-904\n",
      "  链方向: -\n",
      "  FS类型: 1\n",
      "  序列片段: AGAAAGGATGGTACTGAAAATCAACGAAGTACTTTCACATTTTAGAAAGA...GCTGAGAACGATATTGACAA\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Extract sequences around PRF sites\n",
    "if len(fscanr_results) > 0:\n",
    "    print(\"📝 Extracting sequences around PRF sites from mRNA sequences...\")\n",
    "    \n",
    "    prf_sequences = extract_prf_regions(\n",
    "        mrna_file=mrna_file,\n",
    "        prf_data=fscanr_results\n",
    "    )\n",
    "    \n",
    "    print(f\"\\n✅ Sequence extraction complete!\")\n",
    "    print(f\"Number of successfully extracted sequences: {len(prf_sequences)}\")\n",
    "    \n",
    "    if len(prf_sequences) > 0:\n",
    "        print(f\"\\n📏 Sequence length validation:\")\n",
    "        seq_lengths = prf_sequences['399bp'].str.len()\n",
    "        print(f\"  399bp sequence length distribution: {seq_lengths.value_counts().to_dict()}\")\n",
    "        print(f\"  Average length: {seq_lengths.mean():.1f}\")\n",
    "        \n",
    "        print(\"\\n🧬 Extracted sequence examples:\")\n",
    "        for i, row in prf_sequences.head(3).iterrows():\n",
    "            print(f\"Sequence {i+1}: {row['DNA_seqid']}\")\n",
    "            print(f\"  FS position: {row['FS_start']}-{row['FS_end']}\")\n",
    "            print(f\"  Strand orientation: {row['Strand']}\")\n",
    "            print(f\"  FS type: {row['FS_type']}\")\n",
    "            print(f\"  Sequence fragment: {row['399bp'][:50]}...{row['399bp'][-20:]}\")\n",
    "            print()\n",
    "    else:\n",
    "        print(\"❌ Sequence extraction failed\")\n",
    "else:\n",
    "    print(\"⚠️ Skipping sequence extraction - no FScanR results\")\n",
    "    prf_sequences = pd.DataFrame()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. FScanpy Prediction - Machine Learning Model Analysis\n",
    "\n",
    "Use FScanpy's machine learning models to predict PRF probabilities for the extracted sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🤖 FScanpy预测器初始化完成\n",
      "\n",
      "🎯 对 24 个FScanR识别的序列进行预测...\n",
      "\n",
      "📊 FScanR+FScanpy预测结果:\n",
      "      DNA_seqid  FS_start  FS_type  Short_Probability  Long_Probability  \\\n",
      "0  MSTRG.9380.1      3797        1           0.239192          0.087024   \n",
      "1  MSTRG.9431.1      4136        1           0.326807          0.356356   \n",
      "2  MSTRG.9432.1       848        1           0.310908          0.159746   \n",
      "3  MSTRG.9582.1       302        1           0.272451          0.223354   \n",
      "4   MSTRG.961.1      1536       -1           0.263269          0.046773   \n",
      "\n",
      "   Ensemble_Probability  \n",
      "0              0.147891  \n",
      "1              0.344536  \n",
      "2              0.220211  \n",
      "3              0.242993  \n",
      "4              0.133372  \n"
     ]
    }
   ],
   "source": [
    "# Initialize predictor\n",
    "predictor = PRFPredictor()\n",
    "print(\"🤖 FScanpy predictor initialization complete\")\n",
    "\n",
    "# Predict FScanR identified sequences\n",
    "if len(prf_sequences) > 0:\n",
    "    print(f\"\\n🎯 Predicting {len(prf_sequences)} sequences identified by FScanR...\")\n",
    "    \n",
    "    fscanr_predictions = predictor.predict_regions(\n",
    "        sequences=prf_sequences['399bp'],\n",
    "        ensemble_weight=0.4  # Balanced configuration\n",
    "    )\n",
    "    \n",
    "    # Merge results\n",
    "    fscanr_predictions = pd.concat([\n",
    "        prf_sequences.reset_index(drop=True),\n",
    "        fscanr_predictions.reset_index(drop=True)\n",
    "    ], axis=1)\n",
    "    \n",
    "    print(\"\\n📊 FScanR+FScanpy prediction results:\")\n",
    "    result_cols = ['DNA_seqid', 'FS_start', 'FS_type', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']\n",
    "    print(fscanr_predictions[result_cols].head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "🧪 对 3 个验证区域进行预测...\n",
      "\n",
      "📊 验证区域预测结果:\n",
      "       DNA_seqid  label    source  Short_Probability  Long_Probability  \\\n",
      "0  MSTRG.18491.1      0  EUPLOTES           0.368610          0.144442   \n",
      "1   MSTRG.4662.1      0  EUPLOTES           0.229811          0.053352   \n",
      "2  MSTRG.14742.1      0  EUPLOTES           0.454152          0.345118   \n",
      "\n",
      "   Ensemble_Probability  \n",
      "0              0.234109  \n",
      "1              0.123936  \n",
      "2              0.388732  \n"
     ]
    }
   ],
   "source": [
    "# Predict validation region data\n",
    "print(f\"\\n🧪 Predicting {len(region_data)} validation regions...\")\n",
    "\n",
    "validation_predictions = predict_prf(\n",
    "    data=region_data.rename(columns={'399bp': 'Long_Sequence'}),\n",
    "    ensemble_weight=0.4\n",
    ")\n",
    "\n",
    "print(\"\\n📊 Validation region prediction results:\")\n",
    "result_cols = ['DNA_seqid', 'label', 'source', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']\n",
    "print(validation_predictions[result_cols].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Sequence-level Prediction and Visualization\n",
    "\n",
    "Select a specific mRNA sequence and use the built-in plot_prf_prediction function for complete sliding window prediction and visualization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🧬 选择演示序列: MSTRG.9127.1\n",
      "序列长度: 256 bp\n",
      "序列前100bp: TGGCCTTCTTACTTGGAAGTCCCCAAGGATCATCTTGGCCATCCTTGCTTTCTTCATGGCTAGATTCTACCTCCTCCCATAATTGTGTGAAACAAGTAAC...\n",
      "\n",
      "🎯 使用plot_prf_prediction进行序列预测和可视化...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/predictor.py:335: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.\n",
      "  plt.tight_layout()\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 39044 (\\N{CJK UNIFIED IDEOGRAPH-9884}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 27979 (\\N{CJK UNIFIED IDEOGRAPH-6D4B}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 27010 (\\N{CJK UNIFIED IDEOGRAPH-6982}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 29575 (\\N{CJK UNIFIED IDEOGRAPH-7387}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 28909 (\\N{CJK UNIFIED IDEOGRAPH-70ED}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 22270 (\\N{CJK UNIFIED IDEOGRAPH-56FE}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 31227 (\\N{CJK UNIFIED IDEOGRAPH-79FB}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 30721 (\\N{CJK UNIFIED IDEOGRAPH-7801}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 20998 (\\N{CJK UNIFIED IDEOGRAPH-5206}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 24067 (\\N{CJK UNIFIED IDEOGRAPH-5E03}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 38598 (\\N{CJK UNIFIED IDEOGRAPH-96C6}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 25104 (\\N{CJK UNIFIED IDEOGRAPH-6210}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 26435 (\\N{CJK UNIFIED IDEOGRAPH-6743}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 37325 (\\N{CJK UNIFIED IDEOGRAPH-91CD}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 24207 (\\N{CJK UNIFIED IDEOGRAPH-5E8F}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 21015 (\\N{CJK UNIFIED IDEOGRAPH-5217}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 20301 (\\N{CJK UNIFIED IDEOGRAPH-4F4D}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 32622 (\\N{CJK UNIFIED IDEOGRAPH-7F6E}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 36807 (\\N{CJK UNIFIED IDEOGRAPH-8FC7}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 28388 (\\N{CJK UNIFIED IDEOGRAPH-6EE4}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 38408 (\\N{CJK UNIFIED IDEOGRAPH-9608}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 20540 (\\N{CJK UNIFIED IDEOGRAPH-503C}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 30340 (\\N{CJK UNIFIED IDEOGRAPH-7684}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 32467 (\\N{CJK UNIFIED IDEOGRAPH-7ED3}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 26524 (\\N{CJK UNIFIED IDEOGRAPH-679C}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 65288 (\\N{FULLWIDTH LEFT PARENTHESIS}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 26465 (\\N{CJK UNIFIED IDEOGRAPH-6761}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 24418 (\\N{CJK UNIFIED IDEOGRAPH-5F62}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n",
      "/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 65289 (\\N{FULLWIDTH RIGHT PARENTHESIS}) missing from font(s) Liberation Sans.\n",
      "  fig.canvas.print_figure(bytes_io, **kw)\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAABRcAAALmCAYAAADYLKN3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8ekN5oAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB3P0lEQVR4nOzdd5hV1b0/4M8MvYmAHUUS7IgRG/ZCxBqCKd6AsSCxBcHY9UaNJvYoMYJ6FQtKNGo0Yosltti9RpMIErFiQ6IIDIq0YWZ+f/DjXMcBxA0yDL7v8+TJnHXWPuu798w65/hh7b3LampqagIAAAAA8BWV13cBAAAAAEDDJFwEAAAAAAoRLgIAAAAAhQgXAQAAAIBChIsAAAAAQCHCRQAAAACgEOEiAAAAAFCIcBEAAAAAKES4CAAAAAAU0ri+CwAAGpa//e1vufnmm2u1XXXVVRk9enQuv/zyWu3nnXdeOnTosMDXOeigg/L8889n0KBBGTx48AL79OnTJ+PGjcv555+fH/7wh6X2v/zlL/nTn/6UN954IxUVFWndunU23XTTHHnkkdlmm22SJD179syECRMWuS/zx55fy+eVlZWlbdu22WabbXLMMcdk/fXXr7P9gw8+mD/96U8ZO3ZsPv3007Rq1Spdu3bNj3/84+y7776LHDtJ/vnPf+aSSy7JSy+9lKZNm2bnnXfOL3/5yzrHbMqUKTnllFPyxBNP5Oqrr87OO+9c6/nKysrccMMNGTVqVCZMmJBVV101u+++e44++ui0bt36S4/HD37wg1xwwQWLrPX666/PxRdfnF69euWSSy5ZZN/bb789Dz30UOlxhw4dct555yVJTj311IwaNapW/yZNmmTttddO7969c/jhh6dp06ZJkmHDhuWyyy6r8/otW7bMJptskp/97Gfp2bNnqX1h/ef78Y9/nKOPPjq//vWva7Uff/zx6dy5c4455pha7f379892222XE044IdOnTy+1f+9730vv3r1z3nnn5Z133im1b7fddunfv3+GDx+eF198sdS+wQYb5IQTTljocVlac+qLvurrzp49e4HHZsMNN1ys8QCAbybhIgDwlbz//vs544wzsvbaaydJLrzwwiTJ5MmTM2DAgPTo0SNJcuONN2bmzJmLfK2WLVtm1KhRGTRoUMrKymo9N27cuLz77rt1trniiity2WWXZdCgQTnjjDPSsmXLvPvuu7nqqqvys5/9LCNHjkz37t1z++23p6qqqrTd97///fTo0SOnnXZarfHn69q1a6666qrS46qqqrz11lu55JJLcsABB+Suu+7KWmutlSSpqanJqaeemvvvvz8DBgzI8ccfn5VXXjkfffRR7rnnnpxwwgn5+9//nrPOOmuh+z5u3LgcfPDB2XHHHXPLLbeksrIyp5xySo466qjceuutKS+fd4LJ888/nxNOOCFt2rRZ6GtdeOGFue2223LmmWdmq622yssvv5wzzjgjH330UYYMGZIkdY5HkkydOjU/+clPst122y30tSsqKnLqqadm7Nixadas2UL7fd6bb75Z61jO/xuZr3379rn77rtLjz/55JM8++yzufjii/Pmm2/md7/7Xa3+jz76aClwrKmpyX/+85/84Q9/yMCBA3PZZZdl9913X2j/z2vRokU++OCD7LnnnqWw+rHHHktFRUXmzp2bTTfdtBR0v/baa3n55ZeTJKuttlrpOH722We57rrrkiSNGjVa4H5OnTp1ge0LOy5Lc0593ld93RkzZizw2AAALIpwEQCoN1tvvXWeeOKJPPfcc3UCrlGjRmXrrbfO448/Xqv9xhtvzL777puBAweW2tZaa61sscUWOfDAA/Ovf/0r3bt3T/v27WttV15enubNm2fVVVddYC2NGzeu89waa6yRLl26ZOedd86f/vSnHHvssUmSP/7xj7nzzjszfPjw7LLLLqX+HTt2TPfu3bPOOuvk2muvzaGHHpp11113geNdd911admyZYYMGVIKOS+55JL06dMnTzzxRHbdddckycUXX5yDDjoom222WQ455JA6rzN9+vTccsstOeqoo0qhUKdOnfLqq6/mqquuyplnnpmVVlqpzvGYP94GG2yQ73//+wusMUnuvffezJgxI3feeWf233//hfb7KsrLy2sd61VXXTVdunTJlClTcvnll+fkk0/OGmusUXp+lVVWqRVsrrbaarnwwgvz8ssv57rrrqsTLn6xPwAAXx/XXAQA6k379u3TvXv33HHHHbXa586dm3vuuafWKa/zzZo1K3PmzKnT3rRp0/zpT3/KoYceulRrXH311dO+ffv85z//KbWNGDEiO++8c61g8fP69++fJ554YqHBYpKMHTs2m266aa3VkxtttFE6duyYp556qtR20UUX5YgjjqizsnO+Vq1a5YknnsiAAQNqta+22mqpqalZ6Eq30aNHZ9SoUTnttNMW+tpJsssuu2TEiBGLfSrukthoo42SJB988MGX9i0vL88GG2xQ6/fyTXXqqafm1FNPre8yAIBvKOEiAFCvvve97+Whhx6qdU27J598Mp988kn23HPPOv133nnnPPDAAzn++OPz97//fYFB49I0ZcqUTJ06tXRK9MSJE/Pee+8tNFhM5l2vcf5pzQvTuHHjNGrUqE57+/bta13Hb1EB5fyx2rdvXyukTOadGrzGGmtk9dVXX+B2Q4cOzc4775zNNttska+/zjrrLLDOr8Pbb7+dJFlzzTUXq/9bb71V+r0AAFA/hIsAQL3aZ599Mnfu3PzlL38ptY0aNSo77rhj2rVrV6f/2Wefnb333jv33XdfDjzwwGy99dbp379/rr/++qV+fbj3338/p5xySlq0aJEf//jHSZIPP/wwyeIHYAvzrW99K//+978zd+7cUtvs2bPz9ttv57PPPlui177xxhvz1FNP5cQTT1zg86+88kqefPLJHHHEEUs0ztJSWVmZp556Ktddd1322GOPLz2206ZNy+9+97u89tprOeigg5ZRlQAALIhrLgIA9apdu3bZaaedcscdd+QnP/lJKioq8uijj9a5Cch8bdq0ye9///t88MEHefzxx/P3v/89zz//fJ599tlcccUVueqqq9K9e/evXMeYMWNqbVdVVZXZs2dnq622yvXXX19aITd/RWJ1dXWt7UePHl3nmoi9e/fOb37zmwWOd+CBB+b+++/Peeedl+OOOy6VlZU599xzU15ensaNi39Fu/7663PBBRfkqKOOSu/evRfY54YbbkjXrl2z5ZZbFh5nSUyePLnWsZ49e3YaN26cPn36LPD03m233bbW4xkzZqRz58658MILF7i69Yv95zv11FML/W0sjz6/H/NX7z744IOlts+H9QAAXyfhIgBQ777//e/n2GOPzZtvvpnnnnsuTZo0WeD1Fj9vrbXWSr9+/dKvX79UV1fnr3/9a0477bScddZZueuuu75yDRtuuGEuvfTS0uNHHnkkF110UU488cR85zvfqTVukrz33nu1tt9oo41y5513lh6feOKJizxle6uttsqFF16Ys88+OzfffHOaN2+eQw45JNtuu+2XnlK9IDU1Nbnoooty3XXX5YQTTsjhhx++wH6VlZV55JFH0r9//688xtKy8sor59Zbby09nn8znQXd4TlJbrvttjRp0iTJvNPSf/azn+VHP/pR9ttvvy/t/3nt27fPxIkTl3wHlgOf/1u7+OKLk6TWStXVVlttWZcEAHxDCRcBgHrXs2fPtGnTJvfdd1+efvrp9OrVKy1atFhg308++SQrrbRSrbby8vLstdde+cc//pEbb7wxNTU1i7xJyYI0bdq01vUN+/fvn/vvvz+nn356Ro0aVQq+VllllWywwQb561//WusmKl/cvnnz5l865n777Zd99tknkydPTocOHdK0adPsvffe6dOnz1eqPZkXMI0cOTK//e1vF3n35+effz6ffPJJ6W7U9aFRo0Zfei3Jz1tnnXVKd39ed911c/DBB+eyyy7LHnvskc6dOy+y/xetKOHi549fq1at6rQBACwrrrkIANS7Zs2aZc8998x9992Xf/3rXws9nfevf/1rtt5
      "text/plain": [
       "<Figure size 1600x800 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "📊 序列预测结果统计:\n",
      "  预测位点总数: 85\n",
      "  高概率位点 (>0.8): 0\n",
      "  中概率位点 (0.4-0.8): 6\n",
      "  最高预测概率: 0.475\n"
     ]
    }
   ],
   "source": [
    "# Select a sequence for demonstration\n",
    "from Bio import SeqIO\n",
    "\n",
    "# Read the first mRNA sequence for demonstration\n",
    "mrna_sequences = list(SeqIO.parse(mrna_file, \"fasta\"))\n",
    "demo_seq = mrna_sequences[0]  # Select the first sequence\n",
    "\n",
    "print(f\"🧬 Selected demonstration sequence: {demo_seq.id}\")\n",
    "print(f\"Sequence length: {len(demo_seq.seq)} bp\")\n",
    "print(f\"First 100bp of sequence: {str(demo_seq.seq)[:100]}...\")\n",
    "\n",
    "# Use built-in plot_prf_prediction function for prediction and visualization\n",
    "print(f\"\\n🎯 Using plot_prf_prediction for sequence prediction and visualization...\")\n",
    "\n",
    "sequence_results, fig = plot_prf_prediction(\n",
    "    sequence=str(demo_seq.seq),\n",
    "    window_size=3,\n",
    "    short_threshold=0.2,\n",
    "    long_threshold=0.2,\n",
    "    ensemble_weight=0.6,\n",
    "    title=f\"PRF Prediction Results for Sequence {demo_seq.id} (Bar Chart + Heatmap)\",\n",
    "    figsize=(16, 8),\n",
    "    dpi=150\n",
    ")\n",
    "\n",
    "plt.show()\n",
    "\n",
    "print(f\"\\n📊 Sequence prediction result statistics:\")\n",
    "print(f\"  Total predicted sites: {len(sequence_results)}\")\n",
    "print(f\"  High probability sites (>0.8): {(sequence_results['Ensemble_Probability'] > 0.8).sum()}\")\n",
    "print(f\"  Medium probability sites (0.4-0.8): {((sequence_results['Ensemble_Probability'] >= 0.4) & (sequence_results['Ensemble_Probability'] <= 0.8)).sum()}\")\n",
    "print(f\"  Highest prediction probability: {sequence_results['Ensemble_Probability'].max():.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "🔝 Top 5 预测位点:\n",
      "  1. 位置 96: \n",
      "     - Short概率: 0.288\n",
      "     - Long概率: 0.755\n",
      "     - 集成概率: 0.475\n",
      "     - 密码子: TAA\n",
      "  2. 位置 12: \n",
      "     - Short概率: 0.606\n",
      "     - Long概率: 0.177\n",
      "     - 集成概率: 0.434\n",
      "     - 密码子: TTG\n",
      "  3. 位置 15: \n",
      "     - Short概率: 0.493\n",
      "     - Long概率: 0.329\n",
      "     - 集成概率: 0.428\n",
      "     - 密码子: GAA\n",
      "  4. 位置 18: \n",
      "     - Short概率: 0.369\n",
      "     - Long概率: 0.510\n",
      "     - 集成概率: 0.426\n",
      "     - 密码子: GTC\n",
      "  5. 位置 105: \n",
      "     - Short概率: 0.248\n",
      "     - Long概率: 0.671\n",
      "     - 集成概率: 0.418\n",
      "     - 密码子: ACT\n",
      "\n",
      "📊 可视化分析完成！\n",
      "图表包含热图和条形图，展示了整个序列的PRF预测概率分布。\n"
     ]
    }
   ],
   "source": [
    "# Print top predicted site probabilities\n",
    "if sequence_results['Ensemble_Probability'].max() > 0.3:\n",
    "    top_predictions = sequence_results.nlargest(5, 'Ensemble_Probability')\n",
    "    print(f\"\\n🔝 Top 5 predicted sites:\")\n",
    "    for i, (_, row) in enumerate(top_predictions.iterrows(), 1):\n",
    "        print(f\"  {i}. Position {row['Position']}: \")\n",
    "        print(f\"     - Short probability: {row['Short_Probability']:.3f}\")\n",
    "        print(f\"     - Long probability: {row['Long_Probability']:.3f}\")\n",
    "        print(f\"     - Ensemble probability: {row['Ensemble_Probability']:.3f}\")\n",
    "        print(f\"     - Codon: {row['Codon']}\")\n",
    "else:\n",
    "    print(\"\\n💡 No high-probability PRF sites detected in this sequence\")\n",
    "\n",
    "print(\"\\n📊 Visualization analysis complete!\")\n",
    "print(\"The chart contains heatmaps and bar charts showing the PRF prediction probability distribution across the entire sequence.\")"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## 📖 Complete Function Reference\n",
    "\n",
    "### All Available Functions and Methods\n",
    "\n",
    "#### Core Prediction Functions\n",
    "\n",
    "**1. `predict_prf(sequence=None, data=None, window_size=3, short_threshold=0.1, ensemble_weight=0.4, model_dir=None)`**\n",
    "- **Purpose**: Universal prediction function for both sliding window and region-based analysis\n",
    "- **Input modes**: \n",
    "  - Single/multiple sequences → sliding window prediction\n",
    "  - DataFrame with 'Long_Sequence'/'399bp' column → region prediction\n",
    "- **Key parameters**:\n",
    "  - `ensemble_weight`: Short model weight (0.0-1.0, default: 0.4)\n",
    "  - `window_size`: Scanning step size (default: 3)\n",
    "  - `short_threshold`: Filtering threshold (default: 0.1)\n",
    "\n",
    "**2. `plot_prf_prediction(sequence, window_size=3, short_threshold=0.65, long_threshold=0.8, ensemble_weight=0.4, title=None, save_path=None, figsize=(12,8), dpi=300)`**\n",
    "- **Purpose**: Prediction with built-in visualization (3-subplot layout: FS site heatmap, prediction heatmap, bar chart)\n",
    "- **Returns**: (prediction_results_df, matplotlib_figure)\n",
    "- **Visualization features**: \n",
    "  - Black bars with alpha=0.6\n",
    "  - 'Reds' colormap for heatmaps\n",
    "  - Height ratios [0.1, 0.1, 1] for subplots\n",
    "\n",
    "#### PRFPredictor Class Methods\n",
    "\n",
    "**3. Class initialization: `PRFPredictor(model_dir=None)`**\n",
    "- Loads HistGradientBoosting (short, 33bp) and BiLSTM-CNN (long, 399bp) models\n",
    "- Uses ensemble weighting for final predictions\n",
    "\n",
    "**4. `predictor.predict_sequence(sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)`**\n",
    "- **Purpose**: Sliding window analysis of complete sequences\n",
    "- **Process**: Scans sequence with specified window size, applies both models\n",
    "\n",
    "**5. `predictor.predict_regions(sequences, short_threshold=0.1, ensemble_weight=0.4)`**\n",
    "- **Purpose**: Batch prediction for pre-defined 399bp regions\n",
    "- **Input**: List/Series of 399bp sequences\n",
    "- **Efficient**: Direct region analysis without sliding window\n",
    "\n",
    "**6. `predictor.predict_single_position(fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)`**\n",
    "- **Purpose**: Single position analysis\n",
    "- **Inputs**: 33bp sequence (fs_period) + 399bp sequence (full_seq)\n",
    "- **Returns**: Dictionary with individual and ensemble probabilities\n",
    "\n",
    "**7. `predictor.plot_sequence_prediction(...)`** \n",
    "- **Purpose**: Class method version of plot_prf_prediction()\n",
    "- **Same parameters** as standalone function\n",
    "\n",
    "#### Utility Functions\n",
    "\n",
    "**8. `fscanr(blastx_output, mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10)`**\n",
    "- **Purpose**: Detect PRF sites from BLASTX alignment results\n",
    "- **Input**: DataFrame with BLASTX columns (qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore, qframe, sframe)\n",
    "- **Output**: PRF sites with FS_start, FS_end, FS_type, Strand information\n",
    "\n",
    "**9. `extract_prf_regions(mrna_file, prf_data)`**\n",
    "- **Purpose**: Extract 399bp sequences around detected PRF sites\n",
    "- **Inputs**: FASTA file path + FScanR results DataFrame\n",
    "- **Handles**: Strand orientation (reverse complement for '-' strand)\n",
    "\n",
    "#### Data Access Functions\n",
    "\n",
    "**10. `get_test_data_path(filename)`**\n",
    "- **Purpose**: Get path to built-in test data files\n",
    "- **Available files**: 'blastx_example.xlsx', 'mrna_example.fasta', 'region_example.csv'\n",
    "\n",
    "**11. `list_test_data()`**\n",
    "- **Purpose**: Display all available test data files\n",
    "\n",
    "### Usage Pattern Examples\n",
    "\n",
    "#### Pattern 1: Quick Single Sequence Analysis\n",
    "```python\n",
    "from FScanpy import predict_prf, plot_prf_prediction\n",
    "\n",
    "# Simple prediction\n",
    "results = predict_prf(sequence=\"ATGCGT...\")\n",
    "\n",
    "# With visualization  \n",
    "results, fig = plot_prf_prediction(sequence=\"ATGCGT...\")\n",
    "```\n",
    "\n",
    "#### Pattern 2: Batch Sequence Analysis\n",
    "```python\n",
    "sequences = [\"seq1\", \"seq2\", \"seq3\"]\n",
    "results = predict_prf(sequence=sequences, ensemble_weight=0.5)\n",
    "```\n",
    "\n",
    "#### Pattern 3: BLASTX Pipeline\n",
    "```python\n",
    "from FScanpy.utils import fscanr, extract_prf_regions\n",
    "\n",
    "# Step 1: Detect PRF sites\n",
    "prf_sites = fscanr(blastx_df)\n",
    "\n",
    "# Step 2: Extract sequences\n",
    "prf_sequences = extract_prf_regions(fasta_file, prf_sites)\n",
    "\n",
    "# Step 3: Predict probabilities\n",
    "results = predict_prf(data=prf_sequences)\n",
    "```\n",
    "\n",
    "#### Pattern 4: Custom Analysis with PRFPredictor\n",
    "```python\n",
    "from FScanpy import PRFPredictor\n",
    "\n",
    "predictor = PRFPredictor()\n",
    "\n",
    "# Method chaining for different analysis types\n",
    "seq_results = predictor.predict_sequence(sequence)\n",
    "region_results = predictor.predict_regions(sequences_399bp)\n",
    "single_result = predictor.predict_single_position(seq_33bp, seq_399bp)\n",
    "```\n",
    "\n",
    "### Parameter Optimization Guide\n",
    "\n",
    "**Ensemble Weight Selection:**\n",
    "- `0.2-0.3`: Conservative (high specificity, favor long model)\n",
    "- `0.4-0.6`: Balanced (recommended default)\n",
    "- `0.7-0.8`: Sensitive (high sensitivity, favor short model)\n",
    "\n",
    "**Window Size Selection:**\n",
    "- `1`: High resolution, every position (slow but detailed)\n",
    "- `3`: Standard resolution (balanced speed/detail)  \n",
    "- `6-9`: Low resolution, faster analysis\n",
    "\n",
    "**Threshold Guidelines:**\n",
    "- `short_threshold`: 0.1-0.3 (controls efficiency by filtering low-probability candidates)\n",
    "- Display thresholds: 0.3-0.8 (controls visualization, higher = cleaner plots)\n",
    "- Classification threshold: 0.5 (standard binary classification cutoff)\n",
    "\n",
    "### Output Interpretation\n",
    "\n",
    "**Main Result Columns:**\n",
    "- `Short_Probability`: HistGradientBoosting model prediction (0-1)\n",
    "- `Long_Probability`: BiLSTM-CNN model prediction (0-1)\n",
    "- `Ensemble_Probability`: **Final prediction** (weighted combination)\n",
    "- `Position`: Sequence position (sliding window mode)\n",
    "- `Codon`: Codon at position (sliding window mode)\n",
    "\n",
    "**Ensemble Probability Interpretation:**\n",
    "- `> 0.8`: High confidence PRF site\n",
    "- `0.5-0.8`: Moderate confidence PRF site  \n",
    "- `0.3-0.5`: Low confidence, worth investigating\n",
    "- `< 0.3`: Unlikely to be PRF site\n",
    "\n",
    "### Best Practices\n",
    "\n",
    "1. **For exploration**: Use `window_size=1, ensemble_weight=0.4`\n",
    "2. **For screening**: Use `window_size=3, ensemble_weight=0.4, short_threshold=0.2`\n",
    "3. **For validation**: Use region-based prediction with known sequences\n",
    "4. **For visualization**: Adjust `short_threshold` and `long_threshold` in plotting functions to control display density\n",
    "\n",
    "This demo covers all major FScanpy functionalities. For detailed parameter descriptions and advanced usage, please refer to the complete tutorial documentation.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 📝 Analysis Summary\n",
    "\n",
    "### 🎯 Key Findings\n",
    "1. **Data Quality**: Test dataset contains real BLASTX alignment results and validation regions\n",
    "2. **FScanR Performance**: Successfully identified potential PRF sites from BLASTX results\n",
    "3. **Model Performance**: Short and Long models each have advantages in different scenarios\n",
    "4. **Prediction Results**: Ensemble model provides more stable prediction performance\n",
    "5. **Visualization**: Built-in plotting functions generate clear heatmaps and bar charts\n",
    "\n",
    "### 🔧 Best Practices\n",
    "- **Data Preprocessing**: Ensure BLASTX results are in correct format\n",
    "- **Parameter Settings**: Use default ensemble weights (0.4:0.6) for balanced performance\n",
    "- **Result Interpretation**: When using FScanpy for whole sequence prediction, don't use 0.5 as threshold, but compare relative probabilities across positions\n",
    "- **Visualization**: Use plot_prf_prediction function to generate standardized plots\n",
    "\n",
    "### 📚 Usage Recommendations\n",
    "1. **Threshold Selection**: Adjust probability thresholds based on application scenarios\n",
    "2. **Result Validation**: Validate prediction results with biological knowledge\n",
    "3. **Performance Optimization**: Use reasonable sliding window sizes for large-scale data\n",
    "4. **Visualization Parameters**: Adjust figsize and dpi for optimal display"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "tf200",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}