2025-05-29 17:58:48 +08:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# FScanpy \n",
"\n",
2025-06-11 21:18:52 +08:00
"This notebook demonstrates how to use FScanpy with real test data for complete PRF site prediction analysis, including:\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"## 🎯 Complete Workflow\n",
"1. **Load Test Data** - Use built-in real test data\n",
"2. **FScanR Analysis** - Identify potential PRF sites from BLASTX results\n",
"3. **Sequence Extraction** - Extract sequences around PRF sites\n",
"4. **FScanpy Prediction** - Use machine learning models to predict probabilities\n",
"5. **Results Visualization** - Generate prediction result plots using built-in plotting functions\n",
"6. **Sequence-level Prediction Demo** - Sliding window analysis of complete sequences\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"## 📊 Data Description\n",
"- **blastx_example.xlsx**: Real BLASTX alignment results\n",
"- **mrna_example.fasta**: Real mRNA sequence data\n",
"- **region_example.csv**: Sample for individual site prediction"
2025-05-29 17:58:48 +08:00
]
},
2025-06-11 21:44:29 +08:00
{
"cell_type": "raw",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"## 📚 FScanpy Function Usage Guide\n",
"\n",
"### Core Functions Overview\n",
"\n",
"FScanpy provides several main functions for PRF prediction:\n",
"\n",
"#### 1. `predict_prf()` - Universal Prediction Function\n",
"```python\n",
"# Single sequence prediction\n",
"results = predict_prf(sequence=\"ATGCGT...\", window_size=3, ensemble_weight=0.4)\n",
"\n",
"# Multiple sequences prediction \n",
"results = predict_prf(sequence=[\"seq1\", \"seq2\"], window_size=3)\n",
"\n",
"# DataFrame region prediction\n",
"results = predict_prf(data=df_with_399bp_column, ensemble_weight=0.4)\n",
"```\n",
"\n",
"#### 2. `plot_prf_prediction()` - Prediction with Visualization\n",
"```python\n",
"# Basic plotting\n",
"results, fig = plot_prf_prediction(sequence=\"ATGCGT...\")\n",
"\n",
"# Custom parameters\n",
"results, fig = plot_prf_prediction(\n",
" sequence=\"ATGCGT...\",\n",
" window_size=1,\n",
" short_threshold=0.65,\n",
" long_threshold=0.8,\n",
" ensemble_weight=0.4,\n",
" save_path=\"plot.png\"\n",
")\n",
"```\n",
"\n",
"#### 3. `PRFPredictor` Class Methods\n",
"```python\n",
"predictor = PRFPredictor()\n",
"\n",
"# Sliding window prediction\n",
"results = predictor.predict_sequence(sequence, window_size=3, ensemble_weight=0.4)\n",
"\n",
"# Region prediction\n",
"results = predictor.predict_regions(sequences_399bp, ensemble_weight=0.4)\n",
"\n",
"# Single position prediction\n",
"result = predictor.predict_single_position(fs_period_33bp, full_seq_399bp)\n",
"\n",
"# Plot prediction\n",
"results, fig = predictor.plot_sequence_prediction(sequence)\n",
"```\n",
"\n",
"#### 4. Utility Functions\n",
"```python\n",
"from FScanpy.utils import fscanr, extract_prf_regions\n",
"\n",
"# Detect PRF sites from BLASTX\n",
"prf_sites = fscanr(blastx_df, mismatch_cutoff=10, evalue_cutoff=1e-5)\n",
"\n",
"# Extract sequences around PRF sites\n",
"prf_sequences = extract_prf_regions(mrna_file, prf_sites)\n",
"```\n",
"\n",
"### Parameter Guidelines\n",
"\n",
"- **ensemble_weight**: 0.4 (default, balanced), 0.2-0.3 (conservative), 0.7-0.8 (sensitive)\n",
"- **window_size**: 1 (detailed), 3 (standard), 6-9 (fast)\n",
"- **short_threshold**: 0.1 (default), 0.2-0.3 (stricter filtering)\n",
"- **Display thresholds**: 0.3-0.8 for visualization filtering\n"
]
},
2025-05-29 17:58:48 +08:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 📦 Environment Setup and Data Loading"
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
2025-06-11 21:18:52 +08:00
"execution_count": 3,
2025-05-29 17:58:48 +08:00
"metadata": {},
"outputs": [
{
2025-06-11 21:18:52 +08:00
"ename": "ImportError",
"evalue": "cannot import name 'PRFPredictor' from 'FScanpy' (unknown location)",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mImportError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[3], line 6\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mmatplotlib\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpyplot\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mplt\u001b[39;00m\n\u001b[1;32m 5\u001b[0m \u001b[38;5;66;03m# Import FScanpy related modules\u001b[39;00m\n\u001b[0;32m----> 6\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mFScanpy\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m PRFPredictor, predict_prf, plot_prf_prediction\n\u001b[1;32m 7\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mFScanpy\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mdata\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m get_test_data_path, list_test_data\n\u001b[1;32m 8\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mFScanpy\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mutils\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m fscanr, extract_prf_regions\n",
"\u001b[0;31mImportError\u001b[0m: cannot import name 'PRFPredictor' from 'FScanpy' (unknown location)"
2025-05-29 17:58:48 +08:00
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Import necessary libraries\n",
2025-05-29 17:58:48 +08:00
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
2025-06-11 21:18:52 +08:00
"# Import FScanpy related modules\n",
2025-05-29 17:58:48 +08:00
"from FScanpy import PRFPredictor, predict_prf, plot_prf_prediction\n",
"from FScanpy.data import get_test_data_path, list_test_data\n",
"from FScanpy.utils import fscanr, extract_prf_regions\n",
"\n",
2025-06-11 21:18:52 +08:00
"print(\"✅ Environment setup complete!\")\n",
"print(\"📋 Available test data:\")\n",
2025-05-29 17:58:48 +08:00
"list_test_data()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 1. Load and Explore Test Data\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"First, load the real test data provided by FScanpy to understand the data structure."
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"📁 数据文件路径:\n",
" BLASTX数据: /mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/data/test_data/blastx_example.xlsx\n",
" mRNA序列: /mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/data/test_data/mrna_example.fasta\n",
" 验证区域: /mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/data/test_data/region_example.csv\n",
"\n",
"🧬 BLASTX数据概览:\n",
" 数据形状: (1000, 14)\n",
" 列名: ['DNA_seqid', 'Pep_seqid', 'pident', 'length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore', 'qframe', 'sframe']\n",
" 唯一序列数: 704\n",
"\n",
"📊 BLASTX数据示例:\n",
" DNA_seqid Pep_seqid pident length evalue qframe\n",
"0 MSTRG.9998.1 CAMPEP_0196994412 68.27 104 1.000000e-33 2\n",
"1 MSTRG.9996.1 CAMPEP_0197017426 49.16 297 3.000000e-79 2\n",
"2 MSTRG.9994.1 CAMPEP_0197009206 98.31 354 0.000000e+00 2\n",
"3 MSTRG.9993.1 CAMPEP_0168331218 51.67 60 2.000000e-37 2\n",
"4 MSTRG.9993.1 CAMPEP_0168331218 45.45 88 2.000000e-37 3\n"
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Get test data paths\n",
2025-05-29 17:58:48 +08:00
"blastx_file = get_test_data_path('blastx_example.xlsx')\n",
"mrna_file = get_test_data_path('mrna_example.fasta')\n",
"region_file = get_test_data_path('region_example.csv')\n",
"\n",
2025-06-11 21:18:52 +08:00
"print(f\"📁 Data file paths:\")\n",
"print(f\" BLASTX data: {blastx_file}\")\n",
"print(f\" mRNA sequences: {mrna_file}\")\n",
"print(f\" Validation regions: {region_file}\")\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"# Load BLASTX data\n",
2025-05-29 17:58:48 +08:00
"blastx_data = pd.read_excel(blastx_file)\n",
2025-06-11 21:18:52 +08:00
"print(f\"\\n🧬 BLASTX data overview:\")\n",
"print(f\" Data shape: {blastx_data.shape}\")\n",
"print(f\" Column names: {list(blastx_data.columns)}\")\n",
"print(f\" Unique sequences: {blastx_data['DNA_seqid'].nunique()}\")\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"# Display first few rows\n",
"print(\"\\n📊 BLASTX data examples:\")\n",
2025-05-29 17:58:48 +08:00
"display_cols = ['DNA_seqid', 'Pep_seqid', 'pident', 'length', 'evalue', 'qframe']\n",
"print(blastx_data[display_cols].head())"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🎯 验证区域数据概览:\n",
" 数据形状: (3, 8)\n",
" 列名: ['FS_period', '399bp', 'fs_position', 'DNA_seqid', 'label', 'source', 'FS_type', 'dataset']\n",
" 数据来源: {'EUPLOTES': 3}\n",
"\n",
"📋 验证区域数据示例:\n",
" fs_position DNA_seqid label source FS_type\n",
"0 16.0 MSTRG.18491.1 0 EUPLOTES negative\n",
"1 16.0 MSTRG.4662.1 0 EUPLOTES negative\n",
"2 16.0 MSTRG.14742.1 0 EUPLOTES negative\n",
"\n",
"📈 标签分布:\n",
"label\n",
"0 3\n",
"Name: count, dtype: int64\n",
"\n",
"🔬 FS类型分布:\n",
"FS_type\n",
"negative 3\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Load validation region data\n",
2025-05-29 17:58:48 +08:00
"region_data = pd.read_csv(region_file)\n",
2025-06-11 21:18:52 +08:00
"print(f\"🎯 Validation region data overview:\")\n",
"print(f\" Data shape: {region_data.shape}\")\n",
"print(f\" Column names: {list(region_data.columns)}\")\n",
"print(f\" Data sources: {region_data['source'].value_counts().to_dict()}\")\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"print(\"\\n📋 Validation region data examples:\")\n",
2025-05-29 17:58:48 +08:00
"display_cols = ['fs_position', 'DNA_seqid', 'label', 'source', 'FS_type']\n",
"print(region_data[display_cols].head())\n",
"\n",
2025-06-11 21:18:52 +08:00
"# Statistical analysis\n",
"print(f\"\\n📈 Label distribution:\")\n",
2025-05-29 17:58:48 +08:00
"print(region_data['label'].value_counts())\n",
2025-06-11 21:18:52 +08:00
"print(f\"\\n🔬 FS type distribution:\")\n",
2025-05-29 17:58:48 +08:00
"print(region_data['FS_type'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 2. FScanR Analysis - Identify Potential PRF Sites from BLASTX\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"Use the FScanR algorithm to analyze BLASTX results and identify potential programmed ribosomal frameshift sites."
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🔍 运行FScanR分析...\n",
"参数设置: mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10\n",
"\n",
"✅ FScanR分析完成! \n",
"检测到的潜在PRF位点数量: 24\n",
"\n",
"📊 FScanR结果概览:\n",
" 列名: ['DNA_seqid', 'FS_start', 'FS_end', 'Pep_seqid', 'Pep_FS_start', 'Pep_FS_end', 'FS_type', 'Strand']\n",
" 涉及的序列数: 16\n",
" 链方向分布: {'+': 16, '-': 8}\n",
" FS类型分布: {1: 16, -1: 7, -2: 1}\n",
"\n",
"🎯 FScanR结果示例:\n",
" DNA_seqid FS_start FS_end Pep_seqid Pep_FS_start \\\n",
"0 MSTRG.9380.1 3797 3802 CAMPEP_0197017206 1137 \n",
"1 MSTRG.9431.1 4136 4192 CAMPEP_0197016790 657 \n",
"3 MSTRG.9432.1 848 904 CAMPEP_0197016790 753 \n",
"4 MSTRG.9582.1 302 304 CAMPEP_0197003180 214 \n",
"5 MSTRG.961.1 1536 1533 CAMPEP_0197017908 590 \n",
"\n",
" Pep_FS_end FS_type Strand \n",
"0 1138 1 + \n",
"1 675 1 + \n",
"3 2 1 - \n",
"4 214 1 + \n",
"5 19 -1 - \n"
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Run FScanR analysis\n",
"print(\"🔍 Running FScanR analysis...\")\n",
2025-06-11 21:44:29 +08:00
"print(\"Parameter settings: mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10\")\n",
2025-05-29 17:58:48 +08:00
"\n",
"fscanr_results = fscanr(\n",
" blastx_data,\n",
" mismatch_cutoff=10,\n",
" evalue_cutoff=1e-5,\n",
2025-06-11 21:44:29 +08:00
" frameDist_cutoff=10\n",
2025-05-29 17:58:48 +08:00
")\n",
"\n",
2025-06-11 21:18:52 +08:00
"print(f\"\\n✅ FScanR analysis complete!\")\n",
"print(f\"Number of potential PRF sites detected: {len(fscanr_results)}\")\n",
2025-05-29 17:58:48 +08:00
"\n",
"if len(fscanr_results) > 0:\n",
2025-06-11 21:18:52 +08:00
" print(f\"\\n📊 FScanR results overview:\")\n",
" print(f\" Column names: {list(fscanr_results.columns)}\")\n",
" print(f\" Number of sequences involved: {fscanr_results['DNA_seqid'].nunique()}\")\n",
" print(f\" Strand orientation distribution: {fscanr_results['Strand'].value_counts().to_dict()}\")\n",
" print(f\" FS type distribution: {fscanr_results['FS_type'].value_counts().to_dict()}\")\n",
2025-05-29 17:58:48 +08:00
" \n",
2025-06-11 21:18:52 +08:00
" print(\"\\n🎯 FScanR results examples:\")\n",
2025-05-29 17:58:48 +08:00
" print(fscanr_results.head())\n",
"else:\n",
2025-06-11 21:18:52 +08:00
" print(\"⚠️ No PRF sites detected, may need to adjust parameters\")"
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 3. Sequence Extraction - Extract Sequences Around PRF Sites\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"Extract sequence fragments around PRF sites identified by FScanR from mRNA sequences."
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"📝 从mRNA序列中提取PRF位点周围序列...\n",
"\n",
"✅ 序列提取完成!\n",
"成功提取的序列数量: 24\n",
"\n",
"📏 序列长度验证:\n",
" 399bp序列长度分布: {399: 24}\n",
" 平均长度: 399.0\n",
"\n",
"🧬 提取序列示例:\n",
"序列 1: MSTRG.9380.1\n",
" FS位置: 3797-3802\n",
" 链方向: +\n",
" FS类型: 1\n",
" 序列片段: AAGGAGTTTGAAGAAGAACAGGAAAAACAAGAGAAAGAGAGAAAGGAGAA...NNNNNNNNNNNNNNNNNNNN\n",
"\n",
"序列 2: MSTRG.9431.1\n",
" FS位置: 4136-4192\n",
" 链方向: +\n",
" FS类型: 1\n",
" 序列片段: CAAGTATCTGAGTGGGAGGGAGACACAGGTGTTGATCAAACCCCATTCCC...ATAATGACGGAGGCTTCAGA\n",
"\n",
"序列 3: MSTRG.9432.1\n",
" FS位置: 848-904\n",
" 链方向: -\n",
" FS类型: 1\n",
" 序列片段: AGAAAGGATGGTACTGAAAATCAACGAAGTACTTTCACATTTTAGAAAGA...GCTGAGAACGATATTGACAA\n",
"\n"
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Extract sequences around PRF sites\n",
2025-05-29 17:58:48 +08:00
"if len(fscanr_results) > 0:\n",
2025-06-11 21:18:52 +08:00
" print(\"📝 Extracting sequences around PRF sites from mRNA sequences...\")\n",
2025-05-29 17:58:48 +08:00
" \n",
" prf_sequences = extract_prf_regions(\n",
" mrna_file=mrna_file,\n",
" prf_data=fscanr_results\n",
" )\n",
" \n",
2025-06-11 21:18:52 +08:00
" print(f\"\\n✅ Sequence extraction complete!\")\n",
" print(f\"Number of successfully extracted sequences: {len(prf_sequences)}\")\n",
2025-05-29 17:58:48 +08:00
" \n",
" if len(prf_sequences) > 0:\n",
2025-06-11 21:18:52 +08:00
" print(f\"\\n📏 Sequence length validation:\")\n",
2025-05-29 17:58:48 +08:00
" seq_lengths = prf_sequences['399bp'].str.len()\n",
2025-06-11 21:18:52 +08:00
" print(f\" 399bp sequence length distribution: {seq_lengths.value_counts().to_dict()}\")\n",
" print(f\" Average length: {seq_lengths.mean():.1f}\")\n",
2025-05-29 17:58:48 +08:00
" \n",
2025-06-11 21:18:52 +08:00
" print(\"\\n🧬 Extracted sequence examples:\")\n",
2025-05-29 17:58:48 +08:00
" for i, row in prf_sequences.head(3).iterrows():\n",
2025-06-11 21:18:52 +08:00
" print(f\"Sequence {i+1}: {row['DNA_seqid']}\")\n",
" print(f\" FS position: {row['FS_start']}-{row['FS_end']}\")\n",
" print(f\" Strand orientation: {row['Strand']}\")\n",
" print(f\" FS type: {row['FS_type']}\")\n",
" print(f\" Sequence fragment: {row['399bp'][:50]}...{row['399bp'][-20:]}\")\n",
2025-05-29 17:58:48 +08:00
" print()\n",
" else:\n",
2025-06-11 21:18:52 +08:00
" print(\"❌ Sequence extraction failed\")\n",
2025-05-29 17:58:48 +08:00
"else:\n",
2025-06-11 21:18:52 +08:00
" print(\"⚠️ Skipping sequence extraction - no FScanR results\")\n",
2025-05-29 17:58:48 +08:00
" prf_sequences = pd.DataFrame()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 4. FScanpy Prediction - Machine Learning Model Analysis\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"Use FScanpy's machine learning models to predict PRF probabilities for the extracted sequences."
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🤖 FScanpy预测器初始化完成\n",
"\n",
"🎯 对 24 个FScanR识别的序列进行预测...\n",
"\n",
"📊 FScanR+FScanpy预测结果:\n",
" DNA_seqid FS_start FS_type Short_Probability Long_Probability \\\n",
"0 MSTRG.9380.1 3797 1 0.239192 0.087024 \n",
"1 MSTRG.9431.1 4136 1 0.326807 0.356356 \n",
"2 MSTRG.9432.1 848 1 0.310908 0.159746 \n",
"3 MSTRG.9582.1 302 1 0.272451 0.223354 \n",
"4 MSTRG.961.1 1536 -1 0.263269 0.046773 \n",
"\n",
" Ensemble_Probability \n",
"0 0.147891 \n",
"1 0.344536 \n",
"2 0.220211 \n",
"3 0.242993 \n",
"4 0.133372 \n"
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Initialize predictor\n",
2025-05-29 17:58:48 +08:00
"predictor = PRFPredictor()\n",
2025-06-11 21:18:52 +08:00
"print(\"🤖 FScanpy predictor initialization complete\")\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"# Predict FScanR identified sequences\n",
2025-05-29 17:58:48 +08:00
"if len(prf_sequences) > 0:\n",
2025-06-11 21:18:52 +08:00
" print(f\"\\n🎯 Predicting {len(prf_sequences)} sequences identified by FScanR...\")\n",
2025-05-29 17:58:48 +08:00
" \n",
" fscanr_predictions = predictor.predict_regions(\n",
" sequences=prf_sequences['399bp'],\n",
2025-06-11 21:18:52 +08:00
" ensemble_weight=0.4 # Balanced configuration\n",
2025-05-29 17:58:48 +08:00
" )\n",
" \n",
2025-06-11 21:18:52 +08:00
" # Merge results\n",
2025-05-29 17:58:48 +08:00
" fscanr_predictions = pd.concat([\n",
" prf_sequences.reset_index(drop=True),\n",
" fscanr_predictions.reset_index(drop=True)\n",
" ], axis=1)\n",
" \n",
2025-06-11 21:18:52 +08:00
" print(\"\\n📊 FScanR+FScanpy prediction results:\")\n",
2025-05-29 17:58:48 +08:00
" result_cols = ['DNA_seqid', 'FS_start', 'FS_type', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']\n",
" print(fscanr_predictions[result_cols].head())"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"🧪 对 3 个验证区域进行预测...\n",
"\n",
"📊 验证区域预测结果:\n",
" DNA_seqid label source Short_Probability Long_Probability \\\n",
"0 MSTRG.18491.1 0 EUPLOTES 0.368610 0.144442 \n",
"1 MSTRG.4662.1 0 EUPLOTES 0.229811 0.053352 \n",
"2 MSTRG.14742.1 0 EUPLOTES 0.454152 0.345118 \n",
"\n",
" Ensemble_Probability \n",
"0 0.234109 \n",
"1 0.123936 \n",
"2 0.388732 \n"
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Predict validation region data\n",
"print(f\"\\n🧪 Predicting {len(region_data)} validation regions...\")\n",
2025-05-29 17:58:48 +08:00
"\n",
"validation_predictions = predict_prf(\n",
" data=region_data.rename(columns={'399bp': 'Long_Sequence'}),\n",
" ensemble_weight=0.4\n",
")\n",
"\n",
2025-06-11 21:18:52 +08:00
"print(\"\\n📊 Validation region prediction results:\")\n",
2025-05-29 17:58:48 +08:00
"result_cols = ['DNA_seqid', 'label', 'source', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']\n",
"print(validation_predictions[result_cols].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 5. Sequence-level Prediction and Visualization\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"Select a specific mRNA sequence and use the built-in plot_prf_prediction function for complete sliding window prediction and visualization."
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🧬 选择演示序列: MSTRG.9127.1\n",
"序列长度: 256 bp\n",
"序列前100bp: TGGCCTTCTTACTTGGAAGTCCCCAAGGATCATCTTGGCCATCCTTGCTTTCTTCATGGCTAGATTCTACCTCCTCCCATAATTGTGTGAAACAAGTAAC...\n",
"\n",
"🎯 使用plot_prf_prediction进行序列预测和可视化...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/predictor.py:335: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.\n",
" plt.tight_layout()\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 39044 (\\N{CJK UNIFIED IDEOGRAPH-9884}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 27979 (\\N{CJK UNIFIED IDEOGRAPH-6D4B}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 27010 (\\N{CJK UNIFIED IDEOGRAPH-6982}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 29575 (\\N{CJK UNIFIED IDEOGRAPH-7387}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 28909 (\\N{CJK UNIFIED IDEOGRAPH-70ED}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 22270 (\\N{CJK UNIFIED IDEOGRAPH-56FE}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 31227 (\\N{CJK UNIFIED IDEOGRAPH-79FB}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 30721 (\\N{CJK UNIFIED IDEOGRAPH-7801}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 20998 (\\N{CJK UNIFIED IDEOGRAPH-5206}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 24067 (\\N{CJK UNIFIED IDEOGRAPH-5E03}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 38598 (\\N{CJK UNIFIED IDEOGRAPH-96C6}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 25104 (\\N{CJK UNIFIED IDEOGRAPH-6210}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 26435 (\\N{CJK UNIFIED IDEOGRAPH-6743}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 37325 (\\N{CJK UNIFIED IDEOGRAPH-91CD}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 24207 (\\N{CJK UNIFIED IDEOGRAPH-5E8F}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 21015 (\\N{CJK UNIFIED IDEOGRAPH-5217}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 20301 (\\N{CJK UNIFIED IDEOGRAPH-4F4D}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 32622 (\\N{CJK UNIFIED IDEOGRAPH-7F6E}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 36807 (\\N{CJK UNIFIED IDEOGRAPH-8FC7}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 28388 (\\N{CJK UNIFIED IDEOGRAPH-6EE4}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 38408 (\\N{CJK UNIFIED IDEOGRAPH-9608}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 20540 (\\N{CJK UNIFIED IDEOGRAPH-503C}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 30340 (\\N{CJK UNIFIED IDEOGRAPH-7684}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 32467 (\\N{CJK UNIFIED IDEOGRAPH-7ED3}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 26524 (\\N{CJK UNIFIED IDEOGRAPH-679C}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 65288 (\\N{FULLWIDTH LEFT PARENTHESIS}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 26465 (\\N{CJK UNIFIED IDEOGRAPH-6761}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 24418 (\\N{CJK UNIFIED IDEOGRAPH-5F62}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 65289 (\\N{FULLWIDTH RIGHT PARENTHESIS}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABRcAAALmCAYAAADYLKN3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8ekN5oAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB3P0lEQVR4nOzdd5hV1b0/4M8MvYmAHUUS7IgRG/ZCxBqCKd6AsSCxBcHY9UaNJvYoMYJ6FQtKNGo0Yosltti9RpMIErFiQ6IIDIq0YWZ+f/DjXMcBxA0yDL7v8+TJnHXWPuu798w65/hh7b3LampqagIAAAAA8BWV13cBAAAAAEDDJFwEAAAAAAoRLgIAAAAAhQgXAQAAAIBChIsAAAAAQCHCRQAAAACgEOEiAAAAAFCIcBEAAAAAKES4CAAAAAAU0ri+CwAAGpa//e1vufnmm2u1XXXVVRk9enQuv/zyWu3nnXdeOnTosMDXOeigg/L8889n0KBBGTx48AL79OnTJ+PGjcv555+fH/7wh6X2v/zlL/nTn/6UN954IxUVFWndunU23XTTHHnkkdlmm22SJD179syECRMWuS/zx55fy+eVlZWlbdu22WabbXLMMcdk/fXXr7P9gw8+mD/96U8ZO3ZsPv3007Rq1Spdu3bNj3/84+y7776LHDtJ/vnPf+aSSy7JSy+9lKZNm2bnnXfOL3/5yzrHbMqUKTnllFPyxBNP5Oqrr87OO+9c6/nKysrccMMNGTVqVCZMmJBVV101u+++e44++ui0bt36S4/HD37wg1xwwQWLrPX666/PxRdfnF69euWSSy5ZZN/bb789Dz30UOlxhw4dct555yVJTj311IwaNapW/yZNmmTttddO7969c/jhh6dp06ZJkmHDhuWyyy6r8/otW7bMJptskp/97Gfp2bNnqX1h/ef78Y9/nKOPPjq//vWva7Uff/zx6dy5c4455pha7f379892222XE044IdOnTy+1f+9730vv3r1z3nnn5Z133im1b7fddunfv3+GDx+eF198sdS+wQYb5IQTTljocVlac+qLvurrzp49e4HHZsMNN1ys8QCAbybhIgDwlbz//vs544wzsvbaaydJLrzwwiTJ5MmTM2DAgPTo0SNJcuONN2bmzJmLfK2WLVtm1KhRGTRoUMrKymo9N27cuLz77rt1trniiity2WWXZdCgQTnjjDPSsmXLvPvuu7nqqqvys5/9LCNHjkz37t1z++23p6qqqrTd97///fTo0SOnnXZarfHn69q1a6666qrS46qqqrz11lu55JJLcsABB+Suu+7KWmutlSSpqanJqaeemvvvvz8DBgzI8ccfn5VXXjkfffRR7rnnnpxwwgn5+9//nrPOOmuh+z5u3LgcfPDB2XHHHXPLLbeksrIyp5xySo466qjceuutKS+fd4LJ888/nxNOOCFt2rRZ6GtdeOGFue2223LmmWdmq622yssvv5wzzjgjH330UYYMGZIkdY5HkkydOjU/+clPst122y30tSsqKnLqqadm7Nixadas2UL7fd6bb75Z61jO/xuZr3379rn77rtLjz/55JM8++yzufjii/Pmm2/md7/7Xa3+jz76aClwrKmpyX/+85/84Q9/yMCBA3PZZZdl9913X2j/z2vRokU++OCD7LnnnqWw+rHHHktFRUXmzp2bTTfdtBR0v/baa3n55ZeTJKuttlrpOH722We57rrrkiSNGjVa4H5OnTp1ge0LOy5Lc0593ld93RkzZizw2AAALIpwEQCoN1tvvXWeeOKJPPfcc3UCrlGjRmXrrbfO448/Xqv9xhtvzL777puBAweW2tZaa61sscUWOfDAA/Ovf/0r3bt3T/v27WttV15enubNm2fVVVddYC2NGzeu89waa6yRLl26ZOedd86f/vSnHHvssUmSP/7xj7nzzjszfPjw7LLLLqX+HTt2TPfu3bPOOuvk2muvzaGHHpp11113geNdd911admyZYYMGVIKOS+55JL06dMnTzzxRHbdddckycUXX5yDDjoom222WQ455JA6rzN9+vTccsstOeqoo0qhUKdOnfLqq6/mqquuyplnnpmVVlqpzvGYP94GG2yQ73//+wusMUnuvffezJgxI3feeWf233//hfb7KsrLy2sd61VXXTVdunTJlClTcvnll+fkk0/OGmusUXp+lVVWqRVsrrbaarnwwgvz8ssv57rrrqsTLn6xPwAAXx/XXAQA6k379u3TvXv33HHHHbXa586dm3vuuafWKa/zzZo1K3PmzKnT3rRp0/zpT3/KoYceulRrXH311dO+ffv85z//KbWNGDEiO++8c61g8fP69++fJ554YqHBYpKMHTs2m266aa3VkxtttFE6duyYp556qtR20UUX5YgjjqizsnO+Vq1a5YknnsiAAQNqta+22mqpqalZ6Eq30aNHZ9SoUTnttNMW+tpJsssuu2TEiBGLfSrukthoo42SJB988MGX9i0vL88GG2xQ6/fyTXXqqafm1FNPre8yAIBvKOEiAFCvvve97+Whhx6qdU27J598Mp988kn23HPPOv133nnnPPDAAzn++OPz97//fYFB49I0ZcqUTJ06tXRK9MSJE/Pee+8tNFhM5l2vcf5pzQvTuHHjNGrUqE57+/bta13Hb1EB5fyx2rdvXyukTOadGrzGGmtk9dVXX+B2Q4cOzc4775zNNttska+/zjrrLLDOr8Pbb7+dJFlzzTUXq/9bb71V+r0AAFA/hIsAQL3aZ599Mnfu3PzlL38ptY0aNSo77rhj2rVrV6f/2Wefnb333jv33XdfDjzwwGy99dbp379/rr/++qV+fbj3338/p5xySlq0aJEf//jHSZIPP/wwyeIHYAvzrW99K//+978zd+7cUtvs2bPz9ttv57PPPlui177xxhvz1FNP5cQTT1zg86+88kqefPLJHHHEEUs0ztJSWVmZp556Ktddd1322GOPLz2206ZNy+9+97u89tprOeigg5ZRlQAALIhrLgIA9apdu3bZaaedcscdd+QnP/lJKioq8uijj9a5Cch8bdq0ye9///t88MEHefzxx/P3v/89zz//fJ599tlcccUVueqqq9K9e/evXMeYMWNqbVdVVZXZs2dnq622yvXXX19aITd/RWJ1dXWt7UePHl3nmoi9e/fOb37zmwWOd+CBB+b+++/Peeedl+OOOy6VlZU599xzU15ensaNi39Fu/7663PBBRfkqKOOSu/evRfY54YbbkjXrl2z5ZZbFh5nSUyePLnWsZ49e3YaN26cPn36LPD03m233bbW4xkzZqRz58658MILF7i69Yv95zv11FML/W0sjz6/H/NX7z744IOlts+H9QAAXyfhIgBQ777//e/n2GOPzZtvvpnnnnsuTZo0WeD1Fj9vrbXWSr9+/dKvX79UV1fnr3/9a0477bScddZZueuuu75yDRtuuGEuvfTS0uNHHnkkF110UU488cR85zvfqTVukrz33nu1tt9oo41y5513lh6feOKJizxle6uttsqFF16Ys88+OzfffHOaN2+eQw45JNtuu+2XnlK9IDU1Nbnoooty3XXX5YQTTsjhhx++wH6VlZV55JFH0r9//688xtKy8sor59Zbby09nn8znQXd4TlJbrvttjRp0iTJvNPSf/azn+VHP/pR9ttvvy/t/3nt27fPxIkTl3wHlgOf/1u7+OKLk6TWStXVVlttWZcEAHxDCRcBgHrXs2fPtGnTJvfdd1+efvrp9OrVKy1atFhg308++SQrrbRSrbby8vLstdde+cc//pEbb7wxNTU1i7xJyYI0bdq01vUN+/fvn/vvvz+nn356Ro0aVQq+VllllWywwQb561//WusmKl/cvnnz5l865n777Zd99tknkydPTocOHdK0adPsvffe6dOnz1eqPZkXMI0cOTK//e1vF3n35+effz6ffPJJ6W7U9aFRo0Zfei3Jz1tnnXVKd39ed911c/DBB+eyyy7LHnvskc6dOy+y/xetKOHi549fq1at6rQBACwrrrkIANS7Zs2aZc8998x9992Xf/3rXws9nfevf/1rtt5
"text/plain": [
"<Figure size 1600x800 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"📊 序列预测结果统计:\n",
" 预测位点总数: 85\n",
" 高概率位点 (>0.8): 0\n",
" 中概率位点 (0.4-0.8): 6\n",
" 最高预测概率: 0.475\n"
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Select a sequence for demonstration\n",
2025-05-29 17:58:48 +08:00
"from Bio import SeqIO\n",
"\n",
2025-06-11 21:18:52 +08:00
"# Read the first mRNA sequence for demonstration\n",
2025-05-29 17:58:48 +08:00
"mrna_sequences = list(SeqIO.parse(mrna_file, \"fasta\"))\n",
2025-06-11 21:18:52 +08:00
"demo_seq = mrna_sequences[0] # Select the first sequence\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"print(f\"🧬 Selected demonstration sequence: {demo_seq.id}\")\n",
"print(f\"Sequence length: {len(demo_seq.seq)} bp\")\n",
"print(f\"First 100bp of sequence: {str(demo_seq.seq)[:100]}...\")\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"# Use built-in plot_prf_prediction function for prediction and visualization\n",
"print(f\"\\n🎯 Using plot_prf_prediction for sequence prediction and visualization...\")\n",
2025-05-29 17:58:48 +08:00
"\n",
"sequence_results, fig = plot_prf_prediction(\n",
" sequence=str(demo_seq.seq),\n",
" window_size=3,\n",
" short_threshold=0.2,\n",
" long_threshold=0.2,\n",
" ensemble_weight=0.6,\n",
2025-06-11 21:18:52 +08:00
" title=f\"PRF Prediction Results for Sequence {demo_seq.id} (Bar Chart + Heatmap)\",\n",
2025-05-29 17:58:48 +08:00
" figsize=(16, 8),\n",
" dpi=150\n",
")\n",
"\n",
"plt.show()\n",
"\n",
2025-06-11 21:18:52 +08:00
"print(f\"\\n📊 Sequence prediction result statistics:\")\n",
"print(f\" Total predicted sites: {len(sequence_results)}\")\n",
"print(f\" High probability sites (>0.8): {(sequence_results['Ensemble_Probability'] > 0.8).sum()}\")\n",
"print(f\" Medium probability sites (0.4-0.8): {((sequence_results['Ensemble_Probability'] >= 0.4) & (sequence_results['Ensemble_Probability'] <= 0.8)).sum()}\")\n",
"print(f\" Highest prediction probability: {sequence_results['Ensemble_Probability'].max():.3f}\")"
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"🔝 Top 5 预测位点:\n",
" 1. 位置 96: \n",
" - Short概率: 0.288\n",
" - Long概率: 0.755\n",
" - 集成概率: 0.475\n",
" - 密码子: TAA\n",
" 2. 位置 12: \n",
" - Short概率: 0.606\n",
" - Long概率: 0.177\n",
" - 集成概率: 0.434\n",
" - 密码子: TTG\n",
" 3. 位置 15: \n",
" - Short概率: 0.493\n",
" - Long概率: 0.329\n",
" - 集成概率: 0.428\n",
" - 密码子: GAA\n",
" 4. 位置 18: \n",
" - Short概率: 0.369\n",
" - Long概率: 0.510\n",
" - 集成概率: 0.426\n",
" - 密码子: GTC\n",
" 5. 位置 105: \n",
" - Short概率: 0.248\n",
" - Long概率: 0.671\n",
" - 集成概率: 0.418\n",
" - 密码子: ACT\n",
"\n",
"📊 可视化分析完成!\n",
"图表包含热图和条形图, 展示了整个序列的PRF预测概率分布。\n"
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Print top predicted site probabilities\n",
2025-05-29 17:58:48 +08:00
"if sequence_results['Ensemble_Probability'].max() > 0.3:\n",
" top_predictions = sequence_results.nlargest(5, 'Ensemble_Probability')\n",
2025-06-11 21:18:52 +08:00
" print(f\"\\n🔝 Top 5 predicted sites:\")\n",
2025-05-29 17:58:48 +08:00
" for i, (_, row) in enumerate(top_predictions.iterrows(), 1):\n",
2025-06-11 21:18:52 +08:00
" print(f\" {i}. Position {row['Position']}: \")\n",
" print(f\" - Short probability: {row['Short_Probability']:.3f}\")\n",
" print(f\" - Long probability: {row['Long_Probability']:.3f}\")\n",
" print(f\" - Ensemble probability: {row['Ensemble_Probability']:.3f}\")\n",
" print(f\" - Codon: {row['Codon']}\")\n",
2025-05-29 17:58:48 +08:00
"else:\n",
2025-06-11 21:18:52 +08:00
" print(\"\\n💡 No high-probability PRF sites detected in this sequence\")\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"print(\"\\n📊 Visualization analysis complete!\")\n",
"print(\"The chart contains heatmaps and bar charts showing the PRF prediction probability distribution across the entire sequence.\")"
2025-05-29 17:58:48 +08:00
]
},
2025-06-11 21:44:29 +08:00
{
"cell_type": "raw",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"## 📖 Complete Function Reference\n",
"\n",
"### All Available Functions and Methods\n",
"\n",
"#### Core Prediction Functions\n",
"\n",
"**1. `predict_prf(sequence=None, data=None, window_size=3, short_threshold=0.1, ensemble_weight=0.4, model_dir=None)`**\n",
"- **Purpose**: Universal prediction function for both sliding window and region-based analysis\n",
"- **Input modes**: \n",
" - Single/multiple sequences → sliding window prediction\n",
" - DataFrame with 'Long_Sequence'/'399bp' column → region prediction\n",
"- **Key parameters**:\n",
" - `ensemble_weight`: Short model weight (0.0-1.0, default: 0.4)\n",
" - `window_size`: Scanning step size (default: 3)\n",
" - `short_threshold`: Filtering threshold (default: 0.1)\n",
"\n",
"**2. `plot_prf_prediction(sequence, window_size=3, short_threshold=0.65, long_threshold=0.8, ensemble_weight=0.4, title=None, save_path=None, figsize=(12,8), dpi=300)`**\n",
"- **Purpose**: Prediction with built-in visualization (3-subplot layout: FS site heatmap, prediction heatmap, bar chart)\n",
"- **Returns**: (prediction_results_df, matplotlib_figure)\n",
"- **Visualization features**: \n",
" - Black bars with alpha=0.6\n",
" - 'Reds' colormap for heatmaps\n",
" - Height ratios [0.1, 0.1, 1] for subplots\n",
"\n",
"#### PRFPredictor Class Methods\n",
"\n",
"**3. Class initialization: `PRFPredictor(model_dir=None)`**\n",
"- Loads HistGradientBoosting (short, 33bp) and BiLSTM-CNN (long, 399bp) models\n",
"- Uses ensemble weighting for final predictions\n",
"\n",
"**4. `predictor.predict_sequence(sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)`**\n",
"- **Purpose**: Sliding window analysis of complete sequences\n",
"- **Process**: Scans sequence with specified window size, applies both models\n",
"\n",
"**5. `predictor.predict_regions(sequences, short_threshold=0.1, ensemble_weight=0.4)`**\n",
"- **Purpose**: Batch prediction for pre-defined 399bp regions\n",
"- **Input**: List/Series of 399bp sequences\n",
"- **Efficient**: Direct region analysis without sliding window\n",
"\n",
"**6. `predictor.predict_single_position(fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)`**\n",
"- **Purpose**: Single position analysis\n",
"- **Inputs**: 33bp sequence (fs_period) + 399bp sequence (full_seq)\n",
"- **Returns**: Dictionary with individual and ensemble probabilities\n",
"\n",
"**7. `predictor.plot_sequence_prediction(...)`** \n",
"- **Purpose**: Class method version of plot_prf_prediction()\n",
"- **Same parameters** as standalone function\n",
"\n",
"#### Utility Functions\n",
"\n",
"**8. `fscanr(blastx_output, mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10)`**\n",
"- **Purpose**: Detect PRF sites from BLASTX alignment results\n",
"- **Input**: DataFrame with BLASTX columns (qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore, qframe, sframe)\n",
"- **Output**: PRF sites with FS_start, FS_end, FS_type, Strand information\n",
"\n",
"**9. `extract_prf_regions(mrna_file, prf_data)`**\n",
"- **Purpose**: Extract 399bp sequences around detected PRF sites\n",
"- **Inputs**: FASTA file path + FScanR results DataFrame\n",
"- **Handles**: Strand orientation (reverse complement for '-' strand)\n",
"\n",
"#### Data Access Functions\n",
"\n",
"**10. `get_test_data_path(filename)`**\n",
"- **Purpose**: Get path to built-in test data files\n",
"- **Available files**: 'blastx_example.xlsx', 'mrna_example.fasta', 'region_example.csv'\n",
"\n",
"**11. `list_test_data()`**\n",
"- **Purpose**: Display all available test data files\n",
"\n",
"### Usage Pattern Examples\n",
"\n",
"#### Pattern 1: Quick Single Sequence Analysis\n",
"```python\n",
"from FScanpy import predict_prf, plot_prf_prediction\n",
"\n",
"# Simple prediction\n",
"results = predict_prf(sequence=\"ATGCGT...\")\n",
"\n",
"# With visualization \n",
"results, fig = plot_prf_prediction(sequence=\"ATGCGT...\")\n",
"```\n",
"\n",
"#### Pattern 2: Batch Sequence Analysis\n",
"```python\n",
"sequences = [\"seq1\", \"seq2\", \"seq3\"]\n",
"results = predict_prf(sequence=sequences, ensemble_weight=0.5)\n",
"```\n",
"\n",
"#### Pattern 3: BLASTX Pipeline\n",
"```python\n",
"from FScanpy.utils import fscanr, extract_prf_regions\n",
"\n",
"# Step 1: Detect PRF sites\n",
"prf_sites = fscanr(blastx_df)\n",
"\n",
"# Step 2: Extract sequences\n",
"prf_sequences = extract_prf_regions(fasta_file, prf_sites)\n",
"\n",
"# Step 3: Predict probabilities\n",
"results = predict_prf(data=prf_sequences)\n",
"```\n",
"\n",
"#### Pattern 4: Custom Analysis with PRFPredictor\n",
"```python\n",
"from FScanpy import PRFPredictor\n",
"\n",
"predictor = PRFPredictor()\n",
"\n",
"# Method chaining for different analysis types\n",
"seq_results = predictor.predict_sequence(sequence)\n",
"region_results = predictor.predict_regions(sequences_399bp)\n",
"single_result = predictor.predict_single_position(seq_33bp, seq_399bp)\n",
"```\n",
"\n",
"### Parameter Optimization Guide\n",
"\n",
"**Ensemble Weight Selection:**\n",
"- `0.2-0.3`: Conservative (high specificity, favor long model)\n",
"- `0.4-0.6`: Balanced (recommended default)\n",
"- `0.7-0.8`: Sensitive (high sensitivity, favor short model)\n",
"\n",
"**Window Size Selection:**\n",
"- `1`: High resolution, every position (slow but detailed)\n",
"- `3`: Standard resolution (balanced speed/detail) \n",
"- `6-9`: Low resolution, faster analysis\n",
"\n",
"**Threshold Guidelines:**\n",
"- `short_threshold`: 0.1-0.3 (controls efficiency by filtering low-probability candidates)\n",
"- Display thresholds: 0.3-0.8 (controls visualization, higher = cleaner plots)\n",
"- Classification threshold: 0.5 (standard binary classification cutoff)\n",
"\n",
"### Output Interpretation\n",
"\n",
"**Main Result Columns:**\n",
"- `Short_Probability`: HistGradientBoosting model prediction (0-1)\n",
"- `Long_Probability`: BiLSTM-CNN model prediction (0-1)\n",
"- `Ensemble_Probability`: **Final prediction** (weighted combination)\n",
"- `Position`: Sequence position (sliding window mode)\n",
"- `Codon`: Codon at position (sliding window mode)\n",
"\n",
"**Ensemble Probability Interpretation:**\n",
"- `> 0.8`: High confidence PRF site\n",
"- `0.5-0.8`: Moderate confidence PRF site \n",
"- `0.3-0.5`: Low confidence, worth investigating\n",
"- `< 0.3`: Unlikely to be PRF site\n",
"\n",
"### Best Practices\n",
"\n",
"1. **For exploration**: Use `window_size=1, ensemble_weight=0.4`\n",
"2. **For screening**: Use `window_size=3, ensemble_weight=0.4, short_threshold=0.2`\n",
"3. **For validation**: Use region-based prediction with known sequences\n",
"4. **For visualization**: Adjust `short_threshold` and `long_threshold` in plotting functions to control display density\n",
"\n",
"This demo covers all major FScanpy functionalities. For detailed parameter descriptions and advanced usage, please refer to the complete tutorial documentation.\n"
]
},
2025-05-29 17:58:48 +08:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 📝 Analysis Summary\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"### 🎯 Key Findings\n",
"1. **Data Quality**: Test dataset contains real BLASTX alignment results and validation regions\n",
"2. **FScanR Performance**: Successfully identified potential PRF sites from BLASTX results\n",
"3. **Model Performance**: Short and Long models each have advantages in different scenarios\n",
"4. **Prediction Results**: Ensemble model provides more stable prediction performance\n",
"5. **Visualization**: Built-in plotting functions generate clear heatmaps and bar charts\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"### 🔧 Best Practices\n",
"- **Data Preprocessing**: Ensure BLASTX results are in correct format\n",
"- **Parameter Settings**: Use default ensemble weights (0.4:0.6) for balanced performance\n",
"- **Result Interpretation**: When using FScanpy for whole sequence prediction, don't use 0.5 as threshold, but compare relative probabilities across positions\n",
"- **Visualization**: Use plot_prf_prediction function to generate standardized plots\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"### 📚 Usage Recommendations\n",
"1. **Threshold Selection**: Adjust probability thresholds based on application scenarios\n",
"2. **Result Validation**: Validate prediction results with biological knowledge\n",
"3. **Performance Optimization**: Use reasonable sliding window sizes for large-scale data\n",
"4. **Visualization Parameters**: Adjust figsize and dpi for optimal display"
2025-05-29 17:58:48 +08:00
]
}
],
"metadata": {
"kernelspec": {
2025-06-11 21:18:52 +08:00
"display_name": "tf200",
2025-05-29 17:58:48 +08:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2025-06-11 21:18:52 +08:00
"version": "3.9.0"
2025-05-29 17:58:48 +08:00
}
},
"nbformat": 4,
"nbformat_minor": 4
}