FScanpy-package/FScanpy_Demo.ipynb

703 lines
70 KiB
Plaintext
Raw Normal View History

2025-05-29 17:58:48 +08:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# FScanpy \n",
"\n",
"这个 Notebook 展示了如何使用 FScanpy 的真实测试数据进行完整的 PRF 位点预测分析,包括:\n",
"\n",
"## 🎯 完整工作流程\n",
"1. **加载测试数据** - 使用内置的真实测试数据\n",
"2. **FScanR 分析** - 从 BLASTX 结果识别潜在 PRF 位点\n",
"3. **序列提取** - 提取 PRF 位点周围的序列\n",
"4. **FScanpy 预测** - 使用机器学习模型预测概率\n",
"5. **结果可视化** - 使用内置绘图函数生成预测结果图表\n",
"6. **序列级预测演示** - 完整序列的滑动窗口分析\n",
"\n",
"## 📊 数据说明\n",
"- **blastx_example.xlsx**: 真实BLASTX比对结果\n",
"- **mrna_example.fasta**: 真实mRNA序列数据\n",
"- **region_example.csv**: 单独对某个位点进行预测的样本"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 📦 环境准备和数据加载"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"✅ 环境准备完成!\n",
"📋 可用的测试数据:\n"
]
},
{
"data": {
"text/plain": [
"['blastx_example.xlsx', 'mrna_example.fasta', 'region_example.csv']"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 导入必要的库\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# 导入FScanpy相关模块\n",
"from FScanpy import PRFPredictor, predict_prf, plot_prf_prediction\n",
"from FScanpy.data import get_test_data_path, list_test_data\n",
"from FScanpy.utils import fscanr, extract_prf_regions\n",
"\n",
"print(\"✅ 环境准备完成!\")\n",
"print(\"📋 可用的测试数据:\")\n",
"list_test_data()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. 加载和探索测试数据\n",
"\n",
"首先加载 FScanpy 提供的真实测试数据,了解数据结构。"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"📁 数据文件路径:\n",
" BLASTX数据: /mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/data/test_data/blastx_example.xlsx\n",
" mRNA序列: /mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/data/test_data/mrna_example.fasta\n",
" 验证区域: /mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/data/test_data/region_example.csv\n",
"\n",
"🧬 BLASTX数据概览:\n",
" 数据形状: (1000, 14)\n",
" 列名: ['DNA_seqid', 'Pep_seqid', 'pident', 'length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore', 'qframe', 'sframe']\n",
" 唯一序列数: 704\n",
"\n",
"📊 BLASTX数据示例:\n",
" DNA_seqid Pep_seqid pident length evalue qframe\n",
"0 MSTRG.9998.1 CAMPEP_0196994412 68.27 104 1.000000e-33 2\n",
"1 MSTRG.9996.1 CAMPEP_0197017426 49.16 297 3.000000e-79 2\n",
"2 MSTRG.9994.1 CAMPEP_0197009206 98.31 354 0.000000e+00 2\n",
"3 MSTRG.9993.1 CAMPEP_0168331218 51.67 60 2.000000e-37 2\n",
"4 MSTRG.9993.1 CAMPEP_0168331218 45.45 88 2.000000e-37 3\n"
]
}
],
"source": [
"# 获取测试数据路径\n",
"blastx_file = get_test_data_path('blastx_example.xlsx')\n",
"mrna_file = get_test_data_path('mrna_example.fasta')\n",
"region_file = get_test_data_path('region_example.csv')\n",
"\n",
"print(f\"📁 数据文件路径:\")\n",
"print(f\" BLASTX数据: {blastx_file}\")\n",
"print(f\" mRNA序列: {mrna_file}\")\n",
"print(f\" 验证区域: {region_file}\")\n",
"\n",
"# 加载BLASTX数据\n",
"blastx_data = pd.read_excel(blastx_file)\n",
"print(f\"\\n🧬 BLASTX数据概览:\")\n",
"print(f\" 数据形状: {blastx_data.shape}\")\n",
"print(f\" 列名: {list(blastx_data.columns)}\")\n",
"print(f\" 唯一序列数: {blastx_data['DNA_seqid'].nunique()}\")\n",
"\n",
"# 显示前几行\n",
"print(\"\\n📊 BLASTX数据示例:\")\n",
"display_cols = ['DNA_seqid', 'Pep_seqid', 'pident', 'length', 'evalue', 'qframe']\n",
"print(blastx_data[display_cols].head())"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🎯 验证区域数据概览:\n",
" 数据形状: (3, 8)\n",
" 列名: ['FS_period', '399bp', 'fs_position', 'DNA_seqid', 'label', 'source', 'FS_type', 'dataset']\n",
" 数据来源: {'EUPLOTES': 3}\n",
"\n",
"📋 验证区域数据示例:\n",
" fs_position DNA_seqid label source FS_type\n",
"0 16.0 MSTRG.18491.1 0 EUPLOTES negative\n",
"1 16.0 MSTRG.4662.1 0 EUPLOTES negative\n",
"2 16.0 MSTRG.14742.1 0 EUPLOTES negative\n",
"\n",
"📈 标签分布:\n",
"label\n",
"0 3\n",
"Name: count, dtype: int64\n",
"\n",
"🔬 FS类型分布:\n",
"FS_type\n",
"negative 3\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"# 加载验证区域数据\n",
"region_data = pd.read_csv(region_file)\n",
"print(f\"🎯 验证区域数据概览:\")\n",
"print(f\" 数据形状: {region_data.shape}\")\n",
"print(f\" 列名: {list(region_data.columns)}\")\n",
"print(f\" 数据来源: {region_data['source'].value_counts().to_dict()}\")\n",
"\n",
"print(\"\\n📋 验证区域数据示例:\")\n",
"display_cols = ['fs_position', 'DNA_seqid', 'label', 'source', 'FS_type']\n",
"print(region_data[display_cols].head())\n",
"\n",
"# 统计分析\n",
"print(f\"\\n📈 标签分布:\")\n",
"print(region_data['label'].value_counts())\n",
"print(f\"\\n🔬 FS类型分布:\")\n",
"print(region_data['FS_type'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. FScanR 分析 - 从 BLASTX 识别潜在 PRF 位点\n",
"\n",
"使用 FScanR 算法分析 BLASTX 结果,识别潜在的程序性核糖体移码位点。"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🔍 运行FScanR分析...\n",
"参数设置: mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10\n",
"\n",
"✅ FScanR分析完成\n",
"检测到的潜在PRF位点数量: 24\n",
"\n",
"📊 FScanR结果概览:\n",
" 列名: ['DNA_seqid', 'FS_start', 'FS_end', 'Pep_seqid', 'Pep_FS_start', 'Pep_FS_end', 'FS_type', 'Strand']\n",
" 涉及的序列数: 16\n",
" 链方向分布: {'+': 16, '-': 8}\n",
" FS类型分布: {1: 16, -1: 7, -2: 1}\n",
"\n",
"🎯 FScanR结果示例:\n",
" DNA_seqid FS_start FS_end Pep_seqid Pep_FS_start \\\n",
"0 MSTRG.9380.1 3797 3802 CAMPEP_0197017206 1137 \n",
"1 MSTRG.9431.1 4136 4192 CAMPEP_0197016790 657 \n",
"3 MSTRG.9432.1 848 904 CAMPEP_0197016790 753 \n",
"4 MSTRG.9582.1 302 304 CAMPEP_0197003180 214 \n",
"5 MSTRG.961.1 1536 1533 CAMPEP_0197017908 590 \n",
"\n",
" Pep_FS_end FS_type Strand \n",
"0 1138 1 + \n",
"1 675 1 + \n",
"3 2 1 - \n",
"4 214 1 + \n",
"5 19 -1 - \n"
]
}
],
"source": [
"# 运行FScanR分析\n",
"print(\"🔍 运行FScanR分析...\")\n",
"print(\"参数设置: mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10\")\n",
"\n",
"fscanr_results = fscanr(\n",
" blastx_data,\n",
" mismatch_cutoff=10,\n",
" evalue_cutoff=1e-5,\n",
" frameDist_cutoff=100\n",
")\n",
"\n",
"print(f\"\\n✅ FScanR分析完成\")\n",
"print(f\"检测到的潜在PRF位点数量: {len(fscanr_results)}\")\n",
"\n",
"if len(fscanr_results) > 0:\n",
" print(f\"\\n📊 FScanR结果概览:\")\n",
" print(f\" 列名: {list(fscanr_results.columns)}\")\n",
" print(f\" 涉及的序列数: {fscanr_results['DNA_seqid'].nunique()}\")\n",
" print(f\" 链方向分布: {fscanr_results['Strand'].value_counts().to_dict()}\")\n",
" print(f\" FS类型分布: {fscanr_results['FS_type'].value_counts().to_dict()}\")\n",
" \n",
" print(\"\\n🎯 FScanR结果示例:\")\n",
" print(fscanr_results.head())\n",
"else:\n",
" print(\"⚠️ 未检测到PRF位点可能需要调整参数\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. 序列提取 - 获取 PRF 位点周围序列\n",
"\n",
"从 mRNA 序列中提取 FScanR 识别的 PRF 位点周围的序列片段。"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"📝 从mRNA序列中提取PRF位点周围序列...\n",
"\n",
"✅ 序列提取完成!\n",
"成功提取的序列数量: 24\n",
"\n",
"📏 序列长度验证:\n",
" 399bp序列长度分布: {399: 24}\n",
" 平均长度: 399.0\n",
"\n",
"🧬 提取序列示例:\n",
"序列 1: MSTRG.9380.1\n",
" FS位置: 3797-3802\n",
" 链方向: +\n",
" FS类型: 1\n",
" 序列片段: AAGGAGTTTGAAGAAGAACAGGAAAAACAAGAGAAAGAGAGAAAGGAGAA...NNNNNNNNNNNNNNNNNNNN\n",
"\n",
"序列 2: MSTRG.9431.1\n",
" FS位置: 4136-4192\n",
" 链方向: +\n",
" FS类型: 1\n",
" 序列片段: CAAGTATCTGAGTGGGAGGGAGACACAGGTGTTGATCAAACCCCATTCCC...ATAATGACGGAGGCTTCAGA\n",
"\n",
"序列 3: MSTRG.9432.1\n",
" FS位置: 848-904\n",
" 链方向: -\n",
" FS类型: 1\n",
" 序列片段: AGAAAGGATGGTACTGAAAATCAACGAAGTACTTTCACATTTTAGAAAGA...GCTGAGAACGATATTGACAA\n",
"\n"
]
}
],
"source": [
"# 提取PRF位点周围的序列\n",
"if len(fscanr_results) > 0:\n",
" print(\"📝 从mRNA序列中提取PRF位点周围序列...\")\n",
" \n",
" prf_sequences = extract_prf_regions(\n",
" mrna_file=mrna_file,\n",
" prf_data=fscanr_results\n",
" )\n",
" \n",
" print(f\"\\n✅ 序列提取完成!\")\n",
" print(f\"成功提取的序列数量: {len(prf_sequences)}\")\n",
" \n",
" if len(prf_sequences) > 0:\n",
" print(f\"\\n📏 序列长度验证:\")\n",
" seq_lengths = prf_sequences['399bp'].str.len()\n",
" print(f\" 399bp序列长度分布: {seq_lengths.value_counts().to_dict()}\")\n",
" print(f\" 平均长度: {seq_lengths.mean():.1f}\")\n",
" \n",
" print(\"\\n🧬 提取序列示例:\")\n",
" for i, row in prf_sequences.head(3).iterrows():\n",
" print(f\"序列 {i+1}: {row['DNA_seqid']}\")\n",
" print(f\" FS位置: {row['FS_start']}-{row['FS_end']}\")\n",
" print(f\" 链方向: {row['Strand']}\")\n",
" print(f\" FS类型: {row['FS_type']}\")\n",
" print(f\" 序列片段: {row['399bp'][:50]}...{row['399bp'][-20:]}\")\n",
" print()\n",
" else:\n",
" print(\"❌ 序列提取失败\")\n",
"else:\n",
" print(\"⚠️ 跳过序列提取 - 无FScanR结果\")\n",
" prf_sequences = pd.DataFrame()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. FScanpy 预测 - 机器学习模型分析\n",
"\n",
"使用 FScanpy 的机器学习模型对提取的序列进行 PRF 概率预测。"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🤖 FScanpy预测器初始化完成\n",
"\n",
"🎯 对 24 个FScanR识别的序列进行预测...\n",
"\n",
"📊 FScanR+FScanpy预测结果:\n",
" DNA_seqid FS_start FS_type Short_Probability Long_Probability \\\n",
"0 MSTRG.9380.1 3797 1 0.239192 0.087024 \n",
"1 MSTRG.9431.1 4136 1 0.326807 0.356356 \n",
"2 MSTRG.9432.1 848 1 0.310908 0.159746 \n",
"3 MSTRG.9582.1 302 1 0.272451 0.223354 \n",
"4 MSTRG.961.1 1536 -1 0.263269 0.046773 \n",
"\n",
" Ensemble_Probability \n",
"0 0.147891 \n",
"1 0.344536 \n",
"2 0.220211 \n",
"3 0.242993 \n",
"4 0.133372 \n"
]
}
],
"source": [
"# 初始化预测器\n",
"predictor = PRFPredictor()\n",
"print(\"🤖 FScanpy预测器初始化完成\")\n",
"\n",
"# 对FScanR识别的序列进行预测\n",
"if len(prf_sequences) > 0:\n",
" print(f\"\\n🎯 对 {len(prf_sequences)} 个FScanR识别的序列进行预测...\")\n",
" \n",
" fscanr_predictions = predictor.predict_regions(\n",
" sequences=prf_sequences['399bp'],\n",
" ensemble_weight=0.4 # 平衡配置\n",
" )\n",
" \n",
" # 合并结果\n",
" fscanr_predictions = pd.concat([\n",
" prf_sequences.reset_index(drop=True),\n",
" fscanr_predictions.reset_index(drop=True)\n",
" ], axis=1)\n",
" \n",
" print(\"\\n📊 FScanR+FScanpy预测结果:\")\n",
" result_cols = ['DNA_seqid', 'FS_start', 'FS_type', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']\n",
" print(fscanr_predictions[result_cols].head())"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"🧪 对 3 个验证区域进行预测...\n",
"\n",
"📊 验证区域预测结果:\n",
" DNA_seqid label source Short_Probability Long_Probability \\\n",
"0 MSTRG.18491.1 0 EUPLOTES 0.368610 0.144442 \n",
"1 MSTRG.4662.1 0 EUPLOTES 0.229811 0.053352 \n",
"2 MSTRG.14742.1 0 EUPLOTES 0.454152 0.345118 \n",
"\n",
" Ensemble_Probability \n",
"0 0.234109 \n",
"1 0.123936 \n",
"2 0.388732 \n"
]
}
],
"source": [
"# 对验证区域数据进行预测\n",
"print(f\"\\n🧪 对 {len(region_data)} 个验证区域进行预测...\")\n",
"\n",
"validation_predictions = predict_prf(\n",
" data=region_data.rename(columns={'399bp': 'Long_Sequence'}),\n",
" ensemble_weight=0.4\n",
")\n",
"\n",
"print(\"\\n📊 验证区域预测结果:\")\n",
"result_cols = ['DNA_seqid', 'label', 'source', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']\n",
"print(validation_predictions[result_cols].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. 序列级预测和可视化\n",
"\n",
"选择一个具体的mRNA序列使用内置的plot_prf_prediction函数进行完整的滑动窗口预测和可视化。"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🧬 选择演示序列: MSTRG.9127.1\n",
"序列长度: 256 bp\n",
"序列前100bp: TGGCCTTCTTACTTGGAAGTCCCCAAGGATCATCTTGGCCATCCTTGCTTTCTTCATGGCTAGATTCTACCTCCTCCCATAATTGTGTGAAACAAGTAAC...\n",
"\n",
"🎯 使用plot_prf_prediction进行序列预测和可视化...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/predictor.py:335: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.\n",
" plt.tight_layout()\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 39044 (\\N{CJK UNIFIED IDEOGRAPH-9884}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 27979 (\\N{CJK UNIFIED IDEOGRAPH-6D4B}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 27010 (\\N{CJK UNIFIED IDEOGRAPH-6982}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 29575 (\\N{CJK UNIFIED IDEOGRAPH-7387}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 28909 (\\N{CJK UNIFIED IDEOGRAPH-70ED}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 22270 (\\N{CJK UNIFIED IDEOGRAPH-56FE}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 31227 (\\N{CJK UNIFIED IDEOGRAPH-79FB}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 30721 (\\N{CJK UNIFIED IDEOGRAPH-7801}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 20998 (\\N{CJK UNIFIED IDEOGRAPH-5206}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 24067 (\\N{CJK UNIFIED IDEOGRAPH-5E03}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 38598 (\\N{CJK UNIFIED IDEOGRAPH-96C6}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 25104 (\\N{CJK UNIFIED IDEOGRAPH-6210}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 26435 (\\N{CJK UNIFIED IDEOGRAPH-6743}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 37325 (\\N{CJK UNIFIED IDEOGRAPH-91CD}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 24207 (\\N{CJK UNIFIED IDEOGRAPH-5E8F}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 21015 (\\N{CJK UNIFIED IDEOGRAPH-5217}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 20301 (\\N{CJK UNIFIED IDEOGRAPH-4F4D}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 32622 (\\N{CJK UNIFIED IDEOGRAPH-7F6E}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 36807 (\\N{CJK UNIFIED IDEOGRAPH-8FC7}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 28388 (\\N{CJK UNIFIED IDEOGRAPH-6EE4}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 38408 (\\N{CJK UNIFIED IDEOGRAPH-9608}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 20540 (\\N{CJK UNIFIED IDEOGRAPH-503C}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 30340 (\\N{CJK UNIFIED IDEOGRAPH-7684}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 32467 (\\N{CJK UNIFIED IDEOGRAPH-7ED3}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 26524 (\\N{CJK UNIFIED IDEOGRAPH-679C}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 65288 (\\N{FULLWIDTH LEFT PARENTHESIS}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 26465 (\\N{CJK UNIFIED IDEOGRAPH-6761}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 24418 (\\N{CJK UNIFIED IDEOGRAPH-5F62}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n",
"/home/guest01/.conda/envs/tf200/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 65289 (\\N{FULLWIDTH RIGHT PARENTHESIS}) missing from font(s) Liberation Sans.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABRcAAALmCAYAAADYLKN3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8ekN5oAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB3P0lEQVR4nOzdd5hV1b0/4M8MvYmAHUUS7IgRG/ZCxBqCKd6AsSCxBcHY9UaNJvYoMYJ6FQtKNGo0Yosltti9RpMIErFiQ6IIDIq0YWZ+f/DjXMcBxA0yDL7v8+TJnHXWPuu798w65/hh7b3LampqagIAAAAA8BWV13cBAAAAAEDDJFwEAAAAAAoRLgIAAAAAhQgXAQAAAIBChIsAAAAAQCHCRQAAAACgEOEiAAAAAFCIcBEAAAAAKES4CAAAAAAU0ri+CwAAGpa//e1vufnmm2u1XXXVVRk9enQuv/zyWu3nnXdeOnTosMDXOeigg/L8889n0KBBGTx48AL79OnTJ+PGjcv555+fH/7wh6X2v/zlL/nTn/6UN954IxUVFWndunU23XTTHHnkkdlmm22SJD179syECRMWuS/zx55fy+eVlZWlbdu22WabbXLMMcdk/fXXr7P9gw8+mD/96U8ZO3ZsPv3007Rq1Spdu3bNj3/84+y7776LHDtJ/vnPf+aSSy7JSy+9lKZNm2bnnXfOL3/5yzrHbMqUKTnllFPyxBNP5Oqrr87OO+9c6/nKysrccMMNGTVqVCZMmJBVV101u+++e44++ui0bt36S4/HD37wg1xwwQWLrPX666/PxRdfnF69euWSSy5ZZN/bb789Dz30UOlxhw4dct555yVJTj311IwaNapW/yZNmmTttddO7969c/jhh6dp06ZJkmHDhuWyyy6r8/otW7bMJptskp/97Gfp2bNnqX1h/ef78Y9/nKOPPjq//vWva7Uff/zx6dy5c4455pha7f379892222XE044IdOnTy+1f+9730vv3r1z3nnn5Z133im1b7fddunfv3+GDx+eF198sdS+wQYb5IQTTljocVlac+qLvurrzp49e4HHZsMNN1ys8QCAbybhIgDwlbz//vs544wzsvbaaydJLrzwwiTJ5MmTM2DAgPTo0SNJcuONN2bmzJmLfK2WLVtm1KhRGTRoUMrKymo9N27cuLz77rt1trniiity2WWXZdCgQTnjjDPSsmXLvPvuu7nqqqvys5/9LCNHjkz37t1z++23p6qqqrTd97///fTo0SOnnXZarfHn69q1a6666qrS46qqqrz11lu55JJLcsABB+Suu+7KWmutlSSpqanJqaeemvvvvz8DBgzI8ccfn5VXXjkfffRR7rnnnpxwwgn5+9//nrPOOmuh+z5u3LgcfPDB2XHHHXPLLbeksrIyp5xySo466qjceuutKS+fd4LJ888/nxNOOCFt2rRZ6GtdeOGFue2223LmmWdmq622yssvv5wzzjgjH330UYYMGZIkdY5HkkydOjU/+clPst122y30tSsqKnLqqadm7Nixadas2UL7fd6bb75Z61jO/xuZr3379rn77rtLjz/55JM8++yzufjii/Pmm2/md7/7Xa3+jz76aClwrKmpyX/+85/84Q9/yMCBA3PZZZdl9913X2j/z2vRokU++OCD7LnnnqWw+rHHHktFRUXmzp2bTTfdtBR0v/baa3n55ZeTJKuttlrpOH722We57rrrkiSNGjVa4H5OnTp1ge0LOy5Lc0593ld93RkzZizw2AAALIpwEQCoN1tvvXWeeOKJPPfcc3UCrlGjRmXrrbfO448/Xqv9xhtvzL777puBAweW2tZaa61sscUWOfDAA/Ovf/0r3bt3T/v27WttV15enubNm2fVVVddYC2NGzeu89waa6yRLl26ZOedd86f/vSnHHvssUmSP/7xj7nzzjszfPjw7LLLLqX+HTt2TPfu3bPOOuvk2muvzaGHHpp11113geNdd911admyZYYMGVIKOS+55JL06dMnTzzxRHbdddckycUXX5yDDjoom222WQ455JA6rzN9+vTccsstOeqoo0qhUKdOnfLqq6/mqquuyplnnpmVVlqpzvGYP94GG2yQ73//+wusMUnuvffezJgxI3feeWf233//hfb7KsrLy2sd61VXXTVdunTJlClTcvnll+fkk0/OGmusUXp+lVVWqRVsrrbaarnwwgvz8ssv57rrrqsTLn6xPwAAXx/XXAQA6k379u3TvXv33HHHHbXa586dm3vuuafWKa/zzZo1K3PmzKnT3rRp0/zpT3/KoYceulRrXH311dO+ffv85z//KbWNGDEiO++8c61g8fP69++fJ554YqHBYpKMHTs2m266aa3VkxtttFE6duyYp556qtR20UUX5YgjjqizsnO+Vq1a5YknnsiAAQNqta+22mqpqalZ6Eq30aNHZ9SoUTnttNMW+tpJsssuu2TEiBGLfSrukthoo42SJB988MGX9i0vL88GG2xQ6/fyTXXqqafm1FNPre8yAIBvKOEiAFCvvve97+Whhx6qdU27J598Mp988kn23HPPOv133nnnPPDAAzn++OPz97//fYFB49I0ZcqUTJ06tXRK9MSJE/Pee+8tNFhM5l2vcf5pzQvTuHHjNGrUqE57+/bta13Hb1EB5fyx2rdvXyukTOadGrzGGmtk9dVXX+B2Q4cOzc4775zNNttska+/zjrrLLDOr8Pbb7+dJFlzzTUXq/9bb71V+r0AAFA/hIsAQL3aZ599Mnfu3PzlL38ptY0aNSo77rhj2rVrV6f/2Wefnb333jv33XdfDjzwwGy99dbp379/rr/++qV+fbj3338/p5xySlq0aJEf//jHSZIPP/wwyeIHYAvzrW99K//+978zd+7cUtvs2bPz9ttv57PPPlui177xxhvz1FNP5cQTT1zg86+88kqefPLJHHHEEUs0ztJSWVmZp556Ktddd1322GOPLz2206ZNy+9+97u89tprOeigg5ZRlQAALIhrLgIA9apdu3bZaaedcscdd+QnP/lJKioq8uijj9a5Cch8bdq0ye9///t88MEHefzxx/P3v/89zz//fJ599tlcccUVueqqq9K9e/evXMeYMWNqbVdVVZXZs2dnq622yvXXX19aITd/RWJ1dXWt7UePHl3nmoi9e/fOb37zmwWOd+CBB+b+++/Peeedl+OOOy6VlZU599xzU15ensaNi39Fu/7663PBBRfkqKOOSu/evRfY54YbbkjXrl2z5ZZbFh5nSUyePLnWsZ49e3YaN26cPn36LPD03m233bbW4xkzZqRz58658MILF7i69Yv95zv11FML/W0sjz6/H/NX7z744IOlts+H9QAAXyfhIgBQ777//e/n2GOPzZtvvpnnnnsuTZo0WeD1Fj9vrbXWSr9+/dKvX79UV1fnr3/9a0477bScddZZueuuu75yDRtuuGEuvfTS0uNHHnkkF110UU488cR85zvfqTVukrz33nu1tt9oo41y5513lh6feOKJizxle6uttsqFF16Ys88+OzfffHOaN2+eQw45JNtuu+2XnlK9IDU1Nbnoooty3XXX5YQTTsjhhx++wH6VlZV55JFH0r9//688xtKy8sor59Zbby09nn8znQXd4TlJbrvttjRp0iTJvNPSf/azn+VHP/pR9ttvvy/t/3nt27fPxIkTl3wHlgOf/1u7+OKLk6TWStXVVlttWZcEAHxDCRcBgHrXs2fPtGnTJvfdd1+efvrp9OrVKy1atFhg308++SQrrbRSrbby8vLstdde+cc//pEbb7wxNTU1i7xJyYI0bdq01vUN+/fvn/vvvz+nn356Ro0aVQq+VllllWywwQb561//WusmKl/cvnnz5l865n777Zd99tknkydPTocOHdK0adPsvffe6dOnz1eqPZkXMI0cOTK//e1vF3n35+effz6ffPJJ6W7U9aFRo0Zfei3Jz1tnnXVKd39ed911c/DBB+eyyy7LHnvskc6dOy+y/xetKOHi549fq1at6rQBACwrrrkIANS7Zs2aZc8998x9992Xf/3rXws9nfevf/1rtt5
"text/plain": [
"<Figure size 1600x800 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"📊 序列预测结果统计:\n",
" 预测位点总数: 85\n",
" 高概率位点 (>0.8): 0\n",
" 中概率位点 (0.4-0.8): 6\n",
" 最高预测概率: 0.475\n"
]
}
],
"source": [
"# 选择一个序列进行演示\n",
"from Bio import SeqIO\n",
"\n",
"# 读取第一个mRNA序列作为演示\n",
"mrna_sequences = list(SeqIO.parse(mrna_file, \"fasta\"))\n",
"demo_seq = mrna_sequences[0] # 选择第一个序列\n",
"\n",
"print(f\"🧬 选择演示序列: {demo_seq.id}\")\n",
"print(f\"序列长度: {len(demo_seq.seq)} bp\")\n",
"print(f\"序列前100bp: {str(demo_seq.seq)[:100]}...\")\n",
"\n",
"# 使用内置的plot_prf_prediction函数进行预测和可视化\n",
"print(f\"\\n🎯 使用plot_prf_prediction进行序列预测和可视化...\")\n",
"\n",
"sequence_results, fig = plot_prf_prediction(\n",
" sequence=str(demo_seq.seq),\n",
" window_size=3,\n",
" short_threshold=0.2,\n",
" long_threshold=0.2,\n",
" ensemble_weight=0.6,\n",
" title=f\"序列 {demo_seq.id} 的PRF预测结果条形图+热图)\",\n",
" figsize=(16, 8),\n",
" dpi=150\n",
")\n",
"\n",
"plt.show()\n",
"\n",
"print(f\"\\n📊 序列预测结果统计:\")\n",
"print(f\" 预测位点总数: {len(sequence_results)}\")\n",
"print(f\" 高概率位点 (>0.8): {(sequence_results['Ensemble_Probability'] > 0.8).sum()}\")\n",
"print(f\" 中概率位点 (0.4-0.8): {((sequence_results['Ensemble_Probability'] >= 0.4) & (sequence_results['Ensemble_Probability'] <= 0.8)).sum()}\")\n",
"print(f\" 最高预测概率: {sequence_results['Ensemble_Probability'].max():.3f}\")"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"🔝 Top 5 预测位点:\n",
" 1. 位置 96: \n",
" - Short概率: 0.288\n",
" - Long概率: 0.755\n",
" - 集成概率: 0.475\n",
" - 密码子: TAA\n",
" 2. 位置 12: \n",
" - Short概率: 0.606\n",
" - Long概率: 0.177\n",
" - 集成概率: 0.434\n",
" - 密码子: TTG\n",
" 3. 位置 15: \n",
" - Short概率: 0.493\n",
" - Long概率: 0.329\n",
" - 集成概率: 0.428\n",
" - 密码子: GAA\n",
" 4. 位置 18: \n",
" - Short概率: 0.369\n",
" - Long概率: 0.510\n",
" - 集成概率: 0.426\n",
" - 密码子: GTC\n",
" 5. 位置 105: \n",
" - Short概率: 0.248\n",
" - Long概率: 0.671\n",
" - 集成概率: 0.418\n",
" - 密码子: ACT\n",
"\n",
"📊 可视化分析完成!\n",
"图表包含热图和条形图展示了整个序列的PRF预测概率分布。\n"
]
}
],
"source": [
"# 打印Top预测位点的概率\n",
"if sequence_results['Ensemble_Probability'].max() > 0.3:\n",
" top_predictions = sequence_results.nlargest(5, 'Ensemble_Probability')\n",
" print(f\"\\n🔝 Top 5 预测位点:\")\n",
" for i, (_, row) in enumerate(top_predictions.iterrows(), 1):\n",
" print(f\" {i}. 位置 {row['Position']}: \")\n",
" print(f\" - Short概率: {row['Short_Probability']:.3f}\")\n",
" print(f\" - Long概率: {row['Long_Probability']:.3f}\")\n",
" print(f\" - 集成概率: {row['Ensemble_Probability']:.3f}\")\n",
" print(f\" - 密码子: {row['Codon']}\")\n",
"else:\n",
" print(\"\\n💡 该序列没有检测到高概率的PRF位点\")\n",
"\n",
"print(\"\\n📊 可视化分析完成!\")\n",
"print(\"图表包含热图和条形图展示了整个序列的PRF预测概率分布。\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 📝 分析总结\n",
"\n",
"### 🎯 主要发现\n",
"1. **数据质量**: 测试数据集包含真实的BLASTX比对结果和验证区域\n",
"2. **FScanR效果**: 从BLASTX结果中识别出潜在PRF位点\n",
"3. **模型性能**: Short和Long模型在不同场景下各有优势\n",
"4. **预测结果**: 集成模型提供了更稳定的预测性能\n",
"5. **可视化**: 内置绘图函数生成清晰的热图和条形图\n",
"\n",
"### 🔧 最佳实践\n",
"- **数据预处理**: 确保BLASTX结果格式正确\n",
"- **参数设置**: 使用默认的集成权重(0.4:0.6)获得平衡性能\n",
"- **结果解读**: 在使用FScanpy对整条序列进行预测时,不应该使用0.5作为阈值,而应该比较不同位置的概率高低\n",
"- **可视化**: 使用plot_prf_prediction函数生成标准化图表\n",
"\n",
"### 📚 使用建议\n",
"1. **阈值选择**: 根据应用场景调整概率阈值\n",
"2. **结果验证**: 结合生物学知识验证预测结果\n",
"3. **性能优化**: 对于大规模数据使用合理的滑动窗口大小\n",
"4. **可视化参数**: 调整figsize和dpi获得最佳显示效果"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}