2025-05-29 17:58:48 +08:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# FScanpy \n",
"\n",
2025-06-11 21:18:52 +08:00
"This notebook demonstrates how to use FScanpy with real test data for complete PRF site prediction analysis, including:\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"## 🎯 Complete Workflow\n",
"1. **Load Test Data** - Use built-in real test data\n",
"2. **FScanR Analysis** - Identify potential PRF sites from BLASTX results\n",
"3. **Sequence Extraction** - Extract sequences around PRF sites\n",
"4. **FScanpy Prediction** - Use machine learning models to predict probabilities\n",
"5. **Results Visualization** - Generate prediction result plots using built-in plotting functions\n",
"6. **Sequence-level Prediction Demo** - Sliding window analysis of complete sequences\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"## 📊 Data Description\n",
"- **blastx_example.xlsx**: Real BLASTX alignment results\n",
"- **mrna_example.fasta**: Real mRNA sequence data\n",
"- **region_example.csv**: Sample for individual site prediction"
2025-05-29 17:58:48 +08:00
]
},
2025-06-11 21:44:29 +08:00
{
2025-08-14 16:06:49 +08:00
"cell_type": "markdown",
2025-06-11 21:44:29 +08:00
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"## 📚 FScanpy Function Usage Guide\n",
"\n",
"### Core Functions Overview\n",
"\n",
"FScanpy provides several main functions for PRF prediction:\n",
"\n",
"#### 1. `predict_prf()` - Universal Prediction Function\n",
"```python\n",
"# Single sequence prediction\n",
"results = predict_prf(sequence=\"ATGCGT...\", window_size=3, ensemble_weight=0.4)\n",
"\n",
"# Multiple sequences prediction \n",
"results = predict_prf(sequence=[\"seq1\", \"seq2\"], window_size=3)\n",
"\n",
"# DataFrame region prediction\n",
"results = predict_prf(data=df_with_399bp_column, ensemble_weight=0.4)\n",
"```\n",
"\n",
"#### 2. `plot_prf_prediction()` - Prediction with Visualization\n",
"```python\n",
"# Basic plotting\n",
"results, fig = plot_prf_prediction(sequence=\"ATGCGT...\")\n",
"\n",
"# Custom parameters\n",
"results, fig = plot_prf_prediction(\n",
" sequence=\"ATGCGT...\",\n",
" window_size=1,\n",
" short_threshold=0.65,\n",
" long_threshold=0.8,\n",
" ensemble_weight=0.4,\n",
" save_path=\"plot.png\"\n",
")\n",
"```\n",
"\n",
"#### 3. `PRFPredictor` Class Methods\n",
"```python\n",
"predictor = PRFPredictor()\n",
"\n",
"# Sliding window prediction\n",
"results = predictor.predict_sequence(sequence, window_size=3, ensemble_weight=0.4)\n",
"\n",
"# Region prediction\n",
"results = predictor.predict_regions(sequences_399bp, ensemble_weight=0.4)\n",
"\n",
"# Single position prediction\n",
"result = predictor.predict_single_position(fs_period_33bp, full_seq_399bp)\n",
"\n",
"# Plot prediction\n",
"results, fig = predictor.plot_sequence_prediction(sequence)\n",
"```\n",
"\n",
"#### 4. Utility Functions\n",
"```python\n",
"from FScanpy.utils import fscanr, extract_prf_regions\n",
"\n",
"# Detect PRF sites from BLASTX\n",
"prf_sites = fscanr(blastx_df, mismatch_cutoff=10, evalue_cutoff=1e-5)\n",
"\n",
"# Extract sequences around PRF sites\n",
"prf_sequences = extract_prf_regions(mrna_file, prf_sites)\n",
"```\n",
"\n",
"### Parameter Guidelines\n",
"\n",
"- **ensemble_weight**: 0.4 (default, balanced), 0.2-0.3 (conservative), 0.7-0.8 (sensitive)\n",
"- **window_size**: 1 (detailed), 3 (standard), 6-9 (fast)\n",
"- **short_threshold**: 0.1 (default), 0.2-0.3 (stricter filtering)\n",
"- **Display thresholds**: 0.3-0.8 for visualization filtering\n"
]
},
2025-05-29 17:58:48 +08:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 📦 Environment Setup and Data Loading"
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
2025-08-14 16:06:49 +08:00
"execution_count": 5,
2025-05-29 17:58:48 +08:00
"metadata": {},
"outputs": [
{
2025-08-14 16:06:49 +08:00
"name": "stderr",
"output_type": "stream",
"text": [
"2025-08-14 15:54:26.764777: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.\n",
"2025-08-14 15:54:26.765259: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
"2025-08-14 15:54:26.818561: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
"To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/attr_value.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/tensor.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/resource_handle.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/tensor_shape.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/types.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/full_type.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/function.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/node_def.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/op_def.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/graph.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/graph_debug_info.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/versions.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/protobuf/config.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at xla/tsl/protobuf/coordination_config.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/cost_graph.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/step_stats.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/allocation_description.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/framework/tensor_description.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/protobuf/cluster.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/google/protobuf/runtime_version.py:98: UserWarning: Protobuf gencode version 5.28.3 is exactly one major version older than the runtime version 6.31.1 at tensorflow/core/protobuf/debug.proto. Please update the gencode to avoid compatibility violations in the next runtime release.\n",
" warnings.warn(\n",
"2025-08-14 15:54:28.305921: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
"2025-08-14 15:54:28.307332: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"✅ Environment setup complete!\n",
"📋 Available test data:\n"
2025-05-29 17:58:48 +08:00
]
2025-08-14 16:06:49 +08:00
},
{
"data": {
"text/plain": [
"['blastx_example.xlsx',\n",
" 'full_seq.xlsx',\n",
" 'mrna_example.fasta',\n",
" 'region_example.csv']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
2025-05-29 17:58:48 +08:00
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Import necessary libraries\n",
2025-05-29 17:58:48 +08:00
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
2025-06-11 21:18:52 +08:00
"# Import FScanpy related modules\n",
2025-05-29 17:58:48 +08:00
"from FScanpy import PRFPredictor, predict_prf, plot_prf_prediction\n",
"from FScanpy.data import get_test_data_path, list_test_data\n",
"from FScanpy.utils import fscanr, extract_prf_regions\n",
"\n",
2025-06-11 21:18:52 +08:00
"print(\"✅ Environment setup complete!\")\n",
"print(\"📋 Available test data:\")\n",
2025-05-29 17:58:48 +08:00
"list_test_data()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 1. Load and Explore Test Data\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"First, load the real test data provided by FScanpy to understand the data structure."
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
2025-08-14 16:06:49 +08:00
"execution_count": 6,
2025-05-29 17:58:48 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2025-08-14 16:06:49 +08:00
"📁 Data file paths:\n",
" BLASTX data: /mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/data/test_data/blastx_example.xlsx\n",
" mRNA sequences: /mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/data/test_data/mrna_example.fasta\n",
" Validation regions: /mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/data/test_data/region_example.csv\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"🧬 BLASTX data overview:\n",
" Data shape: (1000, 14)\n",
" Column names: ['DNA_seqid', 'Pep_seqid', 'pident', 'length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore', 'qframe', 'sframe']\n",
" Unique sequences: 704\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"📊 BLASTX data examples:\n",
2025-05-29 17:58:48 +08:00
" DNA_seqid Pep_seqid pident length evalue qframe\n",
"0 MSTRG.9998.1 CAMPEP_0196994412 68.27 104 1.000000e-33 2\n",
"1 MSTRG.9996.1 CAMPEP_0197017426 49.16 297 3.000000e-79 2\n",
"2 MSTRG.9994.1 CAMPEP_0197009206 98.31 354 0.000000e+00 2\n",
"3 MSTRG.9993.1 CAMPEP_0168331218 51.67 60 2.000000e-37 2\n",
"4 MSTRG.9993.1 CAMPEP_0168331218 45.45 88 2.000000e-37 3\n"
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Get test data paths\n",
2025-05-29 17:58:48 +08:00
"blastx_file = get_test_data_path('blastx_example.xlsx')\n",
"mrna_file = get_test_data_path('mrna_example.fasta')\n",
"region_file = get_test_data_path('region_example.csv')\n",
"\n",
2025-06-11 21:18:52 +08:00
"print(f\"📁 Data file paths:\")\n",
"print(f\" BLASTX data: {blastx_file}\")\n",
"print(f\" mRNA sequences: {mrna_file}\")\n",
"print(f\" Validation regions: {region_file}\")\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"# Load BLASTX data\n",
2025-05-29 17:58:48 +08:00
"blastx_data = pd.read_excel(blastx_file)\n",
2025-06-11 21:18:52 +08:00
"print(f\"\\n🧬 BLASTX data overview:\")\n",
"print(f\" Data shape: {blastx_data.shape}\")\n",
"print(f\" Column names: {list(blastx_data.columns)}\")\n",
"print(f\" Unique sequences: {blastx_data['DNA_seqid'].nunique()}\")\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"# Display first few rows\n",
"print(\"\\n📊 BLASTX data examples:\")\n",
2025-05-29 17:58:48 +08:00
"display_cols = ['DNA_seqid', 'Pep_seqid', 'pident', 'length', 'evalue', 'qframe']\n",
"print(blastx_data[display_cols].head())"
]
},
{
"cell_type": "code",
2025-08-14 16:06:49 +08:00
"execution_count": 7,
2025-05-29 17:58:48 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2025-08-14 16:06:49 +08:00
"🎯 Validation region data overview:\n",
" Data shape: (3, 8)\n",
" Column names: ['FS_period', '399bp', 'fs_position', 'DNA_seqid', 'label', 'source', 'FS_type', 'dataset']\n",
" Data sources: {'EUPLOTES': 3}\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"📋 Validation region data examples:\n",
2025-05-29 17:58:48 +08:00
" fs_position DNA_seqid label source FS_type\n",
"0 16.0 MSTRG.18491.1 0 EUPLOTES negative\n",
"1 16.0 MSTRG.4662.1 0 EUPLOTES negative\n",
"2 16.0 MSTRG.14742.1 0 EUPLOTES negative\n",
"\n",
2025-08-14 16:06:49 +08:00
"📈 Label distribution:\n",
2025-05-29 17:58:48 +08:00
"label\n",
"0 3\n",
"Name: count, dtype: int64\n",
"\n",
2025-08-14 16:06:49 +08:00
"🔬 FS type distribution:\n",
2025-05-29 17:58:48 +08:00
"FS_type\n",
"negative 3\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Load validation region data\n",
2025-05-29 17:58:48 +08:00
"region_data = pd.read_csv(region_file)\n",
2025-06-11 21:18:52 +08:00
"print(f\"🎯 Validation region data overview:\")\n",
"print(f\" Data shape: {region_data.shape}\")\n",
"print(f\" Column names: {list(region_data.columns)}\")\n",
"print(f\" Data sources: {region_data['source'].value_counts().to_dict()}\")\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"print(\"\\n📋 Validation region data examples:\")\n",
2025-05-29 17:58:48 +08:00
"display_cols = ['fs_position', 'DNA_seqid', 'label', 'source', 'FS_type']\n",
"print(region_data[display_cols].head())\n",
"\n",
2025-06-11 21:18:52 +08:00
"# Statistical analysis\n",
"print(f\"\\n📈 Label distribution:\")\n",
2025-05-29 17:58:48 +08:00
"print(region_data['label'].value_counts())\n",
2025-06-11 21:18:52 +08:00
"print(f\"\\n🔬 FS type distribution:\")\n",
2025-05-29 17:58:48 +08:00
"print(region_data['FS_type'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 2. FScanR Analysis - Identify Potential PRF Sites from BLASTX\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"Use the FScanR algorithm to analyze BLASTX results and identify potential programmed ribosomal frameshift sites."
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
2025-08-14 16:06:49 +08:00
"execution_count": 8,
2025-05-29 17:58:48 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2025-08-14 16:06:49 +08:00
"🔍 Running FScanR analysis...\n",
"Parameter settings: mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"✅ FScanR analysis complete!\n",
"Number of potential PRF sites detected: 16\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"📊 FScanR results overview:\n",
" Column names: ['DNA_seqid', 'FS_start', 'FS_end', 'Pep_seqid', 'Pep_FS_start', 'Pep_FS_end', 'FS_type', 'Strand']\n",
" Number of sequences involved: 12\n",
" Strand orientation distribution: {'+': 11, '-': 5}\n",
" FS type distribution: {1: 9, -1: 7}\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"🎯 FScanR results examples:\n",
2025-05-29 17:58:48 +08:00
" DNA_seqid FS_start FS_end Pep_seqid Pep_FS_start \\\n",
"0 MSTRG.9380.1 3797 3802 CAMPEP_0197017206 1137 \n",
2025-08-14 16:06:49 +08:00
"1 MSTRG.9582.1 302 304 CAMPEP_0197003180 214 \n",
"2 MSTRG.961.1 1536 1533 CAMPEP_0197017908 590 \n",
"3 MSTRG.9622.1 555 560 CAMPEP_0197016962 182 \n",
"4 MSTRG.9648.1 801 803 CAMPEP_0197001104 257 \n",
2025-05-29 17:58:48 +08:00
"\n",
" Pep_FS_end FS_type Strand \n",
"0 1138 1 + \n",
2025-08-14 16:06:49 +08:00
"1 214 1 + \n",
"2 19 -1 - \n",
"3 183 1 + \n",
"4 257 1 + \n"
2025-05-29 17:58:48 +08:00
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Run FScanR analysis\n",
"print(\"🔍 Running FScanR analysis...\")\n",
2025-06-11 21:44:29 +08:00
"print(\"Parameter settings: mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10\")\n",
2025-05-29 17:58:48 +08:00
"\n",
"fscanr_results = fscanr(\n",
" blastx_data,\n",
" mismatch_cutoff=10,\n",
" evalue_cutoff=1e-5,\n",
2025-06-11 21:44:29 +08:00
" frameDist_cutoff=10\n",
2025-05-29 17:58:48 +08:00
")\n",
"\n",
2025-06-11 21:18:52 +08:00
"print(f\"\\n✅ FScanR analysis complete!\")\n",
"print(f\"Number of potential PRF sites detected: {len(fscanr_results)}\")\n",
2025-05-29 17:58:48 +08:00
"\n",
"if len(fscanr_results) > 0:\n",
2025-06-11 21:18:52 +08:00
" print(f\"\\n📊 FScanR results overview:\")\n",
" print(f\" Column names: {list(fscanr_results.columns)}\")\n",
" print(f\" Number of sequences involved: {fscanr_results['DNA_seqid'].nunique()}\")\n",
" print(f\" Strand orientation distribution: {fscanr_results['Strand'].value_counts().to_dict()}\")\n",
" print(f\" FS type distribution: {fscanr_results['FS_type'].value_counts().to_dict()}\")\n",
2025-05-29 17:58:48 +08:00
" \n",
2025-06-11 21:18:52 +08:00
" print(\"\\n🎯 FScanR results examples:\")\n",
2025-05-29 17:58:48 +08:00
" print(fscanr_results.head())\n",
"else:\n",
2025-06-11 21:18:52 +08:00
" print(\"⚠️ No PRF sites detected, may need to adjust parameters\")"
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 3. Sequence Extraction - Extract Sequences Around PRF Sites\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"Extract sequence fragments around PRF sites identified by FScanR from mRNA sequences."
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
2025-08-14 16:06:49 +08:00
"execution_count": 9,
2025-05-29 17:58:48 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2025-08-14 16:06:49 +08:00
"📝 Extracting sequences around PRF sites from mRNA sequences...\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"✅ Sequence extraction complete!\n",
"Number of successfully extracted sequences: 16\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"📏 Sequence length validation:\n",
" 399bp sequence length distribution: {399: 16}\n",
" Average length: 399.0\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"🧬 Extracted sequence examples:\n",
"Sequence 1: MSTRG.9380.1\n",
" FS position: 3797-3802\n",
" Strand orientation: +\n",
" FS type: 1\n",
" Sequence fragment: AAGGAGTTTGAAGAAGAACAGGAAAAACAAGAGAAAGAGAGAAAGGAGAA...NNNNNNNNNNNNNNNNNNNN\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"Sequence 2: MSTRG.9582.1\n",
" FS position: 302-304\n",
" Strand orientation: +\n",
" FS type: 1\n",
" Sequence fragment: ATCAAGCTGATTAGAGATGGAGGGGGAGGTGTGTTCAATAATATATCTAC...AGTCAACTTCCAGTCCAACA\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"Sequence 3: MSTRG.961.1\n",
" FS position: 1536-1533\n",
" Strand orientation: -\n",
" FS type: -1\n",
" Sequence fragment: ATGCTACTTTGGGAGAGAAAATTAACTGGGGAGAACTTGCATATGATTCT...ACAAATATTTCTCTAATTCA\n",
2025-05-29 17:58:48 +08:00
"\n"
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Extract sequences around PRF sites\n",
2025-05-29 17:58:48 +08:00
"if len(fscanr_results) > 0:\n",
2025-06-11 21:18:52 +08:00
" print(\"📝 Extracting sequences around PRF sites from mRNA sequences...\")\n",
2025-05-29 17:58:48 +08:00
" \n",
" prf_sequences = extract_prf_regions(\n",
" mrna_file=mrna_file,\n",
" prf_data=fscanr_results\n",
" )\n",
" \n",
2025-06-11 21:18:52 +08:00
" print(f\"\\n✅ Sequence extraction complete!\")\n",
" print(f\"Number of successfully extracted sequences: {len(prf_sequences)}\")\n",
2025-05-29 17:58:48 +08:00
" \n",
" if len(prf_sequences) > 0:\n",
2025-06-11 21:18:52 +08:00
" print(f\"\\n📏 Sequence length validation:\")\n",
2025-05-29 17:58:48 +08:00
" seq_lengths = prf_sequences['399bp'].str.len()\n",
2025-06-11 21:18:52 +08:00
" print(f\" 399bp sequence length distribution: {seq_lengths.value_counts().to_dict()}\")\n",
" print(f\" Average length: {seq_lengths.mean():.1f}\")\n",
2025-05-29 17:58:48 +08:00
" \n",
2025-06-11 21:18:52 +08:00
" print(\"\\n🧬 Extracted sequence examples:\")\n",
2025-05-29 17:58:48 +08:00
" for i, row in prf_sequences.head(3).iterrows():\n",
2025-06-11 21:18:52 +08:00
" print(f\"Sequence {i+1}: {row['DNA_seqid']}\")\n",
" print(f\" FS position: {row['FS_start']}-{row['FS_end']}\")\n",
" print(f\" Strand orientation: {row['Strand']}\")\n",
" print(f\" FS type: {row['FS_type']}\")\n",
" print(f\" Sequence fragment: {row['399bp'][:50]}...{row['399bp'][-20:]}\")\n",
2025-05-29 17:58:48 +08:00
" print()\n",
" else:\n",
2025-06-11 21:18:52 +08:00
" print(\"❌ Sequence extraction failed\")\n",
2025-05-29 17:58:48 +08:00
"else:\n",
2025-06-11 21:18:52 +08:00
" print(\"⚠️ Skipping sequence extraction - no FScanR results\")\n",
2025-05-29 17:58:48 +08:00
" prf_sequences = pd.DataFrame()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 4. FScanpy Prediction - Machine Learning Model Analysis\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"Use FScanpy's machine learning models to predict PRF probabilities for the extracted sequences."
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
2025-08-14 16:06:49 +08:00
"execution_count": 10,
2025-05-29 17:58:48 +08:00
"metadata": {},
"outputs": [
2025-08-14 16:06:49 +08:00
{
"name": "stderr",
"output_type": "stream",
"text": [
"/mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/predictor.py:23: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
" from pkg_resources import resource_filename\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/sklearn/base.py:380: InconsistentVersionWarning: Trying to unpickle estimator _BinMapper from version 1.6.0 when using version 1.6.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
"https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/sklearn/base.py:380: InconsistentVersionWarning: Trying to unpickle estimator HistGradientBoostingClassifier from version 1.6.0 when using version 1.6.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
"https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
" warnings.warn(\n"
]
},
2025-05-29 17:58:48 +08:00
{
"name": "stdout",
"output_type": "stream",
"text": [
2025-08-14 16:06:49 +08:00
"🤖 FScanpy predictor initialization complete\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"🎯 Predicting 16 sequences identified by FScanR...\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"📊 FScanR+FScanpy prediction results:\n",
2025-05-29 17:58:48 +08:00
" DNA_seqid FS_start FS_type Short_Probability Long_Probability \\\n",
"0 MSTRG.9380.1 3797 1 0.239192 0.087024 \n",
2025-08-14 16:06:49 +08:00
"1 MSTRG.9582.1 302 1 0.272451 0.223354 \n",
"2 MSTRG.961.1 1536 -1 0.263269 0.046773 \n",
"3 MSTRG.9622.1 555 1 0.652591 0.408316 \n",
"4 MSTRG.9648.1 801 1 0.287211 0.308532 \n",
2025-05-29 17:58:48 +08:00
"\n",
" Ensemble_Probability \n",
"0 0.147891 \n",
2025-08-14 16:06:49 +08:00
"1 0.242993 \n",
"2 0.133372 \n",
"3 0.506026 \n",
"4 0.300004 \n"
2025-05-29 17:58:48 +08:00
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Initialize predictor\n",
2025-05-29 17:58:48 +08:00
"predictor = PRFPredictor()\n",
2025-06-11 21:18:52 +08:00
"print(\"🤖 FScanpy predictor initialization complete\")\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"# Predict FScanR identified sequences\n",
2025-05-29 17:58:48 +08:00
"if len(prf_sequences) > 0:\n",
2025-06-11 21:18:52 +08:00
" print(f\"\\n🎯 Predicting {len(prf_sequences)} sequences identified by FScanR...\")\n",
2025-05-29 17:58:48 +08:00
" \n",
" fscanr_predictions = predictor.predict_regions(\n",
" sequences=prf_sequences['399bp'],\n",
2025-06-11 21:18:52 +08:00
" ensemble_weight=0.4 # Balanced configuration\n",
2025-05-29 17:58:48 +08:00
" )\n",
" \n",
2025-06-11 21:18:52 +08:00
" # Merge results\n",
2025-05-29 17:58:48 +08:00
" fscanr_predictions = pd.concat([\n",
" prf_sequences.reset_index(drop=True),\n",
" fscanr_predictions.reset_index(drop=True)\n",
" ], axis=1)\n",
" \n",
2025-06-11 21:18:52 +08:00
" print(\"\\n📊 FScanR+FScanpy prediction results:\")\n",
2025-05-29 17:58:48 +08:00
" result_cols = ['DNA_seqid', 'FS_start', 'FS_type', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']\n",
" print(fscanr_predictions[result_cols].head())"
]
},
{
"cell_type": "code",
2025-08-14 16:06:49 +08:00
"execution_count": 11,
2025-05-29 17:58:48 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
2025-08-14 16:06:49 +08:00
"🧪 Predicting 3 validation regions...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/sklearn/base.py:380: InconsistentVersionWarning: Trying to unpickle estimator _BinMapper from version 1.6.0 when using version 1.6.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
"https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/sklearn/base.py:380: InconsistentVersionWarning: Trying to unpickle estimator HistGradientBoostingClassifier from version 1.6.0 when using version 1.6.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
"https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"📊 Validation region prediction results:\n",
2025-05-29 17:58:48 +08:00
" DNA_seqid label source Short_Probability Long_Probability \\\n",
"0 MSTRG.18491.1 0 EUPLOTES 0.368610 0.144442 \n",
"1 MSTRG.4662.1 0 EUPLOTES 0.229811 0.053352 \n",
"2 MSTRG.14742.1 0 EUPLOTES 0.454152 0.345118 \n",
"\n",
" Ensemble_Probability \n",
"0 0.234109 \n",
"1 0.123936 \n",
"2 0.388732 \n"
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Predict validation region data\n",
"print(f\"\\n🧪 Predicting {len(region_data)} validation regions...\")\n",
2025-05-29 17:58:48 +08:00
"\n",
"validation_predictions = predict_prf(\n",
" data=region_data.rename(columns={'399bp': 'Long_Sequence'}),\n",
" ensemble_weight=0.4\n",
")\n",
"\n",
2025-06-11 21:18:52 +08:00
"print(\"\\n📊 Validation region prediction results:\")\n",
2025-05-29 17:58:48 +08:00
"result_cols = ['DNA_seqid', 'label', 'source', 'Short_Probability', 'Long_Probability', 'Ensemble_Probability']\n",
"print(validation_predictions[result_cols].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 5. Sequence-level Prediction and Visualization\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"Select a specific mRNA sequence and use the built-in plot_prf_prediction function for complete sliding window prediction and visualization."
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
2025-08-14 16:06:49 +08:00
"execution_count": 12,
2025-05-29 17:58:48 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2025-08-14 16:06:49 +08:00
"🧬 Selected demonstration sequence: MSTRG.9127.1\n",
"Sequence length: 256 bp\n",
"First 100bp of sequence: TGGCCTTCTTACTTGGAAGTCCCCAAGGATCATCTTGGCCATCCTTGCTTTCTTCATGGCTAGATTCTACCTCCTCCCATAATTGTGTGAAACAAGTAAC...\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"🎯 Using plot_prf_prediction for sequence prediction and visualization...\n"
2025-05-29 17:58:48 +08:00
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
2025-08-14 16:06:49 +08:00
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/sklearn/base.py:380: InconsistentVersionWarning: Trying to unpickle estimator _BinMapper from version 1.6.0 when using version 1.6.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
"https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
" warnings.warn(\n",
"/home/guest01/.conda/envs/fs/lib/python3.9/site-packages/sklearn/base.py:380: InconsistentVersionWarning: Trying to unpickle estimator HistGradientBoostingClassifier from version 1.6.0 when using version 1.6.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
"https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
" warnings.warn(\n",
"/mnt/lmpbe/guest01/FScanpy-package-main/FScanpy/predictor.py:347: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.\n",
" plt.tight_layout()\n"
2025-05-29 17:58:48 +08:00
]
},
{
"data": {
2025-08-14 16:06:49 +08:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAABR8AAALtCAYAAAChPBNAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8ekN5oAAAACXBIWXMAAA9hAAAPYQGoP6dpAABxwElEQVR4nOzdd5gV5dk/8HsX2AWWqjRBZK0YlG5UMHYi1ogmhthQYgy2iGKDBEUkSuxoNMES2xsVxJrEFkUsifhqxLUCYsVIF5GiUnbn94c/zuuyCywLw9kDn8917cWe50y5Z3aeOct3n5nJS5IkCQAAAACADSw/2wUAAAAAAJsm4SMAAAAAkArhIwAAAACQCuEjAAAAAJAK4SMAAAAAkArhIwAAAACQCuEjAAAAAJAK4SMAAAAAkArhIwAAAACQCuEjAKzB888/H3l5ebFgwYKIiLjrrruiSZMm67XMDbGMXHHppZdGly5dUlluy5YtIy8vLx599NENvnzYlOyzzz5x3333ZbuMtSouLo5Ro0ZluwzWw5577hkPPfRQtssAoIYRPgJQbSeffHLk5eVFXl5eFBQUxA477BCXXXZZrFixIiL+L7hb+dW8efM49NBD4+23317tcr7/9cEHH1S63lWX27Jly/jpT38aH330Uerb3Ldv33j//ferPH1l/5le12VU13777ZfZR3Xr1o2ddtopRo4cGUmSpL7u1Tn55JOjT58+67WMyZMnx/Dhw+OWW26JmTNnxiGHHLJhivv/3nzzzfjJT34SLVq0iLp160ZxcXH07ds35syZs0HXk6tW9r+mTZvGt99+W+691157LXPMfd9tt90WnTt3jgYNGkSTJk2ia9euMXLkyIj4ro9U1v9Xfp188skREeXaGjVqFD/84Q/jscceq1DfsmXL4uqrr45u3bpFUVFRNG7cODp37hxDhw6NGTNmrHHbHnjggejSpUvUr18/2rVrF1dffXW592fOnBnHHXdc7LTTTpGfnx/nnHNOhWXcdtttsffee0fTpk2jadOm0atXr3j11VfLTbO6bV11fd/34osvxhFHHBGtW7dep9D9b3/7W8yePTt+8YtfZNq+v89r1aoVrVu3jlNOOSW+/PLLKi2zOhYuXBi/+93vYuedd466detGq1atolevXvHwww9vlHPSfvvtV+nPa2OtJ40/OtXEP2QNHTo0Bg8eHGVlZdkuBYAaRPgIwHo5+OCDY+bMmTFt2rQ477zz4tJLL63wH+ipU6fGzJkz4+mnn46lS5fGYYcdFsuWLat0Od//2nbbbde47qlTp8aMGTNi3Lhx8e6778YRRxwRpaWlFaZLkiQTiK6vevXqRYsWLbK+jKo69dRTY+bMmTF16tQYMmRIXHLJJTF69OiNsu60fPjhhxERceSRR0arVq2isLCwWstZvnx5hba5c+fGgQceGFtssUU8/fTTMXny5LjzzjujdevWsWTJkvWqe1PTsGHDeOSRR8q1/eUvf4ltttmmXNsdd9wR55xzTpx99tlRUlIS//73v+PCCy+MxYsXR8R3geXKPr9yxNTKc8bMmTPjhhtuyCzrzjvvjJkzZ8Z//vOf2GuvveJnP/tZuT9mLF26NH784x/HFVdcESeffHK8+OKL8fbbb8eNN94Y8+bNiz/+8Y+r3Z4nn3wyjj/++DjttNPinXfeiT/96U9x/fXXx0033VRu+c2bN4+hQ4dG586dK13O888/H8cee2xMmDAhJk6cGG3bto2DDjooPv/888w0q57r7rjjjsjLy4uf/vSnq61vyZIl0blz57j55ptXO01lbrzxxujfv3/k55f/tf+yyy6LmTNnxvTp0+Pee++NF198Mc4+++x1WvaqVj2vr7RgwYLo2bNn3HPPPTFkyJCYNGlSvPjii9G3b9+48MIL46uvvlqv9VanpqoqLi6O559/fsMUsxk45JBDYtGiRfHkk09muxQAapIEAKrppJNOSo488shybT/+8Y+TPffcM0mSJJkwYUISEcmXX36Zef9vf/tbEhHJm2++ucblrElly7333nuTiEimTJmSef+JJ55IunXrltSpUyeZMGFCUlpamlxxxRVJcXFxUrdu3aRTp07JuHHjyi378ccfT3bcccekbt26yX777Zfceeed5dZ15513Jo0bNy43z9/+9rdkt912SwoLC5Mtt9wy6dOnT5IkSbLvvvsmEVHua3XL+NOf/pRst912SZ06dZKddtopueeee8q9HxHJbbfdlvTp0yepV69essMOOySPPfbYGvfTvvvumwwcOLBcW7du3ZKjjjoq8/rbb79NzjvvvKR169ZJ/fr1k9133z2ZMGFC5v1PPvkkOfzww5MmTZok9evXTzp06JA8/vjjq92ORx55JPn+rxfDhg1LOnfunPl+1f0xYcKEZOnSpcmZZ56ZtGrVKiksLEy22Wab5Iorrqh0mypbRpIkSWlpaTJ8+PCkTZs2SUFBQdK5c+fkySefzMz38ccfJxGRjBkzJtlnn32SwsLC5M4776yw/EceeSSpXbt2snz58jXu27fffjs5+OCDk6KioqRFixbJCSeckMydOzfz/uLFi5MTTzwxKSoqSlq1apVcc801FX4eEZE88sgj5ZbbuHHjcnVNnz49OeaYY5LGjRsnTZs2TX7yk58kH3/8ceb9lX3n6quvTlq1apVsscUWyRlnnJEsW7YsM823336bXHjhhcnWW2+dFBQUJNtvv31y++23V3lbVrWyfw0dOjTp1atXpv3rr79OGjdunFx88cXljoEjjzwyOfnkk9e4P1dd9vf79kqr7q+FCxcmEZHccMMNmbaRI0cm+fn5yaRJkypdfllZ2WrXfeyxxyY/+9nPyrXdeOONydZbb13pfJX1r8qsWLEiadiwYXL33XevdpojjzwyOeCAA9a6rJUqO3YqM2fOnCQvLy955513yrW3a9cuuf7668u1jRgxIunQoUPm9bx585Jf/OIXSevWrZN69eolu+66a3LfffeVm2ffffdNzjzzzGTgwIHJlltumey3336V1nH66acnRUVFyeeff17hvUWLFmX6W7t27ZLLL7886d+/f9KgQYOkbdu2yS233FJu+gsvvDDZcccdk3r16iXbbrttMnTo0HLH+8pzzm233ZYUFxcneXl5yUknnVThvPH9frQm7dq1K3dOXJvVHReVnS8fffTRpGvXrklhYWGy7bbbJpdeemm5c8+1116b7Lrrrkn9+vWTrbfeOjn99NOTRYsWJUnyf33l+1/Dhg3L1DxixIjMOWibbbZJHnvssWTOnDnJT37yk6SoqCjp2LFj8tprr2XWtS4/7zPPPDNp1KhRsuWWWyZDhw6t0D/69++fnHDCCVXeZwBs+ox8BGCDqlev3mpHmnz11VcxZsyYiIgoKCjY4OuNKD/KZfDgwfGHP/whJk+eHJ06dYqRI0fGPffcE6NHj4533303zj333DjhhBPihRdeiIiIzz77LI4++ug44ogjoqSkJH71q1/F4MGD17jexx9/PI466qg49NBD44033ojx48fH7rvvHhERDz/8cGy99daZEUYzZ86sdBmPPPJIDBw4MM4777x45513YsCAAdG/f/+YMGFCuemGDx8eP//5z+Ott96KQw89NI4//viYP39+lfZPkiTx0ksvxZQpU8rt+7POOismTpwYY8aMibfeeiuOOeaYOPjgg2PatGkREXHmmWfG0qVLMyPIrrzyymjQoEGV1rmq888/P37+85+XG+Xas2fPuPHGG+Nvf/tbPPDAAzF16tS49957o7i4eLXLuPPOOyMiyu3TG264Ia699tq45ppr4q233orevXvHT37yk8x2rDR48OAYOHBgTJ48OXr37l1h+a1atYoVK1bEI488stpLQRcsWBAHHHBAdO3aNf7zn//EU089FbNnz46f//znmWkuuOCCeOGFF+Kxxx6Lf/7zn/H888/HpEmT1ml/LV++PHr37h0NGzaMl156Kf79739HgwYN4uCDDy53nE+YMCE+/PDDmDBhQtx9991x1113xV133ZV5v1+/fnH//ffHjTfeGJMnT45bbrkl8zOsyraszoknnhg
2025-05-29 17:58:48 +08:00
"text/plain": [
2025-08-14 16:06:49 +08:00
"<Figure size 1600x800 with 3 Axes>"
2025-05-29 17:58:48 +08:00
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
2025-08-14 16:06:49 +08:00
"📊 Sequence prediction result statistics:\n",
" Total predicted sites: 85\n",
" High probability sites (>0.8): 0\n",
" Medium probability sites (0.4-0.8): 6\n",
" Highest prediction probability: 0.475\n"
2025-05-29 17:58:48 +08:00
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Select a sequence for demonstration\n",
2025-05-29 17:58:48 +08:00
"from Bio import SeqIO\n",
"\n",
2025-06-11 21:18:52 +08:00
"# Read the first mRNA sequence for demonstration\n",
2025-05-29 17:58:48 +08:00
"mrna_sequences = list(SeqIO.parse(mrna_file, \"fasta\"))\n",
2025-06-11 21:18:52 +08:00
"demo_seq = mrna_sequences[0] # Select the first sequence\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"print(f\"🧬 Selected demonstration sequence: {demo_seq.id}\")\n",
"print(f\"Sequence length: {len(demo_seq.seq)} bp\")\n",
"print(f\"First 100bp of sequence: {str(demo_seq.seq)[:100]}...\")\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"# Use built-in plot_prf_prediction function for prediction and visualization\n",
"print(f\"\\n🎯 Using plot_prf_prediction for sequence prediction and visualization...\")\n",
2025-05-29 17:58:48 +08:00
"\n",
"sequence_results, fig = plot_prf_prediction(\n",
" sequence=str(demo_seq.seq),\n",
" window_size=3,\n",
" short_threshold=0.2,\n",
" long_threshold=0.2,\n",
" ensemble_weight=0.6,\n",
2025-06-11 21:18:52 +08:00
" title=f\"PRF Prediction Results for Sequence {demo_seq.id} (Bar Chart + Heatmap)\",\n",
2025-05-29 17:58:48 +08:00
" figsize=(16, 8),\n",
" dpi=150\n",
")\n",
"\n",
"plt.show()\n",
"\n",
2025-06-11 21:18:52 +08:00
"print(f\"\\n📊 Sequence prediction result statistics:\")\n",
"print(f\" Total predicted sites: {len(sequence_results)}\")\n",
"print(f\" High probability sites (>0.8): {(sequence_results['Ensemble_Probability'] > 0.8).sum()}\")\n",
"print(f\" Medium probability sites (0.4-0.8): {((sequence_results['Ensemble_Probability'] >= 0.4) & (sequence_results['Ensemble_Probability'] <= 0.8)).sum()}\")\n",
"print(f\" Highest prediction probability: {sequence_results['Ensemble_Probability'].max():.3f}\")"
2025-05-29 17:58:48 +08:00
]
},
{
"cell_type": "code",
2025-08-14 16:06:49 +08:00
"execution_count": 13,
2025-05-29 17:58:48 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
2025-08-14 16:06:49 +08:00
"🔝 Top 5 predicted sites:\n",
" 1. Position 96: \n",
" - Short probability: 0.288\n",
" - Long probability: 0.755\n",
" - Ensemble probability: 0.475\n",
" - Codon: TAA\n",
" 2. Position 12: \n",
" - Short probability: 0.606\n",
" - Long probability: 0.177\n",
" - Ensemble probability: 0.434\n",
" - Codon: TTG\n",
" 3. Position 15: \n",
" - Short probability: 0.493\n",
" - Long probability: 0.329\n",
" - Ensemble probability: 0.428\n",
" - Codon: GAA\n",
" 4. Position 18: \n",
" - Short probability: 0.369\n",
" - Long probability: 0.510\n",
" - Ensemble probability: 0.426\n",
" - Codon: GTC\n",
" 5. Position 105: \n",
" - Short probability: 0.248\n",
" - Long probability: 0.671\n",
" - Ensemble probability: 0.418\n",
" - Codon: ACT\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-08-14 16:06:49 +08:00
"📊 Visualization analysis complete!\n",
"The chart contains heatmaps and bar charts showing the PRF prediction probability distribution across the entire sequence.\n"
2025-05-29 17:58:48 +08:00
]
}
],
"source": [
2025-06-11 21:18:52 +08:00
"# Print top predicted site probabilities\n",
2025-05-29 17:58:48 +08:00
"if sequence_results['Ensemble_Probability'].max() > 0.3:\n",
" top_predictions = sequence_results.nlargest(5, 'Ensemble_Probability')\n",
2025-06-11 21:18:52 +08:00
" print(f\"\\n🔝 Top 5 predicted sites:\")\n",
2025-05-29 17:58:48 +08:00
" for i, (_, row) in enumerate(top_predictions.iterrows(), 1):\n",
2025-06-11 21:18:52 +08:00
" print(f\" {i}. Position {row['Position']}: \")\n",
" print(f\" - Short probability: {row['Short_Probability']:.3f}\")\n",
" print(f\" - Long probability: {row['Long_Probability']:.3f}\")\n",
" print(f\" - Ensemble probability: {row['Ensemble_Probability']:.3f}\")\n",
" print(f\" - Codon: {row['Codon']}\")\n",
2025-05-29 17:58:48 +08:00
"else:\n",
2025-06-11 21:18:52 +08:00
" print(\"\\n💡 No high-probability PRF sites detected in this sequence\")\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"print(\"\\n📊 Visualization analysis complete!\")\n",
"print(\"The chart contains heatmaps and bar charts showing the PRF prediction probability distribution across the entire sequence.\")"
2025-05-29 17:58:48 +08:00
]
},
2025-06-11 21:44:29 +08:00
{
2025-08-14 16:06:49 +08:00
"cell_type": "markdown",
2025-06-11 21:44:29 +08:00
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"## 📖 Complete Function Reference\n",
"\n",
"### All Available Functions and Methods\n",
"\n",
"#### Core Prediction Functions\n",
"\n",
"**1. `predict_prf(sequence=None, data=None, window_size=3, short_threshold=0.1, ensemble_weight=0.4, model_dir=None)`**\n",
"- **Purpose**: Universal prediction function for both sliding window and region-based analysis\n",
"- **Input modes**: \n",
" - Single/multiple sequences → sliding window prediction\n",
" - DataFrame with 'Long_Sequence'/'399bp' column → region prediction\n",
"- **Key parameters**:\n",
" - `ensemble_weight`: Short model weight (0.0-1.0, default: 0.4)\n",
" - `window_size`: Scanning step size (default: 3)\n",
" - `short_threshold`: Filtering threshold (default: 0.1)\n",
"\n",
"**2. `plot_prf_prediction(sequence, window_size=3, short_threshold=0.65, long_threshold=0.8, ensemble_weight=0.4, title=None, save_path=None, figsize=(12,8), dpi=300)`**\n",
"- **Purpose**: Prediction with built-in visualization (3-subplot layout: FS site heatmap, prediction heatmap, bar chart)\n",
"- **Returns**: (prediction_results_df, matplotlib_figure)\n",
"- **Visualization features**: \n",
" - Black bars with alpha=0.6\n",
" - 'Reds' colormap for heatmaps\n",
" - Height ratios [0.1, 0.1, 1] for subplots\n",
"\n",
"#### PRFPredictor Class Methods\n",
"\n",
"**3. Class initialization: `PRFPredictor(model_dir=None)`**\n",
"- Loads HistGradientBoosting (short, 33bp) and BiLSTM-CNN (long, 399bp) models\n",
"- Uses ensemble weighting for final predictions\n",
"\n",
"**4. `predictor.predict_sequence(sequence, window_size=3, short_threshold=0.1, ensemble_weight=0.4)`**\n",
"- **Purpose**: Sliding window analysis of complete sequences\n",
"- **Process**: Scans sequence with specified window size, applies both models\n",
"\n",
"**5. `predictor.predict_regions(sequences, short_threshold=0.1, ensemble_weight=0.4)`**\n",
"- **Purpose**: Batch prediction for pre-defined 399bp regions\n",
"- **Input**: List/Series of 399bp sequences\n",
"- **Efficient**: Direct region analysis without sliding window\n",
"\n",
"**6. `predictor.predict_single_position(fs_period, full_seq, short_threshold=0.1, ensemble_weight=0.4)`**\n",
"- **Purpose**: Single position analysis\n",
"- **Inputs**: 33bp sequence (fs_period) + 399bp sequence (full_seq)\n",
"- **Returns**: Dictionary with individual and ensemble probabilities\n",
"\n",
"**7. `predictor.plot_sequence_prediction(...)`** \n",
"- **Purpose**: Class method version of plot_prf_prediction()\n",
"- **Same parameters** as standalone function\n",
"\n",
"#### Utility Functions\n",
"\n",
"**8. `fscanr(blastx_output, mismatch_cutoff=10, evalue_cutoff=1e-5, frameDist_cutoff=10)`**\n",
"- **Purpose**: Detect PRF sites from BLASTX alignment results\n",
"- **Input**: DataFrame with BLASTX columns (qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore, qframe, sframe)\n",
"- **Output**: PRF sites with FS_start, FS_end, FS_type, Strand information\n",
"\n",
"**9. `extract_prf_regions(mrna_file, prf_data)`**\n",
"- **Purpose**: Extract 399bp sequences around detected PRF sites\n",
"- **Inputs**: FASTA file path + FScanR results DataFrame\n",
"- **Handles**: Strand orientation (reverse complement for '-' strand)\n",
"\n",
"#### Data Access Functions\n",
"\n",
"**10. `get_test_data_path(filename)`**\n",
"- **Purpose**: Get path to built-in test data files\n",
"- **Available files**: 'blastx_example.xlsx', 'mrna_example.fasta', 'region_example.csv'\n",
"\n",
"**11. `list_test_data()`**\n",
"- **Purpose**: Display all available test data files\n",
"\n",
"### Usage Pattern Examples\n",
"\n",
"#### Pattern 1: Quick Single Sequence Analysis\n",
"```python\n",
"from FScanpy import predict_prf, plot_prf_prediction\n",
"\n",
"# Simple prediction\n",
"results = predict_prf(sequence=\"ATGCGT...\")\n",
"\n",
"# With visualization \n",
"results, fig = plot_prf_prediction(sequence=\"ATGCGT...\")\n",
"```\n",
"\n",
"#### Pattern 2: Batch Sequence Analysis\n",
"```python\n",
"sequences = [\"seq1\", \"seq2\", \"seq3\"]\n",
"results = predict_prf(sequence=sequences, ensemble_weight=0.5)\n",
"```\n",
"\n",
"#### Pattern 3: BLASTX Pipeline\n",
"```python\n",
"from FScanpy.utils import fscanr, extract_prf_regions\n",
"\n",
"# Step 1: Detect PRF sites\n",
"prf_sites = fscanr(blastx_df)\n",
"\n",
"# Step 2: Extract sequences\n",
"prf_sequences = extract_prf_regions(fasta_file, prf_sites)\n",
"\n",
"# Step 3: Predict probabilities\n",
"results = predict_prf(data=prf_sequences)\n",
"```\n",
"\n",
"#### Pattern 4: Custom Analysis with PRFPredictor\n",
"```python\n",
"from FScanpy import PRFPredictor\n",
"\n",
"predictor = PRFPredictor()\n",
"\n",
"# Method chaining for different analysis types\n",
"seq_results = predictor.predict_sequence(sequence)\n",
"region_results = predictor.predict_regions(sequences_399bp)\n",
"single_result = predictor.predict_single_position(seq_33bp, seq_399bp)\n",
"```\n",
"\n",
"### Parameter Optimization Guide\n",
"\n",
"**Ensemble Weight Selection:**\n",
"- `0.2-0.3`: Conservative (high specificity, favor long model)\n",
"- `0.4-0.6`: Balanced (recommended default)\n",
"- `0.7-0.8`: Sensitive (high sensitivity, favor short model)\n",
"\n",
"**Window Size Selection:**\n",
"- `1`: High resolution, every position (slow but detailed)\n",
"- `3`: Standard resolution (balanced speed/detail) \n",
"- `6-9`: Low resolution, faster analysis\n",
"\n",
"**Threshold Guidelines:**\n",
"- `short_threshold`: 0.1-0.3 (controls efficiency by filtering low-probability candidates)\n",
"- Display thresholds: 0.3-0.8 (controls visualization, higher = cleaner plots)\n",
"- Classification threshold: 0.5 (standard binary classification cutoff)\n",
"\n",
"### Output Interpretation\n",
"\n",
"**Main Result Columns:**\n",
"- `Short_Probability`: HistGradientBoosting model prediction (0-1)\n",
"- `Long_Probability`: BiLSTM-CNN model prediction (0-1)\n",
"- `Ensemble_Probability`: **Final prediction** (weighted combination)\n",
"- `Position`: Sequence position (sliding window mode)\n",
"- `Codon`: Codon at position (sliding window mode)\n",
"\n",
"**Ensemble Probability Interpretation:**\n",
"- `> 0.8`: High confidence PRF site\n",
"- `0.5-0.8`: Moderate confidence PRF site \n",
"- `0.3-0.5`: Low confidence, worth investigating\n",
"- `< 0.3`: Unlikely to be PRF site\n",
"\n",
"### Best Practices\n",
"\n",
"1. **For exploration**: Use `window_size=1, ensemble_weight=0.4`\n",
"2. **For screening**: Use `window_size=3, ensemble_weight=0.4, short_threshold=0.2`\n",
"3. **For validation**: Use region-based prediction with known sequences\n",
"4. **For visualization**: Adjust `short_threshold` and `long_threshold` in plotting functions to control display density\n",
"\n",
"This demo covers all major FScanpy functionalities. For detailed parameter descriptions and advanced usage, please refer to the complete tutorial documentation.\n"
]
},
2025-05-29 17:58:48 +08:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
2025-06-11 21:18:52 +08:00
"## 📝 Analysis Summary\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"### 🎯 Key Findings\n",
"1. **Data Quality**: Test dataset contains real BLASTX alignment results and validation regions\n",
"2. **FScanR Performance**: Successfully identified potential PRF sites from BLASTX results\n",
"3. **Model Performance**: Short and Long models each have advantages in different scenarios\n",
"4. **Prediction Results**: Ensemble model provides more stable prediction performance\n",
"5. **Visualization**: Built-in plotting functions generate clear heatmaps and bar charts\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"### 🔧 Best Practices\n",
"- **Data Preprocessing**: Ensure BLASTX results are in correct format\n",
"- **Parameter Settings**: Use default ensemble weights (0.4:0.6) for balanced performance\n",
"- **Result Interpretation**: When using FScanpy for whole sequence prediction, don't use 0.5 as threshold, but compare relative probabilities across positions\n",
"- **Visualization**: Use plot_prf_prediction function to generate standardized plots\n",
2025-05-29 17:58:48 +08:00
"\n",
2025-06-11 21:18:52 +08:00
"### 📚 Usage Recommendations\n",
"1. **Threshold Selection**: Adjust probability thresholds based on application scenarios\n",
"2. **Result Validation**: Validate prediction results with biological knowledge\n",
"3. **Performance Optimization**: Use reasonable sliding window sizes for large-scale data\n",
"4. **Visualization Parameters**: Adjust figsize and dpi for optimal display"
2025-05-29 17:58:48 +08:00
]
}
],
"metadata": {
"kernelspec": {
2025-08-14 16:06:49 +08:00
"display_name": "fs",
2025-05-29 17:58:48 +08:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2025-08-14 16:06:49 +08:00
"version": "3.9.23"
2025-05-29 17:58:48 +08:00
}
},
"nbformat": 4,
"nbformat_minor": 4
}