π Sample README file#
Course: ADSC 3910 β Applied Data Science Integrated Practice 2
Instructor: Dr. Quan Nguyen
Institution: Thompson Rivers University
Term: Fall 2025
π₯ Team Information#
Member |
Role |
GitHub Handle |
|---|---|---|
Alice Smith |
Vector Store & Data Ingestion |
@alicesmith |
Bob Chen |
Query Transformation & Retrieval |
@bobchen |
Carol Patel |
Prompt Engineering & Documentation |
@carolpatel |
π§ Project Overview#
This project implements a Retrieval-Augmented Generation (RAG) pipeline that retrieves course-specific materials from a MongoDB Atlas vector store and uses an OpenAI model to generate context-aware answers.
Key features:
Query transformation for higher-quality retrieval
External prompt files (
prompts/) for maintainabilityConda-based environment reproducibility
LangSmith traces for debugging and transparency
ποΈ Repository Structure#
βββ src/
β βββ pipeline.py
β βββ retriever.py
β βββ utils.py
βββ prompts/
β βββ system_prompt_v1.txt
β βββ human_prompt_v1.txt
β βββ README.md
βββ data/
β βββ (corpus or ingestion scripts)
βββ logs/
β βββ trace_2025-10-26.json
βββ notebooks/
β βββ 03_rag_pipeline.ipynb
βββ environment.yml
βββ .env.example
βββ README.md
βοΈ Environment Setup (Conda)#
1οΈβ£ Clone Repository#
git clone https://github.com/yourusername/rag-pipeline-m3.git
cd rag-pipeline-m3
2οΈβ£ Create Conda Environment#
conda env create -f environment.yml
conda activate rag-pipeline
3οΈβ£ Verify Python and Package Versions#
python --version # expected: 3.10.14
conda list langchain # check version matches environment.yml
4οΈβ£ Set Up Environment Variables#
Copy .env.example β .env and fill in your own credentials:
OPENAI_API_KEY=sk-xxxx
MONGODB_URI=your_mongo_connection_uri
Ensure .env is not committed (.gitignore includes it).
π Running the Pipeline#
To run from the command line:
python src/pipeline.py --query "What is responsible AI in education?"
Or launch the demo notebook:
jupyter notebook notebooks/03_rag_pipeline.ipynb
π¬ Example Query Walkthrough#
Query:
βSummarize TRUβs AI policy for teaching and learning.β
Under the hood:
Query transformed β βAI policy teaching learning summary TRUβ
Top 3 documents retrieved via MongoDB Atlas Vector Search
Prompts combined:
system_prompt_v1.txthuman_prompt_v1.txt
LLM:
gpt-4o-miniTrace logged to
logs/trace_2025-10-26.json
Expected Output (excerpt):
βTRUβs AI policy encourages responsible use of AI tools in teaching while maintaining academic integrityβ¦β
π§Ύ Prompts and Documentation#
All prompts are stored and versioned under /prompts.
File |
Purpose |
Version |
|---|---|---|
|
Defines assistant behavior and tone |
v1.0 |
|
Template for user query + context insertion |
v1.0 |
Example loader:
from utils import load_prompt
system_prompt = load_prompt("prompts/system_prompt_v1.txt")
human_prompt = load_prompt("prompts/human_prompt_v1.txt")
See prompts/README.md for variable placeholders like {query} and {context}.
π Reproducibility Notes#
Component |
Reproducibility Action |
|---|---|
Environment |
Managed via Conda ( |
Prompts |
Stored as separate files with version labels |
Data Source |
Snapshot captured on 2025-10-15 |
Retriever |
MongoDB Atlas Vector Search (k = 4, index |
Logs |
All runs logged to |
Randomness |
Random seeds fixed to 42 in code |
π§© Naming Conventions#
Type |
Convention |
Example |
|---|---|---|
Code files |
lowercase with underscores |
|
Prompts |
include role + version |
|
Logs |
include timestamp |
|
Notebook |
prefix with step or milestone |
|
Git tag |
milestone + version |
|
βοΈ Example environment.yml#
name: rag-pipeline
channels:
- defaults
dependencies:
- python=3.10.14
- langchain=0.3.2
- openai=1.50.1
- pymongo=4.10.1
- python-dotenv=1.0.1
- jupyter
- pip
π§ Reflection#
Challenges: Handling variable retrieval results due to index updates.
Next Steps: Add Dockerfile for environment replication and CI test for prompt loading.
π References & Acknowledgements#
TRU ADSC 3910 Course Materials
LangChain and LangSmith Documentation
MongoDB Atlas Vector Search Guides
OpenAI API Docs
Last Updated: October 2025 | Tag: milestone3-v1.0