🔍 Sample README file

🔍 Sample README file#

Course: ADSC 3910 – Applied Data Science Integrated Practice 2
Instructor: Dr. Quan Nguyen
Institution: Thompson Rivers University
Term: Fall 2025

👥 Team Information#

Member	Role	GitHub Handle
Alice Smith	Vector Store & Data Ingestion	@alicesmith
Bob Chen	Query Transformation & Retrieval	@bobchen
Carol Patel	Prompt Engineering & Documentation	@carolpatel

🧠 Project Overview#

This project implements a Retrieval-Augmented Generation (RAG) pipeline that retrieves course-specific materials from a MongoDB Atlas vector store and uses an OpenAI model to generate context-aware answers.

Key features:

Query transformation for higher-quality retrieval
External prompt files (prompts/) for maintainability
Conda-based environment reproducibility
LangSmith traces for debugging and transparency

🗂️ Repository Structure#

├── src/
│   ├── pipeline.py
│   ├── retriever.py
│   └── utils.py
├── prompts/
│   ├── system_prompt_v1.txt
│   ├── human_prompt_v1.txt
│   └── README.md
├── data/
│   └── (corpus or ingestion scripts)
├── logs/
│   └── trace_2025-10-26.json
├── notebooks/
│   └── 03_rag_pipeline.ipynb
├── environment.yml
├── .env.example
└── README.md

⚙️ Environment Setup (Conda)#

1️⃣ Clone Repository#

git clone https://github.com/yourusername/rag-pipeline-m3.git
cd rag-pipeline-m3

2️⃣ Create Conda Environment#

conda env create -f environment.yml
conda activate rag-pipeline

3️⃣ Verify Python and Package Versions#

python --version       # expected: 3.10.14
conda list langchain    # check version matches environment.yml

4️⃣ Set Up Environment Variables#

Copy .env.example → .env and fill in your own credentials:

OPENAI_API_KEY=sk-xxxx
MONGODB_URI=your_mongo_connection_uri

Ensure .env is not committed (.gitignore includes it).

🚀 Running the Pipeline#

To run from the command line:

python src/pipeline.py --query "What is responsible AI in education?"

Or launch the demo notebook:

jupyter notebook notebooks/03_rag_pipeline.ipynb

💬 Example Query Walkthrough#

Query:

“Summarize TRU’s AI policy for teaching and learning.”

Under the hood:

Query transformed → “AI policy teaching learning summary TRU”
Top 3 documents retrieved via MongoDB Atlas Vector Search
Prompts combined:
- system_prompt_v1.txt
- human_prompt_v1.txt
LLM: gpt-4o-mini
Trace logged to logs/trace_2025-10-26.json

Expected Output (excerpt):

“TRU’s AI policy encourages responsible use of AI tools in teaching while maintaining academic integrity…”

🧾 Prompts and Documentation#

All prompts are stored and versioned under /prompts.

File	Purpose	Version
`system_prompt_v1.txt`	Defines assistant behavior and tone	v1.0
`human_prompt_v1.txt`	Template for user query + context insertion	v1.0

Example loader:

from utils import load_prompt
system_prompt = load_prompt("prompts/system_prompt_v1.txt")
human_prompt = load_prompt("prompts/human_prompt_v1.txt")

See prompts/README.md for variable placeholders like {query} and {context}.

🔁 Reproducibility Notes#

Component	Reproducibility Action
Environment	Managed via Conda (`environment.yml`, Python 3.10.14)
Prompts	Stored as separate files with version labels
Data Source	Snapshot captured on 2025-10-15
Retriever	MongoDB Atlas Vector Search (k = 4, index `vector_index`)
Logs	All runs logged to `/logs` with timestamps
Randomness	Random seeds fixed to 42 in code

🧩 Naming Conventions#

Type	Convention	Example
Code files	lowercase with underscores	`pipeline.py`, `query_transformer.py`
Prompts	include role + version	`system_prompt_v1.txt`
Logs	include timestamp	`trace_2025-10-26.json`
Notebook	prefix with step or milestone	`03_rag_pipeline.ipynb`
Git tag	milestone + version	`milestone3-v1.0`

⚙️ Example environment.yml#

name: rag-pipeline
channels:
  - defaults
dependencies:
  - python=3.10.14
  - langchain=0.3.2
  - openai=1.50.1
  - pymongo=4.10.1
  - python-dotenv=1.0.1
  - jupyter
  - pip

🧭 Reflection#

Challenges: Handling variable retrieval results due to index updates.
Next Steps: Add Dockerfile for environment replication and CI test for prompt loading.

📚 References & Acknowledgements#

TRU ADSC 3910 Course Materials
LangChain and LangSmith Documentation
MongoDB Atlas Vector Search Guides
OpenAI API Docs

Last Updated: October 2025 | Tag: milestone3-v1.0