πŸ” Sample README file#

Course: ADSC 3910 – Applied Data Science Integrated Practice 2
Instructor: Dr. Quan Nguyen
Institution: Thompson Rivers University
Term: Fall 2025


πŸ‘₯ Team Information#

Member

Role

GitHub Handle

Alice Smith

Vector Store & Data Ingestion

@alicesmith

Bob Chen

Query Transformation & Retrieval

@bobchen

Carol Patel

Prompt Engineering & Documentation

@carolpatel


🧠 Project Overview#

This project implements a Retrieval-Augmented Generation (RAG) pipeline that retrieves course-specific materials from a MongoDB Atlas vector store and uses an OpenAI model to generate context-aware answers.

Key features:

  • Query transformation for higher-quality retrieval

  • External prompt files (prompts/) for maintainability

  • Conda-based environment reproducibility

  • LangSmith traces for debugging and transparency


πŸ—‚οΈ Repository Structure#

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ pipeline.py
β”‚   β”œβ”€β”€ retriever.py
β”‚   └── utils.py
β”œβ”€β”€ prompts/
β”‚   β”œβ”€β”€ system_prompt_v1.txt
β”‚   β”œβ”€β”€ human_prompt_v1.txt
β”‚   └── README.md
β”œβ”€β”€ data/
β”‚   └── (corpus or ingestion scripts)
β”œβ”€β”€ logs/
β”‚   └── trace_2025-10-26.json
β”œβ”€β”€ notebooks/
β”‚   └── 03_rag_pipeline.ipynb
β”œβ”€β”€ environment.yml
β”œβ”€β”€ .env.example
└── README.md

βš™οΈ Environment Setup (Conda)#

1️⃣ Clone Repository#

git clone https://github.com/yourusername/rag-pipeline-m3.git
cd rag-pipeline-m3

2️⃣ Create Conda Environment#

conda env create -f environment.yml
conda activate rag-pipeline

3️⃣ Verify Python and Package Versions#

python --version       # expected: 3.10.14
conda list langchain    # check version matches environment.yml

4️⃣ Set Up Environment Variables#

Copy .env.example β†’ .env and fill in your own credentials:

OPENAI_API_KEY=sk-xxxx
MONGODB_URI=your_mongo_connection_uri

Ensure .env is not committed (.gitignore includes it).


πŸš€ Running the Pipeline#

To run from the command line:

python src/pipeline.py --query "What is responsible AI in education?"

Or launch the demo notebook:

jupyter notebook notebooks/03_rag_pipeline.ipynb

πŸ’¬ Example Query Walkthrough#

Query:

β€œSummarize TRU’s AI policy for teaching and learning.”

Under the hood:

  1. Query transformed β†’ β€œAI policy teaching learning summary TRU”

  2. Top 3 documents retrieved via MongoDB Atlas Vector Search

  3. Prompts combined:

    • system_prompt_v1.txt

    • human_prompt_v1.txt

  4. LLM: gpt-4o-mini

  5. Trace logged to logs/trace_2025-10-26.json

Expected Output (excerpt):

β€œTRU’s AI policy encourages responsible use of AI tools in teaching while maintaining academic integrity…”


🧾 Prompts and Documentation#

All prompts are stored and versioned under /prompts.

File

Purpose

Version

system_prompt_v1.txt

Defines assistant behavior and tone

v1.0

human_prompt_v1.txt

Template for user query + context insertion

v1.0

Example loader:

from utils import load_prompt
system_prompt = load_prompt("prompts/system_prompt_v1.txt")
human_prompt = load_prompt("prompts/human_prompt_v1.txt")

See prompts/README.md for variable placeholders like {query} and {context}.


πŸ” Reproducibility Notes#

Component

Reproducibility Action

Environment

Managed via Conda (environment.yml, Python 3.10.14)

Prompts

Stored as separate files with version labels

Data Source

Snapshot captured on 2025-10-15

Retriever

MongoDB Atlas Vector Search (k = 4, index vector_index)

Logs

All runs logged to /logs with timestamps

Randomness

Random seeds fixed to 42 in code


🧩 Naming Conventions#

Type

Convention

Example

Code files

lowercase with underscores

pipeline.py, query_transformer.py

Prompts

include role + version

system_prompt_v1.txt

Logs

include timestamp

trace_2025-10-26.json

Notebook

prefix with step or milestone

03_rag_pipeline.ipynb

Git tag

milestone + version

milestone3-v1.0


βš™οΈ Example environment.yml#

name: rag-pipeline
channels:
  - defaults
dependencies:
  - python=3.10.14
  - langchain=0.3.2
  - openai=1.50.1
  - pymongo=4.10.1
  - python-dotenv=1.0.1
  - jupyter
  - pip

🧭 Reflection#

Challenges: Handling variable retrieval results due to index updates.
Next Steps: Add Dockerfile for environment replication and CI test for prompt loading.


πŸ“š References & Acknowledgements#

  • TRU ADSC 3910 Course Materials

  • LangChain and LangSmith Documentation

  • MongoDB Atlas Vector Search Guides

  • OpenAI API Docs


Last Updated: October 2025 | Tag: milestone3-v1.0