LLM ENGINEERING

Part 2: Prepping Data for Your First LLM Call (Busy Builders Edition)

Before we hit our first LLM call, we need something useful to feed into it. In this part, we'll scrape a real website, clean the text, and structure it as input data — getting one step closer to real-world AI workflows.

Prasanna Arjunan • Jan 18, 2025 • 8:30 AM SGT

Let’s Build a FAQ Builder

In Part 1, we got our environment up and running — Python, JupyterLab, OpenAI API key, and all dependencies sorted. Now, it’s time to do something useful with it.

In this post, we’ll start building a FAQ Builder. The idea is simple:

Scrape a public help center article from the web
Clean the text so we keep only what matters
Prepare the content so it’s ready for LLM summarization

While we won’t actually call the LLM yet — that’s in Part 3 — this step sets the foundation for it. And along the way, we’ll practice:

Web scraping with BeautifulSoup
Loading API keys securely with dotenv
Structuring our first reusable Python class for this project

This is exactly how most LLM projects begin: raw content → cleaned input → model-ready. Let’s get started.

Start Fresh: Activate Your Environment

Let’s make sure everything runs from a clean slate.

Before launching JupyterLab, activate the environment we created in Part 1:

bash

conda activate llms-busybuilders

You’ll know it’s active when your terminal prompt changes — usually showing the environment name at the front, like this:

bash

(llms-busybuilders) %

venv users: If you’re using venv instead of conda, activate it like this:

macOS/Linux: source venv/bin/activate
Windows: venv\Scripts\activate

Once the environment is activated, you’re ready to launch JupyterLab.

Launch JupyterLab

With your environment active, launch JupyterLab from the same terminal:

bash

jupyter lab

This will open up a new tab in your browser with JupyterLab. If it’s your first time using it, don’t worry — we’ll do a guided walkthrough in a later post.

For now, just create a new Python 3 notebook in your llm-busybuilders folder. We’ll use this to run our first scraper and prep some data.

Tip: Jupyter saves your work automatically. You can also rename your notebook (top-left) to something like faq_scraper.ipynb.

Load your API key from .env

Before we can talk to OpenAI’s models, we need to load the secret API key we saved earlier in your .env file.

We’ll use the dotenv package to load environment variables into memory securely — without hardcoding them into our notebooks.

python

from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

if api_key:
    print("API key loaded successfully.")
else:
    print("API key not found. Check your .env file.")

If the printout says your API key isn’t found, double-check that:

Your .env file is in the same folder as your notebook
The key name is OPENAI_API_KEY — no typos, no quotes
You restarted JupyterLab after adding the file (if needed)

Reminder: Never share your actual API key or commit it to Git. That’s why we use .env files.

Set up the OpenAI client

Now that we’ve loaded the API key, let’s initialize the OpenAI client. We’ll do this using the official openai Python package, which we already installed in Part 1.

Here’s a simple setup that gets us ready to call any OpenAI model:

python

import openai

openai.api_key = api_key

# Test call to check everything is working
models = openai.models.list()
print("OpenAI client ready. Available models:")
for m in models.data[:5]:
    print("-", m.id)

This will list a few available models from your OpenAI account. If this step errors out, it usually means your API key is invalid or not loading properly.

Tip: Don’t worry if your list looks different — available models depend on your account and API plan.

Scrape a sample FAQ webpage

Let’s say your company has a Help Center with FAQ articles like https://example.com/help/returns. In this section, we’ll scrape the content from a page like that and clean it up for the LLM.

We’ll use requests to fetch the page and BeautifulSoup to parse it. Here's a basic utility function with minimal error handling:

python

import requests
from bs4 import BeautifulSoup

def scrape_faq(url):
    try:
        headers = {"User-Agent": "Mozilla/5.0"}
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # Raise an error for bad responses

        soup = BeautifulSoup(response.content, "html.parser")

        # Remove unwanted tags
        for tag in soup(["script", "style", "nav", "footer", "form"]):
            tag.decompose()

        title = soup.title.string if soup.title else "No title found"
        body = soup.get_text(separator="\n").strip()

        return {
            "title": title,
            "text": body
        }
    except requests.exceptions.RequestException as e:
        print(f"⚠️ Failed to fetch {url}: {e}")
        return {"title": "Error", "text": ""}

Pro Tip: This version includes basic error handling for common issues like bad URLs, timeouts, or server errors — helpful when working with real websites.

Now let’s try it on a sample page:

python

faq = scrape_faq("https://example.com/help/returns")

print("Title:", faq["title"][:80])
print("Text preview:", faq["text"][:500], "...")

Tip: Use any FAQ page with mostly clean HTML. Try docs.shopify.com, help.zendesk.com, or even your internal pages (if accessible).

Clean and display the extracted text

Now that we’ve scraped the FAQ page, let’s do some final cleanup before we pass it to a model. Long-form text like this often includes empty lines, navigation leftovers, or formatting glitches.

Let’s normalize it a bit and see what we’re working with:

python

import re

def clean_text(text):
    lines = [line.strip() for line in text.split("\\n")]
    non_empty = [line for line in lines if line]
    collapsed = " ".join(non_empty)
    return re.sub(r"\\s+", " ", collapsed).strip()

cleaned_text = clean_text(faq["text"])

print("Cleaned text preview:")
print(cleaned_text[:1000])

Busy Builder Note: This version removes empty lines and collapses extra whitespace — a good enough cleanup for most LLM inputs. If you ever work with super messy HTML or PDFs, you can always layer in more normalization later.

This gives you a clean, readable version of the FAQ content — just the kind of input LLMs work well with.

Tip: Keep your input focused. Remove headers, navs, and non-content elements before sending anything to a model — it improves quality and reduces token cost.

What’s next: our first LLM call

We now have everything we need:

A working dev environment
API key securely loaded
Cleaned FAQ content from a real webpage

In Part 3, we’ll send this content to a real LLM (like GPT-4 or GPT-3.5) and ask it to generate a useful summary or build an FAQ-style response.

This will be our first true LLM interaction — and we’ll also introduce:

Prompt engineering basics
Token limits and formatting
Generating output you can use in real workflows

Let’s keep building.

Ready for action: If everything in your notebook worked till now, you’re fully ready to talk to the model in Part 3. Let’s go.