Part 2: Prepping Data for Your First LLM Call (Busy Builders Edition)
Before we hit our first LLM call, we need something useful to feed into it. In this part, we'll scrape a real website, clean the text, and structure it as input data — getting one step closer to real-world AI workflows.
Let’s Build a FAQ Builder
In Part 1, we got our environment up and running — Python, JupyterLab, OpenAI API key, and all dependencies sorted. Now, it’s time to do something useful with it.
In this post, we’ll start building a FAQ Builder. The idea is simple:
- Scrape a public help center article from the web
- Clean the text so we keep only what matters
- Prepare the content so it’s ready for LLM summarization
While we won’t actually call the LLM yet — that’s in Part 3 — this step sets the foundation for it. And along the way, we’ll practice:
- Web scraping with
BeautifulSoup - Loading API keys securely with
dotenv - Structuring our first reusable Python class for this project
This is exactly how most LLM projects begin: raw content → cleaned input → model-ready. Let’s get started.
Start Fresh: Activate Your Environment
Let’s make sure everything runs from a clean slate.
Before launching JupyterLab, activate the environment we created in Part 1:
conda activate llms-busybuilders
You’ll know it’s active when your terminal prompt changes — usually showing the environment name at the front, like this:
(llms-busybuilders) %
venv users: If you’re using
venv instead of conda, activate it
like this:
-
macOS/Linux:
source venv/bin/activate -
Windows:
venv\Scripts\activate
Once the environment is activated, you’re ready to launch JupyterLab.
Launch JupyterLab
With your environment active, launch JupyterLab from the same terminal:
jupyter lab
This will open up a new tab in your browser with JupyterLab. If it’s your first time using it, don’t worry — we’ll do a guided walkthrough in a later post.
For now, just create a new Python 3 notebook in your
llm-busybuilders folder. We’ll use this to run our
first scraper and prep some data.
Tip: Jupyter saves your work automatically. You
can also rename your notebook (top-left) to something like
faq_scraper.ipynb.
Load your API key from .env
Before we can talk to OpenAI’s models, we need to load the
secret API key we saved earlier in your .env file.
We’ll use the dotenv package to load environment
variables into memory securely — without hardcoding them into
our notebooks.
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
if api_key:
print("API key loaded successfully.")
else:
print("API key not found. Check your .env file.")
If the printout says your API key isn’t found, double-check that:
-
Your
.envfile is in the same folder as your notebook -
The key name is
OPENAI_API_KEY— no typos, no quotes - You restarted JupyterLab after adding the file (if needed)
Reminder: Never share your actual API key or
commit it to Git. That’s why we use .env files.
Set up the OpenAI client
Now that we’ve loaded the API key, let’s initialize the OpenAI
client. We’ll do this using the official
openai Python package, which we already installed
in Part 1.
Here’s a simple setup that gets us ready to call any OpenAI model:
import openai
openai.api_key = api_key
# Test call to check everything is working
models = openai.models.list()
print("OpenAI client ready. Available models:")
for m in models.data[:5]:
print("-", m.id)
This will list a few available models from your OpenAI account. If this step errors out, it usually means your API key is invalid or not loading properly.
Tip: Don’t worry if your list looks different — available models depend on your account and API plan.
Scrape a sample FAQ webpage
Let’s say your company has a Help Center with FAQ articles like
https://example.com/help/returns. In this section,
we’ll scrape the content from a page like that and clean it up
for the LLM.
We’ll use requests to fetch the page and
BeautifulSoup to parse it. Here's a basic utility
function with minimal error handling:
import requests
from bs4 import BeautifulSoup
def scrape_faq(url):
try:
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status() # Raise an error for bad responses
soup = BeautifulSoup(response.content, "html.parser")
# Remove unwanted tags
for tag in soup(["script", "style", "nav", "footer", "form"]):
tag.decompose()
title = soup.title.string if soup.title else "No title found"
body = soup.get_text(separator="\n").strip()
return {
"title": title,
"text": body
}
except requests.exceptions.RequestException as e:
print(f"⚠️ Failed to fetch {url}: {e}")
return {"title": "Error", "text": ""}
Pro Tip: This version includes basic error handling for common issues like bad URLs, timeouts, or server errors — helpful when working with real websites.
Now let’s try it on a sample page:
faq = scrape_faq("https://example.com/help/returns")
print("Title:", faq["title"][:80])
print("Text preview:", faq["text"][:500], "...")
Tip: Use any FAQ page with mostly clean HTML. Try docs.shopify.com, help.zendesk.com, or even your internal pages (if accessible).
Clean and display the extracted text
Now that we’ve scraped the FAQ page, let’s do some final cleanup before we pass it to a model. Long-form text like this often includes empty lines, navigation leftovers, or formatting glitches.
Let’s normalize it a bit and see what we’re working with:
import re
def clean_text(text):
lines = [line.strip() for line in text.split("\\n")]
non_empty = [line for line in lines if line]
collapsed = " ".join(non_empty)
return re.sub(r"\\s+", " ", collapsed).strip()
cleaned_text = clean_text(faq["text"])
print("Cleaned text preview:")
print(cleaned_text[:1000])
Busy Builder Note: This version removes empty lines and collapses extra whitespace — a good enough cleanup for most LLM inputs. If you ever work with super messy HTML or PDFs, you can always layer in more normalization later.
This gives you a clean, readable version of the FAQ content — just the kind of input LLMs work well with.
Tip: Keep your input focused. Remove headers, navs, and non-content elements before sending anything to a model — it improves quality and reduces token cost.
What’s next: our first LLM call
We now have everything we need:
- A working dev environment
- API key securely loaded
- Cleaned FAQ content from a real webpage
In Part 3, we’ll send this content to a real LLM (like GPT-4 or GPT-3.5) and ask it to generate a useful summary or build an FAQ-style response.
This will be our first true LLM interaction — and we’ll also introduce:
- Prompt engineering basics
- Token limits and formatting
- Generating output you can use in real workflows
Let’s keep building.
Ready for action: If everything in your notebook worked till now, you’re fully ready to talk to the model in Part 3. Let’s go.