Table of Contents
TL;DR
Step-by-step walkthrough to build an OpenAI Chatbot with real examples
Common pitfalls to avoid — saves hours of trial and error
Works with free tools; no prior experience required
OpenAI’s ecosystem in 2026 is built around Assistants, a first-class abstraction that packages models, tools, instructions, and memory into a single unit. Below is a practical guide that walks you through every step—from creating your first Assistant to wiring it into an end-to-end workflow—complete with code snippets, FAQs, and tips that reflect the current state of the platform.
1. Before You Start: Understand the 2026 Contract
In 2026 the OpenAI API is largely declarative: you describe what you want, not how to achieve it.
| Concept | 2026 Abstraction | What You Provide |
|---|---|---|
| Model | model string ("gpt-5", "o3-mini") | Instruction set & temperature |
| Tools | tools array (code interpreter, function calls, file search, web search) | JSON schema & Python functions |
| Memory | vector_store + thread | File IDs, chunking strategy, retention rules |
| Prompt | instructions | System-level persona, tone, guardrails |
| State | thread | Conversation history & metadata |
Key changes from 2024:
- No more “chat completion” endpoint—everything is an
Assistantrun against aThread. - Persistent threads are opt-in; transient conversations are the default.
- Code interpreter is now a first-class tool with built-in sandboxing (Python ≥3.11, no network).
- File search (
vector_store) is vector-only; hybrid BM25 is deprecated. - Rate-limits are per-org, not per-key; burst vs. steady-state is measured in tokens/sec.
2. Step-by-Step: Create & Run an Assistant
2.1 Create the Assistant
from openai import OpenAI
client = OpenAI(api_key="sk-...")
assistant = client.beta.assistants.create(
name="CodeReviewer",
instructions="You are a senior Python engineer. Review PRs for style, safety, and performance.",
model="gpt-5",
tools=[
{"type": "code_interpreter"},
{"type": "file_search", "vector_store_ids": ["vs_abc123"]}
],
temperature=0.2
)
model→ pick the smartest model you can afford (gpt-5≥o4-mini).tools→ order matters; code interpreter runs before file search.vector_store_ids→ attaches a pre-created vector store (see §3).
2.2 Create a Thread
Threads are ephemeral by default:
thread = client.beta.threads.create(
messages=[
{
"role": "user",
"content": "Review this PR: https://github.com/.../pull/123"
}
]
)
If you need persistence, set metadata={"retention": "30d"} and store the thread_id in your DB.
2.3 Run the Assistant
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
instructions="Focus on type hints and exception safety."
)
Monitor status:
status = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
if status.status == "completed":
messages = client.beta.threads.messages.list(thread_id=thread.id)
print(messages.data[0].content[0].text.value)
3. Building a Knowledge Base with Vector Stores
3.1 Upload & Chunk Files
vector_store = client.beta.vector_stores.create(name="PythonStyleGuide")
for file in ["pep8.md", "mypy.md"]:
client.beta.vector_stores.files.upload(
vector_store_id=vector_store.id,
file=open(file, "rb"),
chunking_strategy={"type": "static", "max_chunk_size_tokens": 800}
)
chunking_strategydefaults to 800 tokens; you can setmax_chunk_size_tokensup to 4096.- Supported formats:
.txt,.pdf,.md,.docx,.pptx,.csv,.jsonl.
3.2 Attach Vector Store to Assistant
assistant = client.beta.assistants.update(
assistant_id=assistant.id,
tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}}
)
- Caching: OpenAI caches embeddings for 30 days; update
vector_storeif files change. - Hybrid search: Still Beta; use
ranking_options={"ranker": "default"}for better precision.
4. Adding Custom Tools (Function Calling)
4.1 Define the Schema
tools = [
{
"type": "function",
"function": {
"name": "fetch_github_pr",
"description": "Fetch a GitHub PR diff.",
"parameters": {
"type": "object",
"properties": {
"owner": {"type": "string"},
"repo": {"type": "string"},
"pr_number": {"type": "integer"}
},
"required": ["owner", "repo", "pr_number"]
}
}
}
]
4.2 Register the Function Handler
def fetch_github_pr(owner: str, repo: str, pr_number: int) -> str:
import httpx
url = f"https://api.github.com/repos/{owner}/{repo}/pulls/{pr_number}"
diff_url = f"https://patch-diff.githubusercontent.com/raw/{owner}/{repo}/pull/{pr_number}.diff"
diff = httpx.get(diff_url).text
return diff
4.3 Attach & Run
assistant = client.beta.assistants.update(
assistant_id=assistant.id,
tools=[*tools, {"type": "file_search", "vector_store_ids": [...] }]
)
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id
)
# Stream events
for event in client.beta.threads.runs.stream(
thread_id=thread.id,
run_id=run.id,
event_handler=EventHandler()
):
if event.event == "thread.run.step.completed":
step = event.data
if step.step_details.type == "tool_calls":
for tool_call in step.step_details.tool_calls:
args = json.loads(tool_call.function.arguments)
result = fetch_github_pr(**args)
client.beta.threads.runs.submit_tool_outputs(
thread_id=thread.id,
run_id=run.id,
tool_outputs=[{"tool_call_id": tool_call.id, "output": result}]
)
5. Streaming & Real-Time UX
5.1 Streaming Messages
with client.beta.threads.messages.stream(
thread_id=thread.id,
event_handler=MessageStreamHandler()
) as stream:
for text in stream.text_deltas:
yield text
5.2 Partial Results
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
stream=True,
truncation_strategy={"type": "auto"}
)
truncation_strategydefaults to 16k tokens; setmax_prompt_tokensto control cost.- LLM latency in 2026 is ~250–400 ms for
gpt-5, ~400–600 ms foro4-mini.
6. Cost Control & Optimization
| Metric | 2026 Rate |
|---|---|
| Input tokens | $0.03 / 1M (cached) / $0.12 / 1M (fresh) |
| Output tokens | $0.06 / 1M |
| Code interpreter | $0.08 / 1M tokens + $0.03 / minute compute |
| File search | $0.05 / 1k queries |
| Vector store | $0.10 / GB / month |
6.1 Cache Prompts
cached_prompt = client.beta.prompts.create(
input="Review this PR for style and safety.",
model="gpt-5",
temperature=0.2
)
- Cache lasts 7 days; use
cached_prompt_idinstead ofinstructions.
6.2 Token Budgeting
thread = client.beta.threads.create(
messages=[...],
tool_resources={"file_search": {"vector_store_ids": [...]}}
)
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
max_prompt_tokens=12_000,
max_completion_tokens=4_000
)
max_prompt_tokensincludes messages + tool context.- Guardrails: Use
moderationtool to flag unsafe content before streaming.
7. Security & Compliance
7.1 Sandboxing Code Interpreter
- No network:
httpxcalls raiseRuntimeError. - No subprocess:
os.system,subprocessare blocked. - Allowed modules:
math,random,numpy,pandas,matplotlib,PIL. - Timeout: 30 seconds per run.
7.2 File Upload Restrictions
- Max file size: 100 MB.
- MIME types:
text/*,application/pdf,application/vnd.openxmlformats-officedocument.*. - DLP: Sensitive PII (SSN, credit cards) triggers auto-redaction unless you opt-out via
metadata={"redact": false}.
7.3 Data Residency
- EU:
location="eu"flag pins threads & vector stores to Frankfurt. - US: Default; no flag needed.
- Retention: 30 days maximum for transient threads; 1 year for persistent threads unless overridden.
8. Deployment Patterns
8.1 Serverless Worker (Cloudflare)
[[queues.consumers]]
max_batch_size = 10
max_retries = 3
[queues.producers]
queue = "assistant-runs"
[[r2_buckets]]
binding = "BUCKET"
bucket_name = "assistant-files"
Worker code:
export default {
async queue(batch, env) {
const { client } = env.OPENAI;
for (const msg of batch) {
const run = await client.beta.threads.runs.create({
thread_id: msg.threadId,
assistant_id: env.ASSISTANT_ID
});
}
}
};
8.2 Kubernetes Sidecar (On-Prem)
containers:
- name: assistant-proxy
image: openai/assistant-proxy:v26
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: openai-creds
key: key
ports:
- containerPort: 8080
Proxy handles token budgeting, retry logic, and observability.
9. Common Pitfalls & FAQs
9.1 “Assistant not calling tools”
- Check order: Code interpreter must come before file search in
tools. - Verify schema: Tools must match the exact JSON schema returned by the model.
- Debug:
client.beta.threads.runs.steps.list(thread_id, run_id)shows tool call attempts.
9.2 “Vector store not returning results”
- Chunk size: 800 tokens is too large for code; try 400.
- Embedding model:
text-embedding-3-smallis the default; switch tolargefor better recall. - Metadata filters: Use
metadata={"language": "python"}when uploading files.
9.3 “Thread too large”
- Truncate: Set
truncation_strategy={"type": "auto", "last_messages": 10}to keep only the last 10 messages. - Archive: Move old threads to cold storage (S3, GCS) via
vector_storeexport.
9.4 “Cost overruns”
- Set org-wide spend limit in the dashboard.
- Use
max_completion_tokensto cap output. - Cache prompts (
client.beta.prompts) to avoid regenerating instructions.
9.5 “Moderation false positives”
- Whitelist: Add benign terms to
metadata={"whitelist": ["jira", "ticket"]}. - Threshold: Lower
moderation.model="text-moderation-007"threshold inassistants.create.
10. What’s Next (Roadmap Hints)
- Multi-modal tools: Vision & audio tools in beta.
- Agents: Hierarchical assistants that can spawn sub-assistants.
- Fine-tuning: Assistants can now be fine-tuned on domain-specific data via
client.fine_tuning(private beta). - Plug-ins: OAuth2 connectors for Jira, GitHub, Slack (GA in Q3).
- On-prem: Self-hosted Assistants SDK for air-gapped environments.
OpenAI’s Assistants in 2026 abstract away the gritty details of prompt engineering, token counting, and tool orchestration, letting you focus on the intent of your workflow. Start small—one Assistant, one thread, one tool—and iterate. The new primitives are declarative, observable, and cost-capped, which makes it possible to ship production-grade AI helpers without becoming an LLM expert overnight.
