36 commits, 1 revert, and 4 prompt versions — what prompt engineering actually looks like

Context

I’m building a two-tier AI consultation system for my freelance website. A free bot (Claude Haiku) qualifies leads in 8 exchanges, premium (Claude Opus) delivers in-depth technical analysis for 200 PLN (~$50). The entire system — from empty repo to working product — was built in one day.

This article isn’t about the system architecture (I wrote about that here). This article is about what you don’t see in technical documentation: the iterative process of refining prompts, an experiment that failed, and the lessons learned.

Phase 1: MVP works, but the bot has identity issues

The first prompt was simple: “You are Artur Mrowicki’s assistant, conduct BANT qualification.” The bot worked — asked questions, collected data. But the first tests revealed something unpleasant:

The bot said “I build projects”, “my architecture”, “I can help you.” The client was talking to a bot thinking they were talking to me. This isn’t a technical problem — it’s a trust problem. When the client realizes, they feel deceived.

Commit c8c8bee: bot identity — separate AI entity, never impersonate Artur.

I added the first identity rule: “You are a SEPARATE AI entity. NEVER speak in first person as Artur.” Problem solved. But this was just the beginning.

Phase 2: Fix cascade — one test, five bugs

I ran a full conversation test, pretending to be a demanding technical client. Result: the bot collected data, but along the way:

Accepted email in chat — wrote “I’m registering your email.” No such feature exists. The contact form only appears after the conversation.
Promised response times — “Artur will get back to you tonight.” The only timeframe I can promise is 48 hours.
Invented system features — mentioned “tickets”, “request queue”, “async chat with Artur.” None of these exist.
Mixed languages — “deadline”, “scope”, “track record”, “low-overhead approach.” In a Polish conversation, this sounds unprofessional.
Invented experience — stated specific technologies that “Artur has worked with” without having that information.

Five problems, five new rules. The prompt grew from 6 to 15 rules.

Commit 003a9c0: harden bot prompts — no fake promises, no email acceptance, no invented features.

Each rule was a response to a specific bug. Each made sense in isolation. The problem? 15 rules is a lot for Haiku — the model started losing context and applied rules inconsistently.

Phase 3: Summary problems

The bot ran conversations correctly, but the JSON summary — generated after completion — had its own issues:

Hallucinated SLA: the bot wrote “response within 24 hours” though nobody agreed to that in the conversation.
Missing technologies: the client mentioned Istio and mTLS, but the summary omitted them.
Polglish in summary: “low-overhead approach”, “direct communication” — even though the chat prompt enforced Polish, the summary prompt didn’t.
Client email leaked: the client shared their email in chat (despite the rule forbidding it), and the summary extracted it and sent it back to the client. The email should only be visible to me.

Commit ce022df: harden summary prompts — no hallucinated SLA, no polglish, complete tech extraction.

I added four new rules to the summary prompt: language enforcement, no inventing agreements, complete technology extraction, internal _client_contact field. This worked. But it exposed a deeper problem.

Phase 4: The Gemini experiment — and why it failed

Haiku had one fundamental problem: Polish language quality. It wasn’t about grammar — sentences were correct. It was about naturalness. The bot sounded like it was translated from English. It inserted anglicisms despite a 15-item banned word list. Clients notice this.

The idea: what if I swapped Haiku for Gemini 3 Flash? Google trained models on a larger Polish corpus. Maybe Polish quality would be better?

Commit fe01dd1: switch free chat from Claude Haiku to Gemini 3 Flash.

Swapping the model required:

New SDK (@google/genai)
Different message format (role: 'model' instead of 'assistant', parts: [{ text }] instead of content)
Different API (ai.models.generateContent() instead of anthropic.messages.create())
Thinking configuration (thinkingConfig: { thinkingLevel: 'low' })

The first test was promising. Gemini’s Polish was more natural, flowing, without anglicisms. But new problems appeared.

Iteration 1: Gemini invents Artur’s experience

Gemini told a client: “Artur’s diagnostic approach for bottlenecks focuses on code-free observability using eBPF-based tools.”

Sounds professional. Problem? I have no idea if Artur has ever used eBPF. Gemini made it up based on conversation context, not facts.

Commit 97fac69: strengthen anti-hallucination rules — no invented tools/methods.

I added a rule: “If you don’t know whether Artur worked with a technology, say directly: I don’t have that information.”

Iteration 2: Gemini becomes too passive

After adding the anti-hallucination rule, the bot answered everything with: “I don’t have that information. Artur will answer personally.” Out of 8 exchanges, 6 ended with the same sentence. The bot became useless — the client learned nothing, and I received empty summaries.

Commit 5b8b003: make bot actively dig into client problems instead of repeating “I don’t know”.

I added an active digging instruction: instead of saying “I don’t know”, the bot should ask deeper about the client’s problem. But this created another issue.

Iteration 3: Gemini closes the conversation after its own question

The bot asked a question, the client hadn’t answered yet, and the bot closed the conversation after one exchange. The [COMPLETE] marker appeared too early, the contact form popped up after the second message.

Iteration 4: Too rigid vs. too loose

After three rounds of fixes, the bot was either too rigid (refused to answer technical questions) or too loose (invented experience). I couldn’t find the sweet spot within the prompt.

Commit fcd0937: balanced bot approach — share industry knowledge, don’t attribute to Artur.

An attempted compromise: the bot can share industry knowledge (“typically one would use X”), but can’t attribute it to Artur. Worked in theory. In practice, Gemini still tended to create narratives where “Artur typically uses” sounded like “Artur does.”

Phase 5: Revert — accepting limitations

After four iterations, the Gemini prompt had more rules than the Haiku prompt. Each fix solved one problem but created another. Classic overcorrection loop:

too loose → add rule → too rigid → loosen rule → too loose (elsewhere)

Decision: full revert to pre-Gemini state.

Commit 00e293e: revert — restore free chat to Claude Haiku, remove Gemini.

I removed @google/genai from dependencies, restored chat-free.ts to Haiku, restored original prompts. 4 commits of work — thrown away.

Why the revert was the right call

Haiku has worse Polish. But:

Better rule adherence (fewer hallucinations)
Doesn’t invent experience
Is predictable

Gemini has better Polish. But:

Hallucinates experience despite rules forbidding it
Is “creative” where it should be rigid
Requires more rules to behave correctly — and more rules = worse adherence

Worse Polish is a smaller problem than invented experience. A client will forgive an anglicism. They won’t forgive a lie about competencies.

Phase 6: Fewer rules, more freedom

After the revert, I looked at the prompt with fresh eyes. 15 rules — many overlapped or were too verbose. Instead of adding more, I trimmed:

Commit a6477b4: refactor — slim down free bot prompts — 15 rules to 10.

Key changes:

Merged overlapping rules (e.g., “don’t guess” + “don’t invent experience” → one rule about technical freedom with a disclaimer)
Removed overly specific scripts (verbatim low-budget response, list of 30 banned English words)
Added a disclaimer: “This conversation aims to accelerate the project process. Technical findings will be verified by Artur.”

That disclaimer changed everything. The bot could now say “typically in such cases one would use X” without fear of the client treating it as a promise. Because the disclaimer clearly stated: this is a preliminary conversation, Artur will verify.

Three new principles instead of fifteen old rules

Budget: ask once, then move on. Instead of pushing for a specific number, accept “expert rates” and continue. A client who refuses to share budget after the first ask won’t share it after the third.
Urgency: take seriously, but don’t promise. Instead of ignoring “production is on fire” and responding with standard 48h, the bot marks the priority and says: “Artur sees the priority and sometimes responds faster in emergency situations.”
Technical freedom with disclaimer. The bot can share knowledge, suggest directions, name tools. But it cannot invent specific projects by Artur.

The prompt shrank by 40%. Haiku followed it better.

Phase 7: Urgent form with Telegram

Last lesson from testing: a bot that recognizes urgency but does nothing about it frustrates the client. “I see it’s urgent. Artur will respond within 48 hours.” — that sounds like mockery when someone’s production is down.

Solution: two completion markers instead of one.

Commit e9af61a: urgent contact form with Telegram notification.

The bot uses [COMPLETE_URGENT] instead of [COMPLETE] when it recognizes an emergency
The frontend shows a different form: email + phone (both required), red button
The form sends a Telegram notification with the full summary
Message: “Artur usually responds within an hour. If unavailable — standard response within 48h.”

The bot doesn’t promise faster response. It says “usually” — because sometimes I genuinely respond faster to urgent matters. And Telegram means I find out immediately, not 2 hours later when I check email.

Process takeaways

1. Overcorrection is a real problem

Every rule in a prompt has a cost. Not just in tokens — in model adherence. 15 rules is too many. The model starts interpreting them selectively, merging them, ignoring them. 10 well-written rules work better than 15 verbose ones.

2. Switching models doesn’t fix prompt problems

Gemini had better Polish but required more rules to control hallucinations. Haiku had worse Polish but was more obedient. Switching models moved the problem from one place to another — it didn’t solve it.

3. A revert is not a failure

4 commits of work went to waste. But the knowledge from them didn’t. I now know that Gemini hallucinates experience, that overcorrection creates loops, that the disclaimer “findings will be verified” is better than a list of 15 prohibitions. I wouldn’t have that knowledge without the experiment.

4. Disclaimer > prohibition

“Don’t invent Artur’s experience” is a prohibition. The model tries to work around it. “Technical findings will be verified by Artur” is context. The model accepts it. Instead of saying “what not to do”, it’s better to say “what context you operate in.”

5. Features last, not first

The urgent form with Telegram was the last commit, not the first. If I’d started with it, I wouldn’t have known the bot needs technical freedom to even recognize urgency in the first place. Fix the fundamentals first (identity, rules, tone), then add features.

Timeline (git log)

1a99878  Initial commit — arturmrowicki.pl
4657ef3  feat: two-tier consultation system
c8c8bee  fix: bot identity — never impersonate Artur
7dc3199  fix: no polglish, no guessing, Artur decides
003a9c0  fix: no fake promises, no email acceptance
ce022df  fix: harden summary prompts
fe01dd1  feat: switch to Gemini 3 Flash          ← experiment
97fac69  fix: anti-hallucination for Gemini       ← iteration 1
5b8b003  fix: active digging, not "I don't know"  ← iteration 2
fcd0937  fix: balanced approach                   ← iteration 3
0dcdc7f  revert: restore prompts                  ← revert
00e293e  revert: remove Gemini, back to Haiku     ← full revert
a6477b4  refactor: 15 rules → 10, more freedom    ← breakthrough
e9af61a  feat: urgent form + Telegram             ← feature

36 commits. 4 of them discarded. But the system that came out of this is better than if I’d gotten it right on the first try — because I know why every decision is the way it is.

Want to see how the bot works? Talk to it — it’s available 24/7.