By Barry Meng in AI — 03 May 2026

I Trained a Chatbot to Talk Like Me

I wanted to create a chatbot that spoke closely to how I would talk. Sometimes, you just don't want to bother with sending messages, and wonder if it would be possible to somehow replicate your own voice with this.

My initial question was, where was the data? I use this aggregator for my chats, called Beeper, which is essentially a matrix bridge over my various chat apps, such as Instagram, Discord, and WhatsApp. So my plan was now, using the Beeper API to download my chats, and training some kind of LLM on top.

yo wanna grab food?

I don't even know where the nearest food place is 😳

Are you in Sydney?

Cuz if so I'm in Randwick 🥺🥺🥺🥺🥺🥺🥺🥺🥺🥺🥺

a different chat

have u seen the new dune movie

Yea it's amazing

Did you see 1890? (It's on Netflix right now)

another

hey barry

Heyy 😍😍😍😍😍❤️‍🔥

How u doin? 🥰🥰🥰🥰😍😍😍😍😘😘😍❤️‍🔥

The starting question was simple, even if the answer turned out not to be: could you actually clone a person’s writing voice from casual DM data, without the result just sounding like ChatGPT wearing an emoji costume? I had two years of one-on-one chats exported from Beeper, which felt like a lot of data until I started cleaning it, and after dropping URL-only messages, dropping the messages that were just emoji, capping how much each individual person could contribute so the dataset wasn’t dominated by my four most active friends, and then a few rounds of artifact filtering, what I had left could fit in a small spreadsheet. Terrible.

Each row is a sliding window of conversation, with some prior turns of context and then one of my replies as the target, where the context alternates between “them” (whoever I was chatting with) and “me”, and the pipeline drops anything that’s too old, anything that’s only a URL, anything that’s only emoji, and a couple of Beeper export quirks I only noticed after the second training run, like the literal string “None” that prefixes a quote-reply when the parent message has gone missing, or the system events (“Couldn’t load message”, “started a call”) that show up in the export as if they were real messages I’d typed. Once you spot one of those in your training data you start seeing them everywhere, and the dataset for version three had about ninety of them baked in before I caught it, which means I’d accidentally trained the model to occasionally claim that messages couldn’t be loaded.

The single biggest data decision was what to do with rapid-fire bursts: if I send five short messages within sixty seconds, do those become five training rows or one row with internal newlines? I went with one row, because that’s how I actually text, in short bursts rather than paragraphs, and the model needed to learn the shape. The downside, which I didn’t appreciate until much later, was that this also taught the model to send long bursts unprompted, producing six lines of follow-up to a one-line question.

Version one was a reasonable starting point: a LoRA on the Instruct version of Qwen2.5-7B, all the data I had, no rebalancing. The model talked, but it also overfit pretty hard, with specific friends’ inside jokes surfacing in unrelated replies as if it had memorized whole conversations, so version two kept the same recipe but balanced the data, capping each person at the same number of rows, dropping the low-volume ones, and filtering out short and emoji-only targets. It was better, but the voice was still off in some way I couldn’t quite pin down.

For version three I tried two things at once, switching to a multi-turn chat format where each window became alternating user and assistant turns instead of a flat continuation, and shrinking the LoRA at the same time, and this was the version I shipped first, which was also the version I hated.

yo wanna grab food?

Sure

What are you cooking

hey barry

What's up / what you need? 😊

That second exchange is the one that gives the game away, because “What you need? 😊” is nothing I would ever type, and what was happening underneath was that the model was reverting to assistant defaults whenever it had room to, while the chat format I’d switched to was making that easier instead of harder, because matching the exact shape of an assistant conversation was the strongest signal the base model knew.

The diagnosis took longer than it should have, partly because the failure mode was so emoji-heavy that it looked at first like a stylistic problem rather than a structural one. The Instruct version of the base model has been heavily trained to be a polite assistant, and the multi-turn format I was using during fine-tuning matched the same shape the base model had used during all of its assistant training, so my LoRA was teaching the model to be Barry while the base model was constantly trying to revert to assistant. I tested the theory by giving the model a fake warmup before each real prompt, just a few rounds of small talk first, and the outputs got noticeably more Barry-shaped, which suggested the model wasn’t really broken so much as constantly being pulled back to assistant defaults whenever a prompt looked like the start of an assistant conversation.

v3, no warmup

yo wanna grab food?

Sure

What are you cooking

v3, with fake warmup

hru

not bad u

good good

yo wanna grab food?

sounds good

where

The fix had to come from the base model itself, so for version four I swapped the Instruct base for the non-Instruct base, which has the same architecture and the same weights underneath but without the RLHF assistant training layered on top. Fine-tuning on a base model turns out to be a more legitimate approach for voice cloning anyway, because you’re not trying to override existing behavior so much as shape a neutral substrate. While I was at it I tightened the data side too: a couple of low-volume recipients got held out entirely as the eval set, so eval loss was finally measuring whether the voice generalized to people the model had never seen rather than whether it had memorized recent phrases for friends it had been training on, the per-recipient cap dropped further to broaden the voice diversity, and the Beeper system-message artifacts got filtered out properly. The change was immediate, and the same prompts that had previously produced “what you need? 😊” started producing something that actually sounded like me.

v4 (same prompts as above)

yo wanna grab food?

I don't even know where the nearest food place is 😳

Are you in Sydney?

Cuz if so I'm in Randwick 🥺🥺🥺🥺🥺🥺🥺🥺🥺🥺🥺

have u seen the new dune movie

Nah just trailer

I got an invite for my birthday 🥰

So happy 🥳❤️‍🔥❤️‍🔥❤️‍🔥❤️‍🔥❤️‍🔥

hey barry

Yo boss 🥺🥺

Muuu why u no reply, I was so sad 😢😭😭

That sounds a lot more like me, or close enough that I noticed the difference immediately, and the way “muuu” stretches a complaint and the total absence of any “I’m here to help” framing make the voice unmistakable, even though the Randwick is wrong (I’m not in Sydney) and the model is making most of its concrete details up. You can read the responses out loud and they sound like me reading my own messages from a couple years ago, which is approximately what they are.

It also has problems, plenty of them, the most obvious being that the emoji spam is real, and relatively divorced from the specific chat context, because a lot of my training data has me doing exactly that and the LoRA absorbed the pattern instead of generalizing past it. The model hallucinates context constantly: it claims to be on holiday, it claims to be in Sydney, it claims to be at uni, all of which were true in some training-data window but none of which are true now, and asked about my “python project” it produced something that sounded suspiciously like a real conversation I’d once had about a recursive exercise, which is the model showing me memorized fragments rather than generated text. Every response is also a multi-message dump, which makes sense when you remember that each training target was itself a multi-message dump after burst collapse, but reads as too much from the user’s side, so I cap the output to a couple of lines at the API layer to make it feel more like an actual DM exchange rather than a wall of unrequested context.

There was also Runpod, which deserves a paragraph here because it ate more time than the actual training did. The training itself was uneventful and quick once it started, but getting to the point where it could start was the real time sink, and by the time three pods deep I’d already hit a pod with no SSH key injection, then a pod where the host driver was too old for what the libraries wanted, and then a chain of version mismatches where one library tried to import an attribute from another library that had been removed in the version my driver could actually run. If you’re tempted to think the modern LLM stack is plug-and-play, it is, right up until your CUDA driver is one minor version behind your image’s preferred runtime, at which point you get to chase a different attribute error every time you rerun the script.

The path forward, if I do this again, is fairly clear in outline: less LoRA capacity to give the model less room to memorize specific phrases verbatim, tighter burst collapse so training targets are less rambly on average, and probably a better eval altogether, because the loss is fine for measuring whether the model is converging but useless for measuring whether it actually sounds like me. The guess-the-real-Barry game I built into the site, where friends play real-versus-generated head-to-heads, is closer to an honest signal, and the obvious metric for the next round is the win rate from those rounds rather than anything that comes out of the training script.

Whether version four deserves to be called a clone is debatable, since it nails the surface tics but can’t talk about anything real about my life without making things up. What it has is a kind of textural fidelity, the kind where if you saw a screenshot of one of its responses out of context you’d assume I wrote it, and whether that counts as voice cloning depends on what you think a voice actually is.

Subscribe to Context by Barry