Author: Johnny Tseng

RAG: The Smarter, Cheaper Way to Scale Expertise

Let’s talk about Retrieval-Augmented Generation (RAG). Whether we realize it or not, we all use RAG daily.

If I asked you, “What’s the capital of Zimbabwe?” your thought process would probably go like this:

1. Why do I need to know that?

2. I’ll just Google it.

And if you did, you’d find the answer: Harare—which also happens to be the largest city in Zimbabwe.

This is the beauty of having the world’s information at your fingertips. Instead of memorizing everything, you use your brainpower to process, reason, and make decisions.

AI should work the same way. When using RAG, you’re essentially storing data elsewhere and retrieving that information prior to processing an answer to the question or prompt given.

Why RAG is More Efficient Than Memorization:

Traditional AI models rely on storing vast amounts of knowledge in their parameters. The bigger the model, the more computing power, RAM, and cost required to process information—most of which may never even be used.

If we apply the Pareto principle (80/20 rule) to AI, it’s likely that for most use cases, a model only uses 20% of its training data to handle 80% of real-world tasks. So why force it to memorize everything when it can just retrieve knowledge on demand?

Instead of training a massive model that tries to “know everything,” RAG keeps models smaller, cheaper, and more adaptable.

Applying RAG to Sales AI:

Since I typically write from a sales perspective, imagine a model trained specifically to be great at selling.

Now, let’s say we want this AI to sell cars. Instead of fine-tuning the model with every single piece of knowledge about every car ever made, we just:

• Train it to be a sales expert (negotiation tactics, objection handling, deal closing).

• Use RAG to pull in car-specific data (pricing, specs, competitive advantages, ideal customer profile, etc.) only when needed.

This approach is faster, more cost-effective, and scalable compared to retraining an entire model every time new information becomes available.

Takeway:

AI models should work like smart humans—focusing on expertise and retrieving information when necessary, rather than memorizing everything.

That’s why RAG isn’t just an optimization—it’s a fundamental shift in how we think about AI efficiency.

February 3, 2025
Mixture of Experts (MoE) – The GTM Team of AI
In the world of large language models (LLMs), bigger isn’t always better. While traditional AI models operate like a generalist, attempting to handle every task with a single approach, the Mixture of Experts (MoE) architecture introduces a more efficient, scalable way to process information.

Here’s a relatable analogy. Think of MoE as a Go-To-Market (GTM) team, where different specialists handle different aspects of a deal, ensuring efficiency, accuracy, and scalability. Just like in business, where no single person can manage everything effectively, AI benefits from a team of experts, each specializing in a particular domain.

How Mixture of Experts (MoE) Works

Instead of a single, monolithic AI model doing everything, MoE routes tasks to the best-suited expert models based on the input. Only a subset of the experts is activated per query, meaning the model becomes:

✅ More efficient – It doesn’t waste computational power on irrelevant experts.
✅ More accurate – Specialized experts perform better than a generalist model.
✅ More scalable – New experts can be added without massively inflating costs.

Now, let’s map this to a GTM team to see how the principle applies in business.

The GTM Team as an MoE Model

A GTM team thrives because each function specializes in different parts of the customer journey. Here’s how MoE mirrors the structure of a well-run GTM team:

1. Sales → The Persuasion & Negotiation Expert

Sales reps focus on prospecting, engaging leads, and closing deals. In an MoE model, this would be an expert agent trained in persuasive language, negotiation, and sales strategies to handle responses that require engagement and conversion tactics.

2. Commercial Solutions → The Pricing & Financial Expert

Commercial teams ensure that deals are structured properly, with accurate pricing and margin considerations. In an MoE model, this would be an expert trained in numerical reasoning and financial modeling, responsible for optimizing pricing strategies and contract structures.

3. Legal → The Regulatory & Compliance Expert

The legal team safeguards the company from contractual and regulatory risks. In MoE, this would be an expert fine-tuned in legal language processing, ensuring AI-generated content aligns with compliance requirements and avoids risks.

4. Post-Sales → The Execution & Support Expert

Once a deal is closed, the post-sales team ensures smooth onboarding and implementation. In MoE, this could be an expert trained to provide troubleshooting, customer support, and documentation assistance, helping customers adopt and integrate solutions seamlessly.

5. Customer Success (CSM) → The Retention & Expansion Expert

CSMs focus on renewals, upsells, and customer satisfaction. In MoE, this would be an expert that specializes in customer sentiment analysis, identifying engagement patterns and proactively recommending optimizations to enhance customer relationships.

The Key Takeaway: MoE is a GTM Team for AI

Rather than relying on one massive model trying to do everything, MoE distributes the workload to specialized expert models, ensuring:
- Better Performance – Each expert is optimized for a specific task.
- Lower Costs – Not all experts are activated at once (kind of like on demand resources), reducing compute overhead.
- Greater Adaptability – New experts can be added without rebuilding the entire system. IE: Sr. Legal when the junior level attorney is not getting the job done.
Just as a GTM team relies on sales, legal, finance, and post-sales experts to close and manage deals effectively, an MoE model leverages different AI experts to generate optimal responses efficiently. The future of AI isn’t about building one all-knowing model—it’s about orchestrating a network of specialized experts, just like the best-run businesses do. This can be seen most recently with the DeepSeek R1 model and Mistral who has been utilizing MoE since their inception.
February 1, 2025
LLM Distillation: Making AI Smarter, Smaller, and Faster

I want you to do the following:

🧠 Imagine a tenured PhD professor—they know everything, but their responses take longer because they evaluate more possibilities. Their expertise is unmatched, but their time is expensive.

🎓 Now, imagine training a bright student to be almost as smart—they can answer 80-90% as well, but much faster and at the cost of a $30/hr tutor instead of a six-figure professor.

That’s LLM distillation—taking a massive AI model and teaching a smaller, faster version to be just as effective for most tasks but without the overhead. Curious to know more?

Let’s get into the weeds!

Absolutely! Here’s a clean, digestible breakdown that follows your analogy while keeping it practical for a LinkedIn audience that’s tech-savvy but not deep in AI.

LLM Distillation: Making AI Smarter, Smaller, and Faster

We’ve all seen the massive AI models—they’re powerful but expensive and slow. What if you could shrink them while keeping most of their intelligence?

That’s where LLM distillation comes in.

How It Works (Without the Jargon)

Using our PhD Professor vs. Student analogy, here’s how the process plays out:

📚 Step 1: The Professor Teaches the Student

The big model (PhD professor) generates thousands of high-quality answers. These responses are captured as training data.

💡 Think of this as having a professor write down their best explanations, step by step, across thousands of topics.

📝 Step 2: The Student Learns Through Imitation

A smaller AI model (the student) is trained to mimic the professor’s responses as closely as possible.

💡 It’s like giving the student years of top-tier study guides and having them practice until they sound nearly as good as the professor.

🎯 Step 3: The Student Gets Graded & Optimized

The smaller model is evaluated on:

✔️ Accuracy – How well does it match the big model?

✔️ Speed – Can it respond significantly faster?

✔️ Efficiency – Can it run on lower-cost hardware?

💡 At this stage, the student might not have every nuance, but they’re smart enough to handle 80-90% of real-world tasks with speed and efficiency.

🚀 Step 4: Deployment – The Student is Ready for the Real World

Once trained and fine-tuned, the distilled model is put to work—whether in customer support, AI-powered sales tools, or real-time assistants.

💡 Now, instead of waiting for a PhD professor to respond, you get near-instant answers from a well-trained student—at a fraction of the cost.

Why This Matters

✅ AI at Scale – Smaller models mean AI that can run on your laptop, phone, or edge devices.

✅ Cost Savings – Distilled AI can be 10x cheaper to run than massive cloud-hosted models.

✅ Faster AI Assistants – Perfect for real-time applications like chatbots, sales enablement, and AI copilots.

✅ Custom AI for Industries – Instead of a general-purpose AI, businesses can create specialized AI that’s fine-tuned for their needs.

Final Thought

LLM distillation isn’t about making AI dumber—it’s about making it smarter where it counts while being lean, cost-effective, and fast.

🚀 The future of AI isn’t just about having bigger models—it’s about having the right-sized intelligence for the job.

Would love to hear your thoughts—where do you see distilled AI making the biggest impact? 👇

#AI #LLM #MachineLearning #Tech #Innovation #ArtificialIntelligence

January 31, 2025
RAM: More = Good

If you’ve ever seen the movie Limitless with Bradley Cooper, you may have had the same thought as me. One pill to magically make me smarter, more efficient, and highly productive for a short period of time. I would love that. The things I could do.

For computers, RAM is that pill.

RAM stands for Random Access Memory. In short, it’s like short-term memory—but on steroids. Imagine trying to solve a puzzle: the bigger your table, the more pieces you can lay out at once, making it easier to see patterns and solve the problem faster. RAM works the same way for AI models.

Have you ever noticed when your phone starts slowing down because too many apps are open at once? That’s because your phone’s RAM is maxed out, and it can’t juggle everything efficiently. AI models face the same challenge but on a much larger scale. Every time an AI model processes language, generates an image, or crunches numbers for a business decision, it’s using RAM to do that thinking in real time. More RAM means faster responses, smoother performance, and the ability to handle more complex AI tasks at once.

This is why AI workloads demand massive amounts of RAM. Unlike traditional applications, AI models don’t just “store” data—they process and manipulate huge datasets dynamically. If RAM is too small, the model will either slow down, crash, or be forced to offload to slower storage, killing efficiency.

So, if you’re selling AI solutions, one of the first things to check is whether the hardware has enough RAM to handle the workload. Otherwise, no matter how powerful the processor is, the system will be bottlenecked—just like a genius trying to work with sticky notes instead of a whiteboard.

January 30, 2025
CPU vs. GPU: Why a Two-Way Street Can’t Keep Up With a Highway

Got it! Keeping it clean and focused on CPUs vs. GPUs will make the message stronger. Here’s a refined version of your blog post that keeps it simple, conversational, and clear:

CPU vs. GPU: A Two-Way Street vs. a Multi-Lane Highway

If you’re in sales, you’ve probably heard the terms CPU and GPU thrown around when talking about AI, cloud computing, or high-performance workloads. But what’s the real difference?

Let’s make it simple: think of data like cars on a road.

CPU: The Two-Way Street

A CPU (Central Processing Unit) is like a two-way street in a small town. It handles a few cars (data tasks) at a time, but each car gets personalized attention—it can change directions, take different turns, and stop at intersections.

This makes CPUs great for handling complex tasks that require flexibility and decision-making, like:

✅ Running software applications

✅ Making calculations

✅ Managing system operations

CPUs are designed for efficiency, not bulk traffic. They’re great at doing a few things really well, but not at handling thousands of things at once.

GPU: The Multi-Lane Highway

Now, a GPU (Graphics Processing Unit) is like a multi-lane highway designed for rush hour traffic. Instead of processing one or two cars carefully, it’s optimized to move thousands of cars (data bits) in parallel, all in the same direction.

This makes GPUs perfect for massively parallel tasks, like:

🚀 Training AI models

🚀 Processing huge datasets

🚀 Rendering graphics

GPUs aren’t as flexible as CPUs, but that’s the point—they sacrifice versatility for raw speed and volume.

More Traffic = More Heat

Here’s the catch: the more cars on the road, the more friction, and the more heat.

• A CPU, processing fewer cars, stays relatively cool.

• A GPU, handling thousands of data bits at once, generates way more heat and requires serious cooling to keep from overheating.

That’s why GPUs need bigger heatsinks and cooling systems—they’re pushing way more traffic through at once.

Which One Does Your Customer Need?

For sales, this all boils down to what kind of road your customer is driving on:

• Are they processing general-purpose tasks, running software, or making real-time decisions? → They need a CPU.

• Are they crunching large amounts of data, training AI models, or rendering graphics? → They need a GPU.

At the end of the day, both have their place—it just depends on the type of “traffic” your customer is managing.

For more content like this, feedback is appreciated!

January 29, 2025
Salespeople Are Athletes: Use AI to Refine Your Game

I have always held the belief, the best sales people are athletes—we’re constantly competing. Whether it’s against an incumbent, other companies, or even internally to climb the leaderboard, we face challenges every day. Just like athletes, we can’t wait until game time to sharpen our skills. Preparation and practice are key to winning.

In today’s fast-paced selling environment, tools like Gong, Outreach, and ChatGPT are essential for refining your pitch, overcoming objections, and pushing deals forward. If you’re not leveraging these tools, you’re leaving your growth to chance.

How to Use Advanced Voice Mode

1. Start with Clear Instructions:

• Begin by prompting ChatGPT with a clear set of instructions to role-play with you. For example:

• “I would like you to simulate a role-playing scenario where you act as [specific persona].”

2. Define the Persona:

• Provide detailed guidance on how ChatGPT should behave. For instance:

• “You are a stern CTO who values their time. You will not tolerate nonsense. If I say anything nonsensical, cut me off immediately and indicate that this conversation needs to stop. Challenge me and don’t exaggerate being nice.”

3. Share Context:

• Clearly describe the product or service you’re proposing and provide a concise value statement to help ChatGPT better simulate the interaction.

4. Push for Variety:

• Encourage ChatGPT to create different scenarios and responses to keep you on your toes. The more specific and detailed your instructions, the more effective the role-play becomes.

From here, you can be in control of pushing the platform to keep challenging you and to give you different scenarios. Close early, close often.

January 27, 2025
What is an LLM, basics.

Here’s an improved and expanded version of what you’ve written, with additional suggestions and clarifications:

Understanding LLMs for Beginners

When you hear terms like LLM, SLM, or just Model, it can sound a bit complicated, but let’s break it down.

• Model: This refers to a machine learning system designed to perform a specific task, in this case, understanding and generating human language.

• LLM (Large Language Model): A model that is “large” because it has been trained on massive amounts of text data (think millions or billions of sentences) and contains billions of parameters (the “knobs” it adjusts to improve predictions).

• SLM (Small Language Model): A smaller, less complex version of an LLM, designed for tasks that don’t require as much power or storage.

How They Work

At their core, these models function by predicting the most probable next word in a sentence based on the context of the words that came before it. This is called language modeling, and it’s how they generate coherent, human-like responses.

For example:

If you start with the phrase “The sky is”, the model might predict the next word as “blue”, because that’s the most likely word based on the training data it has seen.

Key Vocabulary

• Training:

This is the process of teaching the model by showing it vast amounts of text data. The model adjusts its parameters to improve its ability to predict the next word or understand relationships between words.

Think of it like learning a new language: the more examples you study, the better you get.

Key facts about training:

• It requires massive computational power (think supercomputers or thousands of GPUs working together).

• It is extremely expensive and time-intensive, sometimes taking weeks or months to complete.

• Inference:

Once the model is trained, it’s ready to be used. Inference refers to the process of applying the trained model to make predictions or generate responses.

For example, when you type a question into ChatGPT, the model is performing inference to give you an answer.

Key facts about inference:

• It is less computationally demanding than training, but still requires good hardware for larger models.

• Most of the cost for businesses using LLMs comes from inference, as it happens every time someone uses the model.

What Makes an LLM “Large”?

The “large” in LLM refers to both:

1. Data Size: The amount of text it has been trained on. For example, GPT-3 (a famous LLM) was trained on hundreds of gigabytes of text from books, websites, and more.

2. Parameter Count: Parameters are like the “brains” of the model. More parameters mean the model can handle more complex tasks, but it also requires more memory and power to operate.

• A small model might have a few million parameters.

• A large model like GPT-3 has over 175 billion parameters.

Why Does Size Matter?

• Larger Models: Tend to be more accurate and capable of understanding nuanced or complex prompts. However, they’re also slower and more expensive to use.

• Smaller Models: Faster and cheaper, but might struggle with difficult or context-heavy tasks. These are great for lightweight applications like chatbots for customer support.

Real-World Applications of LLMs

1. Chatbots: Like customer support bots or personal assistants (think Siri or Alexa).

2. Translation: Converting text from one language to another.

3. Content Generation: Writing articles, code, or even stories.

4. Summarization: Reducing long articles or documents into shorter, concise summaries.

5. Medical or Legal Analysis: Helping professionals analyze complex documents or data.

Limitations of LLMs

It’s important to understand what LLMs can’t do well:

• They don’t truly “understand” like humans do; they only predict based on patterns in the data they’ve seen.

• They can sometimes make errors, like generating factually incorrect or nonsensical answers (called hallucinations).

• They require careful oversight in critical tasks like medicine or law to avoid mistakes.

This is a basic introduction to help you understand LLMs. The next time you hear about AI and models like ChatGPT, you’ll have a better grasp of how they work and what they’re capable of.

1. Examples of Use Cases: Use ChatGPT, DeepSeek, Claude or any LLM or model that can help you compose emails, notes, make your email sound nicer if you’re in a bad mood, or help you communicate sterness without too many F bombs, it’s great.

2. Simplify Concepts/Learning: If you you’re here, you’re most likely trying to further your own education. When learning a new subject it can be hard. Use the models to help you understand. The only thing to consider is that sometimes they will hallucinate, aka make things up. With that being said, double check by using perplexity.ai to ensure that it’s verified by multiple sources

January 24, 2025
Why did I start this?

My personal experience has shown me that there are quite a bit of folks out there like myself who work in sales and want to be able to stay relevant. The challenge is, everything is getting more technical, and without a descent foundation it can be a bit overwhelming and challenging all at the same time.

The othe reason is, AI will be everywhere. It will fill our personal lives as personal assistants. It will over time replace the traditional search function because the tediousness of having to scroll, click, read and converge information is way too time consuming. I’m also just hoping for good karma. By sharing what I’ve learned and how I’ve learned, maybe it can help you. If it does, then please drop a note. I’m a middle child raised by a strict asian father, feedback is my jam.

January 22, 2025