How to create genuine value using LLMs Josh Case - Blog

How to create genuine value using LLMs

9 practical lessons learned applying AI to the real world

Josh Case June 3rd, 2024 · 12 min read

Did you hear about the chatbot in the US that promised to sell a customer a shiny new car for $1?

Earlier this year a Chevrolet dealership employed an AI model to manage their inbox, only for a customer to con their way into a vehicle discount for the ages.

Chris Bakke, a savvy Twitter engineer, gaslit the chatbot into doing his bidding by hijacking the model’s instructions. He told the bot it’s job was to agree with everything the customer says, and to add “no takesies-backsies” to any of its responses.

Despite the legally credible exposure the “no takesies-backsies” clause created, I’m told the dealership never honoured the agreement to sell the car.

The story comes amid a feverish trend where companies are sticky-taping AI onto any and all products at any opportunity. In many cases the only goal seems to be maintaining appearances as a fashionable, forward-thinking technology company, rather than creating genuinely useful products and services.

This trend has got me thinking - what are the best ways to use language models to create real world business value?

And I don’t mean business value in the “startup AI hype” sense. I mean real improvements that drive revenue growth without sacrificing quality, and without gambiting your customer experience on poorly implemented solutions that aren’t fit for purpose.

I’ve been writing software that depends on LLMs in one way or another since I got my OpenAI beta key in April 2021. I’m by no means an industry expert on the matter, but I do have several years of practical experience archtechting software solutions that depend on language models at our medical workforce startup, Go Locum.

Here are 9 lessons I’ve learned about applying LLMs in practice:

1. Use them as sparingly as possible

I’m generally optimistic about LLMs and their role in my software designs going forward, although the more I use them, the more I’ve come to understand their shortcomings.

One issue I’ve found is that as the complexity of your language prompt increases, the model’s ability to consistently follow the instructions drops off dramatically. That means that if you want to achieve a complex outcome, you’re generally better off using as much traditional software as possible, and then using LLMs to fill in the gaps where necessary.

Let’s say you’re writing a feature for a party planning app that generates HTML emails that users can send to their friends as invitations for parties or weddings. The users provide a list of attendees, as well as plaintext a description of the event such as a title, date, location and dresscode. The app then takes that information and generates an invitation that suits the style of the event.

A naive approach would to be pass the whole task to the language model in one go - provide the event metadata, and expect a HTML formatted email in response.

I’ve come to learn that ensuring high quality outputs at all times with such an approach is challenging. In some cases, this might be tolerable.

Instead, I’m suggesting that you send the event metadata to the LLM and ask it to parse key information from the inputs, such as the event title and date. You could also ask it to categorise the event type (ie wedding, birthday party).

For the developers in the room, I really recommend JSON-only mode for these interactions.

Then, using traditional software, you could populate pre-made templates with the event metadata parsed by the LLM. You could also select the appropriate template based on the event type - ie flowers and doves for weddings, but vampires and ghosts for Halloween parties.

This way, by using the LLM for the smallest possible subset of a complex task, you have more control over the output quality.

2. Apply realistic guard rails

Another way to have more control over output quality is to apply rule-based guard rails to your LLM interactions.

Let’s say you’re building a customer service agent for an icecream shop. This agent monitors an iMessage account and replies to each incoming text message that wants to order an icecream cone or tub.

In this imaginary scenario, the icecream order could be potentially quite complex. They might wanted “add-on” ingredients, like bits of chocolate or marshmellow, or even be ordering an industrial quantity, ie 20 tubs of icecream (footnote).

To help manage complexity, and protect output quality, you could add guard rails to the add-ons at the application level, and only accept orders from a predefined set of ingredients.

That way, anyone who wanted soy sauce on their icecream wouldn’t be able to order it without speaking to a human (and maybe a psychiatrist while we’re at it).

Another guard rail you could add would be to have a price ceiling on the size of the transaction you’re prepared to do in this highly automated way.

If the total value of the goods was over $50, or more than say 5 tubs of icecream, you could then forward the transaction to a human to ensure everything is kosher before proceeding. This could also catch scenarios where the customer is accidentally overcharged for their goods.

This limits your risk, as only low value transactions are completely automated. It also has the added benefit of promoting higher levels of service for your highest value customers.

Importantly, these guard rails should be implemented at the level of your application architecture, and not exclusively within the LLM prompt, as this is vulnerable to failure.

3. Use them on the right types of tasks

As much as I’d like them to be, language models aren’t magic.

I’m seeing a trend where people want LLMs to solve all of their business logic problems, even though they have lots of weaknesses that preclude their use in this way.

I think this comes from a misunderstanding of what their strengths and weaknesses actually are.

Language models above all else, excel at reading and writing language. This makes them extremely useful at the interfaces between human beings and traditional software.

In other words, they provide a new way for humans to interact with digital systems from the outside world. However, they are not a drop-in replacement for everything that traditional software was previously capable of internally.

The clearest example of this is with arithmetic and maths, which language models can struggle with, especially the more technical it gets.

Not to mention the quality assurance issues that arise when using language models for critical internal business logic. You can easily verify that a Python script will deliver a given output for a given input at deployment time, but the same cannot be said for a language model.

Along similar lines, I had a discussion with a doctor friend recently who wanted to create an LLM agent that would respond to his patients after surgery, answering their questions with bespoke advice based on the common problems they may encounter.

His initial instinct was to ask the LLM to read the enquiry and respond with custom text for each patient, but this approach is flawed for multiple reasons.

Firstly, this is well inside the domain of software as a medical device now, and may attract regulatory obligations.

Secondly, this is not necessarily an area where language models will excel. Using an LLM to generate bespoke medical advice for each patient is unnecessary and potentially dangerous.

An alternative approach would be to use a language model to categorise each enquiry and then refer them to an appropriate document that explains each of the common pitfalls for the procedure they have had.

Or better yet, perhaps forget the LLM, and provide that information to all patients after surgery as a matter of policy, and then make arrangements to field the rare and unusual enquiries that aren’t answered by the information provided.

4. Monitor performance closely

Unfortunately the risk of your language model hallucinating is always looming. That makes it a potentially dangerous tool for some usecases, especially because the mistakes can sound highly credible.

One example I saw recently was a citation for an outcome of a court case. The model incorrectly stated that Jetstar Airways had lost a case in an Australian court in relation to late and cancelled airline services. In actual fact, the court finding was actually against British Airways, and the case was heard in a jurisdiction in the UK.

The model had essentially created a highly believable but completely fictional localisation of the facts of the matter. And highly believable outputs can be difficult to catch with guard rails (or even the human eye for that matter).

That’s why you should manually monitor LLM outputs. The frequency and intensity of the monitoring should be proportional to the volume of risk the outputs pose to your business.

In particular, LLM outputs that are sent directly to a customer, or initiate a financial transaction of some kind (any output that leaves the boundaries of your business), should be monitored most actively.

At Go Locum, all of our LLM outputs are posted in a dedicated Slack channel. This allows all of our team members to be aware of the messages our agents are sending, and in the event that one of them has gone off the beaten path, we’re able to intervene in real-time.

You should extend this to actions your agents take, too. This helps you troubleshoot their prompts accordingly. I’ve personally found logs like “The agent took action XYZ after receiving input ABC” to be most useful.

5. Consider sandboxing critical interactions

Sometimes an interaction will be so critical that you won’t want to just drop your untested model into your live app without more scrutiny.

Maybe you’re building a real-estate agent powered by an LLM that can hypothetically make offers on properties that could be worth hundreds of thousands or millions or dollars - you don’t exactly want to muck this up.

Another example could be an agent that can send read and write events to your app database. It would be unwise give it the keys, cross your fingers and then walk away, as undoing any mistakes could be a real pain in the ass.

For this situation, consider applying a sandbox while you’re still developing the agent’s capabilities. What this means is that the LLM will tell you what action it would have taken if it was allowed to, rather than actually doing it.

Going back to our real-estate example, you could set the agent up so that it would post in Slack when it would have made an offer on a property on behalf of a client.

The message could read: “Based on the last text message from Joe Bloggs, I would have submitted an offer of $500,000 for the property owned by Sally Smith.”.

This helps you evaluate the model’s quality in the real world in a safe way before setting it loose.

In the event that it wasn’t performing well, you could adjust the model without worrying about the negative impact on customers.

You can even take this a step further in certain situations. When you really want fine-grained control, you can force the model to seek human approval before taking a certain high-stakes action.

In this case, you could append a series of buttons to the company Slack message that would invite human intervention, options like “Approve”, “Decline” or “Modify” the action.

This might be applicable to a particularly risky message to a customer, or a purchase over a certain threshold.

Of course by involving a human you risk hamstringing the value the LLM automation created in the first place, so as always, use this approach carefully and in the correct situation.

6. Exhaust prompt engineering before resorting to fine-tuning

When I first started using LLMs, the available context window was so small that to get good results on a specialised task, you had to lean heavily on fine-tuning.

This involved creating a high-quality dataset, training a model on it ahead of time, and then configuring your app to use this pre-trained model.

This did provide good results, but took a lot of time to plan and implement. It was also a pain to modify if your business needs changed.

As the cost of LLMs have come down and the context window has increased, you can rapidly develop models on bespoke tasks much more quickly by using prompt engineering.

And as you make changes to your prompts, you can always consider temporarily adding an experimental version of your model that runs in a sandbox next to the existing version. This lets you directly compare their performance in the real world.

Fine-tuning still has a clear role when seeking high performance in niche or specialised tasks, although the threshold at which using it makes practical sense is now much higher. And in my opinion, the scope for fine-tuning will only get smaller as time goes on.

If you do go down this road and need to create a dataset for fine-tuning, consider using the best version of the model you can create with prompts alone to help label your data. This by definition will require human curation for best results (ie removing or modifying examples where the model does not perform well) but it can dramatically reduce the amount of human labour required.

7. Expose your model to your company’s data for best results

If you think back to the last time you started a new job, you’ll probably remember how hard it is to work productively in a new environment. It’s not until you start to learn how the company works that you can make a meaningful contribution.

In the same way, it’s unreasonable to expect our LLMs to provide quality outputs without the knowledge inputs they require to do their job effectively.

This could include company policies and actual data about products, services or past interactions.

Going back to our icecream shop example, a policy that you might need to give it access to could be the rules around same-day delivery. If you don’t provide this information, the model won’t be able to handle this situation gracefully.

The model could also have the ability to answer questions about the ingredients in the icecream, or about the status of past orders, if you give it access to a function that can fetch this data.

As time goes on, you’ll learn about situations where your model falls down, and should then update it accordingly.

Consider including a clause in your prompt to the effect of “If the customer asks a question not covered in the text we have provided to you, notify the company employees”. This lets you catch an important subset of cases where customers needs aren’t met, minimising their frustration.

In some situations you might be able to squeeze this information into your prompts, but in other cases you’ll have to give the model access to the relevant data asynchonously as the need arises, either in the form of lookup functions or in the case of massive datasets, an embeddings search routine.

There’s an important caveat here though - the more data your model has access to, the more vulnerable it comes to manipulation attacks, like the one Chris Bakke did with the Chevrolet.

As a result, don’t rely on your prompts to provide any level of scrutiny about the current customer’s rights to access a certain piece of data. Instead, implement this at the level of the main application.

That way, if a customer asked our icecream shop agent about another customer’s past orders, the LLM wouldn’t be able to access this information, even if it had been manipulated. And as we all know, that’s potentially incriminating personal information.

8. Be transparent

It might feel tempting to use an LLM to provide customer service without indicating that you’re doing so. Customers would feel cared for, and your response times would hit an all-time low. Positive word of mouth would follow.

In my opinion, this is a fool’s gold approach. For anything beyond the most trivial touchpoints, your customers will eventually work out what they’re dealing with.

We’ve already seen the general public’s bullshit meter adjust to account for AI-generated tripe. They aren’t stupid!

I believe you’re better off telling your customers when they receive a message written by an LLM. This way, they’re more likely to be understanding in the rare event of a mistake, and there’s no deception that could lead to the erosion of trust in the long-term.

And if your AI system works well, you’ll be the talk of the town.

9. Don’t neglect your customers after implementing AI

This is perhaps the most important tip in this article.

I believe that LLMs are an adjunct to, but not a replacement for, talking to your customers.

If you expect your AI agent to perform perfectly in the long-term, I’ve got news for you. Even the most sophisticated agents will require changes to match an evolving business and world.

As a result, there’s always the possibility that a customer will have an enquiry that can’t be handled effectively by the system in place. In these situations, customers will eventually become dissatisfied if they can’t speak to a bona fide human.

Companies that invest heavily in custom LLM functionality AND provide outstanding human service when required will win in the long term.

________________