Jiri De Jonghe

For a long time, training AI models was limited to data scientists. Training good models from scratch has always been a costly endeavour. Creating a model is a multi-stage process where expertise in each step of the process is required to achieve an optimal model. Generally speaking, this process can be divided in the following steps:

Data collection: AI models are trained on data and therefore, data collection is a logical first step of the process. However, data collection is costly: high quality data is difficult to find. The collected data should have the following properties:
- Representative: the collected data must reflect real-word scenarios that the model will encounter. For example, a tornado damage prediction model trained on data from regions with lots of tornadoes (e.g., the Caribbean) will not be able to be used for areas with different environments (e.g., Europe). A related concept is bias: where due to an overrepresentation of one category, the other categories are overshadowed, leading to wrong conclusions.
- Varied: the dataset should contain diverse scenarios such that the model is able to generalize to new, unseen data. For instance, a model trained to detect fraudulent credit card transactions might fail to recognize new patterns involving cryptocurrency if those were not present in the training data.
Data wrangling: oftentimes the raw data you collected in the previous stage is not yet usable to train a model on. The formatting might be wrong, the data might contain parts you need to remove...This is where data wrangling comes in: it is the process of preparing the data and making it available in a user-friendly way to train the model.
- Cleaning: removing or correcting inaccurate, incomplete, or irrelevant data. For example, handling missing values, removing duplicates, or correcting typos.
- Transformation: converting data into a suitable format or structure for analysis.
- Feature Engineering: creating new features from the available data. Often a crucial part that can significantly impact model performance.
Training the model: finally, we're able to train the model. Once again, there a lot of variables to take into account:
- Model Architecture: different problems require different architectures. A CNN might be required for image related tasks, while boosting models might better suit for numerical data.
- Hyperparameter Tuning: training a model can be done in many different ways within the same architecture. For example, choosing the split of your training-test data and learning rate.
Evaluating the model: after the model has been trained, it is crucial to evaluate the model to get an understanding of how the model performs. Does it achieve the expected results or should we iterate our previous steps to increase the performance of the model? To evalute a model we make use of various metrics, such as accuracy, F1-score, MSE, etc. Each use-case often has specific metrics that guide the model building process.

While this is still all true, the coming of LLMs shook up the traditional approaches: many use-cases for which you used to have to train a separate model can now easily be done by an LLM using zero-shot prompting. Just think of use-cases such as intent detection, where you decide whether a sentence is positive, neutral, or negative. Due to the inherent language capabilities of LLMs there is no need to train these types of models anymore, just plug-in your LLM, send an API request, parse the response and get on with your day.

LLMs are convenient. With ever-increasing capabilities as we've seen the past two years, writing a simple prompt can already get you a long way, and in the cases where this doesn't suffice, you might introduce few-shot prompting, chain-of-thought or other more advanced prompting techniques. Furthermore, with the increasing commoditization we've been seeing (with the API prices and open source development of DeepSeek for example) it is cheaper than ever before to make use of these incredibly powerful models. Recently, the AI agent trend has taken off and they are already capable of doing incredible things. However, when designing your system, I urge to you re-evaluate: do you really need an LLM in every part of your processes?

Limitations of LLMs

As most of us, I remember being absolutely amazed the first time I used ChatGPT. It really felt like magic, and more importantly, it felt like it was able to answer any question I threw at it. After using it more and more, the limitations became more apparent: hallucinations, difficulties with instruction following, inconsistencies, prompt sensitivity...And suddenly it all felt a little less magical. And this is not just me, a common sentiment you will find is people wondering if "the LLM has gotten dumber". People love to speculate about the reason: did the LLM provider update the weights? Or did they change the system prompt perhaps? Maybe they are taking shortcuts and actually using a smaller model instead. All of these I've read multiple times over online. I believe none of these are true. Humans simply discovered the limitations of LLMs.

Unfortunately, LLMs have still plenty of problems to solve:

Hallucinations: this is by far, up to this point in time, one of the (if not the) most common problem people encounter when using LLMs. The LLM decides to make up falsehoods and presents them as if they were true. The over-confidence of the LLM makes it even more difficult to detect where the LLM started to hallucinate. For example, the incredibly simple question "how many letter 'r' are in strawberry" eluded LLMs up until the release of reasoning models such as OpenAI's o1. The tokenization approach of an LLM simply does not allow them to answer this question correctly. Nevertheless, they all proudly and confidently stated that there are two letters 'r' in strawberry. When integrating LLMs in processes, hallucinations can, without the proper safeguards, produce a funny result at best, and take down the entire process at worst. These hallucinations can in some cases be very difficult to reproduce, making it near impossible to come up with a way of avoiding these.
Inconsistencies in instruction following: this problem especially can be very annoying to deal with when integrating an LLM in your process flow. You might ask the exact same LLM the exact same question, with the exact same prompt and get good responses 99 out of a 100 times. But this one time the LLM might decide to not strictly follow the given instructions and return its answer to your question in a somewhat different way, resulting in the subsequent steps failing, due to the expetency of a certain format of the output of the LLM. This as well has the same issue as with hallucination, that these results can be very difficult to reproduce.
Prompt Sensitivity: you might've crafted the most masterful prompt the world has ever seen, pushing the LLM to its absolute limits and you're able to extract the best answer from this LLM on a consistent basis. Now suppose you need to add one teeny tiny change to the prompt due to a chang in requirements. Suddenly, your results are significantly worse and the LLM is making mistakes that it wasn't making before that are completely unrelated to the change you made.
Incontrollability: as a user of an LLM (unless you deploy it locally yourself) you're reliant on a third party. Suppose now that the people from before were right, and that OpenAI or any other third party LLM provider did in fact change their model without notifying. The model you rely on might from one day to another be completely unusable. This lack of control can be disastrous for businesses that have built their workflows around a specific LLM. A sudden change in behaviour or pricing could disrupt operations and lead to financial losses. It would not be the first time tech companies enshittified their product after obtaining a large market share.
Evaluation: deciding which LLM is the best and which one to use is not easy. Evaluating the true performance of LLMs is surprisingly difficult. There are plenty of benchmarks on which you can evaluate your LLMs and there are even leaderboards showing how these different LLMs perform. You can use these leaderboards to get some sense of how good the different LLMs are to make a decision about which one you'd like to use. However, although these benchmarks and leaderboards exist, they often don't reflect how well an LLM performs in practice. For example, despite many LLMs that came out recently claiming to outperform Claude Sonnet 3.5 on various coding benchmarks, the widespread consensus among developers is that the best LLM for coding still is (and has been for quite a while) the latter. This discrepancy leads to many believing that LLM researchers are performing benchmark maxing, optimizing models specifically to perform well on benchmarks, where they include (parts of) the benchmark dataset into their training set, leading to inflated and misleading results. Not only does this make it challenging to choose the right LLM for your use-case, it also raises concerns about the trustability of reported performance.

The above issues, including the difficulties in evaluation, can all be traced back to a single root cause: explainability. The LLM is just a black box that is trained on a gigantuous amount of data and that amazingly obtained the capabilities that it has. And, although asking the LLM explicitly to provide its reasoning can help, there is no strict way of knowing what really is going on inside the LLM. The chain of thought that can be seen using reasoning models is simply a text layer on top of their actual thinking, trying to make it align with the expected results of humans. Indeed, DeepSeek R1-Zero switches languages in it's reasoning. Is this because it is hallucinating? One interpretation says that is related to the fact that Chinese is a more information dense language, able to represent more information using a smaller amount of characters. This hypothesis is somewhat supported by a research paper by Meta from 2017. They trained two models together in the art of bargaining and they developed their own, more compact, information dense language. The reason why this doesn't happen with other reasoning models is because they are trained using some form of RLHF, where they are rewarded for producing reasoning that aligns with human understanding, effectively forcing them to use English to get the optimal reward.

Alternative approach

And so it seems, we are at an impasse: on the one hand, creating a model from scratch offers control and reliability but costs a lot of money and time, and, on the other hand, LLMs offer speed and convenience but still suffer from substantial issues that cannot be ignored and can potentially be dealbreakers for operation critical applications. What if we combine the best of both worlds while limiting the downsides to a minimum? Indeed, the primary challenges of our first approach mainly lie in getting the model operational, once the model has been trained we can move it to production and use it for inference at a low cost. Admittedly, there are still plenty of issues that might pop during this process, but handling these issues is nothing new, and frankly, something we're already quite good at! Conversely, the problem with only relying on LLMs occurs during usage: we have relatively little control of its behaviour besides prompting and/or fine-tuning. Combining that with their unpredictable nature (e.g., hallucinations) hinders the ease of adoptions in our systems. What if we could switch now these two around: by using a self-trained model for inference we gain the control and predictability, and cost-effectiveness during deployment. Simultaneously, using an LLM during the model creation stage, we can significantly reduce the cost of creating a model in the first place. This way we can leverage the immense capabilities of LLMs while still having the maturity of tradition MLOps. For this, we use synthetic data.

Synthetic data creation, powered by LLMs, allows use to generate high-quality training data without the need for expensive and time-consuming manual labeling. Instead, the LLM is very capable to generate this dataset, one of the side-effects of pre-training the model on a giant amount of data. Intuitively, this would indeed make sense: if an LLM is able to classify a sentence based on sentiment, then it most likely possesses the underlying knowledge to generate sentences with a specific sentiment. We can leverage this capability to create datasets tailored to our use-cases and needs.

Next up, we can use this synthetic data to train our narrow AI model on. Note that the goal is not to distill the LLM into a smaller model that would inherit all the capabilities of the large LLM. No, oftentimes we only really need a tiny fraction of the capabilities of the LLM to fit the use-case. Where the LLM is considered to be general AI, meaning that is applicable on a wide variety of tasks, the model we're aiming to create is narrow AI, which should be able to perform one task and one task only. Focusing on narrow AI allows use to create highly optimized models that are more efficient, faster, and easier to deploy and manage than general-purpose LLMs. These smaller models can be fine-tuned for specific tasks and easily retrained when data drift is happening, something that would be more difficult with LLMs. Therefore, we only need to extract a tiny part of the knowledge of the LLM.

You can look at it like this: an LLM is like using a flamethrower to light some candles. You might light the candles, you might set your house on fire. Maybe it is better to stick to simple matches.

Synthetic Data Generation

In itself, synthetic data generation is not a new concept, but LLMs have definitely revolutianized the field. Previously, creating good synthetic data was complex, relying on statistical methods and requiring substantial amounts of data to extrapolate from. Not anymore.

Now, remember the previously specified requirements for high quality data. LLMs offer a new approach to meeting these requirements:

Representative: where before we might've used an amount of data to extrapolate to create new synthetic data from, with LLMs this data is not strictly required. Simply inputting the prompt "Create me 10 sentences and classify them as positive, negative, or neutral" will already do the trick. In some cases this might suffice, however, when applying it to a specific use-case where you encounter sentences in a more niche domain or where the context matters, you might want to provide more information. For example, imagine you get an email where the person signs it with "thanks in advance for your most speedy delivery". While this sentence might seem neutral or even slightly positive in a general context, in business context this comes across as more passive-aggressive and thus should be classified as negative. Traditional methods might struggle to capture such nuance context. LLMs, however, can be guided through techniques like few-shot prompting.
Variety: there are plenty of ways to ensure that the dataset is varied; upping the temperature during generation can encourage the LLM to produce more diverse and unlikely outputs. Using different style of prompts or focusing on different parts of the target domain will increase the probabilities that the LLM generates granular and varied responses. Even using different examples in the few-shot prompting can already steer the LLM towards generating varied data. In general, the more varied your prompts for data creation are, the more varied your results are. Of course, keep in mind that the created data should still be representative. Nevertheless, this is still a remaining difficulty and it is still very much possible that your data is not yet varied enough, especially when your application should cover lots of different use-cases. Research is going on into more complex methods for ensuring variety in the generation of your synthetic data.

For further reading regarding synthetic data generation, you can take a look at the following papers:

InstaNER

Imagine needing a Named Entity Recognition (NER) model for a specialized domain, like legal contracts or medical records. Finding a suitable dataset can be a nightmare. I remember having to label data for multiple days straight simply because the data we were working on was unconventional and written in Dutch and French. I can confidently say that it was not an enjoyable experience, and, in some way, this blogpost and my solution, InstaNER, are born out of that experience, and my spite for having to do it again. So let me introduce InstaNER.

InstaNER is an easy-to-use framework that allows you to train a Named Entity Recognition (NER) model. In NER you're trying to identify all the named entities in a sentence. NER is a common task in natural language processing, enabling searching of documents and extraction of key information. For example, take the sentence "John was walking in Chicago". There are two entities that can be found in this sentence, namely John and Chicago. This doesn't seem like a very difficult task, and to be honest, it isn't. The issue however is that collecting a Named Entity Recognition dataset has historically been rather expensive, due to needing sufficient and qualitative samples. Although great datasets are freely available, there are usually a few limitations that strongly limit the capabilities:

Entities are ambiguous: you would think that it is quite easy to identify whether something is an entity or not, but suppose now the following sentence: "John Thomas was walking in Chicago". Should "John Thomas" be considered a single entity with the label "NAME"? What now if we want to detect first- and last name separately? What if both are tagged as name, misidentifying John Thomas as two separete entities could lead to wrong conclusions. In this case we would have to find a new dataset that takes the same approach as us. Would we be so lucky? Or will we have to resort to labeling ourselves? For anything besides the most obvious and common, it is usually the latter. Many such cases can be found; for example, you also want to extract data about the address, house number etc. Some datasets might make this distinction, others won't. Some datasets won't even regard an address as an entity, other datasets do.
Low resource languages: English is the lingua franca and the most used language when creating datasets. However, many of the required applications still require interactions in other languages. A wide array of resources in this language is not a given, and having these resources in English is a luxury which many do simply not have. The lack of NER datasets in low-resource languages hinders the development of crucial language-based technologies in those regions. InstaNER contributes to a potential solution to this problem.

Note that in this tool the chosen use-case was NER, but parallels can easily be made with other NLP tasks, in which the ideas remain the same.

InstaNER sets up an entire pipeline that goes through the following steps:

We create synthetic data using an LLM for training a NER model. The user provides the entities that they want the model to recognize and can provide example sentences that can serve as a starting point for the LLM to base the generated data on. These entities can be simples, such as "PERSON", "LOCATION", but also more specialized, e.g., "LEGAL_CLAUSE", "MEDICAL_CONDITION", anything you can imagine, really! Furthermore, the user provides the language in which it wants the created dataset to be, leveraging the power of pre-training and LLMs to generate data even in low-resource languages. Finally, providing a few examples can greatly help the representability of the generated dataset.
After the synthetic data has been created, you train the NER model. Creating the NER model is achieved by fine-tuning a BERT model, which are pre-trained to have a great contextual understanding of language. A different model variant is used depending on the language provided. Some languages already have a BERT model that has been pre-trained on this language (e.g., English - distilbert-base-uncased), if not, we use a multi-lingual BERT model - bert-base-multilingual-uncased.
Once the model has been trained, we of course evaluate the performance the model on our test-set, a part of the generated dataset on which the model has not been trained. For each type of entity we calculate the accuracy, precision, recall and F1-score.
Finally, a qualitative check can be performed by automatically loading the trained model for inference. The user can interact with the trained model through the CLI, providing sample text and observing the outputs of the model in real-time.

All the used data, hyperparameters, and fine-grained evaluation results are saved together in a directory. This comprehensive recording of the training parameters allow for full reproducability of the model training process, allowing users to easily recreate and tweak their models.

There are two ways to use the tool:

Using the CLI to start the main function with arguments. This approach is recommended if you know what you're doing and have trained NER models before.
Using the Model Creation Agent, which uses an agentic workflow (yes, this is me shoehorning an agent in my post, ironic, isn't it?) that will guide you through the process of generating synthetic data, as well as training and evaluating the model. This makes the tool accessible to users that are not too familiar with the training of an AI model.

To summarize, using synthetic data allows us to combine the strengths of LLMs and specialized models. There are several advantages:

Reduced Data Collection Cost: as previously said, a major and costly bottleneck in the creation of AI models is the collection of and labelling of data. LLMs can be used to generate this synthetic data, significantly reducing data collection and preparation costs.
Low-resource domains: for many specialized domains or languages, datasets are not immediately available or non-existent. LLMs and synthetic data can help bridge this gap, enabling models that before might have been very costly or downright impossible to create.
Control and predictability: smaller, specialized models provide greater control over its behaviour and outputs. Where LLMs might hallucinate when given new unexpected data, you can easily update your own model if there is data drift, ensuring consistent performance and mitigating the risk of unexpected results.
Efficiency and deployability: as opposed to an LLM, which is often too costly to run yourself, you can easily fully self-host a small model, giving you more control over the model and allows you to easily update the model based on your own monitoring. These smaller models require less compute and have a faster inference time.
Privacy: synthetic data can be generated without the presence of sensitive user information. Afterwards, by using your own model, you don't have the need to send your potentially sensitive data to a third party.

However, there are still some limitations and challenges that need to be addressed:

Data quality: as already mentioned previously, generating a representative and diverse synthetic dataset is not straightforward. The LLM might fail to capture nuance and complexities that are present in each use-case. Therefore, validating the synthetic data is still strongly recommended to see if it is representative and satisfies your use-case.
Dependence on LLM: although I advocated for reducing our dependency on LLMs, it is still one of the core components of synthetic data creation. Meaning that, if the LLM has limitations in generating high quality data, the quality of the trained models will also be affected.

Finally, I want to take a moment to point out that it is not a this-or-that scenario. It's not as if you can ONLY choose one of the options. For example, you might combine real-world data that you have collected with some synthetic data to train your model or you might use the LLM to help you label and process the data that you have collected. Likewise, you might include both a self-trained model and an LLM in your pipeline for some extra redundancy. Or you might have a setup where you first use your own model for classification, and sent cases where the model is unsure to an LLM, which might be able to leverage its huge training corpus to make a better decision.

Closing thoughts

I'm very excited by the potential of LLMs, and I believe the incredible things they can do will only increase. However, I hope that with this blogpost and tool showcase it makes you reconsider using an LLM as a silver bullet and shoehorning it in every possible step of your application, leading unpredictable, unreliable and costly systems. LLMs are an amazing tool and are being integrated all around us, but let's not forget that these models are not perfect and that there are always tradeoffs when using this technology.

Imagine building a chatbot that initiates complex pipelines that rely on an LLM for several key steps. Even with a 99% success rate for each step, if the chatbot has to go through five steps, the success rate of the entire pipeline is 95%, if it is ten steps it drops further to 90%. For a high throughput system, these small error rates might quickly add up to a significant problem.

Next time you're temped to use an LLM as a quick fix, consider the tradeoffs. As per always during the design of complex software systems, spending a little bit extra time now might save you a whole lot of time in the future.