Task

Deploy a self hosted LLM for real estate agency, that will generate descriptions for houses and apartments, based on property features. Model should generate descriptions on English, and Russian languages.

Realization

Base model: Llama3

Notably it performs well on descriptions on English, and not so well on Russian. So I decided that dataset should contain dominantly Russian descriptions, with some on English as well.

Then I start to collect data from different sources, notably Prian and Turnintolocal. After that I cleaned and normalized dataset, to be appropriate. For fine-tuning I used alpaca format, which comes in the following format: Alpaca

instruction describes the task the model should perform.
input is optional context or input for the task.
output the answer to the instruction as should be generated.
text: the instruction, input and output formatted with the prompt template used by the authors for fine-tuning their models.

At the end my dataset looked like: Dataset

Next comes training part:

As of tools I used unsloth, they provide convenient, and simple jupiter notebooks, that we can use in Google Colab (in my case), or in any machine with at least 16 gb of vram, (with gpu like T4).

Evaluation

After training I saved model in GGUF format (common file format for LLMs for deployment), and loaded in ollama. After training model improved its performance on generating descriptions on Russian language: Inference

Now it also uses similar description structure, as in dateset, resulting in accurate and predictive outputs.