Datascience in Towards Data Science on Medium,

NER in Czech Documents with XLM-RoBERTa using Accelerate

11/18/2024 Jesus Santana

Decisions I made during the development of a document processing model that was successfully deployed

Image generated by Dall-E

Although I have over 8 years of experience with ML projects, this was my first NLP project. I initially searched for existing resources and code but found limited material, particularly for NER in Czech-language documents. This inspired me to compile everything I learned during development into one place, hoping to help future newcomers progress more efficiently. As such, this article offers a practical introduction rather than an in-depth theoretical analysis.

Specific numerical results are omitted due to sensitivity considerations.

Data

Example of document with entities: variable symbol of creditor (red), surname (light green), given name (dark green), date of birth (blue). Sensitive information is blacked out.

Task
The main objective was to identify the client(s) associated with each document through one of the following identifiers:

  • variable symbol of creditor (present in about 20% of documents)
  • birth ID (present in about 60% of documents)
  • combination name + surname + birth date (present in about 50% of documents)

Approximately 5% of the documents contained no identifying entities.

Dataset
For development, I used 710 “true” PDF documents, dividing them into three sets: 600 for training, 55 for validation, and 55 for testing.

Labels
I received an Excel file with entities extracted as plain text, requiring manual labeling of the document text. Using the BIO tagging format, I followed these steps:

  1. Open each document (using extract_text() function from the pdfminer.high_level module)
  2. Split text into words (using the SpaCy model “xx_sent_ud_sm” with adjustments, such as preventing splits on hyphens to handle birth number formats, e.g., ‘84–12–10/7869’)
  3. Identify entities within the text
  4. Assign corresponding labels to entities, using the “O” label for all other words

Alternative Approach
Models like LayoutLM, which also consider bounding boxes for input tokens, might improve quality. However, I avoided this option since, as usual (😮‍💨), I had already spent most of the project time on data preparation (e.g., reformatting Excel files, correcting data errors, labeling). Pursuing bounding box-based models would have required even more time.

While regex and heuristics could theoretically work for simple entities like these, I believe this approach would be ineffective, as it would require overly complex rules to accurately identify the correct ones amidst other potential candidates (lawyer name, case number, other participants in the proceedings, etc.). The model, on the other hand, is capable of learning to distinguish the relevant entities, making the use of heuristics unnecessary.

Model (training)

🤗 Accelerate
Having started in a time when wrappers were less common, I became accustomed to writing my own training loops, which I find easier to debug - an approach that 🤗 Accelerate supports effectively. It proved beneficial in this project - I wasn’t entirely certain of the required data and label formats or shapes and my data didn’t match the well-organized examples often shown in tutorials, but having full access to intermediate computations during the training loop allowed me to iterate quickly.

Context Length
Most tutorials suggest using each sentence as a single training example. However, in this case, I decided a longer context would be more suitable as documents typically contain references to multiple entities, many of which are irrelevant (e.g. lawyers, other creditors, case numbers). This broader context enables the model to better identify the relevant client. I used 512 tokens from each document as one training example. This is a common maximum limit for models but comfortably accommodates all entities in most of my documents.

Labelling of Subtokens
In the 🤗 token classification tutorial [1], recommended approach is:

Only labeling the first token of a given word. Assign -100 to other subtokens from the same word.

However, I found that the following method suggested in the token classification tutorial in their NLP course [2] works much better:

Each token gets the same label as the token that started the word it’s inside, since they are part of the same entity. For tokens inside a word but not at the beginning, we replace the B- with I-

Label “-100” is special label that is ignored by loss function. Hence, I used their functions with minor changes:

def align_labels_with_tokens(labels, word_ids):
new_labels = []
current_word = None
for word_id in word_ids:
if word_id != current_word:
# Start of a new word!
current_word = word_id
label = -100 if word_id is None else labels[word_id]
new_labels.append(label)
elif word_id is None:
# Special token
new_labels.append(-100)
else:
# Same word as previous token
label = labels[word_id]
# If the label is B-XXX we change it to I-XXX
if label % 2 == 1:
label += 1
new_labels.append(label)

return new_labels


def tokenize_and_align_labels(examples):
tokenizer = AutoTokenizer.from_pretrained("../model/xlm-roberta-large")
tokenized_inputs = tokenizer(
examples["tokens"], truncation=True, is_split_into_words=True,
padding="max_length", max_length=512)
all_labels = examples["ner_tags"]
new_labels = []
for i, labels in enumerate(all_labels):
word_ids = tokenized_inputs.word_ids(i)
new_labels.append(align_labels_with_tokens(labels, word_ids))

tokenized_inputs["labels"] = new_labels
return tokenized_inputs

I also used their postprocess()function:

To simplify its evaluation part, we define this postprocess() function that takes predictions and labels and converts them to lists of strings.
def postprocess(predictions, labels):
predictions = predictions.detach().cpu().clone().numpy()
labels = labels.detach().cpu().clone().numpy()

true_labels = [[id2label[l] for l in label if l != -100] for label in labels]
true_predictions = [
[id2label[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
return true_predictions, true_labels

Class Weights
Incorporating class weights into the loss function significantly improved model performance.
While this adjustment may seem straightforward — without it, the model overemphasized the majority “O” class — it’s surprisingly absent from most tutorials. I implemented a custom compute_weights() function to address this imbalance:

def compute_weights(trainset, num_labels):
c = Counter()
for t in trainset:
c += Counter(t['labels'].tolist())
weights = [sum(c.values())/(c[i]+1) for i in range(num_labels)]
return weights

Training Loop
I defined two additional functions: PyTorch DataLoader() to manage batch processing, and a main() function to set up distributed training objects and execute the training loop.

from accelerate import Accelerator, notebook_launcher
from collections import Counter
from datasets import Dataset
from datetime import datetime
import torch
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from transformers import AutoModelForTokenClassification
from transformers import XLMRobertaConfig, XLMRobertaForTokenClassification
from seqeval.metrics import classification_report, f1_score

def create_dataloaders(trainset, evalset, batch_size, num_workers):
train_dataloader = DataLoader(trainset, shuffle=True,
batch_size=batch_size, num_workers=num_workers)
eval_dataloader = DataLoader(evalset, shuffle=False,
batch_size=batch_size, num_workers=num_workers)
return train_dataloader, eval_dataloader

def main(batch_size, num_workers, epochs, model_path, dataset_tr, dataset_ev, training_type, model_params, dt):
accelerator = Accelerator(split_batches=True)
num_labels = model_params['num_labels']

# Prepare data #
train_ds = Dataset.from_dict(
{"tokens": [d[2][:512] for d in dataset_tr],
"ner_tags": [d[1][:512] for d in dataset_tr]})
eval_ds = Dataset.from_dict(
{"tokens": [d[2][:512] for d in dataset_ev],
"ner_tags": [d[1][:512] for d in dataset_ev]})
trainset = train_ds.map(tokenize_and_align_labels, batched=True,
remove_columns=["tokens", "ner_tags"])
evalset = eval_ds.map(tokenize_and_align_labels, batched=True,
remove_columns=["tokens", "ner_tags"])
trainset.set_format("torch")
evalset.set_format("torch")
train_dataloader, eval_dataloader = create_dataloaders(trainset, evalset,
batch_size, num_workers)

# Type of training #
if training_type=='from_scratch':
config = XLMRobertaConfig.from_pretrained(model_path, **model_params)
model = XLMRobertaForTokenClassification(config)
elif training_type=='transfer_learning':
model = AutoModelForTokenClassification.from_pretrained(model_path,
ignore_mismatched_sizes=True, **model_params)
for param in model.parameters():
param.requires_grad=False
for param in model.classifier.parameters():
param.requires_grad=True
elif training_type=='fine_tuning':
model = AutoModelForTokenClassification.from_pretrained(model_path,
**model_params)
for param in model.parameters():
param.requires_grad=True
for param in model.classifier.parameters():
param.requires_grad=True

# Intantiate the optimizer #
optimizer = torch.optim.AdamW(params=model.parameters(), lr=2e-5)

# Instantiate the learning rate scheduler #
lr_scheduler = ReduceLROnPlateau(optimizer, patience=5)

# Define loss function #
weights = compute_weights(trainset, num_labels)
loss_fct = CrossEntropyLoss(weight=torch.tensor(weights))

# Prepare objects for distributed training #
loss_fct, train_dataloader, model, optimizer, eval_dataloader, lr_scheduler = accelerator.prepare(
loss_fct, train_dataloader, model, optimizer, eval_dataloader, lr_scheduler)

# Training loop #
max_f1 = 0 # for early stopping
for t in range(epochs):
# training
accelerator.print(f"\n\nEpoch {t+1}\n-------------------------------")
model.train()
tr_loss = 0
preds = list()
labs = list()
for batch in train_dataloader:
outputs = model(input_ids=batch['input_ids'],
attention_mask=batch['attention_mask'])
labels = batch["labels"]
loss = loss_fct(outputs.logits.view(-1, num_labels), labels.view(-1))
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
tr_loss += loss
predictions = outputs.logits.argmax(dim=-1)
predictions_gathered = accelerator.gather(predictions)
labels_gathered = accelerator.gather(labels)
true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
preds.extend(true_predictions)
labs.extend(true_labels)

lr_scheduler.step(tr_loss)

accelerator.print(f"Train loss: {tr_loss/len(train_dataloader):>8f} \n")
accelerator.print(classification_report(labs, preds))

# evaluation
model.eval()
ev_loss = 0
preds = list()
labs = list()
for batch in eval_dataloader:
with torch.no_grad():
outputs = model(input_ids=batch['input_ids'],
attention_mask=batch['attention_mask'])
labels = batch["labels"]
loss = loss_fct(outputs.logits.view(-1, num_labels), labels.view(-1))

ev_loss += loss
predictions = outputs.logits.argmax(dim=-1)
predictions_gathered = accelerator.gather(predictions)
labels_gathered = accelerator.gather(labels)
true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
preds.extend(true_predictions)
labs.extend(true_labels)

accelerator.print(f"Eval loss: {ev_loss/len(eval_dataloader):>8f} \n")
accelerator.print(classification_report(labs, preds))

accelerator.print(f"Current Learning Rate: {optimizer.param_groups[0]['lr']}")

# checkpoint best model
if f1_score(labs, preds) > max_f1:
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(f"../model/xlml_ner/{dt}/",
is_main_process=accelerator.is_main_process,
save_function=accelerator.save)
accelerator.print(f"Model saved during {t+1}. epoch.")
max_f1 = f1_score(labs, preds)
best_epoch = t

# early stopping
if (t - best_epoch) > 10:
accelerator.print(f"Early stopping after {t+1}. epoch.")
break

accelerator.print("Done!")

With everything prepared, the model is ready for training. I just need to initiate the process:

label_list = [
"O",
"B-evcu", "I-evcu", # variable symbol of creditor
"B-rc", "I-rc", # birth ID
"B-prijmeni", "I-prijmeni", # surname
"B-jmeno", "I-jmeno", # given name
"B-datum", "I-datum", # birth date
]
id2label = {a: b for a,b in enumerate(label_list)}
label2id = {b: a for a,b in enumerate(label_list)}

num_workers = 6 # number of GPUs
batch_size = num_workers*2
epochs = 100
model_path = "../model/xlm-roberta-large"
training_type = "fine_tuning" # from_scratch / transfer_learning / fine_tuning
model_params = {"id2label": id2label, "label2id": label2id, "num_labels": 11}
dt = datetime.now().strftime("%Y%m%d_%H%M%S")
os.mkdir(f"../model/xlml_ner/{dt}")

notebook_launcher(main, args=(batch_size, num_workers, epochs, model_path,
dataset_tr, dataset_ev, training_type, model_params, dt),
num_processes=num_workers, mixed_precision="fp16", use_port="29502")

I find using notebook_launcher() convenient, as it allows me to run training in the console and easily work with results afterward.

XLM-RoBERTa base vs large vs Small-E-Czech
I experimented with fine-tuning three models. The XLM-RoBERTa base model [3] delivered satisfactory performance, but the server capacity also allowed me to try the XLM-RoBERTa large model [3], which has twice the parameters.

XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.

The large model showed a slight improvement in results, so I ultimately deployed it. I also tested Small-E-Czech [4], an Electra-small model pre-trained on Czech web data, but its performance was poor.

Fine-tuning vs Transfer learning vs Training from scratch
In addition to fine-tuning (updating all model weights), I tested transfer learning, as it is sometimes suggested that training only the final (classification) layer may suffice.. However, the performance difference was significant, favoring full fine-tuning. I also attempted training from scratch by importing only architecture of the model, initializing the weights randomly, and then training, but as expected, this approach was ineffective.

RoBERTa vs LLM (Claude 3.5 Sonnet)
I briefly explored zero-shot LLMs, though with minimal prompt engineering (so 🥱). The model struggled even with basic requests, such as (I used Czech in the actual prompt):

Find variable symbol of creditor. This number has exactly 9 consecutive digits 0–9 without letters or other special characters. It is usually preceded by one of the following abbreviations: ‘ev.č.’, ‘zn. opr’, ‘VS. O’, ‘evid. č. opr.’. On the contrary, I’m not interested in a transaction number with the abbreviation ‘č.j.’. This number does not appear often in documents, it may happen that you will not be able to find it, then write ‘cannot find’. If you’re not sure, write ‘not sure’.

The model sometimes failed to output the 9-digit format accurately. Post-processing would filter out shorter numbers, but there were many false positives 9-digit numbers.

Occasionally the model inferred incorrect birth IDs based solely on birth dates (even with temperature set to 0). On the other hand, it excelled at extracting names, surnames, and birth dates.

Overall, even in my previous experiments, I found that LLMs (at the time of writing) perform better with general tasks but lack accuracy and reliability for specific or unconventional tasks. The performance in identifying the client was fairly similar for both approaches. For internal reasons, the RoBERTa model was deployed.

Post-processing

Notably, implementing post-processing can significantly reduce false positives, enhancing overall performance. Each entity was subject to customized filtering and validation rules:

  • variable symbol of debtor - verify 9 digits format
  • birth ID - enforce XXXXXX/XXX(X) format and check divisibility by eleven
  • name and surname - apply lemmatization using MorphoDiTa [5]
  • date of birth - enforce DD.MM.YYYY format

Conclusion

The fine-tuned model was successfully deployed and performs superbly, exceeding expectations given the modest dataset of 710 documents.

While LLMs show promise for general tasks, they lack the accuracy and reliability for specialized tasks. That said, it’s likely that in the near future, even fine-tuning will become unnecessary for all but highly specialized cases as LLMs continue to improve.

Acknowledgments
I would like to thank Martin, Tomáš and Petr for their valuable suggestions for the improvement of this article.

Sources
[1] Hugging Face, Transformers - Token classification
[2] Hugging Face, NLP Course — Token classification
[3] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzman, E. Grave, M. Ott, L. Zettlemoyer and V. Stoyanov, Unsupervised Cross-lingual Representation Learning at Scale (2019), CoRR abs/1911.02116
[4] M. Kocián, J. Náplava, D. Štancl and V. Kadlec, Siamese BERT-based Model for Web Search Relevance Ranking Evaluated on a New Czech Dataset (2021)
[5] J. Straková, M. Straka and J. Hajič . Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition (2014), In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13–18, Baltimore, Maryland, June 2014. Association for Computational Linguistics.


NER in Czech Documents with XLM-RoBERTa using 🤗 Accelerate was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/dEhJcTR
via IFTTT

También Podría Gustarte