Datascience in Towards Data Science on Medium,

Translating a Memoir: A Technical Journey

12/11/2024 Jesus Santana

Leveraging GPT-3.5 and unstructured APIs for translations

This blog post details how I utilised GPT to translate the personal memoir of a family friend, making it accessible to a broader audience. Specifically, I employed GPT-3.5 for translation and Unstructured’s APIs for efficient content extraction and formatting.

The memoir, a heartfelt account by my family friend Carmen Rosa, chronicles her upbringing in Bolivia and her romantic journey in Paris with an Iranian man during the vibrant 1970s. Originally written in Spanish, we aimed to preserve the essence of her narrative while expanding its reach to English-speaking readers through the application of LLM technologies.

Cover image of “Un Destino Sorprendente”, used with permission of author Carmen Rosa Wichtendahl.
Cover image of “Un Destino Sorprendente”, used with permission of author Carmen Rosa Wichtendahl.

Below you can read the translation process in more detail or you can access here the Colab Notebook.

Translating the document

I followed the next steps for the translation of the book:

  1. Import Book Data: I imported the book from a Docx document using the Unstructured API and divided it into chapters and paragraphs.
  2. Translation Technique: I translated each chapter using GPT-3.5. For each paragraph, I provided the latest three translated sentences (if available) from the same chapter. This approach served two purposes:
  • Style Consistency: Maintaining a consistent style throughout the translation by providing context from previous translations.
  • Token Limit: Limiting the number of tokens processed at once to avoid exceeding the model’s context limit.

3. Exporting translation as Docx: I used Unstructured’s API once again to save the translated content in Docx format.

Technical implementation

1. Libraries

We’ ll start with the installation and import of the necessary libraries.

pip install --upgrade openai 
pip install python-dotenv
pip install unstructured
pip install python-docx
import openai

# Unstructured
from unstructured.partition.docx import partition_docx
from unstructured.cleaners.core import group_broken_paragraphs

# Data and other libraries
import pandas as pd
import re
from typing import List, Dict
import os
from dotenv import load_dotenv

2. Connecting to OpenAI’s API

The code below sets up the OpenAI API key for use in a Python project. You need to save your API key in an .env file.

import openai

# Specify the path to the .env file
dotenv_path = '/content/.env'

_ = load_dotenv(dotenv_path) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

3. Loading the book

The code allows us to import the book in Docx format and divide it into individual paragraphs.

elements = partition_docx(
filename="/content/libro.docx",
paragraph_grouper=group_broken_paragraphs
)

The code below returns the paragraph in the 10th index of elements.

print(elements[10])

# Returns: Destino sorprendente, es el título que la autora le puso ...

4. Group book into titles and chapters

The next step involves creating a list of chapters. Each chapter will be represented as a dictionary containing a title and a list of paragraphs. This structure simplifies the process of translating each chapter and paragraph individually. Here’s an example of this format:

[
{"title": title 1, "content": [paragraph 1, paragraph 2, ..., paragraph n]},
{"title": title 2, "content": [paragraph 1, paragraph 2, ..., paragraph n]},
...
{"title": title n, "content": [paragraph 1, paragraph 2, ..., paragraph n]},
]

To achieve this, we’ll create a function called group_by_chapter. Here are the key steps involved:

  1. Extract Relevant Information: We can get each narrative text and title by calling element.category. Those are the only categories we’re interested in translating at this point.
  2. Identify Narrative Titles: We recognise that some titles should be part of the narrative text. To account for this, we assume that italicised titles belong to the narrative paragraph.
def group_by_chapter(elements: List) -> List[Dict]:
chapters = []
current_title = None

for element in elements:

text_style = element.metadata.emphasized_text_tags # checks if it is 'b' or 'i' and returns list
unique_text_style = list(set(text_style)) if text_style is not None else None

# we consider an element a title if it is a title category and the style is bold
is_title = (element.category == "Title") & (unique_text_style == ['b'])

# we consider an element a narrative content if it is a narrative text category or
# if it is a title category, but it is italic or italic and bold
is_narrative = (element.category == "NarrativeText") | (
((element.category == "Title") & (unique_text_style is None)) |
((element.category == "Title") & (unique_text_style == ['i'])) |
((element.category == "Title") & (unique_text_style == ['b', 'i']))
)

# for new titles
if is_title:
print(f"Adding title {element.text}")

# Add previous chapter when a new one comes in, unless current title is None
if current_title is not None:
chapters.append(current_chapter)

current_title = element.text
current_chapter = {"title": current_title, "content": []}

elif is_narrative:
print(f"Adding Narrative {element.text}")
current_chapter["content"].append(element.text)

else:
print(f'### No need to convert. Element type: {element.category}')


return chapters

In the example below, we can see an example:

book_chapters[2] 

# Returns
{'title': 'Proemio',
'content': [
'La autobiografía es considerada ...',
'Dentro de las artes literarias, ...',
'Se encuentra más próxima a los, ...',
]
}

5. Book translation

To translate the book, we follow these steps:

  1. Translate Chapter Titles: We translate the title of each chapter.
  2. Translate Paragraphs: We translate each paragraph, providing the model with the latest three translated sentences as context.
  3. Save Translations: We save both the translated titles and content.

The function below automates this process.

def translate_book(book_chapters: List[Dict]) -> Dict:
translated_book = []
for chapter in book_chapters:
print(f"Translating following chapter: {chapter['title']}.")
translated_title = translate_title(chapter['title'])
translated_chapter_content = translate_chapter(chapter['content'])
translated_book.append({
"title": translated_title,
"content": translated_chapter_content
})
return translated_book

For the title, we ask GPT a simple translation as follows:

def translate_title(title: str) -> str:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages= [{
"role": "system",
"content": f"Translate the following book title into English:\n{title}"
}]
)
return response.choices[0].message.content

To translate a single chapter, we provide the model with the corresponding paragraphs. We instruct the model as follows:

  1. Identify the role: We inform the model that it is a helpful translator for a book.
  2. Provide context: We share the latest three translated sentences from the chapter.
  3. Request translation: We ask the model to translate the next paragraph.

During this process, the function combines all translated paragraphs into a single string.

# Function to translate a chapter using OpenAI API
def translate_chapter(chapter_paragraphs: List[str]) -> str:
translated_content = ""

for i, paragraph in enumerate(chapter_paragraphs):

print(f"Translating paragraph {i + 1} out of {len(chapter_paragraphs)}")

# Builds the message dynamically based on whether there is previous translated content
messages = [{
"role": "system",
"content": "You are a helpful translator for a book."
}]

if translated_content:
latest_content = get_last_three_sentences(translated_content)
messages.append(
{
"role": "system",
"content": f"This is the latest text from the book that you've translated from Spanish into English:\n{latest_content}"
}
)

# Adds the user message for the current paragraph
messages.append(
{
"role": "user",
"content": f"Translate the following text from the book into English:\n{paragraph}"
}
)

# Calls the API
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=messages
)

# Extracts the translated content and appends it
paragraph_translation = response.choices[0].message.content
translated_content += paragraph_translation + '\n\n'

return translated_content

Finally, below we can see the supporting function to get the latest three sentences.

def get_last_three_sentences(paragraph: str) -> str:
# Use regex to split the text into sentences
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', paragraph)

# Get the last three sentences (or fewer if the paragraph has less than 3 sentences)
last_three = sentences[-3:]

# Join the sentences into a single string
return ' '.join(last_three)

6. Book export

Finally, we pass the dictionary of chapters to a function that adds each title as a heading and each content as a paragraph. After each paragraph, a page break is added to separate the chapters. The resulting document is then saved locally as a Docx file.

from docx import Document

def create_docx_from_chapters(chapters: Dict, output_filename: str) -> None:
doc = Document()

for chapter in chapters:
# Add chapter title as Heading 1
doc.add_heading(chapter['title'], level=1)

# Add chapter content as normal text
doc.add_paragraph(chapter['content'])

# Add a page break after each chapter
doc.add_page_break()

# Save the document
doc.save(output_filename)

Limitations

While using GPT and APIs for translation is fast and efficient, there are key limitations compared to human translation:

  • Pronoun and Reference Errors: GPT did misinterpret pronouns or references in few cases, potentially attributing actions or statements to the wrong person in the narrative. A human translator can better resolve such ambiguities.
  • Cultural Context: GPT missed subtle cultural references and idioms that a human translator could interpret more accurately. In this case, several slang terms unique to Santa Cruz, Bolivia, were retained in the original language without additional context or explanation.

Combining AI with human review can balance speed and quality, ensuring translations are both accurate and authentic.

Conclusion

This project demonstrates an approach to translating a book using a combination of GPT-3 and Unstructured APIs. By automating the translation process, we significantly reduced the manual effort required. While the initial translation output may require some minor human revisions to refine the nuances and ensure the highest quality, this approach serves as a strong foundation for efficient and effective book translation

If you have any feedback or suggestions on how to improve this process or the quality of the translations, please feel free to share them in the comments below.

Appendix


Translating a Memoir: A Technical Journey was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/zr7Ui0E
via IFTTT

También Podría Gustarte