Datascience in Towards Data Science on Medium,

TSV in Pandas: A How-To Guide

10/03/2024 Jesus Santana

The Correct Way of Loading and Writing TSV Files with Pandas

Photo by Mika Baumeister on Unsplash

Introduction

TSV is a widely used format for storing tabular data, but it can be confusing when working with textual data and the Pandas library. Two factors cause the confusion:

  1. TSV is very similar to CSV (a well-known format for storing data), but it is not the same.
  2. Pandas default settings are not compatible with the TSV format.

In the story, I briefly discuss the source of confusion and present the best way to handle the TSV format using the Pandas library.

TSV format

TSV [1] is a simple file format similar to CSV. However, there are several important differences:

  1. It uses tabs to separate the fields.
  2. It does not allow some characters, that is, line feed (\n), tabs (\t) and carriage (\r) returns within fields.
  3. There are no quotations of the fields nor escapes of special characters [2] (at least for the original format).

Point 2 is problematic when dealing with text fields, as they might contain the forbidden characters. The suggested way to deal with the forbidden characters is to replace them with arbitrary texts, like <NEWLINE>, <TABULAR>, and <RETURN>. Pandas will not do it for us. Instead, Pandas will quote the fields containing forbidden characters

Loading TSV with Pandas

To load a TSV file with the Pandas library, use the read_csv method with the sep argument set to \t. However, this is not enough to properly load a TSV because Pandas strips the quotation character by default. Let’s see the following case of a TSV file:

field
"quotted text"
text ends with a quote"
any text

To read the TSV, we will use the following code:

import pandas as pd


df = pd.read_csv("sample_quotes.tsv", sep="\t")
print(df)

The output:

                     field
0 quotted text
1 text ends with a quote"
2 any text

Can you spot what is wrong? Quotation marks should surround the first field. Pandas remove the quotation marks as it treats them as special characters to escape the whole field. To ignore this behavior, we should specify the quoting argument to csv.QUOTE_NONE, a following:

import csv
import pandas as pd


df = pd.read_csv("sample_quotes.tsv", sep="\t", quoting=csv.QUOTE_NONE)
print(df)

Now the output looks like this:

                     field
0 "quotted text"
1 text ends with a quote"
2 any text

This might look like a minor issue, but it can lead to serious errors in data. Let’s consider the following TSV file:

field
"quotted text
text ends with a quote
any text"

We have three text pieces. Two of them contain dangling quotes. The first has an opening quotation mark, and the last has a closing quotation mark. We can imagine such a case when we have a quoted text that was split into smaller chunks. When we use the default way to load this TSV file, we will get the following:

                                            field
0 quotted text\ntext ends with a quote\nany text

That is a single field instead of three. Pandas interpreted the quotation marks as an escaped field containing new lines and characters.

This is why it is important to use the quoting=csv.QUOTE_NONE explicitly when loading TSV files.

Writing TSV with Pandas

The first intuition to write a TSV with Pandas is to use the to_csv method with a separator set to \t. Unfortunately, this is the wrong approach because, with its default parameters, Pandas will not properly handle the TSV format. There are two main problems:

  1. Pandas does not remove the illegal characters (\n, \t, and \r) from the fields.
  2. If the field contains a new line or quotation marks, Pandas escapes the whole field or the illegal characters.
  3. Pandas saves the header and index column by default.

The second and third issues can be easily solved by setting the relevant parameters quotting=csv.QUOTE_NONE and index=False, header=False.

The first issue is one we have to deal with on our own. There are several ways to do it. One is to apply a function that will modify the text fields by replacing the illegal characters. Here is a sample implementation of such a method:

def tsv_friendly(field: Any) -> Any:
if isinstance(field, str):
for k, v in {"\n": "<BR/>", "\t": "<TAB/>", "\r": "<RETURN/>"}.items():
field = field.replace(k, v)
return field

Then, to save a data frame to a TSV format, do the following:

text = "First line\nSecond Line\nLine with \t tab."
df = pd.DataFrame([["a", text], ["b", None], ["c", 1.0], ["d", 1]])

df.map(tsv_friendly).to_csv("output.tsv", sep="\t", quoting=csv.QUOTE_NONE,
index=False, header=False)

The TSV file will contain the following:

a First line<BR/>Second Line<BR/>Line with <TAB/> tab.
b
c 1.0
d 1

Using the index=False, header=False I also removed the headers and the index column, which are saved by default.

This almost fixes the problem of saving a data frame into a valid TSV file. However, there is still a small catch. This will not work for a data frame with one column and an empty row. Let’s consider the following example:

df = pd.DataFrame([None, 1.0])

df.map(tsv_friendly).to_csv("output_fail.tsv", sep="\t", quoting=csv.QUOTE_NONE,
index=False, header=False)

This will end up with the following error:

_csv.Error: single empty field record must be quoted

Why bother about TSV?

So, why bother using the format if the data cannot always be properly saved as a TSV? There is no additional benefit if you only load the data from Python. However, having data in a valid TSV file opens an opportunity to manipulate the data using bash or another shell command.

Here are some examples of how to use it:

List of records

wc -l file.tsv

It works because, in TSV, each line represents a single record. In CSV, multiple lines can represent the same record, as the format allows newline characters.

Slice a column

cut -f 2 file.tsv

It works because, in TSV, each tab represents a field separator. There are no escaped separators like in CSV. Using a cut to split the CSV by comma might produce invalid results, as some commas might be escaped. Cut splits the rows by \t by default.

Calculate the frequency of values

cut -f 2 file.tsv | sort | uniq -c

It will output something like this:

     96 Cat
48 Coyot
4 Dingo
181 Dog
73 Fox
4 Hyena
1439 None
4 Ocelot
130 Pig
8 Racoon
37 Serval
9 Wolve

Summary

TSV is a handy format for storing tabular data. It forces some restrictions on storing the data (some characters are illegal — \n, \t, and \r). At the same time, the restrictions make it easier to process the data using shell commands like wc, cut, grep, and so on. If you decide to use it with the Pandas library, make sure to remember how to handle the format properly.

References

  1. https://en.wikipedia.org/wiki/Tab-separated_values
  2. https://www.iana.org/assignments/media-types/text/tab-separated-values

TSV in Pandas: A How-To Guide was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/WBZPRCz
via IFTTT

También Podría Gustarte