search Inloggen search Registreren

Jouw profiel

Registreren Inloggen

Artikel

11
juni

Mouse is

juni 11, 2024

3 views

How to Read and Write Parquet Files with Python

Apache Parquet files are a popular columnar storage format used by data scientists and anyone using the Hadoop ecosystem. It was developed to be very efficient in terms of compression and encoding. Check out their documentation if you want to know all the details about how Parquet files work.

You can read and write Parquet files with Python using the pyarrow package.

Let’s learn how that works now!

Installing pyarrow

The first step is to make sure you have everything you need. In addition to the Python programming language, you will also need pyarrow and the pandas package. You will use pandas because it is another Python package that uses columns as a data format and works well with Parquet files.

You can use pip to install both of these packages. Open up your terminal and run the following command:

python -m pip install pyarrow pandas

If you use Anaconda, you’ll want to install pyarrow using this command instead.

conda install -c conda-forge pyarrow

Anaconda should already include pandas, but if not, you can use the same command above by replacing pyarrow with pandas.

Now that you have pyarrow and pandas installed, you can use it to read and write Parquet files!

Writing Parquet Files with Python

Writing Parquet files with Python is pretty straightforward. The code to turn a pandas DataFrame into a Parquet file is about ten lines.

Open up your favorite Python IDE or text editor and create a new file. You can name it something like parquet_file_writer.pyor use some other descriptive name. Then enter the following code:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


def write_parquet(df: pd.DataFrame, filename: str) -> None:
table = pa.Table.from_pandas(df)
pq.write_table(table, filename)


if __name__ == "__main__":
data = {"Languages": ["Python", "Ruby", "C++"],
"Users": [10000, 5000, 8000],
"Dynamic": [True, True, False],
}
df = pd.DataFrame(data=data, index=list(range(1, 4)))
write_parquet(df, "languages.parquet")

For this example, you have three imports:

  • One for pandas, so you can create a DataFrame
  • One for pyarrow, to create a special pyarrow.Table object
  • One for pyarrow.parquetto transform the table object into a Parquet file

The  write_parquet() function takes in a pandas DataFrame and the file name or path to save the Parquet file to. Then, you transform the DataFrame into a pyarrow Table object before converting that into a Parquet File using the write_table() method, which writes it to disk.

Now you are ready to read that file you just created!

Reading Parquet Files with Python

Reading the Parquet file you created earlier with Python is even easier. You’ll need about half as many lines of code!

You can put the following code into a new file called something like parquet_file_reader.py if you want to:

import pyarrow.parquet as pq

def read_parquet(filename: str) -> None:
table = pq.read_table(filename)
df = table.to_pandas()
print(df)

if __name__ == "__main__":
read_parquet("languages.parquet")

In this example, you read the Parquet file into a pyarrow Table format and then convert it to a pandas DataFrame using the Table’s to_pandas() method.

When you print out the contents of the DataFrame, you will see the following:

  Languages  Users  Dynamic
1 Python 10000 True
2 Ruby 5000 True
3 C++ 8000 False

You can see from the output above that the DataFrame contains all data you saved.

One of the strengths of using a Parquet file is that you can read just parts of the file instead of the whole thing. For example, you can read in just some of the columns rather then the whole file!

Here’s an example of how that works:

import pyarrow.parquet as pq

def read_columns(filename: str, columns: list[str]) -> None:
table = pq.read_table(filename, columns=columns)
print(table)

if __name__ == "__main__":
read_columns("languages.parquet", columns=["Languages", "Users"])

To read in just the “Languages” and “Users” columns from the Parquet file, you pass in the a list that contains just those column names. Then when you call read_table() you pass in the columns you want to read.

Here’s the output when you run this code:

pyarrow.Table
Languages: string
Users: int64
----
Languages: [["Python","Ruby","C++"]]
Users: [[10000,5000,8000]]

This outputs the pyarrow Table format, which differs slightly from a pandas DataFrame. It tells you information about the different columns; for example, Languages are strings, and Users are of type int64.

If you prefer to work only with pandas DataFrames, the pyarrow package allows that too. As long as you know the Parquet file contains pandas DataFrames, you can use read_pandas() instead of read_table().

Here’s a code example:

import pyarrow.parquet as pq

def read_columns_pandas(filename: str, columns: list[str]) -> None:
table = pq.read_pandas(filename, columns=columns)
df = table.to_pandas()
print(df)

if __name__ == "__main__":
read_columns_pandas("languages.parquet", columns=["Languages", "Users"])

When you run this example, the output is a DataFrame that contains just the columns you asked for:

  Languages  Users
1 Python 10000
2 Ruby 5000
3 C++ 8000

One advantage of using the read_pandas() and to_pandas() methods is that they will maintain any additional index column data in the DataFrame, while the pyarrow Table may not.

Reading Parquet File Metadata

You can also get the metadata from a Parquet file using Python. Getting the metadata can be useful when you need to inspect an unfamiliar Parquet file to see what type(s) of data it contains.

Here’s a small code snippet that will read the Parquet file’s metadata and schema:

import pyarrow.parquet as pq

def read_metadata(filename: str) -> None:
parquet_file = pq.ParquetFile(filename)
metadata = parquet_file.metadata
print(metadata)
print(f"Parquet file: {filename} Schema")
print(parquet_file.schema)

if __name__ == "__main__":
read_metadata("languages.parquet")

There are two ways to get the Parquet file’s metadata:

  • Use pq.ParquetFile to read the file and then access the metadata property
  • Use pr.read_metadata(filename) instead

The benefit of the former method is that you can also access the schema property of the ParquetFile object.

When you run this code, you will see this output:

<pyarrow._parquet.FileMetaData object at 0x000002312C1355D0>
created_by: parquet-cpp-arrow version 15.0.2
num_columns: 4
num_rows: 3
num_row_groups: 1
format_version: 2.6
serialized_size: 2682
Parquet file: languages.parquet Schema
<pyarrow._parquet.ParquetSchema object at 0x000002312BBFDF00>
required group field_id=-1 schema {
optional binary field_id=-1 Languages (String);
optional int64 field_id=-1 Users;
optional boolean field_id=-1 Dynamic;
optional int64 field_id=-1 __index_level_0__;
}

Nice! You can read the output above to learn the number of rows and columns of data and the size of the data. The schema tells you what the field types are.

Wrapping Up

Parquet files are becoming more popular in big data and data science-related fields. Python’s pyarrow package makes working with Parquet files easy. You should spend some time experimenting with the code in this tutorial and using it for some of your own Parquet files.

When you want to learn more, check out the Parquet documentation.

What's your reaction ?

Comments (0)

No reviews found