Converting Pandas Dataframes to Pydantic Models

2024-01-17

Overview

In Python, Pandas Dataframes are effectively the standard for tabular data used in data analysis. However, data in a table format can’t be easily represented as objects, especially if not every object corresponds with a single row. This makes the existing object and JSON serialization with Pandas extremely limiting.

This is where Pydantic comes in. Pydantic is a library used for data validation and serialization which treats all data as objects. Every object has its types annotated through custom Pydantic Models, providing an existing object structure.

This is why I created pandas-to-pydantic, a easy to use Python library for converting Pandas Dataframes into Pydantic Models. Allowing you to easily convert tabular data into hierarchical data.

Links

Basic Example

Example Book Data

BookIDTitleAuthorNameGenrePublishedYear
1Harry Potter and the Philosopher’s StoneJ.K. RowlingFantasy1997
2Harry Potter and the Chamber of SecretsJ.K. RowlingFantasy1998
31984George OrwellDystopian Fiction1949
4Animal FarmGeorge OrwellPolitical Satire1945
5Pride and PrejudiceJane AustenRomance1813
7Murder on the Orient ExpressAgatha ChristieMystery1934
9Adventures of Huckleberry FinnMark TwainAdventure1884
10The Adventures of Tom SawyerMark TwainAdventure1876
11The HobbitJ.R.R. TolkienFantasy1937
12The Lord of the RingsJ.R.R. TolkienFantasy1954
import pandas as pd
from pydantic import BaseModel
from pandas_to_pydantic import dataframe_to_pydantic

# Declare pydantic models
class Book(BaseModel):
    BookID: int
    Title: str
    PublishedYear: int

class Author(BaseModel):
    AuthorName: str
    BookList: list[Book]

class Genre(BaseModel):
    Genre: str
    AuthorList: list[Author]

# Update this to your your file path
book_data = pd.read_csv(FILE_PATH)

# Convert pandas dataframe to a pydantic root model and access data as a list of dict
dataframe_to_pydantic(
    data=bookData,
    model=Genre,
    id_column_map={"Genre": "Genre", "AuthorList": "AuthorName"},
).model_dump()

Returns (output shortened):

[{'Genre': 'Fantasy',
  'AuthorList': [{'AuthorName': 'J.K. Rowling',
    'BookList': [{'BookID': 1,
      'Title': "Harry Potter and the Philosopher's Stone",
      'PublishedYear': 1997},
     {'BookID': 2,
      'Title': 'Harry Potter and the Chamber of Secrets',
      'PublishedYear': 1998}]},
   {'AuthorName': 'J.R.R. Tolkien',
    'BookList': [{'BookID': 11, 'Title': 'The Hobbit', 'PublishedYear': 1937},
     {'BookID': 12,
      'Title': 'The Lord of the Rings',
      'PublishedYear': 1954}]}]},
 {'Genre': 'Dystopian Fiction',
  'AuthorList': [{'AuthorName': 'George Orwell',
    'BookList': [{'BookID': 3, 'Title': '1984', 'PublishedYear': 1949}]}]},
...]

In this example, we want to convert the Datframe into a hierarchical data structure, Genre -> Author -> Book.

The dataframe_to_pydantic() accepts a Pandas Dataframe, Pydantic Model, and a dictionary to map field names with unique ids. These inputs allow you to quickly transform the same Dataframe into multiple different structures.

In the backend, the Pydantic Model is deconstructed into different types of fields. Any model that has a field with a child model will need an associated unique id column. As the Dataframe is processed, it is sliced using the id

Advanced Example

Example Library Data

import pandas as pd
from pydantic import BaseModel
from pandas_to_pydantic import dataframe_to_pydantic

# Declare pydantic models
class LibaryDetail(BaseModel):
    LibraryName: str
    Location: str
    EstablishedYear: int
    BookCollectionSize: int

class Author(BaseModel):
    AuthorID: int
    AuthorName: str
    AuthorBirthdate: str

class Book(BaseModel):
    BookID: int
    Title: str
    Genre: str
    PublishedYear: int

class Library(BaseModel):
    LibraryID: int
    Detail: LibaryDetail
    AuthorList: list[Author]
    BookList: list[Book]

# Input data is a pandas dataframe
data = pd.read_csv(FILE_PATH)

# Convert pandas dataframe to a pydantic root model
library_list_root = dataframe_to_pydantic(
    data,
    Library,
    {
        "Library": "LibraryID",
        "BookList": "BookID",
        "AuthorList": "AuthorID",
    },
)

© 2024 Michael Li