Dataset Base Class

We implement various base classes for working with graph datasets. These provide abstractions for loading, caching and accessing collections of graphs.

Available classes

polygraph.datasets.base.AbstractDataset

Bases: ABC

Abstract base class defining the dataset interface.

This class defines the core functionality that all graph datasets must implement. It provides methods for accessing graphs and converting between formats.

__getitem__(idx) abstractmethod

Gets a graph from the dataset by index.

Parameters:
  • idx (Union[int, List[int], slice]) –

    Index of the graph to retrieve

Returns:
  • Union[Data, List[Data]]

    Graph as a PyTorch Geometric Data object

__len__() abstractmethod

Gets the total number of graphs in the dataset.

Returns:
  • int

    Number of graphs

to_nx()

Creates a NetworkXView view of this dataset that returns NetworkX graphs.

Returns:
  • NetworkXView

    NetworkX view wrapper around this dataset

is_valid(graph)

Checks if a graph is structurally valid in the context of this dataset.

This method is optional and can be used in VUN metrics.

Parameters:
  • graph (Graph) –

    NetworkX graph to validate

Returns:
  • bool

    True if the graph is valid for this dataset, False otherwise

polygraph.datasets.base.GraphDataset

Bases: AbstractDataset

Basic dataset using a GraphStorage object for holding graphs.

This class provides functionality for accessing and sampling from a collection of graphs stored in memory or on disk via a GraphStorage object.

Parameters:
  • data_store (GraphStorage) –

    GraphStorage object containing the dataset

dump_data(path)

Dumps the data store to a file.

This file may be used to load the data store later on. In particular, a link to the file may be used in a URLGraphDataset.

Example
from polygraph.datasets import GraphDataset, GraphStorage
import networkx as nx

ds = GraphDataset(GraphStorage.from_nx_graphs([nx.erdos_renyi_graph(64, 0.1) for _ in range(100)]))
ds.dump_data("/tmp/my_dataset.pt")

ds2 = GraphDataset.load_data("/tmp/my_dataset.pt", memmap=True)
assert len(ds2) == 100
assert ds2.to_nx()[0].number_of_nodes() == 64
Parameters:
  • path (str) –

    Path to dump the data store to, preferably with a .pt extension.

load_data(path, memmap=False) staticmethod

Loads a data store from a file.

Parameters:
  • path (str) –

    Path to load the data store from

  • memmap (bool, default: False ) –

    Whether to memory-map the cached data. Useful for large datasets that do not fit into memory.

sample_graph_size(n_samples=None)

From the empirical distribution of this dataset, draw a random sample of graph sizes.

This is useful for generative models that are conditioned on graph size, e.g. DiGress.

Parameters:
  • n_samples (Optional[int], default: None ) –

    Number of samples to draw.

Returns:
  • List[int]

    List of graph sizes, drawn from the empirical distribution with replacement.

summary(precision=2)

Prints a summary of the dataset statistics.

Parameters:
  • precision (int, default: 2 ) –

    Number of decimal places to display

polygraph.datasets.base.URLGraphDataset

Bases: GraphDataset

Dataset that downloads a single split from a URL.

This class handles downloading graph data from a URL and caching it locally.

Parameters:
  • url (str) –

    URL to download the data from

  • memmap (bool, default: False ) –

    Whether to memory-map the cached data. Useful for large datasets that do not fit into memory.

polygraph.datasets.base.SplitGraphDataset

Bases: GraphDataset

Abstract base class for downloading and caching graph data with multiple splits.

This class handles downloading graph data from a URL and caching it locally. Subclasses must implement methods to specify the data source.

Parameters:
  • split (str) –

    Dataset split to load (e.g. 'train', 'test')

  • memmap (bool, default: False ) –

    Whether to memory-map the cached data. Useful for large datasets that do not fit into memory.

url_for_split(split) abstractmethod

Gets the URL to download data for a specific split.

Parameters:
  • split (str) –

    Dataset split (e.g. 'train', 'test')

Returns:
  • str

    URL where the data can be downloaded

hash_for_split(split) abstractmethod

Gets the expected hash for a specific split's data.

This hash is used to validate downloaded data.

Parameters:
  • split (str) –

    Dataset split (e.g. 'train', 'test')

Returns:
  • Optional[str]

    Hash string for validating the split's data

polygraph.datasets.base.ProceduralGraphDataset

Bases: GraphDataset

Dataset that generates graphs procedurally.

This class handles caching of procedurally generated graph data. Subclasses must implement the graph generation logic.

Parameters:
  • split (str) –

    Dataset split to generate

  • config_hash (str) –

    Hash identifying the generation configuration

  • memmap (bool, default: False ) –

    Whether to memory-map the cached data

generate_data() abstractmethod

Generates the graph data for this dataset.

Returns: