Data Loading

The knowledge graph data should be prepared in a file that is compatible with the package’s data load functionality. In order to load the graph, rdf2vecgpu uses two different engines with different implications:

  • cuDF engine: utilizes GPU memory for faster data processing. Suitable for large graphs that fit into GPU memory.

  • rdflib engine: provides the ability to load graph file formats that are not directly supported by cuDF. However, it uses CPU memory and may be slower for large datasets.

As outlined in the Getting Started guide, the engine for loading is selected based on the provided file format. Below, we provide an overview of the supported file formats for each engine.

Supported file formats

In the following, you find an overview of the different supported file formats for both engines:

File Format

cuDF engine

rdflib engine

N-Triples (.nt)

Yes

Yes

N-Quads (.nq)

Yes

Yes

Turtle (.ttl)

No

Yes

RDF/XML (.rdf)

No

Yes

JSON-LD (.jsonld)

No

Yes

Notation-3 (.n3)

No

Yes

Trig (.trig)

No

Yes

CSV (.csv)

Yes

No

Parquet (.parquet)| Yes

No

ORC (.orc)

Yes

No

For optimal performance, it is recommended to use the cuDF engine with supported file formats like Parquet, ORC, N-Triples, or CSV. If your dataset is in a different format, consider converting it to one of these formats for better load efficiency. The best performance is typically achieved with Parquet files due to their columnar storage format, which is well-suited for GPU processing.

Code example

Here is a code snippet demonstrating how to load a knowledge graph using the cuDF engine with a Parquet file:

from rdf2vecgpu import GPU_RDF2Vec, RDF2VecConfig

config = RDF2VecConfig(
    walk_strategy="random",
    walk_depth=4,
    walk_number=100,
    embedding_model="skipgram",
    epochs=5,
    multi_gpu=False,
)
gpu_rdf2vec_model = GPU_RDF2Vec(config=config)

# Path to the triple dataset
path = "data/wikidata5m/wikidata5m_kg.parquet"

# Read data and receive edge data
edge_data = gpu_rdf2vec_model.read_data(path)

Alternatively, when using a file format which is not directly supported by cuDF, this is automatically detected and the rdflib engine is used instead:

edge_data = gpu_rdf2vec_model.read_data("data/wikidata5m/wikidata5m_kg.ttl")

This allows you to seamlessly load different file formats without changing the code logic.

Column mapping and reader keyword arguments

read_data exposes two optional arguments to adapt the reader to non-default schemas:

  • col_map: a mapping from your source column names to the expected subject, predicate, object (and optionally weights) names.

  • read_kwargs: additional keyword arguments forwarded to the underlying cuDF/Dask reader (for example delimiter, columns, compression).

edge_data = gpu_rdf2vec_model.read_data(
    "data/my_graph.csv",
    col_map={"src": "subject", "rel": "predicate", "dst": "object"},
    read_kwargs={"delimiter": "\t"},
)

Weighted walks

When walk_weighted=True is set in the configuration, cuGraph’s biased_random_walks is used for walk generation. The input data must contain a weights column (cuGraph’s standard column name). You can use col_map to rename an existing edge-weight column accordingly:

config = RDF2VecConfig(walk_weighted=True, walk_strategy="random")
gpu_rdf2vec_model = GPU_RDF2Vec(config=config)
edge_data = gpu_rdf2vec_model.read_data(
    "data/spatial_graph.parquet",
    col_map={"distance": "weights"},
)

Literal handling

Knowledge graphs often contain edges whose object is a literal value (for example numeric attributes). RDF2VecConfig exposes three parameters to handle such edges:

  • literal_predicates: a list of predicate strings that identify literal edges.

  • literal_strategy: "drop" removes literal edges from the graph (the pyRDF2Vec default), while "bin" discretizes the object values into bin tokens so the edge is preserved.

  • literal_n_bins and literal_bin_strategy ("quantile" or "uniform") control the binning behavior when literal_strategy="bin".

config = RDF2VecConfig(
    literal_predicates=["<http://example.org/hasHeight>", "<http://example.org/hasAge>"],
    literal_strategy="bin",
    literal_n_bins=5,
    literal_bin_strategy="quantile",
)

Considerations for multi-GPU and distributed setups

Depending on the value of the multi_gpu flag in the configuration, read_data returns the data either as a single cuDF dataframe (for single-GPU training) or as a Dask-cuDF dataframe backed by a list of cuDF partitions (for multi-GPU training).

Depending on the framework used for the graph load, a repartition of the Dask dataframe may be necessary to achieve the best performance for the following steps, which are influenced by the number of GPUs available as well as the number of nodes within the cluster.