Data Loading¶
The knowledge graph data should be prepared in a file that is compatible with the package’s data
load functionality. In order to load the graph, rdf2vecgpu uses two different engines with
different implications:
cuDF engine: utilizes GPU memory for faster data processing. Suitable for large graphs that fit into GPU memory.
rdflib engine: provides the ability to load graph file formats that are not directly supported by cuDF. However, it uses CPU memory and may be slower for large datasets.
As outlined in the Getting Started guide, the engine for loading is selected based on the provided file format. Below, we provide an overview of the supported file formats for each engine.
Supported file formats¶
In the following, you find an overview of the different supported file formats for both engines:
File Format |
cuDF engine |
rdflib engine |
|---|---|---|
N-Triples (.nt) |
Yes |
Yes |
N-Quads (.nq) |
Yes |
Yes |
Turtle (.ttl) |
No |
Yes |
RDF/XML (.rdf) |
No |
Yes |
JSON-LD (.jsonld) |
No |
Yes |
Notation-3 (.n3) |
No |
Yes |
Trig (.trig) |
No |
Yes |
CSV (.csv) |
Yes |
No |
Parquet (.parquet)| Yes |
No |
|
ORC (.orc) |
Yes |
No |
For optimal performance, it is recommended to use the cuDF engine with supported file formats like Parquet, ORC, N-Triples, or CSV. If your dataset is in a different format, consider converting it to one of these formats for better load efficiency. The best performance is typically achieved with Parquet files due to their columnar storage format, which is well-suited for GPU processing.
Code example¶
Here is a code snippet demonstrating how to load a knowledge graph using the cuDF engine with a Parquet file:
from rdf2vecgpu import GPU_RDF2Vec, RDF2VecConfig
config = RDF2VecConfig(
walk_strategy="random",
walk_depth=4,
walk_number=100,
embedding_model="skipgram",
epochs=5,
multi_gpu=False,
)
gpu_rdf2vec_model = GPU_RDF2Vec(config=config)
# Path to the triple dataset
path = "data/wikidata5m/wikidata5m_kg.parquet"
# Read data and receive edge data
edge_data = gpu_rdf2vec_model.read_data(path)
Alternatively, when using a file format which is not directly supported by cuDF, this is automatically detected and the rdflib engine is used instead:
edge_data = gpu_rdf2vec_model.read_data("data/wikidata5m/wikidata5m_kg.ttl")
This allows you to seamlessly load different file formats without changing the code logic.
Column mapping and reader keyword arguments¶
read_data exposes two optional arguments to adapt the reader to non-default schemas:
col_map: a mapping from your source column names to the expectedsubject,predicate,object(and optionallyweights) names.read_kwargs: additional keyword arguments forwarded to the underlying cuDF/Dask reader (for exampledelimiter,columns,compression).
edge_data = gpu_rdf2vec_model.read_data(
"data/my_graph.csv",
col_map={"src": "subject", "rel": "predicate", "dst": "object"},
read_kwargs={"delimiter": "\t"},
)
Weighted walks¶
When walk_weighted=True is set in the configuration, cuGraph’s biased_random_walks is used
for walk generation. The input data must contain a weights column (cuGraph’s standard column
name). You can use col_map to rename an existing edge-weight column accordingly:
config = RDF2VecConfig(walk_weighted=True, walk_strategy="random")
gpu_rdf2vec_model = GPU_RDF2Vec(config=config)
edge_data = gpu_rdf2vec_model.read_data(
"data/spatial_graph.parquet",
col_map={"distance": "weights"},
)
Literal handling¶
Knowledge graphs often contain edges whose object is a literal value (for example numeric
attributes). RDF2VecConfig exposes three parameters to handle such edges:
literal_predicates: a list of predicate strings that identify literal edges.literal_strategy:"drop"removes literal edges from the graph (the pyRDF2Vec default), while"bin"discretizes the object values into bin tokens so the edge is preserved.literal_n_binsandliteral_bin_strategy("quantile"or"uniform") control the binning behavior whenliteral_strategy="bin".
config = RDF2VecConfig(
literal_predicates=["<http://example.org/hasHeight>", "<http://example.org/hasAge>"],
literal_strategy="bin",
literal_n_bins=5,
literal_bin_strategy="quantile",
)
Considerations for multi-GPU and distributed setups¶
Depending on the value of the multi_gpu flag in the configuration, read_data returns the
data either as a single cuDF dataframe (for single-GPU training) or as a Dask-cuDF dataframe
backed by a list of cuDF partitions (for multi-GPU training).
Depending on the framework used for the graph load, a repartition of the Dask dataframe may be necessary to achieve the best performance for the following steps, which are influenced by the number of GPUs available as well as the number of nodes within the cluster.