Retrieval of embeddingsΒΆ
After training RDF2Vec embeddings using the gpuRDF2vec package, you can retrieve the vector
representations for all entities used within the knowledge graph. Similarly to the GPU-based
training process, the retrieval of embeddings is also optimized for performance by building on top
of DLPack to extract the vectors directly from GPU memory. This allows you to handle large-scale
knowledge graphs efficiently.
The following example demonstrates how to perform this retrieval process:
from rdf2vecgpu import GPU_RDF2Vec, RDF2VecConfig
# Initialize the GPU_RDF2Vec pipeline
config = RDF2VecConfig(
walk_strategy="random",
walk_depth=4,
walk_number=100,
embedding_model="skipgram",
epochs=5,
batch_size=None,
vector_size=100,
window_size=5,
min_count=1,
learning_rate=0.01,
negative_samples=5,
random_state=42,
reproducible=False,
multi_gpu=False,
generate_artifact=False,
cpu_count=20,
)
gpu_rdf2vec_model = GPU_RDF2Vec(config=config)
# Read the knowledge graph
edge_data = gpu_rdf2vec_model.read_data("data/wikidata5m/wikidata5m_kg.parquet")
# Train the RDF2Vec embeddings
gpu_rdf2vec_model.fit(edge_df=edge_data, walk_vertices=None)
# Retrieve the embeddings for all entities
embeddings = gpu_rdf2vec_model.transform()
The transform method returns a cuDF dataframe where the keys are the entity URIs, together with
the internal integer-based ID and the embedding vectors. If you set generate_artifact=True
during the configuration, the embeddings will also be saved to disk in the specified output
directory as a Parquet file.