Getting Started¶
The starting point for using rdf2vecgpu is to install the package as described in Installation Guide.
After installation, you can follow these steps to get started with generating RDF2Vec embeddings
using GPU acceleration.
The overall framework design is oriented using similar abstractions as with scikit-learn. The main class
to interact with is GPU_RDF2Vec which provides methods for reading data,
fitting the model, and transforming the data into embeddings. All hyperparameters are bundled in a
RDF2VecConfig object — see Configuration reference for the full parameter reference.
The first step is to instantiate GPU_RDF2Vec with an RDF2VecConfig (or equivalent keyword
arguments), read a knowledge graph from a file, then fit and transform. The fit_transform method
combines fitting the Word2Vec model and returning the embeddings in one step; both can also be called
independently via fit and transform.
Basic usage¶
from rdf2vecgpu import GPU_RDF2Vec, RDF2VecConfig
# Bundle all hyperparameters in a config object
config = RDF2VecConfig(
walk_strategy="random",
walk_depth=4,
walk_number=100,
embedding_model="skipgram",
epochs=5,
batch_size=None,
vector_size=100,
window_size=5,
min_count=1,
learning_rate=0.01,
negative_samples=5,
random_state=42,
reproducible=False,
multi_gpu=False,
generate_artifact=False,
cpu_count=20,
)
# Instantiate the pipeline
gpu_rdf2vec_model = GPU_RDF2Vec(config=config)
# Path to the triple dataset
path = "data/wikidata5m/wikidata5m_kg.parquet"
# Read data and receive edge data
edge_data = gpu_rdf2vec_model.read_data(path)
# Fit the Word2Vec model and transform the dataset to an embedding
embeddings = gpu_rdf2vec_model.fit_transform(edge_df=edge_data, walk_vertices=None)
# Write embedding to file format. Return format is a cuDF dataframe
embeddings.to_parquet("data/wikidata5m/wikidata5m_embeddings.parquet", index=False)
As a shorthand, keyword arguments can be passed directly to GPU_RDF2Vec and they will be forwarded
to RDF2VecConfig internally:
gpu_rdf2vec_model = GPU_RDF2Vec(
walk_strategy="random",
walk_depth=4,
walk_number=100,
embedding_model="skipgram",
epochs=5,
)
Outlook¶
Currently, the package supports the overall workflow following the scikit-learn paradigm. In the future releases we will provide more fine granular interfaces to allow users to customize the different steps based on the specific use case. In addition, this will generally benefit the multi-GPU support and distributed training capabilities in order to reduce the task graph of Dask for very large graphs by allowing users to persist data between the steps.