.. _training: Training RDF2vec with gpuRDF2vec ================================= The training process of RDF2vec embeddings using the ``gpuRDF2vec`` package involves several steps that happen internally and leverage the GPU acceleration capabilities of the package. The overall training of the embedding model happens by calling the ``fit`` method of the :class:`~rdf2vecgpu.gpu_rdf2vec.GPU_RDF2Vec` class. Below, we outline the main steps that are performed during the training process: 1. **Data Reading**: ``read_data`` loads the triples into a cuDF (single-GPU) or Dask-cuDF (multi-GPU) dataframe using :class:`~rdf2vecgpu.reader.kg_file_reader.KGFileReader`. 2. **Walk Extraction**: Based on the selected ``walk_strategy`` (``random`` or ``bfs``), the package generates walks from the knowledge graph. This step is performed entirely on the GPU using cuGraph for efficient graph traversal and walk generation. When ``walk_weighted=True``, cuGraph's ``biased_random_walks`` is used and the input must contain a ``weights`` column. 3. **Data Preparation**: The generated walks are converted into a format suitable for the embedding model using cuDF dataframes, which are handed off to PyTorch tensors via DLPack to avoid CPU bottlenecks. 4. **Embedding Training**: The Word2Vec model is trained on the prepared walks. The package uses an optimized implementation of Word2Vec that leverages GPU acceleration for faster training times and allows scaling across different nodes and GPUs via PyTorch Lightning and Dask. 5. **Model Saving**: After training, the learned embeddings can be saved to disk for later use. Here is an example code snippet demonstrating how to train RDF2vec embeddings using the ``gpuRDF2vec`` package: .. code-block:: python from rdf2vecgpu import GPU_RDF2Vec, RDF2VecConfig # Initialize the GPU_RDF2Vec model with desired parameters config = RDF2VecConfig( walk_strategy="random", walk_depth=4, walk_number=100, embedding_model="skipgram", epochs=5, batch_size=None, vector_size=100, window_size=5, min_count=1, learning_rate=0.01, negative_samples=5, random_state=42, reproducible=False, multi_gpu=False, generate_artifact=False, cpu_count=20, ) gpu_rdf2vec_model = GPU_RDF2Vec(config=config) # Read the knowledge graph edge_data = gpu_rdf2vec_model.read_data("data/wikidata5m/wikidata5m_kg.parquet") # Fit the model to the knowledge graph data gpu_rdf2vec_model.fit(edge_df=edge_data, walk_vertices=None) Pipeline stages and experiment tracking ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Internally, the pipeline is divided into stages wrapped by the tracker's context manager, for example ``data_loading``, ``Literal_Handling``, walk generation, vocabulary construction, and training. When a tracker backend is configured via ``config.tracker`` (``"mlflow"`` or ``"wandb"``), parameters and metrics for each stage are logged to the selected experiment tracking backend. See :doc:`tracking` for details on the available backends, required extras, and what is captured at each stage.