.. _gettingstarted:

Getting Started
=================
The starting point for using ``rdf2vecgpu`` is to install the package as described in :doc:`installation`.
After installation, you can follow these steps to get started with generating RDF2Vec embeddings
using GPU acceleration.

The overall framework design is oriented using similar abstractions as with scikit-learn. The main class
to interact with is :class:`~rdf2vecgpu.gpu_rdf2vec.GPU_RDF2Vec` which provides methods for reading data,
fitting the model, and transforming the data into embeddings. All hyperparameters are bundled in a
:class:`~rdf2vecgpu.config.RDF2VecConfig` object — see :doc:`configuration` for the full parameter reference.

The first step is to instantiate ``GPU_RDF2Vec`` with an ``RDF2VecConfig`` (or equivalent keyword
arguments), read a knowledge graph from a file, then fit and transform. The ``fit_transform`` method
combines fitting the Word2Vec model and returning the embeddings in one step; both can also be called
independently via ``fit`` and ``transform``.

Basic usage
~~~~~~~~~~~~
.. code-block:: python

   from rdf2vecgpu import GPU_RDF2Vec, RDF2VecConfig

   # Bundle all hyperparameters in a config object
   config = RDF2VecConfig(
       walk_strategy="random",
       walk_depth=4,
       walk_number=100,
       embedding_model="skipgram",
       epochs=5,
       batch_size=None,
       vector_size=100,
       window_size=5,
       min_count=1,
       learning_rate=0.01,
       negative_samples=5,
       random_state=42,
       reproducible=False,
       multi_gpu=False,
       generate_artifact=False,
       cpu_count=20,
   )

   # Instantiate the pipeline
   gpu_rdf2vec_model = GPU_RDF2Vec(config=config)

   # Path to the triple dataset
   path = "data/wikidata5m/wikidata5m_kg.parquet"

   # Read data and receive edge data
   edge_data = gpu_rdf2vec_model.read_data(path)

   # Fit the Word2Vec model and transform the dataset to an embedding
   embeddings = gpu_rdf2vec_model.fit_transform(edge_df=edge_data, walk_vertices=None)

   # Write embedding to file format. Return format is a cuDF dataframe
   embeddings.to_parquet("data/wikidata5m/wikidata5m_embeddings.parquet", index=False)

As a shorthand, keyword arguments can be passed directly to ``GPU_RDF2Vec`` and they will be forwarded
to ``RDF2VecConfig`` internally:

.. code-block:: python

   gpu_rdf2vec_model = GPU_RDF2Vec(
       walk_strategy="random",
       walk_depth=4,
       walk_number=100,
       embedding_model="skipgram",
       epochs=5,
   )

Outlook
~~~~~~~~~~~~
Currently, the package supports the overall workflow following the scikit-learn paradigm.
In the future releases we will provide more fine granular interfaces to allow users to
customize the different steps based on the specific use case. In addition, this will generally
benefit the **multi-GPU** support and distributed training capabilities in order to reduce the task
graph of Dask for very large graphs by allowing users to persist data between the steps.