.. _dataload: Data Loading ================ The knowledge graph data should be prepared in a file that is compatible with the package's data load functionality. In order to load the graph, ``rdf2vecgpu`` uses two different engines with different implications: - **cuDF engine**: utilizes GPU memory for faster data processing. Suitable for large graphs that fit into GPU memory. - **rdflib engine**: provides the ability to load graph file formats that are not directly supported by cuDF. However, it uses CPU memory and may be slower for large datasets. As outlined in the :doc:`gettingstarted` guide, the engine for loading is selected based on the provided file format. Below, we provide an overview of the supported file formats for each engine. Supported file formats ~~~~~~~~~~~~~~~~~~~~~~~ In the following, you find an overview of the different supported file formats for both engines: +------------------+-----------------------+-------------------------+ | File Format | cuDF engine | rdflib engine | +==================+=======================+=========================+ | N-Triples (.nt) | Yes | Yes | +------------------+-----------------------+-------------------------+ | N-Quads (.nq) | Yes | Yes | +------------------+-----------------------+-------------------------+ | Turtle (.ttl) | No | Yes | +------------------+-----------------------+-------------------------+ | RDF/XML (.rdf) | No | Yes | +------------------+-----------------------+-------------------------+ | JSON-LD (.jsonld)| No | Yes | +------------------+-----------------------+-------------------------+ | Notation-3 (.n3) | No | Yes | +------------------+-----------------------+-------------------------+ | Trig (.trig) | No | Yes | +------------------+-----------------------+-------------------------+ | CSV (.csv) | Yes | No | +------------------+-----------------------+-------------------------+ | Parquet (.parquet)| Yes | No | +------------------+-----------------------+-------------------------+ | ORC (.orc) | Yes | No | +------------------+-----------------------+-------------------------+ For optimal performance, it is recommended to use the cuDF engine with supported file formats like Parquet, ORC, N-Triples, or CSV. If your dataset is in a different format, consider converting it to one of these formats for better load efficiency. The best performance is typically achieved with Parquet files due to their columnar storage format, which is well-suited for GPU processing. Code example ~~~~~~~~~~~~ Here is a code snippet demonstrating how to load a knowledge graph using the cuDF engine with a Parquet file: .. code-block:: python from rdf2vecgpu import GPU_RDF2Vec, RDF2VecConfig config = RDF2VecConfig( walk_strategy="random", walk_depth=4, walk_number=100, embedding_model="skipgram", epochs=5, multi_gpu=False, ) gpu_rdf2vec_model = GPU_RDF2Vec(config=config) # Path to the triple dataset path = "data/wikidata5m/wikidata5m_kg.parquet" # Read data and receive edge data edge_data = gpu_rdf2vec_model.read_data(path) Alternatively, when using a file format which is not directly supported by cuDF, this is automatically detected and the rdflib engine is used instead: .. code-block:: python edge_data = gpu_rdf2vec_model.read_data("data/wikidata5m/wikidata5m_kg.ttl") This allows you to seamlessly load different file formats without changing the code logic. Column mapping and reader keyword arguments ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``read_data`` exposes two optional arguments to adapt the reader to non-default schemas: - ``col_map``: a mapping from your source column names to the expected ``subject``, ``predicate``, ``object`` (and optionally ``weights``) names. - ``read_kwargs``: additional keyword arguments forwarded to the underlying cuDF/Dask reader (for example ``delimiter``, ``columns``, ``compression``). .. code-block:: python edge_data = gpu_rdf2vec_model.read_data( "data/my_graph.csv", col_map={"src": "subject", "rel": "predicate", "dst": "object"}, read_kwargs={"delimiter": "\t"}, ) Weighted walks ~~~~~~~~~~~~~~~ When ``walk_weighted=True`` is set in the configuration, cuGraph's ``biased_random_walks`` is used for walk generation. The input data must contain a ``weights`` column (cuGraph's standard column name). You can use ``col_map`` to rename an existing edge-weight column accordingly: .. code-block:: python config = RDF2VecConfig(walk_weighted=True, walk_strategy="random") gpu_rdf2vec_model = GPU_RDF2Vec(config=config) edge_data = gpu_rdf2vec_model.read_data( "data/spatial_graph.parquet", col_map={"distance": "weights"}, ) Literal handling ~~~~~~~~~~~~~~~~ Knowledge graphs often contain edges whose object is a literal value (for example numeric attributes). ``RDF2VecConfig`` exposes three parameters to handle such edges: - ``literal_predicates``: a list of predicate strings that identify literal edges. - ``literal_strategy``: ``"drop"`` removes literal edges from the graph (the pyRDF2Vec default), while ``"bin"`` discretizes the object values into bin tokens so the edge is preserved. - ``literal_n_bins`` and ``literal_bin_strategy`` (``"quantile"`` or ``"uniform"``) control the binning behavior when ``literal_strategy="bin"``. .. code-block:: python config = RDF2VecConfig( literal_predicates=["", ""], literal_strategy="bin", literal_n_bins=5, literal_bin_strategy="quantile", ) Considerations for multi-GPU and distributed setups ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Depending on the value of the ``multi_gpu`` flag in the configuration, ``read_data`` returns the data either as a single cuDF dataframe (for single-GPU training) or as a Dask-cuDF dataframe backed by a list of cuDF partitions (for multi-GPU training). Depending on the framework used for the graph load, a repartition of the Dask dataframe may be necessary to achieve the best performance for the following steps, which are influenced by the number of GPUs available as well as the number of nodes within the cluster.