Advanced Tutorial

This tutorial assumes the reader is familiar with d3m ecosystem in general. If not, please refer to other sections of documentation first, e.g., TA1 quick-start guide.

Primitive class

There are a variety of primitive interfaces/classes available. As an example, a primitive doing just attribute extraction without requiring any fitting, a TransformerPrimitiveBase from transformer module can be used.

Each primitives can have it’s own hyper-parameters. Some example hyper-parameter types one can use to describe primitive’s hyper-parameters are: Constant, UniformBool, UniformInt, Choice, List.

Also, each hyper-parameter should be defined as one or more of the four hyper-parameter semantic types:

Example

from d3m.primitive_interfaces import base, transformer
from d3m.metadata import base as metadata_base, hyperparams

__all__ = ('ExampleTransformPrimitive',)


class Hyperparams(hyperparams.Hyperparams):
    learning_rate = hyperparams.Uniform(lower=0.0, upper=1.0, default=0.001, semantic_types=[
        'https://metadata.datadrivendiscovery.org/types/TuningParameter',
    ])
    clusters = hyperparams.UniformInt(lower=1, upper=100, default=10, semantic_types=[
        'https://metadata.datadrivendiscovery.org/types/TuningParameter',
    ])


class ExampleTransformPrimitive(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hyperparams]):
    """
    The docstring is very important and must to be included. It should contain
    relevant information about the hyper-parameters, primitive functionality, etc.
    """

    def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> base.CallResult[Outputs]:
        pass

Input/Output types

The acceptable inputs/outputs of a primitive must be pre-defined. D3M supports a variety of standard input/output container types such as:

Note

Even thought D3M container types behave mostly as standard types, the D3M container types must be used for inputs/outputs, because D3M container types support D3M metadata.

Example

from d3m import container

Inputs  = container.DataFrame
Outputs = container.DataFrame


class ExampleTransformPrimitive(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hyperparams]):
    ...

Note

When returning the output DataFrame, its metadata should be updated with the correct semantic and structural types.

Example

# Update metadata for each DataFrame column.
for column_index in range(outputs.shape[1]):
    column_metadata = {}
    column_metadata['structural_type'] = type(1.0)
    column_metadata['name'] = "column {i}".format(i=column_index)
    column_metadata["semantic_types"] = ("http://schema.org/Float", "https://metadata.datadrivendiscovery.org/types/Attribute",)
    outputs.metadata = outputs.metadata.update((metadata_base.ALL_ELEMENTS, column_index), column_metadata)

Primitive Metadata

It is very crucial to define primitive metadata for the primitive properly. Primitive metadata can be used by TA2 systems to metalearn about primitives and in general decide which primitive to use when.

Example

from d3m.primitive_interfaces import base, transformer
from d3m.metadata import base as metadata_base, hyperparams

__all__ = ('ExampleTransformPrimitive',)

class ExampleTransformPrimitive(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hyperparams]):
    """
    Docstring.
    """

    metadata = metadata_base.PrimitiveMetadata({
        'id': <Unique-ID, generated using UUID>,
        'version': <Primitive-development-version>,
        'name': <Primitive-Name>,
        'python_path': 'd3m.primitives.<>.<>.<>' # Must match path in setup.py,
        'source': {
            'name': <Project-maintainer-name>,
            'uris': [<GitHub-link-to-project>],
            'contact': 'mailto:<Author E-Mail>'
        },
        'installation': [{
            'type': metadata_base.PrimitiveInstallationType.PIP,
            'package_uri': 'git+<git-link-to-project>@{git_commit}#egg=<Package_name>'.format(
                git_commit=d3m_utils.current_git_commit(os.path.dirname(__file__)),
            ),
        }],
        'algorithm_types': [
            # Check https://metadata.datadrivendiscovery.org/devel/?definitions#definitions.algorithm_types for all available algorithm types.
            # If algorithm type s not available a Merge Request should be made to add it to core package.
            metadata_base.PrimitiveAlgorithmType.<Choose-the-algorithm-type-that-best-describes-the-primitive>,
        ],
        # Check https://metadata.datadrivendiscovery.org/devel/?definitions#definitions.primitive_family for all available primitive family types.
        # If primitive family is not available a Merge Request should be made to add it to core package.
        'primitive_family': metadata_base.PrimitiveFamily.<Choose-the-primitive-family-that-closely-associates-to-the-primitive>
    })

    ...

Unit tests

Once the primitives are constructed, unit testing must be done to see if the primitive works as intended.

Sample Setup

import os
import unittest

from d3m.container import dataset
from d3m.metadata import base as metadata_base
from common_primitives import dataset_to_dataframe

from example_primitive import ExampleTransformPrimitive


class ExampleTransformTest(unittest.TestCase):
    def test_happy_path():
        # Load a dataset.
        # Datasets can be obtained from: https://datasets.datadrivendiscovery.org/d3m/datasets
        base_path = '../datasets/training_datasets/seed_datasets_archive/'
        dataset_doc_path = os.path.join(base_path, '38_sick_dataset', 'datasetDoc.json')
        dataset = dataset.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path))

        dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams()
        dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults())
        dataframe = dataframe_primitive.produce(inputs=dataset).value

        # Call example transformer.
        hyperparams_class = SampleTransform.metadata.get_hyperparams()
        primitive  = SampleTransform(hyperparams=hyperparams_class.defaults())
        test_out   = primitive.produce(inputs=dataframe).value

        # Write assertions to make sure that the output (type, shape, metadata) is what is expected.
        self.assertEqual(...)

        ...


if __name__ == '__main__':
    unittest.main()

It is recommended to do the testing inside the D3M Docker container:

docker run --rm -v /home/foo/d3m:/mnt/d3m -it \
  registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
cd /mnt/d3m/example_primitive
python3 primitive_name_test.py

Primitive annotation

Once primitive is constructed and unit testing is successful, the final step in building a primitive is to generate the primitive annotation which will be indexed and used by D3M.

docker run --rm -v /home/foo/d3m:/mnt/d3m -it \
  registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
cd /mnt/d3m/example_primitive
pip3 install -e .
python3 -m d3m index describe -i 4 <primitive_name>

Alternatively, a helper script can be used to generate primitive annotations as well. This can be more convenient when having to manage multiple primitives. In this case, generating the primitive annotation is done as follows:

docker run --rm -v /home/foo/d3m:/mnt/d3m -it \
  registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
cd /mnt/d3m/example_primitive
pip3 install -e .
python3 generate-primitive-json.py ...

Example pipeline

After building custom primitives, it has to be used in an example pipeline and run using one of D3M seed datasets in order to be integrated with other indexed D3M primitives.

The essential elements of pipelines are:

Dataset Denormalizer -> Dataset Parser -> Data Cleaner (If necessary) -> Feature Extraction -> Classifier/Regressor -> Output

An example code of building pipeline is shown below:

# D3M dependencies
from d3m import index
from d3m.metadata.base import ArgumentType
from d3m.metadata.pipeline import Pipeline, PrimitiveStep

# Common Primitives
from common_primitives.column_parser import ColumnParserPrimitive
from common_primitives.dataset_to_dataframe import DatasetToDataFramePrimitive
from common_primitives.extract_columns_semantic_types import ExtractColumnsBySemanticTypesPrimitive

# Testing primitive
from quickstart_primitives.sample_primitive1.input_to_output import InputToOutputPrimitive

# Pipeline
pipeline = Pipeline()
pipeline.add_input(name='inputs')

# Step 0: DatasetToDataFrame (Dataset Denormalizer)
step_0 = PrimitiveStep(primitive_description=DatasetToDataFramePrimitive.metadata.query())
step_0.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='inputs.0')
step_0.add_output('produce')
pipeline.add_step(step_0)

# Step 1: Custom primitive
step_1 = PrimitiveStep(primitive=InputToOutputPrimitive)
step_1.add_argument(name='inputs',  argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce')
step_1.add_output('produce')
pipeline.add_step(step_1)

# Step 2: Column Parser (Dataset Parser)
step_2 = PrimitiveStep(primitive_description=ColumnParserPrimitive.metadata.query())
step_2.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.1.produce')
step_2.add_output('produce')
pipeline.add_step(step_2)

# Step 3: Extract Attributes (Feature Extraction)
step_3 = PrimitiveStep(primitive_description=ExtractColumnsBySemanticTypesPrimitive.metadata.query())
step_3.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.2.produce')
step_3.add_output('produce')
step_3.add_hyperparameter(name='semantic_types', argument_type=ArgumentType.VALUE, data=['https://metadata.datadrivendiscovery.org/types/Attribute'] )
pipeline.add_step(step_3)

# Step 4: Extract Targets (Feature Extraction)
step_4 = PrimitiveStep(primitive_description=ExtractColumnsBySemanticTypesPrimitive.metadata.query())
step_4.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce')
step_4.add_output('produce')
step_4.add_hyperparameter(name='semantic_types', argument_type=ArgumentType.VALUE, data=['https://metadata.datadrivendiscovery.org/types/TrueTarget'] )
pipeline.add_step(step_4)

attributes = 'steps.3.produce'
targets    = 'steps.4.produce'

# Step 6: Imputer (Data Cleaner)
step_5 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_cleaning.imputer.SKlearn'))
step_5.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference=attributes)
step_5.add_output('produce')
pipeline.add_step(step_5)

# Step 7: Classifier
step_6 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.classification.decision_tree.SKlearn'))
step_6.add_argument(name='inputs',  argument_type=ArgumentType.CONTAINER,  data_reference='steps.5.produce')
step_6.add_argument(name='outputs', argument_type=ArgumentType.CONTAINER, data_reference=targets)
step_6.add_output('produce')
pipeline.add_step(step_6)

# Final Output
pipeline.add_output(name='output predictions', data_reference='steps.6.produce')

# print(pipeline.to_json())
with open('./pipeline.json', 'w') as write_file:
    write_file.write(pipeline.to_json(indent=4, sort_keys=False, ensure_ascii=False))

Once pipeline is constructed and the pipeline’s JSON file is generated, the pipeline is run using python3 -m d3m runtime command. Successfully running the pipeline validates that the primitive is working as intended.

docker run --rm -v /home/foo/d3m:/mnt/d3m -it \
  registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
  /bin/bash -c "cd /mnt/d3m; \
    pip3 install -e .; \
    cd pipelines; \
    python3 -m d3m runtime fit-produce \
            --pipeline pipeline.json \
            --problem /datasets/seed_datasets_current/38_sick/TRAIN/problem_TRAIN/problemDoc.json \
            --input /datasets/seed_datasets_current/38_sick/TRAIN/dataset_TRAIN/datasetDoc.json \
            --test-input /datasets/seed_datasets_current/38_sick/TEST/dataset_TEST/datasetDoc.json \
            --output 38_sick_results.csv \
            --output-run pipeline_run.yml; \
    exit"

Advanced: Primitive with static files

When building primitives that uses external/static files i.e. pre-trained weights, the metadata for the primitive must be properly define such dependency. The static file can be hosted anywhere based on your preference, as long as the URL to the file is a direct download link. It must be public so that users of your primitive can access the file. Be sure to keep the URL available, as the older version of the primitive could potentially start failing if URL stops resolving.

Note

Full code of this section can be found in the quickstart repository.

Below is a description of primitive metadata definition required, named _weights_configs for each static file.

_weights_configs = [{
    'type': 'FILE',
    'key': '<Weight File Name>',
    'file_uri': '<URL to directly Download the Weight File>',
    'file_digest':'sha256sum of the <Weight File>',
}]

This _weights_configs should be directly added to the INSTALLATION field of the primitive metadata.

from d3m.primitive_interfaces import base, transformer
from d3m.metadata import base as metadata_base, hyperparams

__all__ = ('ExampleTransform',)

class ExampleTransform(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hyperparams]):
    """
    Docstring.
    """

    _weights_configs = [{
        'type': 'FILE',
        'key': '<Weight File Name>',
        'file_uri': '<URL to directly Download the Weight File>',
        'file_digest':'sha256sum of the <Weight File>',
    }]

    metadata = ...
        'installation': [{
            'type': metadata_base.PrimitiveInstallationType.PIP,
            'package_uri': 'git+<git-link-to-project>@{git_commit}#egg=<Package_name>'.format(
                git_commit=d3m_utils.current_git_commit(os.path.dirname(__file__)),
            ),
        }] + _weights_configs,
        ...

    ...

After the primitive metadata definition, it is important to include code to return the path of files. An example is given as follows:

def _find_weights_path(self, key_filename):
    if key_filename in self.volumes:
        weight_file_path = self.volumes[key_filename]
    else:
        weight_file_path = os.path.join('.', self._weights_configs['file_digest'], key_filename)

    if not os.path.isfile(weight_file_path):
        raise ValueError(
            "Can't get weights file from volumes by key '{key_filename}' and at path '{path}'.".format(
                key_filename=key_filename,
                path=weight_file_path,
            ),
        )

    return weight_file_path

In this example code, _find_weights_path method will try to find the static files from volumes based on weight file key. If it cannot be found (e.g., runtime was not provided with static files), then it looks into the current directory. The latter fallback is useful during development.

To run a pipeline with such primitive, you have to download static files and provide them to the runtime:

docker run --rm -v /home/foo/d3m:/mnt/d3m -it \
  registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
  /bin/bash -c "cd /mnt/d3m; \
    pip3 install -e .; \
    cd pipelines; \
    mkdir /static
    python3 -m d3m index download -p d3m.primitives.path.of.Primitive -o /static; \
    python3 -m d3m runtime --volumes /static fit-produce \
            --pipeline feature_pipeline.json \
            --problem /datasets/seed_datasets_current/22_handgeometry/TRAIN/problem_TRAIN/problemDoc.json \
            --input /datasets/seed_datasets_current/22_handgeometry/TRAIN/dataset_TRAIN/datasetDoc.json \
            --test-input /datasets/seed_datasets_current/22_handgeometry/TEST/dataset_TEST/datasetDoc.json \
            --output 22_handgeometry_results.csv \
            --output-run feature_pipeline_run.yml; \
    exit"

The static files will be downloaded and stored locally based on file_digest of _weights_configs. In this way we don’t duplicate same files used by multiple primitives:

mkdir /static
python3 -m d3m index download -p d3m.primitives.path.of.Primitive -o /static

-p optional argument to download static files for a particular primitive, matching on its Python path. -o optional argument to download the static files into a common folder. If not provided, they are downloaded into the current directory.

After the download, the file structure is given as follows:

/static/
  <file_digest>/
    <file>
  <file_digest>/
    <file>
  ...
  ...