undr.configuration#

User-facing API to manipulate configuration files and trigger actions (download, decompress…).

Overview#

Classes#

Configuration

Represents a dataset configuration (TOML).

DatasetSettings

A dataset entry in a TOML settings file.

DoiSelector

Selector for a DOI download.

IndexStatus

Keeps track of the indexing progress for a dataset.

IndexesStatuses

Maps dataset names to index statuses.

InstallSelector

Selector for a standard installation, maps install modes to actions.

MapMessage

A message generated by MapSelector.

MapProcessFile

Uses a switch to process files and wraps messages into MapMessage.

MapSelector

Applies a user-provided function to each file.

Functions#

configuration_from_path

Reads the configuration (TOML) with the given path.

Attributes#

schema

JSON schema for TOML settings files.

Module Contents#

class undr.configuration.Configuration#

Represents a dataset configuration (TOML).

directory: pathlib.Path#

Local path of the root datasets directory (usually called datasets).

name_to_dataset_settings: dict[str, DatasetSettings]#

Maps dataset names to their parameters.

bibtex(show_display: bool, workers: int, force: bool, bibtex_timeout: float, log_directory: pathlib.Path | None) str#

Downloads index files and BibTeX references for enabled datasets.

Parameters:
  • show_display (bool) – Whether to show progress in the terminal.

  • workers (int) – Number of parallel workers (threads).

  • force (bool) – Whether to re-download resources even if they are already present locally.

  • bibtex_timeout (float) – Timeout for requests to https://dx.doi.org/.

  • log_directory (Optional[pathlib.Path]) – Directory to store log files. Logs are not generated if this is None.

Raises:

task.WorkerException – if a worker raises an error.

Returns:

BibTeX references as a string.

Return type:

str

dataset(name: str) undr.path_directory.Directory#

Returns the dataset with the given name.

Parameters:

name (str) – The dataset name.

Raises:

ValueError – if the dataset exists but is disabled.

Returns:

The dataset’s root directory.

Return type:

path_directory.Directory

display(download_tag: undr.display.Tag = display.Tag(label='download', icon='↓'), process_tag: undr.display.Tag = display.Tag(label='process', icon='⚛')) undr.display.Display#

Returns a display that shows download and process progress for enabled datasets.

Parameters:
  • download_tag (display.Tag, optional) – Label and icon for download. Defaults to display.Tag(label=”download”, icon=”↓”).

  • process_tag (display.Tag, optional) – Label and icon for process. Defaults to display.Tag(label=”process”, icon=”⚛”).

Returns:

Controller for the display thread.

Return type:

display.Display

enabled_datasets_settings() list[DatasetSettings]#

The settings of enabled datasets.

The list always contains at least one item (the function otherwise raises an error).

Raises:

RuntimeError – if all the datasets are disabled or there are no datasets.

Returns:

The settings of the datasets that are enabled, in the same order as the configuration file.

Return type:

list[DatasetSettings]

indexes_statuses(selector: undr.json_index_tasks.Selector) IndexesStatuses#

Builds an indexing report for enabled datasets.

Parameters:

selector (json_index_tasks.Selector) – The selector used to index the dataset.

Returns:

Index status for enabled datasets, in the same order as the configuration file.

Return type:

IndexesStatuses

install(show_display: bool, workers: int, force: bool, log_directory: pathlib.Path | None)#

Downloads index files and data files and decompresses data files.

The action (index only, download, download and decompress) may be different for each dataset and is controlled by undr.install_mode.Mode.

Parameters:
  • show_display (bool) – Whether to show progress in the terminal.

  • workers (int) – Number of parallel workers (threads).

  • force (bool) – Whether to re-download resources even if they are already present locally.

  • log_directory (Optional[pathlib.Path]) – Directory to store log files. Logs are not generated if this is None.

Raises:

task.WorkerException – if a worker raises an error.

iter(recursive: bool = False) Iterable[undr.path.Path]#

Iterates the files in the dataset.

Parameters:

recursive (bool, optional) – Whether to recursively search child directories. Defaults to False.

Returns:

Iterator over the child paths. If recursive is false, the iterator yields the direct children (files and directories) of the root dataset directory. If recursive is true, the iterator yields all the children (files and directories) of the dataset.

Return type:

Iterable[path.Path]

map(switch: undr.formats.Switch, store: undr.persist.Store | None = None, show_display: bool = True, workers: int = multiprocessing.cpu_count() * 2, log_directory: pathlib.Path | None = None) Iterable[Any]#

Applies a function to eacch file in a dataset.

Parameters:
  • switch (formats.Switch) – Specifies the action to perform on each file type.

  • store (Optional[persist.Store], optional) – Saves progress, makes it possible to resume interrupted processing. Defaults to None.

  • show_display (bool, optional) – Whether to show progress in the terminal. Defaults to True.

  • workers (int, optional) – Number of parallel workers (threads). Defaults to twice multiprocessing.cpu_count().

  • log_directory (Optional[pathlib.Path], optional) – Directory to store log files. Logs are not generated if this is None. Defaults to None.

Raises:

task.WorkerException – if a worker raises an error.

Returns:

Iterator over the non-error messages generated by the workers.

Return type:

Iterable[Any]

mktree(root: str | os.PathLike, parents: bool = False, exist_ok: bool = False)#

Creates a copy of the datasets’ file hierarchy without the index or data files.

This function can be combined with map() to implement a map-reduce algorithm over entire datasets.
  1. Use mktree to create a empty copy of the file hierarchy.

  2. Use Configuration.map() to create a result file in the new hierarchy for each data file in the originall hierarchy (for instance, a file that contains a measure algorithm’s performance as a single number).

  3. Collect the results (“reduce”) by reading the result files in the new hierarchy.

This approach has several benefits. The most expensive step b. runs in parallell and can be interrupted and resumed. Result files are stored in a different directory and can easily be deleted without altering the original data. The new file hierarchy prevents name clashes as long as result files are named after data files, and workers do not need to worry about directory existence since mktree runs first.

Parameters:
  • root (Union[str, os.PathLike]) – Directory where the new file hierarchy is created.

  • parents (bool, optional) – Whether to create the parents of the new directory, if they do not exist. Defaults to False.

  • exist_ok (bool, optional) – Whether to silence exeptions if the root directory already exists. Defaults to False.

class undr.configuration.DatasetSettings#

A dataset entry in a TOML settings file.

mode: undr.install_mode.Mode#

The installation mode.

name: str#

The dataset’s name, used to name the local directory.

timeout: float | None#

Request timeout in seconds.

url: str#

The dataset’s base URL.

class undr.configuration.DoiSelector#

Bases: undr.json_index_tasks.Selector

Selector for a DOI download.

action(file: undr.path.File) undr.json_index_tasks.Selector.Action#

Returns the action to apply to the given file.

Called by Index, InstallFilesRecursive and ProcessFilesRecursive. The default implementation returns Selector.Action.PROCESS.

scan_filesystem(directory: undr.path_directory.Directory)#

Whether to scan the filesystem.

Called by Index to decide whether it needs to scan the file system. This function may return False if action() returns one of the following for every file in the directory:

  • Selector.Action.IGNORE

  • Selector.Action.DOI

  • Selector.Action.SKIP

  • Selector.Action.DOWNLOAD_SKIP

class undr.configuration.IndexStatus#

Keeps track of the indexing progress for a dataset.

current_index_files: int#

Number of index files parsed.

The dataset has been indexed if current_index_files and final_index_files are equal.

dataset_settings: DatasetSettings#

User-specified dataset settings.

downloaded_and_processed: bool#

Whether the dataset has been fully downloaded and processed.

final_index_files: int#

Total number of index files.

selector: undr.json_index_tasks.Selector#

Selector to choose actions while indexing.

server: undr.remote.Server#

The remote server for this dataset.

push(message: Any) tuple[bool, IndexStatus | None]#

Updates the status based on the message.

Ignores messages that are not undr.json_index_tasks.IndexLoaded or undr.json_index_tasks.DirectoryScanned.

Returns:

Whether the dataset has been fully indexed and self, if self was updated.

Return type:

tuple[bool, Optional[“IndexStatus”]]

class undr.configuration.IndexesStatuses#

Maps dataset names to index statuses.

name_to_status: dict[str, IndexStatus]#

Inner dict.

push(message: Any) tuple[bool, IndexStatus | None]#

Processes relevant messages.

This function updates the indexing status and returns it if message is a undr.json_index_tasks.IndexLoaded or undr.json_index_tasks.DirectoryScanned object. If the message was the last indexing message for this dataset, the first argument is True.

class undr.configuration.InstallSelector(mode: undr.install_mode.Mode)#

Bases: undr.json_index_tasks.Selector

Selector for a standard installation, maps install modes to actions.

Raises:

ValueError – if mode is not undr.install_mode.Mode.REMOTE, undr.install_mode.Mode.LOCAL, or undr.install_mode.Mode.RAW.

action(file: undr.path.File) undr.json_index_tasks.Selector.Action#

Returns the action to apply to the given file.

Called by Index, InstallFilesRecursive and ProcessFilesRecursive. The default implementation returns Selector.Action.PROCESS.

scan_filesystem(directory: undr.path_directory.Directory)#

Whether to scan the filesystem.

Called by Index to decide whether it needs to scan the file system. This function may return False if action() returns one of the following for every file in the directory:

  • Selector.Action.IGNORE

  • Selector.Action.DOI

  • Selector.Action.SKIP

  • Selector.Action.DOWNLOAD_SKIP

class undr.configuration.MapMessage#

A message generated by MapSelector.

payload: Any#

Payload attached to this message.

The payload may be any object type. The user is reponsible for checking message types.

class undr.configuration.MapProcessFile(file: undr.path.File, switch: undr.formats.Switch)#

Bases: undr.json_index_tasks.ProcessFile

Uses a switch to process files and wraps messages into MapMessage.

Parameters:
run(session: requests.Session, manager: undr.task.Manager)#
class undr.configuration.MapSelector(enabled_types: set[Any], store: undr.persist.ReadOnlyStore | None)#

Bases: undr.json_index_tasks.Selector

Applies a user-provided function to each file.

Parameters:
action(file: undr.path.File)#

Returns the action to apply to the given file.

Called by Index, InstallFilesRecursive and ProcessFilesRecursive. The default implementation returns Selector.Action.PROCESS.

scan_filesystem(directory: undr.path_directory.Directory)#

Whether to scan the filesystem.

Called by Index to decide whether it needs to scan the file system. This function may return False if action() returns one of the following for every file in the directory:

  • Selector.Action.IGNORE

  • Selector.Action.DOI

  • Selector.Action.SKIP

  • Selector.Action.DOWNLOAD_SKIP

undr.configuration.configuration_from_path(path: str | os.PathLike) Configuration#

Reads the configuration (TOML) with the given path.

Parameters:

path (Union[str, os.PathLike]) – Configuration file path.

Raises:

RuntimeError – if two datasets have the same name in the configuration.

Returns:

the parsed TOML configuration.

Return type:

Configuration

undr.configuration.schema#

JSON schema for TOML settings files.