undr.configuration
#
User-facing API to manipulate configuration files and trigger actions (download, decompress…).
Overview#
Classes#
Represents a dataset configuration (TOML). |
|
A dataset entry in a TOML settings file. |
|
Selector for a DOI download. |
|
Keeps track of the indexing progress for a dataset. |
|
Maps dataset names to index statuses. |
|
Selector for a standard installation, maps install modes to actions. |
|
A message generated by |
|
Uses a switch to process files and wraps messages into |
|
Applies a user-provided function to each file. |
Functions#
Reads the configuration (TOML) with the given path. |
Attributes#
JSON schema for TOML settings files. |
Module Contents#
- class undr.configuration.Configuration#
Represents a dataset configuration (TOML).
- directory: pathlib.Path#
Local path of the root datasets directory (usually called datasets).
- name_to_dataset_settings: dict[str, DatasetSettings]#
Maps dataset names to their parameters.
- bibtex(show_display: bool, workers: int, force: bool, bibtex_timeout: float, log_directory: pathlib.Path | None) str #
Downloads index files and BibTeX references for enabled datasets.
- Parameters:
show_display (bool) – Whether to show progress in the terminal.
workers (int) – Number of parallel workers (threads).
force (bool) – Whether to re-download resources even if they are already present locally.
bibtex_timeout (float) – Timeout for requests to https://dx.doi.org/.
log_directory (Optional[pathlib.Path]) – Directory to store log files. Logs are not generated if this is None.
- Raises:
task.WorkerException – if a worker raises an error.
- Returns:
BibTeX references as a string.
- Return type:
- dataset(name: str) undr.path_directory.Directory #
Returns the dataset with the given name.
- Parameters:
name (str) – The dataset name.
- Raises:
ValueError – if the dataset exists but is disabled.
- Returns:
The dataset’s root directory.
- Return type:
- display(download_tag: undr.display.Tag = display.Tag(label='download', icon='↓'), process_tag: undr.display.Tag = display.Tag(label='process', icon='⚛')) undr.display.Display #
Returns a display that shows download and process progress for enabled datasets.
- Parameters:
download_tag (display.Tag, optional) – Label and icon for download. Defaults to display.Tag(label=”download”, icon=”↓”).
process_tag (display.Tag, optional) – Label and icon for process. Defaults to display.Tag(label=”process”, icon=”⚛”).
- Returns:
Controller for the display thread.
- Return type:
- enabled_datasets_settings() list[DatasetSettings] #
The settings of enabled datasets.
The list always contains at least one item (the function otherwise raises an error).
- Raises:
RuntimeError – if all the datasets are disabled or there are no datasets.
- Returns:
The settings of the datasets that are enabled, in the same order as the configuration file.
- Return type:
- indexes_statuses(selector: undr.json_index_tasks.Selector) IndexesStatuses #
Builds an indexing report for enabled datasets.
- Parameters:
selector (json_index_tasks.Selector) – The selector used to index the dataset.
- Returns:
Index status for enabled datasets, in the same order as the configuration file.
- Return type:
- install(show_display: bool, workers: int, force: bool, log_directory: pathlib.Path | None)#
Downloads index files and data files and decompresses data files.
The action (index only, download, download and decompress) may be different for each dataset and is controlled by
undr.install_mode.Mode
.- Parameters:
show_display (bool) – Whether to show progress in the terminal.
workers (int) – Number of parallel workers (threads).
force (bool) – Whether to re-download resources even if they are already present locally.
log_directory (Optional[pathlib.Path]) – Directory to store log files. Logs are not generated if this is None.
- Raises:
task.WorkerException – if a worker raises an error.
- iter(recursive: bool = False) Iterable[undr.path.Path] #
Iterates the files in the dataset.
- Parameters:
recursive (bool, optional) – Whether to recursively search child directories. Defaults to False.
- Returns:
Iterator over the child paths. If recursive is false, the iterator yields the direct children (files and directories) of the root dataset directory. If recursive is true, the iterator yields all the children (files and directories) of the dataset.
- Return type:
- map(switch: undr.formats.Switch, store: undr.persist.Store | None = None, show_display: bool = True, workers: int = multiprocessing.cpu_count() * 2, log_directory: pathlib.Path | None = None) Iterable[Any] #
Applies a function to eacch file in a dataset.
- Parameters:
switch (formats.Switch) – Specifies the action to perform on each file type.
store (Optional[persist.Store], optional) – Saves progress, makes it possible to resume interrupted processing. Defaults to None.
show_display (bool, optional) – Whether to show progress in the terminal. Defaults to True.
workers (int, optional) – Number of parallel workers (threads). Defaults to twice
multiprocessing.cpu_count()
.log_directory (Optional[pathlib.Path], optional) – Directory to store log files. Logs are not generated if this is None. Defaults to None.
- Raises:
task.WorkerException – if a worker raises an error.
- Returns:
Iterator over the non-error messages generated by the workers.
- Return type:
- mktree(root: str | os.PathLike, parents: bool = False, exist_ok: bool = False)#
Creates a copy of the datasets’ file hierarchy without the index or data files.
- This function can be combined with
map()
to implement a map-reduce algorithm over entire datasets. Use
mktree
to create a empty copy of the file hierarchy.Use
Configuration.map()
to create a result file in the new hierarchy for each data file in the originall hierarchy (for instance, a file that contains a measure algorithm’s performance as a single number).Collect the results (“reduce”) by reading the result files in the new hierarchy.
This approach has several benefits. The most expensive step b. runs in parallell and can be interrupted and resumed. Result files are stored in a different directory and can easily be deleted without altering the original data. The new file hierarchy prevents name clashes as long as result files are named after data files, and workers do not need to worry about directory existence since
mktree
runs first.- Parameters:
root (Union[str, os.PathLike]) – Directory where the new file hierarchy is created.
parents (bool, optional) – Whether to create the parents of the new directory, if they do not exist. Defaults to False.
exist_ok (bool, optional) – Whether to silence exeptions if the root directory already exists. Defaults to False.
- This function can be combined with
- class undr.configuration.DatasetSettings#
A dataset entry in a TOML settings file.
- mode: undr.install_mode.Mode#
The installation mode.
- class undr.configuration.DoiSelector#
Bases:
undr.json_index_tasks.Selector
Selector for a DOI download.
- action(file: undr.path.File) undr.json_index_tasks.Selector.Action #
Returns the action to apply to the given file.
Called by
Index
,InstallFilesRecursive
andProcessFilesRecursive
. The default implementation returns Selector.Action.PROCESS.
- scan_filesystem(directory: undr.path_directory.Directory)#
Whether to scan the filesystem.
Called by
Index
to decide whether it needs to scan the file system. This function may return False ifaction()
returns one of the following for every file in the directory:Selector.Action.IGNORE
Selector.Action.DOI
Selector.Action.SKIP
Selector.Action.DOWNLOAD_SKIP
- class undr.configuration.IndexStatus#
Keeps track of the indexing progress for a dataset.
- current_index_files: int#
Number of index files parsed.
The dataset has been indexed if
current_index_files
andfinal_index_files
are equal.
- dataset_settings: DatasetSettings#
User-specified dataset settings.
- selector: undr.json_index_tasks.Selector#
Selector to choose actions while indexing.
- server: undr.remote.Server#
The remote server for this dataset.
- push(message: Any) tuple[bool, IndexStatus | None] #
Updates the status based on the message.
Ignores messages that are not
undr.json_index_tasks.IndexLoaded
orundr.json_index_tasks.DirectoryScanned
.
- class undr.configuration.IndexesStatuses#
Maps dataset names to index statuses.
- name_to_status: dict[str, IndexStatus]#
Inner dict.
- push(message: Any) tuple[bool, IndexStatus | None] #
Processes relevant messages.
This function updates the indexing status and returns it if message is a
undr.json_index_tasks.IndexLoaded
orundr.json_index_tasks.DirectoryScanned
object. If the message was the last indexing message for this dataset, the first argument is True.
- class undr.configuration.InstallSelector(mode: undr.install_mode.Mode)#
Bases:
undr.json_index_tasks.Selector
Selector for a standard installation, maps install modes to actions.
- Raises:
ValueError – if mode is not
undr.install_mode.Mode.REMOTE
,undr.install_mode.Mode.LOCAL
, orundr.install_mode.Mode.RAW
.
- action(file: undr.path.File) undr.json_index_tasks.Selector.Action #
Returns the action to apply to the given file.
Called by
Index
,InstallFilesRecursive
andProcessFilesRecursive
. The default implementation returns Selector.Action.PROCESS.
- scan_filesystem(directory: undr.path_directory.Directory)#
Whether to scan the filesystem.
Called by
Index
to decide whether it needs to scan the file system. This function may return False ifaction()
returns one of the following for every file in the directory:Selector.Action.IGNORE
Selector.Action.DOI
Selector.Action.SKIP
Selector.Action.DOWNLOAD_SKIP
- class undr.configuration.MapMessage#
A message generated by
MapSelector
.- payload: Any#
Payload attached to this message.
The payload may be any object type. The user is reponsible for checking message types.
- class undr.configuration.MapProcessFile(file: undr.path.File, switch: undr.formats.Switch)#
Bases:
undr.json_index_tasks.ProcessFile
Uses a switch to process files and wraps messages into
MapMessage
.- Parameters:
file (path.File) – The file to process.
switch (formats.Switch) – Switch that maps file types to actions.
- run(session: requests.Session, manager: undr.task.Manager)#
- class undr.configuration.MapSelector(enabled_types: set[Any], store: undr.persist.ReadOnlyStore | None)#
Bases:
undr.json_index_tasks.Selector
Applies a user-provided function to each file.
- Parameters:
enabled_types (set[Any]) – The file types (a class in
undr.formats
) to process.store (Optional[persist.ReadOnlyStore]) – A store to check for readily processed files.
- action(file: undr.path.File)#
Returns the action to apply to the given file.
Called by
Index
,InstallFilesRecursive
andProcessFilesRecursive
. The default implementation returns Selector.Action.PROCESS.
- scan_filesystem(directory: undr.path_directory.Directory)#
Whether to scan the filesystem.
Called by
Index
to decide whether it needs to scan the file system. This function may return False ifaction()
returns one of the following for every file in the directory:Selector.Action.IGNORE
Selector.Action.DOI
Selector.Action.SKIP
Selector.Action.DOWNLOAD_SKIP
- undr.configuration.configuration_from_path(path: str | os.PathLike) Configuration #
Reads the configuration (TOML) with the given path.
- Parameters:
path (Union[str, os.PathLike]) – Configuration file path.
- Raises:
RuntimeError – if two datasets have the same name in the configuration.
- Returns:
the parsed TOML configuration.
- Return type:
- undr.configuration.schema#
JSON schema for TOML settings files.