longling logo

longling

Documentation Status PyPI PyPI - Python Version Build Status codecov Download License CodeSize CodeLine

Overview

The project contains several modules for different purposes:

  • lib serves as the basic toolkit that can be used in any place without extra dependencies.
  • ML provides many interfaces to quickly build machine learning tools.

Quick scripts

The project provide several cli scripts to help construct different architecture.

Neural Network

  • glue for mxnet-gluon

CLI

Provide several general tools, consistently invoked by:

longling $subcommand $parameters1 $parameters2

To see the help information:

longling -- --help
longling $subcommand --help

Take a glance on all available cli.

The cli tools is constructed based on fire. Refer to the documentation for detailed usage.

Demo

Split dataset

target: split a dataset into train/valid/test

longling train_valid_test $filename1 $filename2 -- --train_ratio 0.8 --valid_ratio 0.1 --test_ratio 0.1

Similar commands:

  • train_test
longling train_test $filename1 -- --train_ratio 0.8 --test_ratio 0.2
  • train_valid
longling train_valid $filename1 -- --train_ratio 0.8 --valid_ratio 0.2
  • Cross Validation kfold
longling kfold $filename1 $filename2 -- --n_splits 5
Display the tree of content
longling toc .

such as

/
├── __init__.py
├── __pycache__/
│   ├── __init__.cpython-36.pyc
│   └── toc.cpython-36.pyc
└── toc.py
Quickly construct a project
longling arch

or you can also directly copy the template files

longling arch-cli

To be noticed that, you need to check $VARIABLE in the template files.

Result Analysis

The cli tools for result analysis is specially designed for json result format:

longling  max $filename $key1 $key2 $key3
longling  amax $key1 $key2 $key3 --src $filename

For the composite key like {'prf':{'avg': {'f1': 0.77}}}, the key should be presented as prf:avg:f1. Thus, all the key used in the result file should not contain :.

Module Index

Command Line Interfaces

use longling -- --help to see all available cli, and use longling $subcommand -- --help to see concrete help information for a certain cli, e.g. longling encode -- --help

Format and Encoding

longling.lib.stream.encode(src, ...) Convert a file in source encoding to target encoding
longling.lib.loading.csv2jsonl(src, ...[, ...]) 将 csv 格式文件/io流 转换为 json 格式文件/io流
longling.lib.loading.jsonl2csv(src, ...[, ...]) 将 json 格式文件/io流 转换为 csv 格式文件/io流

Download Data

longling.spider.download_data.download_file(url) cli alias: download, download data from specified url

Architecture

longling.toolbox.toc.toc([root, parent, ...]) 打印目录树
longling.Architecture.cli.cli([skip_top, ...]) The main function for arch
longling.Architecture.install_file.nni([tar_dir]) cli alias: arch nni and install nni
longling.Architecture.install_file.gitignore(...) cli alias: arch gitignore
longling.Architecture.install_file.pytest(...) cli alias: arch pytest
longling.Architecture.install_file.coverage(...) cli alias: arch coverage
longling.Architecture.install_file.pysetup([...]) cli alias: arch pysetup
longling.Architecture.install_file.sphinx_conf([...]) cli alias: arch sphinx_conf
longling.Architecture.install_file.makefile([...]) cli alias: arch makefile
longling.Architecture.install_file.readthedocs([...]) cli alias: arch readthedocs
longling.Architecture.install_file.travis(...) cli alias: arch travis
longling.Architecture.install_file.dockerfile(atype) cli alias: arch dockerfile
longling.Architecture.install_file.gitlab_ci(...) cli alias: arch gitlab_ci
longling.Architecture.install_file.chart(...) cli alias: arch chart

Model Selection

Validation on Datasets

Split dataset to train, valid and test or apply kfold.

longling.ML.toolkit.dataset.train_valid_test(...)
param files:
longling.ML.toolkit.dataset.train_test(...)
param files:
longling.ML.toolkit.dataset.kfold(*files[, ...])
Select Best Model

Select best models on specified conditions

longling.ML.toolkit.analyser.cli.select_max(...) cli alias: max
longling.ML.toolkit.analyser.cli.arg_select_max(...) cli alias: amax
longling.ML.toolkit.hyper_search.nni.show_top_k(k) Updated in v1.3.17
longling.ML.toolkit.hyper_search.nni.show(key) Updated in v1.3.17

General Library

Quick Glance

For io:

longling.lib.stream.to_io(stream, <class >, ...) Convert an object as an io stream, could be a path to file or an io stream.
longling.lib.stream.as_io(src, <class >, ...) with wrapper for to_io function, default mode is "r"
longling.lib.stream.as_out_io(tar, <class >, ...) with wrapper for to_io function, default mode is "w"
longling.lib.loading.loading(src, <class >), ...) 缓冲式按行读取文件

迭代器

longling.lib.iterator.AsyncLoopIter(src[, ...]) 异步循环迭代器,适用于加载文件
longling.lib.iterator.CacheAsyncLoopIter(...) 带缓冲池的异步迭代器,适用于带预处理的文件
longling.lib.iterator.iterwrap(itertype, ...) 迭代器装饰器,适用于希望能重复使用一个迭代器的情况,能将迭代器生成函数转换为可以重复使用的函数。 默认使用 AsyncLoopIter。

日志

longling.lib.utilog.config_logging([...]) 主日志设定文件

For path

longling.lib.path.path_append(path, *addition) 路径合并函数
longling.lib.path.abs_current_dir(filepath) 获取文件所在目录的绝对路径
longling.lib.path.file_exist(filepath) 判断文件是否存在

语法糖

longling.lib.candylib.as_list(obj) A utility function that converts the argument to a list if it is not already.

计时与进度

longling.lib.clock.print_time(tips[, logger]) 统计并打印脚本运行时间, 秒为单位
longling.lib.clock.Clock(store_dict, ...[, tips]) 计时器。 包含两种时间:wall_time 和 process_time
longling.lib.stream.flush_print(*values, ...) 刷新打印函数

并发

longling.lib.concurrency.concurrent_pool(...) Simple api for start completely independent concurrent programs:

测试

longling.lib.testing.simulate_stdin(*inputs) 测试中模拟标准输入

结构体 .. autosummary:

longling.lib.structure.AttrDict
longling.lib.structure.nested_update
longling.lib.structure.SortedList

正则 .. autosummary:

longling.lib.structure.variable_replace
longling.lib.structure.default_variable_replace

candylib

longling.lib.candylib.as_list(obj) → list[源代码]

A utility function that converts the argument to a list if it is not already.

参数:obj (object) -- argument to be converted to a list
返回:list_obj -- If obj is a list or tuple, return it. Otherwise, return [obj] as a single-element list.
返回类型:list

实际案例

>>> as_list(1)
[1]
>>> as_list([1])
[1]
>>> as_list((1, 2))
[1, 2]
longling.lib.candylib.dict2pv(dict_obj: dict, path_to_node: list = None)[源代码]
>>> dict_obj = {"a": {"b": [1, 2], "c": "d"}, "e": 1}
>>> path, value = dict2pv(dict_obj)
>>> path
[['a', 'b'], ['a', 'c'], ['e']]
>>> value
[[1, 2], 'd', 1]
longling.lib.candylib.list2dict(list_obj, value=None, dict_obj=None)[源代码]
>>> list_obj = ["a", 2, "c"]
>>> list2dict(list_obj, 10)
{'a': {2: {'c': 10}}}
longling.lib.candylib.get_dict_by_path(dict_obj, path_to_node)[源代码]
>>> dict_obj = {"a": {"b": {"c": 1}}}
>>> get_dict_by_path(dict_obj, ["a", "b", "c"])
1
longling.lib.candylib.format_byte_sizeof(num, suffix='B')[源代码]

实际案例

>>> format_byte_sizeof(1024)
'1.00KB'
longling.lib.candylib.group_by_n(obj: list, n: int) → list[源代码]

实际案例

>>> list_obj = [1, 2, 3, 4, 5, 6]
>>> group_by_n(list_obj, 3)
[[1, 2, 3], [4, 5, 6]]
longling.lib.candylib.as_ordered_dict(dict_data: (<class 'dict'>, <class 'collections.OrderedDict'>), index: (<class 'list'>, None) = None)[源代码]

实际案例

>>> as_ordered_dict({0: 0, 2: 123, 1: 1})
OrderedDict([(0, 0), (2, 123), (1, 1)])
>>> as_ordered_dict({0: 0, 2: 123, 1: 1}, [2, 0, 1])
OrderedDict([(2, 123), (0, 0), (1, 1)])
>>> as_ordered_dict(OrderedDict([(2, 123), (0, 0), (1, 1)]))
OrderedDict([(2, 123), (0, 0), (1, 1)])

clock

class longling.lib.clock.Clock(store_dict: (<class 'dict'>, None) = None, logger: (<class 'logging.Logger'>, None) = <Logger clock (INFO)>, tips='')[源代码]

计时器。 包含两种时间:wall_time 和 process_time

  • wall_time: 包括等待时间在内的程序运行时间
  • process_time: 不包括等待时间在内的程序运行时间
参数:
  • store_dict (dict or None) -- with closure 中存储运行时间
  • logger (logging.logger) -- 日志
  • tips (str) -- 提示前缀

实际案例

with Clock():
    a = 1 + 1
clock = Clock()
clock.start()
# some code
clock.end(wall=True) # default to return the wall_time, to get process_time, set wall=False
end(wall=True)[源代码]

计时结束,返回间隔时间

process_time

获取程序运行时间(不包括等待时间)

start()[源代码]

开始计时

wall_time

获取程序运行时间(包括等待时间)

longling.lib.clock.print_time(tips: str = '', logger=<Logger clock (INFO)>)[源代码]

统计并打印脚本运行时间, 秒为单位

参数:
  • tips (str) --
  • logger (logging.Logger or logging) --

实际案例

>>> with print_time("tips"):
...     a = 1 + 1  # The code you want to test
longling.lib.clock.Timer

longling.lib.clock.Clock 的别名

concurrency

longling.lib.concurrency.concurrent_pool(level: str, pool_size: int = None, ret: list = None)[源代码]

Simple api for start completely independent concurrent programs:

  • thread
  • process
  • coroutine

实际案例

def pseudo(idx):
    return idx
ret = []
with concurrent_pool("p", ret=ret) as e:  # or concurrent_pool("t", ret=ret)
     for i in range(4):
        e.submit(pseudo, i)
print(ret)

[0, 1, 2, 3]

class longling.lib.concurrency.ThreadPool(max_workers=None, thread_name_prefix='', ret: list = None)[源代码]
submit(fn, *args, **kwargs)[源代码]

Submits a callable to be executed with the given arguments.

Schedules the callable to be executed as fn(*args, **kwargs) and returns a Future instance representing the execution of the callable.

返回:A Future representing the given call.
class longling.lib.concurrency.ProcessPool(processes=None, *args, ret: list = None, **kwargs)[源代码]

formatter

longling.lib.formatter.dict_format(data: dict, digits=6, col: int = None)[源代码]

实际案例

>>> print(dict_format({"a": 123, "b": 3, "c": 4, "d": 5}))  # doctest: +NORMALIZE_WHITESPACE
a: 123      b: 3    c: 4    d: 5
>>> print(dict_format({"a": 123, "b": 3, "c": 4, "d": 5}, col=3))  # doctest: +NORMALIZE_WHITESPACE
a: 123      b: 3    c: 4
d: 5
longling.lib.formatter.pandas_format(data: (<class 'dict'>, <class 'list'>, <class 'tuple'>), columns: list = None, index: (<class 'list'>, <class 'str'>) = None, orient='index', pd_kwargs: dict = None, max_rows=80, max_columns=80, **kwargs)[源代码]
参数:
  • data (dict, list, tuple, pd.DataFrame) --
  • columns (list, default None) -- Column labels to use when orient='index'. Raises a ValueError if used with orient='columns'.
  • index (list of strings) -- Optional display names matching the labels (same order).
  • orient ({'columns', 'index'}, default 'columns') -- The "orientation" of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass 'columns' (default). Otherwise if the keys should be rows, pass 'index'.
  • pd_kwargs (dict) --
  • max_rows ((int, None), default 80) --
  • max_columns ((int, None), default 80) --

实际案例

>>> print(pandas_format({"a": {"x": 1, "y": 2}, "b": {"x": 1.0, "y": 3}},  ["x", "y"]))
     x  y
a  1.0  2
b  1.0  3
>>> print(pandas_format([[1.0, 2], [1.0, 3]],  ["x", "y"], index=["a", "b"]))
     x  y
a  1.0  2
b  1.0  3
longling.lib.formatter.table_format(data: (<class 'dict'>, <class 'list'>, <class 'tuple'>), columns: list = None, index: (<class 'list'>, <class 'str'>) = None, orient='index', pd_kwargs: dict = None, max_rows=80, max_columns=80, **kwargs)
参数:
  • data (dict, list, tuple, pd.DataFrame) --
  • columns (list, default None) -- Column labels to use when orient='index'. Raises a ValueError if used with orient='columns'.
  • index (list of strings) -- Optional display names matching the labels (same order).
  • orient ({'columns', 'index'}, default 'columns') -- The "orientation" of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass 'columns' (default). Otherwise if the keys should be rows, pass 'index'.
  • pd_kwargs (dict) --
  • max_rows ((int, None), default 80) --
  • max_columns ((int, None), default 80) --

实际案例

>>> print(pandas_format({"a": {"x": 1, "y": 2}, "b": {"x": 1.0, "y": 3}},  ["x", "y"]))
     x  y
a  1.0  2
b  1.0  3
>>> print(pandas_format([[1.0, 2], [1.0, 3]],  ["x", "y"], index=["a", "b"]))
     x  y
a  1.0  2
b  1.0  3
longling.lib.formatter.series_format(data: dict, digits=6, col: int = None)

实际案例

>>> print(dict_format({"a": 123, "b": 3, "c": 4, "d": 5}))  # doctest: +NORMALIZE_WHITESPACE
a: 123      b: 3    c: 4    d: 5
>>> print(dict_format({"a": 123, "b": 3, "c": 4, "d": 5}, col=3))  # doctest: +NORMALIZE_WHITESPACE
a: 123      b: 3    c: 4
d: 5

iterator

class longling.lib.iterator.BaseIter(src, fargs=None, fkwargs=None, length=None, *args, **kwargs)[源代码]

迭代器

Notes

  • 如果 src 是一个迭代器实例,那么在一轮迭代之后,迭代器里的内容就被迭代完了,将无法重启。
  • 如果想使得迭代器可以一直被循环迭代,那么 src 应当是迭代器实例的生成函数, 同时在每次循环结束后,调用reset()
  • 如果 src 没有 __length__,那么在第一次迭代结束前,无法对 BaseIter 的实例调用 len() 函数

实际案例

# 单次迭代后穷尽内容
with open("demo.txt") as f:
    bi = BaseIter(f)
    for line in bi:
        pass

# 可多次迭代
def open_file():
    with open("demo.txt") as f:
        for line in f:
            yield line

bi = BaseIter(open_file)
for _ in range(5):
    for line in bi:
        pass
    bi.reset()

# 简化的可多次迭代的写法
@BaseIter.wrap
def open_file():
    with open("demo.txt") as f:
        for line in f:
            yield line

bi = open_file()
for _ in range(5):
    for line in bi:
        pass
    bi.reset()
class longling.lib.iterator.MemoryIter(src, fargs=None, fkwargs=None, length=None, prefetch=False, *args, **kwargs)[源代码]

内存迭代器

会将所有迭代器内容装载入内存

class longling.lib.iterator.LoopIter(src, fargs=None, fkwargs=None, length=None, *args, **kwargs)[源代码]

循环迭代器

每次迭代后会进行自动的 reset() 操作

class longling.lib.iterator.AsyncLoopIter(src, fargs=None, fkwargs=None, tank_size=8, timeout=None, level='t')[源代码]

异步循环迭代器,适用于加载文件

数据的读入和数据的使用迭代是异步的。reset() 之后会进行数据预取

class longling.lib.iterator.AsyncIter(src, fargs=None, fkwargs=None, tank_size=8, timeout=None, level='t')[源代码]

异步装载迭代器

不会进行自动 reset()

class longling.lib.iterator.CacheAsyncLoopIter(src, cache_file, fargs=None, fkwargs=None, rerun=True, tank_size=8, timeout=None, level='t')[源代码]

带缓冲池的异步迭代器,适用于带预处理的文件

自动 reset(), 同时针对 src 为 function 时可能存在的复杂预处理(即异步加载取数据操作比迭代输出数据操作时间长很多), 将异步加载中处理的预处理数据放到指定的缓冲文件中

longling.lib.iterator.iterwrap(itertype: str = 'AsyncLoopIter', *args, **kwargs)[源代码]

迭代器装饰器,适用于希望能重复使用一个迭代器的情况,能将迭代器生成函数转换为可以重复使用的函数。 默认使用 AsyncLoopIter。

实际案例

@iterwrap()
def open_file():
    with open("demo.txt") as f:
        for line in f:
            yield line

data = open_file()
for _ in range(5):
    for line in data:
        pass

警告

As mentioned in [1], on Windows or MacOS, spawn() is the default multiprocessing start method. Using spawn(), another interpreter is launched which runs your main script, followed by the internal worker function that receives parameters through pickle serialization. However, decorator ,`functools`, lambda and local function does not well fit pickle like discussed in [2]. Therefore, since version 1.3.36, instead of using multiprocessing, we use multiprocess which replace pickle with dill . Nevertheless, the users should be aware of that level='p' may not work in windows and mac platform if the decorated function does not follow the spawn() behaviour.

Notes

Although fork in multiprocessing is quite easy to use, and iterwrap can work well with it, the users should still be aware of that fork is not safety enough as mentioned in [3].

We use the default mode when deal with multiprocessing, i.e., spawn in windows and macos, and folk in linux. An example to change the default behaviour is multiprocessing.set_start_method('spawn'), which could be found in [3].

References

[1] https://pytorch.org/docs/stable/data.html#platform-specific-behaviors [2] https://stackoverflow.com/questions/51867402/cant- pickle-function-stringtongrams-at-0x104144f28-its-not-the-same-object [3] https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

loading

longling.lib.loading.csv2jsonl(src: ((<class 'str'>, <class 'pathlib.PurePath'>), (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, <class 'codecs.StreamReaderWriter'>, <class 'fileinput.FileInput'>)), tar: ((<class 'str'>, <class 'pathlib.PurePath'>), (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, <class 'codecs.StreamReaderWriter'>, <class 'fileinput.FileInput'>)) = None, delimiter=', ', **kwargs)[源代码]

将 csv 格式文件/io流 转换为 json 格式文件/io流

transfer csv file or io stream into json file or io stream

参数:
  • src (PATH_IO_TYPE) -- 数据源,可以是文件路径,也可以是一个IO流。 the path to source file or io stream.
  • tar (PATH_IO_TYPE) -- 输出目标,可以是文件路径,也可以是一个IO流。 the path to target file or io stream.
  • delimiter (str) -- 分隔符 the delimiter used in csv. some usually used delimiters are "," and " "
  • kwargs (dict) -- options passed to csv.DictWriter

实际案例

Assume such component is written in demo.csv:

use following codes to reading the component

csv2json("demo.csv", "demo.jsonl")

and get

longling.lib.loading.jsonl2csv(src: ((<class 'str'>, <class 'pathlib.PurePath'>), (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, <class 'codecs.StreamReaderWriter'>, <class 'fileinput.FileInput'>)), tar: ((<class 'str'>, <class 'pathlib.PurePath'>), (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, <class 'codecs.StreamReaderWriter'>, <class 'fileinput.FileInput'>)) = None, delimiter=', ', **kwargs)[源代码]

将 json 格式文件/io流 转换为 csv 格式文件/io流

transfer json file or io stream into csv file or io stream

参数:
  • src (PATH_IO_TYPE) -- 数据源,可以是文件路径,也可以是一个IO流。 the path to source file or io stream.
  • tar (PATH_IO_TYPE) -- 输出目标,可以是文件路径,也可以是一个IO流。 the path to target file or io stream.
  • delimiter (str) -- 分隔符 the delimiter used in csv. some usually used delimiters are "," and " "
  • kwargs (dict) -- options passed to csv.DictWriter

实际案例

Assume such component is written in demo.csv:

use following codes to reading the component

jsonl2csv("demo.csv", "demo.jsonl")

and get

longling.lib.loading.loading(src: (((<class 'str'>, <class 'pathlib.PurePath'>), (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, <class 'codecs.StreamReaderWriter'>, <class 'fileinput.FileInput'>)), Ellipsis), src_type=None)[源代码]

缓冲式按行读取文件

Support read from

  • jsonl (apply load_jsonl)
  • csv (apply load_csv).
  • file in other format will be treated as raw text (apply load_file).
  • function will be invoked and return
  • others will be directly returned
longling.lib.loading.load_jsonl(src: ((<class 'str'>, <class 'pathlib.PurePath'>), (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, <class 'codecs.StreamReaderWriter'>, <class 'fileinput.FileInput'>)))[源代码]

缓冲式按行读取jsonl文件

实际案例

Assume such component is written in demo.jsonl:

for line in load_jsonl('demo.jsonl'):
    print(line)
longling.lib.loading.load_csv(src: ((<class 'str'>, <class 'pathlib.PurePath'>), (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, <class 'codecs.StreamReaderWriter'>, <class 'fileinput.FileInput'>)), delimiter=', ', **kwargs)[源代码]

read the dict from csv

实际案例

Assume such component is written in demo.csv:

for line in load_csv('demo.csv'):
    print(line)
longling.lib.loading.load_file(src: ((<class 'str'>, <class 'pathlib.PurePath'>), (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, <class 'codecs.StreamReaderWriter'>, <class 'fileinput.FileInput'>)))[源代码]

Read raw text from source

实际案例

Assume such component is written in demo.txt:

use following codes to reading the component

for line in load_csv('demo.txt'):
    print(line, end="")

and get

parser

自定义的配置文件及对应的解析工具包。目的是为了更方便、快速地进行文件参数配置与解析。

自定义的配置文件及对应的解析工具包。目的是为了更方便、快速地进行文件参数配置与解析。

longling.lib.parser.get_class_var(class_obj, exclude_names: (<class 'set'>, None) = None, get_vars=None) → dict[源代码]

Update in v1.3.18

获取某个类的所有属性的变量名及其值

实际案例

>>> class A(object):
...     att1 = 1
...     att2 = 2
>>> get_class_var(A)
{'att1': 1, 'att2': 2}
>>> get_class_var(A, exclude_names={"att1"})
{'att2': 2}
>>> class B(object):
...     att3 = 3
...     att4 = 4
...     @staticmethod
...     def excluded_names():
...         return {"att4"}
>>> get_class_var(B)
{'att3': 3}
参数:
  • class_obj -- 类或类实例。需要注意两者的区别。
  • exclude_names -- 需要排除在外的变量名。也可以通过在类定义 excluded_names 方法来指定要排除的变量名。
  • get_vars --
返回:

类内属性变量名及值

返回类型:

class_var

longling.lib.parser.get_parsable_var(class_obj, parse_exclude: set = None, dump_parse_functions=None, get_vars=True)[源代码]

获取所有可以被解析的参数及其值,可以使用dump_parse_functions来对不可dump的值进行转换

longling.lib.parser.load_configuration(fp, file_format='json', load_parse_function=None)[源代码]

装载配置文件

Updated in version 1.3.16

参数:
  • fp --
  • file_format --
  • load_parse_function --
longling.lib.parser.var2exp(var_str, env_wrap=<function <lambda>>)[源代码]

将含有 $ 标识的变量转换为表达式

参数:
  • var_str --
  • env_wrap --

实际案例

>>> root = "dir"
>>> dataset = "d1"
>>> eval(var2exp("$root/data/$dataset"))
'dir/data/d1'
longling.lib.parser.path_append(path, *addition, to_str=False)[源代码]

路径合并函数

实际案例


path_append("../", "../data", "../dataset1/", "train", to_str=True) '../../data/../dataset1/train'

参数:
  • path (str or PurePath) --
  • addition (list(str or PurePath)) --
  • to_str (bool) -- Convert the new path to str
class longling.lib.parser.Configuration(logger=<module 'logging' from '/home/docs/.pyenv/versions/3.7.9/lib/python3.7/logging/__init__.py'>, **kwargs)[源代码]

自定义的配置文件基类

实际案例

>>> c = Configuration(a=1, b="example", c=[0,2], d={"a1": 3})
>>> c.instance_var
{'a': 1, 'b': 'example', 'c': [0, 2], 'd': {'a1': 3}}
>>> c.default_file_format()
'json'
>>> c.get("a")
1
>>> c.get("e") is None
True
>>> c.get("e", 0)
0
>>> c.update(e=2)
>>> c["e"]
2
class_var

获取所有设定的参数

返回:parameters -- all variables used as parameters
返回类型:dict
dump(cfg_path: str, override=True, file_format=None)[源代码]

将配置参数写入文件

Updated in version 1.3.16

参数:
  • cfg_path (str) --
  • override (bool) --
  • file_format (str) --
classmethod excluded_names()[源代码]

获取非参变量集

返回:exclude names set -- 所有非参变量
返回类型:set
classmethod load(cfg_path, file_format=None, **kwargs)[源代码]

从配置文件中装载配置类

Updated in version 1.3.16

classmethod load_cfg(cfg_path, file_format=None, **kwargs)[源代码]

从配置文件中装载配置参数

parsable_var

获取可以进行命令行设定的参数

返回:store_vars -- 可以进行命令行设定的参数
返回类型:dict
class longling.lib.parser.ConfigurationParser(class_type, excluded_names: (<class 'set'>, None) = None, commands=None, *args, params_help=None, commands_help=None, override_help=False, **kwargs)[源代码]

Update in v1.3.18

配置文件解析类,可用于构建cli工具。该类首先读入所有目标配置文件类class_obj的所有类属性,解析后生成命令行。 普通属性参数使用 "--att_name att_value" 来读入。另外提供一个额外参数标记 ‘--kwargs’ 来读入可选参数。 可选参数格式为

--kwargs key1=value1;key2=value2;...

首先生成一个解析类

cli_parser = ConfigurationParser(Configuration)

除了解析已有配置文件外,解析类还可以进一步添加函数来生成子解析器

cli_parser = ConfigurationParser($function)

或者

cli_parser = ConfigurationParser([$function1, $function2])

用以下三种解析方式中任意一种来解析参数:

  • 命令行模式

    cli_parser()
    
  • 字符串传参模式

    cli_parser('$parameter1 $parameters ...')
    
  • 列表传参模式

    cli_parser(["--a", "int(1)", "--b", "int(2)"])
    

Notes

包含以下关键字的字符串会在解析过程中进行类型转换

int, float, dict, list, set, tuple, None

参数:
  • class_type -- 类。注意是类,不是类实例。
  • excluded_names -- 类中不进行解析的变量名集合
  • commands -- 待解析的命令函数

实际案例

>>> class TestC(Configuration):
...     a = 1
...     b = 2
>>> def test_f1(k=1):
...     return k
>>> def test_f2(h=1):
...      return h
>>> def test_f3(m):
...      return m
>>> parser = ConfigurationParser(TestC)
>>> parser("--a 1 --b 2")
{'a': '1', 'b': '2'}
>>> ConfigurationParser.get_cli_cfg(TestC)
{'a': 1, 'b': 2}
>>> parser(["--a", "1", "--b", "int(1)"])
{'a': '1', 'b': 1}
>>> parser(["--a", "1", "--b", "int(1)", "--kwargs", "c=int(3);d=None"])
{'a': '1', 'b': 1, 'c': 3, 'd': None}
>>> parser.add_command(test_f1, test_f2, test_f3)
>>> parser(["test_f1"])
{'a': 1, 'b': 2, 'k': 1, 'subcommand': 'test_f1'}
>>> parser(["test_f2"])
{'a': 1, 'b': 2, 'h': 1, 'subcommand': 'test_f2'}
>>> parser(["test_f3", "3"])
{'a': 1, 'b': 2, 'm': '3', 'subcommand': 'test_f3'}
>>> parser = ConfigurationParser(TestC, commands=[test_f1, test_f2])
>>> parser(["test_f1"])
{'a': 1, 'b': 2, 'k': 1, 'subcommand': 'test_f1'}
>>> class TestCC:
...     c = {"_c": 1, "_d": 0.1}
>>> parser = ConfigurationParser(TestCC)
>>> parser("--c _c=int(3);_d=float(0.3)")
{'c': {'_c': 3, '_d': 0.3}}
>>> class TestCls:
...     def a(self, a=1):
...         return a
...     @staticmethod
...     def b(b=2):
...         return b
...     @classmethod
...     def c(cls, c=3):
...         return c
>>> parser = ConfigurationParser(TestCls, commands=[TestCls.b, TestCls.c])
>>> parser("b")
{'b': 2, 'subcommand': 'b'}
>>> parser("c")
{'c': 3, 'subcommand': 'c'}
add_command(*commands, help_info: (typing.List[typing.Dict], <class 'dict'>, <class 'str'>, None) = None)[源代码]

批量添加子命令解析器

static func_spec(f)[源代码]

获取函数参数表

classmethod get_cli_cfg(params_class) → dict[源代码]

获取默认配置参数

static parse(arguments)[源代码]

参数后解析

class longling.lib.parser.Formatter(formatter: (<class 'str'>, None) = None)[源代码]

以特定格式格式化字符串

实际案例

>>> formatter = Formatter()
>>> formatter("hello world")
'hello world'
>>> formatter = Formatter("hello {}")
>>> formatter("world")
'hello world'
>>> formatter = Formatter("hello {} v{:.2f}")
>>> formatter("world", 0.2)
'hello world v0.20'
>>> formatter = Formatter("hello {1} v{0:.2f}")
>>> formatter(0.2, "world")
'hello world v0.20'
>>> Formatter.format(0.2, "world", formatter="hello {1} v{0:.3f}")
'hello world v0.200'
class longling.lib.parser.ParserGroup(parsers: dict, prog=None, usage=None, description=None, epilog=None, add_help=True)[源代码]
>>> class TestC(Configuration):
...     a = 1
...     b = 2
>>> def test_f1(k=1):
...     return k
>>> def test_f2(h=1):
...      return h
>>> class TestC2(Configuration):
...     c = 3
>>> parser1 = ConfigurationParser(TestC, commands=[test_f1])
>>> parser2 = ConfigurationParser(TestC, commands=[test_f2])
>>> pg = ParserGroup({"model1": parser1, "model2": parser2})
>>> pg(["model1", "test_f1"])
{'a': 1, 'b': 2, 'k': 1, 'subcommand': 'test_f1'}
>>> pg("model2 test_f2")
{'a': 1, 'b': 2, 'h': 1, 'subcommand': 'test_f2'}
longling.lib.parser.is_classmethod(method)[源代码]
参数:method --

实际案例

>>> class A:
...     def a(self):
...         pass
...     @staticmethod
...     def b():
...         pass
...     @classmethod
...     def c(cls):
...         pass
>>> obj = A()
>>> is_classmethod(obj.a)
False
>>> is_classmethod(obj.b)
False
>>> is_classmethod(obj.c)
True
>>> def fun():
...     pass
>>> is_classmethod(fun)
False

path

longling.lib.path.path_append(path, *addition, to_str=False)[源代码]

路径合并函数

实际案例


path_append("../", "../data", "../dataset1/", "train", to_str=True) '../../data/../dataset1/train'

参数:
  • path (str or PurePath) --
  • addition (list(str or PurePath)) --
  • to_str (bool) -- Convert the new path to str
longling.lib.path.file_exist(filepath)[源代码]

判断文件是否存在

longling.lib.path.abs_current_dir(filepath)[源代码]

获取文件所在目录的绝对路径

Example

longling.lib.path.type_from_name(filename)[源代码]

实际案例

>>> type_from_name("1.txt")
'.txt'
longling.lib.path.tmpfile(suffix=None, prefix=None, dir=None)[源代码]

Create a temporary file, which will automatically cleaned after used (outside "with" closure).

实际案例

progress

进度监视器,帮助用户知晓当前运行进度,主要适配于机器学习中分 epoch,batch 的情况。

和 tqdm 针对单个迭代对象进行快速适配不同, progress的目标是能将监视器不同功能部件模块化后再行组装,可以实现description的动态化, 给用户提供更大的便利性。

  • MonitorPlayer 定义了如何显示进度和其它过程参数(better than tqdm, where only n is changed and description is fixed)
    • 在 __call__ 方法中定义如何显示
  • 继承ProgressMonitor并传入必要参数进行实例化
    • 继承重写ProgressMonitor的__call__函数,用 IterableMIcing 包裹迭代器,这一步可以灵活定义迭代前后的操作
    • 需要在__init__的时候传入一个MonitorPlayer实例
  • IterableMIcing 用来组装迭代器、监控器

一个简单的示例如下

class DemoMonitor(ProgressMonitor):
    def __call__(self, iterator):
        return IterableMIcing(
            iterator,
            self.player, self.player.set_length
        )

progress_monitor = DemoMonitor(MonitorPlayer())

for _ in range(5):
    for _ in progress_monitor(range(10000)):
        pass
    print()

cooperate with tqdm

from tqdm import tqdm

class DemoTqdmMonitor(ProgressMonitor):
    def __call__(self, iterator, **kwargs):
        return tqdm(iterator, **kwargs)
class longling.lib.progress.IterableMIcing(iterator: (typing.Iterable, <class 'list'>, <class 'tuple'>, <class 'dict'>), hook_in_iter=<function pass_function>, hook_after_iter=<function pass_function>, length: (<class 'int'>, None) = None)[源代码]

将迭代器包装为监控器可以使用的迭代类: * 添加计数器 count, 每迭代一次,count + 1, 迭代结束时,可根据 count 得知数据总长 * 每次 __iter__ 时会调用 call_in_iter 函数 * 迭代结束时,会调用 call_after_iter

参数:
  • iterator -- 待迭代数据
  • hook_in_iter -- 每次迭代中的回调函数(例如:打印进度等),接受当前的 count 为输入
  • hook_after_iter -- 每轮迭代后的回调函数(所有数据遍历一遍后),接受当前的 length 为输入
  • length -- 数据总长(有多少条数据)
  • iterator = IterableMIcing(range(100)) (>>>) --
  • for i in iterator (>>>) --
  • pass (..) --
  • len(iterator) (>>>) --
  • 100 --
  • def iter_fn(num) (>>>) --
  • for i in range(num) (..) --
  • yield num (..) --
  • iterator = IterableMIcing(iter_fn(50)) (>>>) --
  • for i in iterator --
  • pass --
  • len(iterator) --
  • 50 --
class longling.lib.progress.MonitorPlayer[源代码]

异步监控器显示器

class longling.lib.progress.AsyncMonitorPlayer(cache_size=10000)[源代码]

异步监控器显示器

regex

longling.lib.regex.variable_replace(string: str, key_lower: bool = True, quotation: str = '', **variables)[源代码]

实际案例

>>> string = "hello $who"
>>> variable_replace(string, who="world")
'hello world'
>>> string = "hello $WHO"
>>> variable_replace(string, key_lower=False, WHO="world")
'hello world'
>>> string = "hello $WHO"
>>> variable_replace(string, who="longling")
'hello longling'
>>> string = "hello $Wh_o"
>>> variable_replace(string, wh_o="longling")
'hello longling'
longling.lib.regex.default_variable_replace(string: str, default_value: (<class 'str'>, None, <class 'dict'>) = None, key_lower: bool = True, quotation: str = '', **variables) → str[源代码]

实际案例

>>> string = "hello $who, I am $author"
>>> default_variable_replace(string, default_value={"author": "groot"}, who="world")
'hello world, I am groot'
>>> string = "hello $who, I am $author"
>>> default_variable_replace(string, default_value={"author": "groot"})
'hello , I am groot'
>>> string = "hello $who, I am $author"
>>> default_variable_replace(string, default_value='', who="world")
'hello world, I am '
>>> string = "hello $who, I am $author"
>>> default_variable_replace(string, default_value=None, who="world")
'hello world, I am $author'

stream

此模块用以进行流处理

longling.lib.stream.to_io(stream: (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, (<class 'str'>, <class 'pathlib.PurePath'>), <class 'list'>, None) = None, mode='r', encoding='utf-8', **kwargs)[源代码]

Convert an object as an io stream, could be a path to file or an io stream.

实际案例

to_io("demo.txt")  # equal to open("demo.txt")
to_io(open("demo.txt"))  # equal to open("demo.txt")
a = to_io()  # equal to a = sys.stdin
b = to_io(mode="w)  # equal to a = sys.stdout
longling.lib.stream.as_io(src: (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, (<class 'str'>, <class 'pathlib.PurePath'>), <class 'list'>, None) = None, mode='r', encoding='utf-8', **kwargs)[源代码]

with wrapper for to_io function, default mode is "r"

实际案例

with as_io("demo.txt") as f:
    for line in f:
        pass

# equal to
with open(demo.txt) as src:
    with as_io(src) as f:
        for line in f:
            pass

# from several files
with as_io(["demo1.txt", "demo2.txt"]) as f:
    for line in f:
        pass

# from sys.stdin
with as_io() as f:
    for line in f:
        pass
longling.lib.stream.as_out_io(tar: (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, (<class 'str'>, <class 'pathlib.PurePath'>), <class 'list'>, None) = None, mode='w', encoding='utf-8', **kwargs)[源代码]

with wrapper for to_io function, default mode is "w"

实际案例

with as_out_io("demo.txt") as wf:
    print("hello world", file=wf)

# equal to
with open(demo.txt) as tar:
    with as_out_io(tar) as f:
        print("hello world", file=wf)

# to sys.stdout
with as_out_io() as wf:
    print("hello world", file=wf)

# to sys.stderr
with as_out_io(mode="stderr) as wf:
    print("hello world", file=wf)
longling.lib.stream.wf_open(stream_name: (((<class 'str'>, <class 'pathlib.PurePath'>), (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, <class 'codecs.StreamReaderWriter'>, <class 'fileinput.FileInput'>)), None) = None, mode='w', encoding='utf-8', **kwargs)[源代码]

Simple wrapper to codecs for writing.

stream_name为空时 mode - w 返回标准错误输出 stderr; 否则,返回标准输出 stdout

stream_name不为空时,返回文件流

参数:
  • stream_name (str, PurePath or None) --
  • mode (str) --
  • encoding (str) -- 编码方式,默认为 utf-8
返回:

write_stream -- 返回打开的流

返回类型:

StreamReaderWriter

实际案例

>>> wf = wf_open(mode="stdout")
>>> print("hello world", file=wf)
hello world
longling.lib.stream.close_io(stream)[源代码]

关闭文件流,忽略 sys.stdin, sys.stdout, sys.stderr

longling.lib.stream.flush_print(*values, **kwargs)[源代码]

刷新打印函数

longling.lib.stream.build_dir(path, mode=509, parse_dir=True)[源代码]

创建目录,从path中解析出目录路径,如果目录不存在,创建目录

参数:
  • path (str) --
  • mode (int) --
  • parse_dir (bool) --
class longling.lib.stream.AddPrinter(fp, values_wrapper=<function AddPrinter.<lambda>>, to_io_params=None, ensure_io=False, **kwargs)[源代码]

以add方法添加文件内容的打印器

实际案例

>>> import sys
>>> printer = AddPrinter(sys.stdout, ensure_io=True)
>>> printer.add("hello world")
hello world
exception longling.lib.stream.StreamError[源代码]
longling.lib.stream.check_file(filepath, size=None)[源代码]

检查文件是否存在,size给定时,检查文件大小是否一致

参数:
  • filepath (str) --
  • size (int) --
返回:

file exist or not

返回类型:

bool

longling.lib.stream.encode(src, src_encoding, tar, tar_encoding)[源代码]

Convert a file in source encoding to target encoding

参数:
  • src --
  • src_encoding --
  • tar --
  • tar_encoding --
longling.lib.stream.block_std()[源代码]

实际案例

>>> print("hello world")
hello world
>>> with block_std():
...     print("hello world")

structure

class longling.lib.structure.AttrDict(*args, **kwargs)[源代码]

Example

>>> ad = AttrDict({'first_name': 'Eduardo'}, last_name='Pool', age=24, sports=['Soccer'])
>>> ad
{'first_name': 'Eduardo', 'last_name': 'Pool', 'age': 24, 'sports': ['Soccer']}
>>> ad.first_name
'Eduardo'
>>> ad.age
24
>>> ad.age = 16
>>> ad.age
16
>>> ad["age"] = 20
>>> ad["age"]
20
class longling.lib.structure.SortedList(iterable: Iterable[T_co] = (), key=None)[源代码]

A list maintaining the element in an ascending order.

A custom key function can be supplied to customize the sort order.

实际案例

>>> sl = SortedList()
>>> sl.adds(*[1, 2, 3, 4, 5])
>>> sl
[1, 2, 3, 4, 5]
>>> sl.add(7)
>>> sl
[1, 2, 3, 4, 5, 7]
>>> sl.add(6)
>>> sl
[1, 2, 3, 4, 5, 6, 7]
>>> sl = SortedList([4])
>>> sl.add(3)
>>> sl.add(2)
>>> sl
[2, 3, 4]
>>> list(reversed(sl))
[4, 3, 2]
>>> sl = SortedList([("harry", 1), ("tom", 0)], key=lambda x: x[1])
>>> sl
[('tom', 0), ('harry', 1)]
>>> sl.add(("jack", -1), key=lambda x: x[1])
>>> sl
[('jack', -1), ('tom', 0), ('harry', 1)]
>>> sl.add(("ada", 2))
>>> sl
[('jack', -1), ('tom', 0), ('harry', 1), ('ada', 2)]
longling.lib.structure.nested_update(src: dict, update: dict)[源代码]

实际案例

>>> nested_update({"a": {"x": 1}}, {"a": {"y": 2}})
{'a': {'x': 1, 'y': 2}}
>>> nested_update({"a": {"x": 1}}, {"a": {"x": 2}})
{'a': {'x': 2}}
>>> nested_update({"a": {"x": 1}}, {"b": {"y": 2}})
{'a': {'x': 1}, 'b': {'y': 2}}
>>> nested_update({"a": {"x": 1}}, {"a": 2})
{'a': 2}

testing

longling.lib.testing.simulate_stdin(*inputs)[源代码]

测试中模拟标准输入

参数:inputs (list of str) --

实际案例

>>> with simulate_stdin("12", "", "34"):
...     a = input()
...     b = input()
...     c = input()
>>> a
'12'
>>> b
''
>>> c
'34'

time

longling.lib.time.get_current_timestamp() → str[源代码]

实际案例

> get_current_timestamp()
'20200327172235'

utilog

日志设定文件

longling.lib.utilog.config_logging(filename=None, log_format=None, level=20, logger=None, console_log_level=None, propagate=False, mode='a', file_format=None, encoding: (<class 'str'>, None) = 'utf-8', enable_colored=False, datefmt=None)[源代码]

主日志设定文件

参数:
  • filename (str or None) -- 日志存储文件名,不为空时将创建文件存储日志
  • log_format (str) -- 默认日志输出格式: %(name)s, %(levelname)s %(message)s 如果 datefmt 被指定,则变为 %(name)s: %(asctime)s, %(levelname)s %(message)s
  • level (str or int) -- 默认日志等级
  • logger (str or logging.logger) -- 日志logger名,可以为空(使用root logger), 字符串类型(创建对应名logger),logger
  • console_log_level (str, int or None) -- 屏幕日志等级,不为空时,使能屏幕日志输出
  • propagate (bool) --
  • mode (str) --
  • file_format (str or None) -- 文件日志输出格式,为空时,使用log_format
  • encoding --
  • enable_colored (bool) --
  • datefmt (str) --
longling.lib.utilog.default_timestamp() → str

实际案例

> get_current_timestamp()
'20200327172235'

yaml

class longling.lib.yaml_helper.FoldedString[源代码]
longling.lib.yaml_helper.dump_folded_yaml(yaml_string)[源代码]

specially designed for arch module, should not be used in other places

longling.lib.yaml_helper.ordered_yaml_load(stream, Loader=<class 'yaml.loader.Loader'>, object_pairs_hook=<class 'collections.OrderedDict'>)[源代码]

实际案例

ordered_yaml_load("path_to_file.yaml")
OrderedDict({"a":123})

Spider

一个简单的爬虫库

预期提供:

  1. 能下载指定url的文件,如果是压缩文件,进行可选的解压缩
    • [x] 可以显示进度
  2. 能够进行简单的信息抽取
    • [x] 提取url
    • [x] 提取文本
    • [ ] 提取图片

更复杂的功能,如爬取特定网站形成结构化数据,反爬虫等内容独立成库

Quick Glance

longling.spider.lib.get_html_code(url) get encoded html code from specified url
longling.spider.download_data.download_file(url) cli alias: download, download data from specified url
longling.spider.lib.get_html.get_html_code(url)[源代码]

get encoded html code from specified url

longling.spider.download_data.download_file(url, save_path=None, override=True, decomp=True, reporthook=None)[源代码]

cli alias: download, download data from specified url

参数:
  • url --
  • save_path --
  • override --
  • decomp --
  • reporthook --

Architecture Tools for Constructing Projects

notice

  • In sphinx setting, the default setting in sphinx-quickstart would be
> Separate source and build directories (y/n) [n]

Thus, default directory to the built files is _build. If the y is chosen, the directory to the built files is build.

entrance

longling.Architecture.cli.main.cli(skip_top=True, project=None, override=None, tar_dir='./', **kwargs)[源代码]

The main function for arch

components

longling.Architecture.install_file.template_copy(src: (<class 'str'>, <class 'pathlib.PurePath'>), tar: (<class 'str'>, <class 'pathlib.PurePath'>), default_value: (<class 'str'>, <class 'dict'>, None) = '', quotation="'", key_lower=True, **variables)[源代码]

Generate the tar file based on the template file where the variables will be replaced. Usually, the variable is specified like $PROJECT in the template file.

参数:
  • src (template file) --
  • tar (target location) --
  • default_value (the default value) --
  • quotation (the quotation to wrap the variable value) --
  • variables (the real variable values which are used to replace the variable in template file) --
longling.Architecture.install_file.gitignore(atype: str = '', tar_dir: (<class 'str'>, <class 'pathlib.PurePath'>) = './')[源代码]

cli alias: arch gitignore

参数:
  • atype (the gitignore type, currently support docs and python) --
  • tar_dir (target directory) --
longling.Architecture.install_file.pytest(tar_dir: (<class 'str'>, <class 'pathlib.PurePath'>) = './')[源代码]

cli alias: arch pytest

参数:tar_dir --
longling.Architecture.install_file.coverage(tar_dir: (<class 'str'>, <class 'pathlib.PurePath'>) = './', **variables)[源代码]

cli alias: arch coverage

参数:
  • tar_dir --
  • variables --

    These variables should be provided:

    • project
longling.Architecture.install_file.pysetup(tar_dir='./', **variables)[源代码]

cli alias: arch pysetup

参数:
  • tar_dir --
  • variables --
longling.Architecture.install_file.sphinx_conf(tar_dir='./', **variables)[源代码]

cli alias: arch sphinx_conf

参数:
  • tar_dir --
  • variables --
longling.Architecture.install_file.makefile(tar_dir='./', **variables)[源代码]

cli alias: arch makefile

参数:
  • tar_dir --
  • variables --
longling.Architecture.install_file.readthedocs(tar_dir='./')[源代码]

cli alias: arch readthedocs

参数:tar_dir --
longling.Architecture.install_file.travis(tar_dir: (<class 'str'>, <class 'pathlib.PurePath'>) = './')[源代码]

cli alias: arch travis

参数:tar_dir --
longling.Architecture.install_file.nni(tar_dir='./')[源代码]

cli alias: arch nni and install nni

参数:tar_dir --
longling.Architecture.install_file.dockerfile(atype, tar_dir='./', **variables)[源代码]

cli alias: arch dockerfile

参数:
  • atype --
  • tar_dir --
  • variables --
longling.Architecture.install_file.gitlab_ci(private, stages: dict, atype: str = '', tar_dir: (<class 'str'>, <class 'pathlib.PurePath'>) = './', version_in_path=True)[源代码]

cli alias: arch gitlab_ci

参数:
  • private --
  • stages --
  • atype --
  • tar_dir --
  • version_in_path --
longling.Architecture.install_file.chart(tar_dir: (<class 'str'>, <class 'pathlib.PurePath'>) = './')[源代码]

cli alias: arch chart

参数:tar_dir (target directory) --
longling.Architecture.utils.legal_input(__promt: str, __legal_input: set = None, __illegal_input: set = None, is_legal=None, __default_value: str = None)[源代码]

To make sure the input legal, if the input is illegal, the user will be asked to retype the input.

Input being illegal means the input either not in __legal_input or in __illegal_input.

For the case that user do not type in anything, if the __default_value is set, the __default_value will be used. Else, the input will be treated as illegal.

参数:
  • __promt (str) -- tips to be displayed
  • __legal_input (set) -- the input should be in the __legal_input set
  • __illegal_input (set) -- the input should be not in the __illegal_input set
  • is_legal (function) -- the function used to judge whether the input is legal, by default, we use the inner __is_legal function
  • __default_value (str) -- default value when user type in nothing

Machine Learning

ML Framework

ML Framework is designed to help quickly construct a practical ML program where the user can focus on developing the algorithm despite of some other additional but important engineering components like log and cli.

Currently, two supported packages are provided for the popular DL framework: mxnet and pytorch. The overall scenery are almost the same, but the details may be a little different.

To be noticed that, ML Framework just provide a template, all components are allowed to be modified.

Overview

The architecture produced by the ML Framework is look like:

ModelName/
├── __init__.py
├── docs/
├── ModelName/
├── REAME.md
└── Some other components

And the core part is ModelName under the package ModelName, the architecture of it is:

ModelName/
├── __init__.py
├── ModelName.py                <-- the main module
└── Module/
    ├── __init__.py
    ├── configuration.py        <-- define the configurable variables
    ├── etl.py                  <-- define how the data will be loaded and preprocessed
    ├── module.py               <-- the wrapper of the network, raraly need modification
    ├── run.py                  <-- human testing script
    └── sym/                    <-- define the network
        ├── __init__.py
        ├── fit_eval.py         <-- define how the network will be trained and evaluated
        ├── net.py              <-- network architecture
        └── viz.py              <-- (option) how to visualize the network
Configuration

In configuration, some variables are predefined, such as the data_dir(where the data is stored) and model_dir(where the model file like parameters and running log should be stored). The following rules is used to automatically construct the needed path, which can be modified as the user wants:

model_name = "automatically be consistent with ModelName"

root = "./"
dataset = ""  # option
timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") # option
workspace = "" # option

root_data_dir = "$root/data/$dataset" if dataset else "$root/data"
data_dir = "$root_data_dir/data"
root_model_dir = "$root_data_dir/model/$model_name"
model_dir = "$root_model_dir/$workspace" if workspace else root_model_dir
cfg_path = "$model_dir/configuration.json"

The value of the variable containing $ will be automatically evaluated during program running. Thus, it is easy to construct flexible variables via cli. For example, some one may want to have the model_dir contained timestamp information, then he or she can specify the model_dir in cli like:

--model_dir \$root_model_dir/\$workspace/\$timestamp

Annotation: \ is a escape character in shell, which can have \$variable finally be got as the string $variable in the program, otherwise, the variable will be converted to the environment variable of shell like $HOME.

Also, some general variables which may frequently used in all algorithms are also predefined like optimizer and batch_size:

# 训练参数设置
begin_epoch = 0
end_epoch = 100
batch_size = 32
save_epoch = 1

# 优化器设置
optimizer, optimizer_params = get_optimizer_cfg(name="base")
lr_params = {
    "learning_rate": optimizer_params["learning_rate"],
    "step": 100,
    "max_update_steps": get_update_steps(
        update_epoch=10,
        batches_per_epoch=1000,
    ),
}

metrics

Metrics
classification
longling.ML.metrics.classification.classification_report(y_true, y_pred=None, y_score=None, labels=None, metrics=None, sample_weight=None, average_options=None, multiclass_to_multilabel=False, logger=<module 'logging' from '/home/docs/.pyenv/versions/3.7.9/lib/python3.7/logging/__init__.py'>, **kwargs)[源代码]

Currently support binary and multiclasss classification.

参数:
  • y_true (list, 1d array-like, or label indicator array / sparse matrix) -- Ground truth (correct) target values.
  • y_pred (list or None, 1d array-like, or label indicator array / sparse matrix) -- Estimated targets as returned by a classifier.
  • y_score (array or None, shape = [n_samples] or [n_samples, n_classes]) -- Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by "decision_function" on some classifiers). For binary y_true, y_score is supposed to be the score of the class with greater label.
  • labels (array, shape = [n_labels]) -- Optional list of label indices to include in the report.
  • metrics (list of str,) -- Support: precision, recall, f1, support, accuracy, auc, aupoc.
  • sample_weight (array-like of shape = [n_samples], optional) -- Sample weights.
  • average_options (str or list) -- default to macro, choices (one or many): "micro", "macro", "samples", "weighted"
  • multiclass_to_multilabel (bool) --
  • logger --

实际案例

>>> import numpy as np
>>> # binary classification
>>> y_true = np.array([0, 0, 1, 1, 0])
>>> y_pred = np.array([0, 1, 0, 1, 0])
>>> classification_report(y_true, y_pred)
           precision    recall        f1  support
0           0.666667  0.666667  0.666667        3
1           0.500000  0.500000  0.500000        2
macro_avg   0.583333  0.583333  0.583333        5
accuracy: 0.600000
>>> y_true = np.array([0, 0, 1, 1])
>>> y_score = np.array([0.1, 0.4, 0.35, 0.8])
>>> classification_report(y_true, y_score=y_score)    # doctest: +NORMALIZE_WHITESPACE
macro_auc: 0.750000 macro_aupoc: 0.833333
>>> y_true = np.array([0, 0, 1, 1])
>>> y_pred = [0, 0, 0, 1]
>>> y_score = np.array([0.1, 0.4, 0.35, 0.8])
>>> classification_report(y_true, y_pred, y_score=y_score)    # doctest: +NORMALIZE_WHITESPACE
           precision  recall        f1  support
0           0.666667    1.00  0.800000        2
1           1.000000    0.50  0.666667        2
macro_avg   0.833333    0.75  0.733333        4
accuracy: 0.750000  macro_auc: 0.750000     macro_aupoc: 0.833333
>>> # multiclass classification
>>> y_true = [0, 1, 2, 2, 2]
>>> y_pred = [0, 0, 2, 2, 1]
>>> classification_report(y_true, y_pred)
           precision    recall        f1  support
0                0.5  1.000000  0.666667        1
1                0.0  0.000000  0.000000        1
2                1.0  0.666667  0.800000        3
macro_avg        0.5  0.555556  0.488889        5
accuracy: 0.600000
>>> # multiclass in multilabel
>>> y_true = np.array([0, 0, 1, 1, 2, 1])
>>> y_pred = np.array([2, 1, 0, 2, 1, 0])
>>> y_score = np.array([
...    [0.15, 0.4, 0.45],
...    [0.1, 0.9, 0.0],
...    [0.33333, 0.333333, 0.333333],
...    [0.15, 0.4, 0.45],
...    [0.1, 0.9, 0.0],
...    [0.33333, 0.333333, 0.333333]
... ])
>>> classification_report(
...    y_true, y_pred, y_score,
...    multiclass_to_multilabel=True,
...    metrics=["aupoc"]
... )
              aupoc
0          0.291667
1          0.416667
2          0.166667
macro_avg  0.291667
>>> classification_report(
...     y_true, y_pred, y_score,
...    multiclass_to_multilabel=True,
...    metrics=["auc", "aupoc"]
... )
                auc     aupoc
0          0.250000  0.291667
1          0.055556  0.416667
2          0.100000  0.166667
macro_avg  0.135185  0.291667
macro_auc: 0.194444
>>> y_true = np.array([0, 1, 1, 1, 2, 1])
>>> y_pred = np.array([2, 1, 0, 2, 1, 0])
>>> y_score = np.array([
...    [0.45, 0.4, 0.15],
...    [0.1, 0.9, 0.0],
...    [0.33333, 0.333333, 0.333333],
...    [0.15, 0.4, 0.45],
...    [0.1, 0.9, 0.0],
...    [0.33333, 0.333333, 0.333333]
... ])
>>> classification_report(
...    y_true, y_pred,
...    y_score,
...    multiclass_to_multilabel=True,
... )    # doctest: +NORMALIZE_WHITESPACE
           precision    recall        f1   auc     aupoc  support
0           0.000000  0.000000  0.000000  1.00  1.000000        1
1           0.500000  0.250000  0.333333  0.25  0.583333        4
2           0.000000  0.000000  0.000000  0.10  0.166667        1
macro_avg   0.166667  0.083333  0.111111  0.45  0.583333        6
accuracy: 0.166667  macro_auc: 0.437500
>>> classification_report(
...    y_true, y_pred,
...    y_score,
...    labels=[0, 1],
...    multiclass_to_multilabel=True,
... )    # doctest: +NORMALIZE_WHITESPACE
           precision  recall        f1   auc     aupoc  support
0               0.00   0.000  0.000000  1.00  1.000000        1
1               0.50   0.250  0.333333  0.25  0.583333        4
macro_avg       0.25   0.125  0.166667  0.45  0.583333        5
accuracy: 0.166667  macro_auc: 0.437500
regression
longling.ML.metrics.regression.regression_report(y_true, y_pred, metrics=None, sample_weight=None, multioutput='uniform_average', average_options=None, key_prefix='', key_suffix='', verbose=True)[源代码]
参数:
  • y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) -- Ground truth (correct) target values.
  • y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) -- Estimated target values.
  • metrics (list of str,) -- Support: evar(explained_variance), mse, rmse, mae, r2
  • sample_weight (array-like of shape (n_samples,), optional) -- Sample weights.
  • multioutput (string in ['raw_values', 'uniform_average', 'variance_weighted'], list) --

    or array-like of shape (n_outputs) Defines aggregating of multiple output values. Disabled when verbose is True. Array-like value defines weights used to average errors. 'raw_values' :

    Returns a full set of errors in case of multioutput input.
    'uniform_average' :
    Errors of all outputs are averaged with uniform weight. Alias: "macro"
    'variance_weighted':
    Only support in evar and r2. Scores of all outputs are averaged, weighted by the variances of each individual output. Alias: "vw"
  • average_options (str or list) -- default to macro, choices (one or many): "macro", "vw"
  • key_prefix (str) --
  • key_suffix (str) --
  • verbose (bool) --
返回:

  • evar (explained variance)
  • mse (mean squared error)
  • rmse (root mean squared error)
  • mae (mean absolute error)
  • r2 (r2 score)

实际案例

>>> y_true = [[0.5, 1, 1], [-1, 1, 1], [7, -6, 1]]
>>> y_pred = [[0, 2, 1], [-1, 2, 1], [8, -5, 1]]
>>> regression_report(y_true, y_pred)   # doctest: +NORMALIZE_WHITESPACE
                       evar       mse      rmse  mae        r2
0                  0.967742  0.416667  0.645497  0.5  0.965438
1                  1.000000  1.000000  1.000000  1.0  0.908163
2                  1.000000  0.000000  0.000000  0.0  1.000000
uniform_average    0.989247  0.472222  0.548499  0.5  0.957867
variance_weighted  0.983051  0.472222  0.548499  0.5  0.938257
>>> regression_report(y_true, y_pred, verbose=False)   # doctest: +NORMALIZE_WHITESPACE
evar: 0.989247      mse: 0.472222   rmse: 0.548499  mae: 0.500000   r2: 0.957867
>>> regression_report(
...     y_true, y_pred, multioutput="variance_weighted", verbose=False
... )   # doctest: +NORMALIZE_WHITESPACE
evar: 0.983051      mse: 0.472222   rmse: 0.548499  mae: 0.500000   r2: 0.938257
>>> regression_report(y_true, y_pred, multioutput=[0.3, 0.6, 0.1], verbose=False)   # doctest: +NORMALIZE_WHITESPACE
evar: 0.990323      mse: 0.725000   rmse: 0.793649  mae: 0.750000   r2: 0.934529
>>> regression_report(y_true, y_pred, verbose=True)   # doctest: +NORMALIZE_WHITESPACE
                       evar       mse      rmse  mae        r2
0                  0.967742  0.416667  0.645497  0.5  0.965438
1                  1.000000  1.000000  1.000000  1.0  0.908163
2                  1.000000  0.000000  0.000000  0.0  1.000000
uniform_average    0.989247  0.472222  0.548499  0.5  0.957867
variance_weighted  0.983051  0.472222  0.548499  0.5  0.938257
>>> regression_report(
...     y_true, y_pred, verbose=True, average_options=["macro", "vw", [0.3, 0.6, 0.1]]
... )   # doctest: +NORMALIZE_WHITESPACE
                       evar       mse      rmse   mae        r2
0                  0.967742  0.416667  0.645497  0.50  0.965438
1                  1.000000  1.000000  1.000000  1.00  0.908163
2                  1.000000  0.000000  0.000000  0.00  1.000000
uniform_average    0.989247  0.472222  0.548499  0.50  0.957867
variance_weighted  0.983051  0.472222  0.548499  0.50  0.938257
weighted           0.990323  0.725000  0.793649  0.75  0.934529
ranking
longling.ML.metrics.ranking.ranking_report(y_true, y_pred, k: (<class 'int'>, <class 'list'>) = None, continuous=False, coerce='ignore', pad_pred=-100, metrics=None, bottom=False, verbose=True) → longling.ML.metrics.utils.POrderedDict[源代码]
参数:
  • y_true --
  • y_pred --
  • k --
  • continuous --
  • coerce --
  • pad_pred --
  • metrics --
  • bottom --
  • verbose --

实际案例

>>> y_true = [[1, 0, 0], [0, 0, 1]]
>>> y_pred = [[0.75, 0.5, 1], [1, 0.2, 0.1]]
>>> ranking_report(y_true, y_pred)  # doctest: +NORMALIZE_WHITESPACE
       ndcg@k  precision@k  recall@k  f1@k  len@k  support@k
1   1.000000     0.000000       0.0   0.0    1.0          2
3   0.565465     0.333333       1.0   0.5    3.0          2
5   0.565465     0.333333       1.0   0.5    3.0          2
10  0.565465     0.333333       1.0   0.5    3.0          2
auc: 0.250000       map: 0.416667   mrr: 0.416667   coverage_error: 2.500000        ranking_loss: 0.750000  len: 3.000000
support: 2
>>> ranking_report(y_true, y_pred, k=[1, 3, 5])  # doctest: +NORMALIZE_WHITESPACE
       ndcg@k  precision@k  recall@k  f1@k  len@k  support@k
1   1.000000     0.000000       0.0   0.0    1.0          2
3   0.565465     0.333333       1.0   0.5    3.0          2
5   0.565465     0.333333       1.0   0.5    3.0          2
auc: 0.250000       map: 0.416667   mrr: 0.416667   coverage_error: 2.500000        ranking_loss: 0.750000  len: 3.000000
support: 2
>>> ranking_report(y_true, y_pred, bottom=True)  # doctest: +NORMALIZE_WHITESPACE
          ndcg@k  precision@k  recall@k  f1@k  len@k  support@k  ndcg@k(B) \
1   1.000000     0.000000       0.0   0.0    1.0          2   1.000000
3   0.565465     0.333333       1.0   0.5    3.0          2   0.806574
5   0.565465     0.333333       1.0   0.5    3.0          2   0.806574
10  0.565465     0.333333       1.0   0.5    3.0          2   0.806574
<BLANKLINE>
    precision@k(B)  recall@k(B)   f1@k(B)  len@k(B)  support@k(B)
1         0.500000         0.25  0.333333       1.0             2
3         0.666667         1.00  0.800000       3.0             2
5         0.666667         1.00  0.800000       3.0             2
10        0.666667         1.00  0.800000       3.0             2
auc: 0.250000       map: 0.416667   mrr: 0.416667   coverage_error: 2.500000        ranking_loss: 0.750000  len: 3.000000
support: 2  map(B): 0.708333        mrr(B): 0.750000
>>> ranking_report(y_true, y_pred, bottom=True, metrics=["auc"])  # doctest: +NORMALIZE_WHITESPACE
auc: 0.250000   len: 3.000000       support: 2
>>> y_true = [[0.9, 0.7, 0.1], [0, 0.5, 1]]
>>> y_pred = [[0.75, 0.5, 1], [1, 0.2, 0.1]]
>>> ranking_report(y_true, y_pred, continuous=True)  # doctest: +NORMALIZE_WHITESPACE
      ndcg@k  len@k  support@k
3   0.675647    3.0          2
5   0.675647    3.0          2
10  0.675647    3.0          2
mrr: 0.750000       len: 3.000000   support: 2
>>> y_true = [[1, 0], [0, 0, 1]]
>>> y_pred = [[0.75, 0.5], [1, 0.2, 0.1]]
>>> ranking_report(y_true, y_pred)  # doctest: +NORMALIZE_WHITESPACE
    ndcg@k  precision@k  recall@k      f1@k  len@k  support@k
1     1.00     0.500000       0.5  0.500000    1.0          2
3     0.75     0.416667       1.0  0.583333    2.5          2
5     0.75     0.416667       1.0  0.583333    2.5          2
10    0.75     0.416667       1.0  0.583333    2.5          2
auc: 0.500000       map: 0.666667   mrr: 0.666667   coverage_error: 2.000000        ranking_loss: 0.500000  len: 2.500000
support: 2
>>> ranking_report(y_true, y_pred, coerce="abandon")  # doctest: +NORMALIZE_WHITESPACE
   ndcg@k  precision@k  recall@k  f1@k  len@k  support@k
1     1.0     0.500000       0.5   0.5    1.0          2
3     0.5     0.333333       1.0   0.5    3.0          1
auc: 0.500000       map: 0.666667   mrr: 0.666667   coverage_error: 2.000000        ranking_loss: 0.500000  len: 2.500000
support: 2
>>> ranking_report(y_true, y_pred, coerce="padding")  # doctest: +NORMALIZE_WHITESPACE
    ndcg@k  precision@k  recall@k      f1@k  len@k  support@k
1     1.00     0.500000       0.5  0.500000    1.0          2
3     0.75     0.416667       1.0  0.583333    2.5          2
5     0.75     0.416667       1.0  0.583333    2.5          2
10    0.75     0.416667       1.0  0.583333    2.5          2
auc: 0.500000       map: 0.666667   mrr: 0.666667   coverage_error: 2.000000        ranking_loss: 0.500000  len: 2.500000
support: 2
>>> ranking_report(y_true, y_pred, bottom=True)  # doctest: +NORMALIZE_WHITESPACE
    ndcg@k  precision@k  recall@k      f1@k  len@k  support@k  ndcg@k(B)  \
1     1.00     0.500000       0.5  0.500000    1.0          2   1.000000
3     0.75     0.416667       1.0  0.583333    2.5          2   0.846713
5     0.75     0.416667       1.0  0.583333    2.5          2   0.846713
10    0.75     0.416667       1.0  0.583333    2.5          2   0.846713
<BLANKLINE>
    precision@k(B)  recall@k(B)   f1@k(B)  len@k(B)  support@k(B)
1         0.500000          0.5  0.500000       1.0             2
3         0.583333          1.0  0.733333       2.5             2
5         0.583333          1.0  0.733333       2.5             2
10        0.583333          1.0  0.733333       2.5             2
auc: 0.500000       map: 0.666667   mrr: 0.666667   coverage_error: 2.000000        ranking_loss: 0.500000  len: 2.500000
support: 2  map(B): 0.791667        mrr(B): 0.750000
>>> ranking_report(y_true, y_pred, bottom=True, coerce="abandon")  # doctest: +NORMALIZE_WHITESPACE
   ndcg@k  precision@k  recall@k  f1@k  len@k  support@k  ndcg@k(B)  \
1     1.0     0.500000       0.5   0.5    1.0          2   1.000000
3     0.5     0.333333       1.0   0.5    3.0          1   0.693426
<BLANKLINE>
   precision@k(B)  recall@k(B)  f1@k(B)  len@k(B)  support@k(B)
1        0.500000          0.5      0.5       1.0             2
3        0.666667          1.0      0.8       3.0             1
auc: 0.500000       map: 0.666667   mrr: 0.666667   coverage_error: 2.000000        ranking_loss: 0.500000
len: 2.500000       support: 2      map(B): 0.791667        mrr(B): 0.750000
>>> ranking_report(y_true, y_pred, bottom=True, coerce="padding")  # doctest: +NORMALIZE_WHITESPACE
    ndcg@k  precision@k  recall@k      f1@k  len@k  support@k  ndcg@k(B)  \
1     1.00     0.500000       0.5  0.500000    1.0          2   1.000000
3     0.75     0.416667       1.0  0.583333    2.5          2   0.846713
5     0.75     0.416667       1.0  0.583333    2.5          2   0.846713
10    0.75     0.416667       1.0  0.583333    2.5          2   0.846713
<BLANKLINE>
    precision@k(B)  recall@k(B)   f1@k(B)  len@k(B)  support@k(B)
1             0.50          0.5  0.500000       1.0             2
3             0.50          1.0  0.650000       3.0             2
5             0.30          1.0  0.452381       5.0             2
10            0.15          1.0  0.257576      10.0             2
auc: 0.500000       map: 0.666667   mrr: 0.666667   coverage_error: 2.000000        ranking_loss: 0.500000  len: 2.500000
support: 2  map(B): 0.791667        mrr(B): 0.750000

toolkit

工具包模块

Monitor

用来监控数据装载,训练、测试(如epoch,batch 的进程,叫progress好了)。

数据的装载其实可以分出去

需要一个 monitor group 来统一管理这些monitor,可以用两种,一个是类,一个是字典。

API reference
General Toolkit
analyser
longling.ML.toolkit.analyser.get_max(src: ((<class 'str'>, <class 'pathlib.PurePath'>), <class 'list'>), *keys, with_keys: (<class 'str'>, None) = None, with_all=False, merge=True)[源代码]

实际案例

>>> src = [
... {"Epoch": 0, "macro avg": {"f1": 0.7}, "loss": 0.04, "accuracy": 0.7},
... {"Epoch": 1, "macro avg": {"f1": 0.88}, "loss": 0.03, "accuracy": 0.8},
... {"Epoch": 1, "macro avg": {"f1": 0.7}, "loss": 0.02, "accuracy": 0.66}
... ]
>>> result, _ = get_max(src, "accuracy", merge=False)
>>> result
{'accuracy': 0.8}
>>> _, result_appendix = get_max(src, "accuracy", with_all=True, merge=False)
>>> result_appendix
{'accuracy': {'Epoch': 1, 'macro avg': {'f1': 0.88}, 'loss': 0.03, 'accuracy': 0.8}}
>>> result, result_appendix = get_max(src, "accuracy", "macro avg:f1", with_keys="Epoch", merge=False)
>>> result
{'accuracy': 0.8, 'macro avg:f1': 0.88}
>>> result_appendix
{'accuracy': {'Epoch': 1}, 'macro avg:f1': {'Epoch': 1}}
>>> get_max(src, "accuracy", "macro avg:f1", with_keys="Epoch")
{'accuracy': {'Epoch': 1, 'accuracy': 0.8}, 'macro avg:f1': {'Epoch': 1, 'macro avg:f1': 0.88}}
longling.ML.toolkit.analyser.get_min(src: ((<class 'str'>, <class 'pathlib.PurePath'>), <class 'list'>), *keys, with_keys: (<class 'str'>, None) = None, with_all=False, merge=True)[源代码]
>>> src = [
... {"Epoch": 0, "macro avg": {"f1": 0.7}, "loss": 0.04, "accuracy": 0.7},
... {"Epoch": 1, "macro avg": {"f1": 0.88}, "loss": 0.03, "accuracy": 0.8},
... {"Epoch": 1, "macro avg": {"f1": 0.7}, "loss": 0.02, "accuracy": 0.66}
... ]
>>> get_min(src, "loss")
{'loss': 0.02}
longling.ML.toolkit.analyser.key_parser(key)[源代码]

实际案例

>>> key_parser("macro avg:f1")
['macro avg', 'f1']
>>> key_parser("accuracy")
'accuracy'
>>> key_parser("iteration:accuracy")
['iteration', 'accuracy']
dataset
class longling.ML.toolkit.dataset.ID2Feature(feature_df: pandas.core.frame.DataFrame, id_field=None, set_index=False)[源代码]

实际案例

>>> import pandas as pd
>>> df = pd.DataFrame({"id": [0, 1, 2, 3, 4], "numeric": [1, 2, 3, 4, 5], "text": ["a", "b", "c", "d", "e"]})
>>> i2f = ID2Feature(df, id_field="id", set_index=True)
>>> i2f[2]
numeric    3
text       c
Name: 2, dtype: object
>>> i2f[[2, 3]]["numeric"]
id
2    3
3    4
Name: numeric, dtype: int64
>>> i2f(2)
[3, 'c']
>>> i2f([2, 3])
[[3, 'c'], [4, 'd']]
class longling.ML.toolkit.dataset.ItemSpecificSampler(triplet_df: pandas.core.frame.DataFrame, query_field='item_id', pos_field='pos', neg_field='neg', set_index=False, item_id_range=None, user_id_range=None, random_state=10)[源代码]

实际案例

>>> import pandas as pd
>>> user_num = 3
>>> item_num = 4
>>> rating_matrix = pd.DataFrame({
...     "user_id": [0, 1, 1, 1, 2],
...     "item_id": [1, 3, 0, 2, 1]
... })
>>> triplet_df = ItemSpecificSampler.rating2triplet(rating_matrix)
>>> triplet_df   # doctest: +NORMALIZE_WHITESPACE
            pos neg
item_id
0           [1]  []
1        [0, 2]  []
2           [1]  []
3           [1]  []
>>> triplet_df.index
Int64Index([0, 1, 2, 3], dtype='int64', name='item_id')
>>> sampler = ItemSpecificSampler(triplet_df)
>>> sampler(1)
(0, [0])
>>> sampler = ItemSpecificSampler(triplet_df, user_id_range=user_num)
>>> sampler(0, implicit=True)
(1, [2])
>>> sampler(0, 5, implicit=True)
(2, [2, 0, 0, 0, 0])
>>> sampler(0, 5, implicit=True, pad_value=-1)
(2, [0, 2, -1, -1, -1])
>>> sampler([0, 1, 2], 5, implicit=True, pad_value=-1)
[(2, [0, 2, -1, -1, -1]), (1, [1, -1, -1, -1, -1]), (2, [0, 2, -1, -1, -1])]
>>> rating_matrix = pd.DataFrame({
...     "user_id": [0, 1, 1, 1, 2],
...     "item_id": [1, 3, 0, 2, 1],
...     "score": [1, 0, 1, 1, 0]
... })
>>> triplet_df = ItemSpecificSampler.rating2triplet(rating_matrix=rating_matrix, value_field="score")
>>> triplet_df   # doctest: +NORMALIZE_WHITESPACE
         pos  neg
item_id
0        [1]   []
1        [0]  [2]
2        [1]   []
3         []  [1]
>>> sampler = UserSpecificPairSampler(triplet_df)
>>> sampler([0, 1, 2], 5, pad_value=-1)
[(0, [-1, -1, -1, -1, -1]), (1, [2, -1, -1, -1, -1]), (0, [-1, -1, -1, -1, -1])]
>>> sampler([0, 1, 2], 5, neg=False, pad_value=-1)
[(1, [1, -1, -1, -1, -1]), (1, [0, -1, -1, -1, -1]), (1, [1, -1, -1, -1, -1])]
>>> sampler(rating_matrix["item_id"], 2, neg=rating_matrix["score"],
...     excluded_key=rating_matrix["user_id"], pad_value=-1)
[(1, [2, -1]), (0, [-1, -1]), (0, [-1, -1]), (0, [-1, -1]), (1, [0, -1])]
>>> sampler(rating_matrix["item_id"], 2, neg=rating_matrix["score"],
...     excluded_key=rating_matrix["user_id"], pad_value=-1, return_column=True)
((1, 0, 0, 0, 1), ([2, -1], [-1, -1], [-1, -1], [-1, -1], [0, -1]))
>>> sampler(rating_matrix["item_id"], 2, neg=rating_matrix["score"],
...     excluded_key=rating_matrix["user_id"], pad_value=-1, return_column=True, split_sample_to_column=True)
((1, 0, 0, 0, 1), [(2, -1, -1, -1, 0), (-1, -1, -1, -1, -1)])
class longling.ML.toolkit.dataset.TripletPairSampler(triplet_df: pandas.core.frame.DataFrame, query_field, pos_field='pos', neg_field='neg', set_index=False, query_range: (<class 'int'>, <class 'tuple'>, <class 'list'>) = None, key_range: (<class 'int'>, <class 'tuple'>, <class 'list'>) = None, random_state=10)[源代码]

实际案例

>>> # implicit feedback
>>> import pandas as pd
>>> triplet_df = pd.DataFrame({
...     "query": [0, 1, 2],
...     "pos": [[1], [3, 0, 2], [1]],
...     "neg": [[], [], []]
... })
>>> sampler = TripletPairSampler(triplet_df, "query", set_index=True)
>>> rating_matrix = pd.DataFrame({
...     "query": [0, 1, 1, 1, 2],
...     "key": [1, 3, 0, 2, 1]
... })
>>> triplet_df = TripletPairSampler.rating2triplet(rating_matrix, query_field="query", key_field="key")
>>> triplet_df   # doctest: +NORMALIZE_WHITESPACE
             pos neg
query
0            [1]  []
1      [3, 0, 2]  []
2            [1]  []
>>> sampler = TripletPairSampler(triplet_df, "query")
>>> sampler(0)
(0, [0])
>>> sampler(0, 3)
(0, [0, 0, 0])
>>> sampler(0, 3, padding=False)
(0, [])
>>> sampler = TripletPairSampler(triplet_df, "query", query_range=3, key_range=4)
>>> sampler(0)
(0, [0])
>>> sampler(0, 3)
(0, [0, 0, 0])
>>> sampler(0, 3, padding=False)
(0, [])
>>> sampler(0, 5, padding=False, implicit=True)
(3, [2, 3, 0])
>>> sampler(0, 5, padding=False, implicit=True, excluded_key=[3])
(2, [0, 2])
>>> sampler(0, 5, padding=True, implicit=True, excluded_key=[3])
(2, [2, 0, 0, 0, 0])
>>> sampler(0, 5, implicit=True, pad_value=-1)
(3, [2, 3, 0, -1, -1])
>>> sampler(0, 5, implicit=True, fast_implicit=True, pad_value=-1)
(3, [0, 2, 3, -1, -1])
>>> sampler(0, 5, implicit=True, fast_implicit=True, with_n_implicit=3, pad_value=-1)
(3, [0, 2, 3, -1, -1, -1, -1, -1])
>>> sampler(0, 5, implicit=True, fast_implicit=True, with_n_implicit=3, pad_value=-1, padding_implicit=True)
(3, [0, 2, 3, -1, -1, -1, -1, -1])
>>> rating_matrix = pd.DataFrame({
...     "query": [0, 1, 1, 1, 2],
...     "key": [1, 3, 0, 2, 1],
...     "score": [1, 0, 1, 1, 0]
... })
>>> triplet_df = TripletPairSampler.rating2triplet(
...     rating_matrix,
...     "query", "key",
...     value_field="score"
... )
>>> triplet_df   # doctest: +NORMALIZE_WHITESPACE
            pos  neg
query
0           [1]   []
1        [0, 2]  [3]
2            []  [1]
>>> sampler = TripletPairSampler(triplet_df, "query", query_range=3, key_range=4)
>>> sampler([0, 1, 2], 5, implicit=True, pad_value=-1)
[(3, [2, 3, 0, -1, -1]), (1, [1, -1, -1, -1, -1]), (3, [3, 0, 2, -1, -1])]
>>> sampler([0, 1, 2], 5, pad_value=-1)
[(0, [-1, -1, -1, -1, -1]), (1, [3, -1, -1, -1, -1]), (1, [1, -1, -1, -1, -1])]
>>> sampler([0, 1, 2], 5, neg=False, pad_value=-1)
[(1, [1, -1, -1, -1, -1]), (2, [0, 2, -1, -1, -1]), (0, [-1, -1, -1, -1, -1])]
>>> sampler(rating_matrix["query"], 2, neg=rating_matrix["score"],
...     excluded_key=rating_matrix["key"], pad_value=-1)
[(0, [-1, -1]), (2, [2, 0]), (1, [3, -1]), (1, [3, -1]), (0, [-1, -1])]
>>> sampler(rating_matrix["query"], 2, neg=rating_matrix["score"],
...     excluded_key=rating_matrix["key"], pad_value=-1, return_column=True)
((0, 2, 1, 1, 0), ([-1, -1], [0, 2], [3, -1], [3, -1], [-1, -1]))
>>> sampler(rating_matrix["query"], 2, neg=rating_matrix["score"],
...     excluded_key=rating_matrix["key"], pad_value=-1, return_column=True, split_sample_to_column=True)
((0, 2, 1, 1, 0), [(-1, 0, 3, 3, -1), (-1, 2, -1, -1, -1)])
>>> rating_matrix = pd.DataFrame({
...     "query": [0, 1, 1, 1, 2],
...     "key": [1, 3, 0, 2, 1],
...     "score": [0.8, 0.4, 0.7, 0.5, 0.1]
... })
>>> TripletPairSampler.rating2triplet(
...     rating_matrix,
...     "query", "key",
...     value_field="score",
...     value_threshold=0.5
... )   # doctest: +NORMALIZE_WHITESPACE
          pos  neg
query
0         [1]   []
1      [0, 2]  [3]
2          []  [1]
class longling.ML.toolkit.dataset.UserSpecificPairSampler(triplet_df: pandas.core.frame.DataFrame, query_field='user_id', pos_field='pos', neg_field='neg', set_index=False, user_id_range=None, item_id_range=None, random_state=10)[源代码]

实际案例

>>> import pandas as pd
>>> user_num = 3
>>> item_num = 4
>>> rating_matrix = pd.DataFrame({
...     "user_id": [0, 1, 1, 1, 2],
...     "item_id": [1, 3, 0, 2, 1]
... })
>>> triplet_df = UserSpecificPairSampler.rating2triplet(rating_matrix)
>>> triplet_df   # doctest: +NORMALIZE_WHITESPACE
               pos neg
user_id
0              [1]  []
1        [3, 0, 2]  []
2              [1]  []
>>> sampler = UserSpecificPairSampler(triplet_df)
>>> sampler(1)
(0, [0])
>>> sampler = UserSpecificPairSampler(triplet_df, item_id_range=item_num)
>>> sampler(0, implicit=True)
(1, [3])
>>> sampler(0, 5, implicit=True)
(3, [3, 2, 0, 0, 0])
>>> sampler(0, 5, implicit=True, pad_value=-1)
(3, [3, 2, 0, -1, -1])
>>> sampler([0, 1, 2], 5, implicit=True, pad_value=-1)
[(3, [2, 3, 0, -1, -1]), (1, [1, -1, -1, -1, -1]), (3, [2, 0, 3, -1, -1])]
>>> rating_matrix = pd.DataFrame({
...     "user_id": [0, 1, 1, 1, 2],
...     "item_id": [1, 3, 0, 2, 1],
...     "score": [1, 0, 1, 1, 0]
... })
>>> triplet_df = UserSpecificPairSampler.rating2triplet(rating_matrix=rating_matrix, value_field="score")
>>> triplet_df   # doctest: +NORMALIZE_WHITESPACE
            pos  neg
user_id
0           [1]   []
1        [0, 2]  [3]
2            []  [1]
>>> sampler = UserSpecificPairSampler(triplet_df)
>>> sampler([0, 1, 2], 5, pad_value=-1)
[(0, [-1, -1, -1, -1, -1]), (1, [3, -1, -1, -1, -1]), (1, [1, -1, -1, -1, -1])]
>>> sampler([0, 1, 2], 5, neg=False, pad_value=-1)
[(1, [1, -1, -1, -1, -1]), (2, [0, 2, -1, -1, -1]), (0, [-1, -1, -1, -1, -1])]
>>> sampler(rating_matrix["user_id"], 2, neg=rating_matrix["score"],
...     excluded_key=rating_matrix["item_id"], pad_value=-1)
[(0, [-1, -1]), (2, [2, 0]), (1, [3, -1]), (1, [3, -1]), (0, [-1, -1])]
>>> sampler(rating_matrix["user_id"], 2, neg=rating_matrix["score"],
...     excluded_key=rating_matrix["item_id"], pad_value=-1, return_column=True)
((0, 2, 1, 1, 0), ([-1, -1], [0, 2], [3, -1], [3, -1], [-1, -1]))
>>> sampler(rating_matrix["user_id"], 2, neg=rating_matrix["score"],
...     excluded_key=rating_matrix["item_id"], pad_value=-1, return_column=True, split_sample_to_column=True)
((0, 2, 1, 1, 0), [(-1, 2, 3, 3, -1), (-1, 0, -1, -1, -1)])
longling.ML.toolkit.dataset.train_test(*files, train_size: (<class 'float'>, <class 'int'>) = 0.8, test_size: (<class 'float'>, <class 'int'>, None) = None, ratio=None, random_state=None, shuffle=True, target_names=None, suffix: list = None, prefix='', logger=<Logger dataset (INFO)>, **kwargs)[源代码]
参数:
  • files --
  • train_size (float, int, or None, (default=0.8)) -- Represent the proportion of the dataset to include in the train split.
  • test_size (float, int, or None) -- Represent the proportion of the dataset to include in the train split.
  • random_state (int, RandomState instance or None, optional (default=None)) -- If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
  • shuffle (boolean, optional (default=True)) -- Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None
  • target_names (list of PATH_TYPE) --
  • suffix (list) --
  • kwargs --
longling.ML.toolkit.dataset.train_valid_test(*files, train_size: (<class 'float'>, <class 'int'>) = 0.8, valid_size: (<class 'float'>, <class 'int'>) = 0.1, test_size: (<class 'float'>, <class 'int'>, None) = None, ratio=None, random_state=None, shuffle=True, target_names=None, suffix: list = None, logger=<Logger dataset (INFO)>, prefix='', **kwargs)[源代码]
参数:
  • files --
  • train_size (float, int, or None, (default=0.8)) -- Represent the proportion of the dataset to include in the train split.
  • valid_size (float, int, or None, (default=0.1)) -- Represent the proportion of the dataset to include in the valid split.
  • test_size (float, int, or None) -- Represent the proportion of the dataset to include in the test split.
  • random_state (int, RandomState instance or None, optional (default=None)) -- If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
  • shuffle (boolean, optional (default=True)) -- Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None
  • target_names --
  • suffix (list) --
  • kwargs --
formatter
class longling.ML.toolkit.formatter.EpisodeEvalFMT(logger=<RootLogger root (WARNING)>, dump_file: (((<class 'str'>, <class 'pathlib.PurePath'>), (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, <class 'codecs.StreamReaderWriter'>, <class 'fileinput.FileInput'>)), None) = False, col: (<class 'int'>, None) = None, **kwargs)[源代码]

实际案例

>>> import numpy as np
>>> from longling.ML.metrics import classification_report
>>> y_true = np.array([0, 0, 1, 1, 2, 1])
>>> y_pred = np.array([2, 1, 0, 1, 1, 0])
>>> y_score = np.array([
...     [0.15, 0.4, 0.45],
...     [0.1, 0.9, 0.0],
...     [0.33333, 0.333333, 0.333333],
...     [0.15, 0.4, 0.45],
...     [0.1, 0.9, 0.0],
...     [0.33333, 0.333333, 0.333333]
... ])
>>> print(EpisodeEvalFMT.format(
...     iteration=30,
...     eval_name_value=classification_report(y_true, y_pred, y_score)
... ))    # doctest: +NORMALIZE_WHITESPACE
Episode [30]
           precision    recall        f1  support
0           0.000000  0.000000  0.000000        2
1           0.333333  0.333333  0.333333        3
2           0.000000  0.000000  0.000000        1
macro_avg   0.111111  0.111111  0.111111        6
accuracy: 0.166667  macro_auc: 0.194444
class longling.ML.toolkit.formatter.EpochEvalFMT(logger=<RootLogger root (WARNING)>, dump_file: (((<class 'str'>, <class 'pathlib.PurePath'>), (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, <class 'codecs.StreamReaderWriter'>, <class 'fileinput.FileInput'>)), None) = False, col: (<class 'int'>, None) = None, **kwargs)[源代码]

实际案例

>>> import numpy as np
>>> from longling.ML.metrics import classification_report
>>> y_true = np.array([0, 0, 1, 1, 2, 1])
>>> y_pred = np.array([2, 1, 0, 1, 1, 0])
>>> y_score = np.array([
...     [0.15, 0.4, 0.45],
...     [0.1, 0.9, 0.0],
...     [0.33333, 0.333333, 0.333333],
...     [0.15, 0.4, 0.45],
...     [0.1, 0.9, 0.0],
...     [0.33333, 0.333333, 0.333333]
... ])
>>> print(EpochEvalFMT.format(
...     iteration=30,
...     eval_name_value=classification_report(y_true, y_pred, y_score)
... ))    # doctest: +NORMALIZE_WHITESPACE
Epoch [30]
           precision    recall        f1  support
0           0.000000  0.000000  0.000000        2
1           0.333333  0.333333  0.333333        3
2           0.000000  0.000000  0.000000        1
macro_avg   0.111111  0.111111  0.111111        6
accuracy: 0.166667  macro_auc: 0.194444
class longling.ML.toolkit.formatter.EvalFMT(logger=<RootLogger root (WARNING)>, dump_file: (((<class 'str'>, <class 'pathlib.PurePath'>), (<class '_io.TextIOWrapper'>, <class 'typing.TextIO'>, <class 'typing.BinaryIO'>, <class 'codecs.StreamReaderWriter'>, <class 'fileinput.FileInput'>)), None) = False, col: (<class 'int'>, None) = None, **kwargs)[源代码]

评价指标格式化类。可以按一定格式快速格式化评价指标。

参数:
  • logger -- 默认为 root logger
  • dump_file -- 不为空时,将结果写入dump_file
  • col (int) -- 每行放置的指标数量
  • kwargs -- 拓展兼容性参数

实际案例

>>> import numpy as np
>>> from longling.ML.metrics import classification_report
>>> y_true = np.array([0, 0, 1, 1, 2, 1])
>>> y_pred = np.array([2, 1, 0, 1, 1, 0])
>>> y_score = np.array([
...     [0.15, 0.4, 0.45],
...     [0.1, 0.9, 0.0],
...     [0.33333, 0.333333, 0.333333],
...     [0.15, 0.4, 0.45],
...     [0.1, 0.9, 0.0],
...     [0.33333, 0.333333, 0.333333]
... ])
>>> print(EvalFMT.format(
...     iteration=30,
...     eval_name_value=classification_report(y_true, y_pred, y_score)
... ))    # doctest: +NORMALIZE_WHITESPACE
Iteration [30]
           precision    recall        f1  support
0           0.000000  0.000000  0.000000        2
1           0.333333  0.333333  0.333333        3
2           0.000000  0.000000  0.000000        1
macro_avg   0.111111  0.111111  0.111111        6
accuracy: 0.166667  macro_auc: 0.194444
longling.ML.toolkit.formatter.result_format(data: dict, col=None)[源代码]
参数:
  • data --
  • col --

实际案例

>>> print(result_format({"a": 1, "b": 2}))    # doctest: +NORMALIZE_WHITESPACE
a: 1        b: 2
>>> print(result_format({"a": 1, "b": {"1": 0.1, "2": 0.3}, "c": {"1": 0.4, "2": 0.0}}))
     1    2
b  0.1  0.3
c  0.4  0.0
a: 1
monitor
class longling.ML.toolkit.monitor.EMAValue(value_function_names: (<class 'list'>, <class 'dict'>), smoothing_constant=0.1, *args, **kwargs)[源代码]

Exponential moving average: smoothing to give progressively lower weights to older values.

\[losses[name] = (1 - c) \times previous\_value + c \times loss\_value\]
>>> ema = EMAValue(["l2"])
>>> ema["l2"]
nan
>>> ema("l2", 100)
>>> ema("l2", 1)
>>> ema["l2"]
90.1
>>> list(ema.values())
[90.1]
>>> list(ema.keys())
['l2']
>>> list(ema.items())
[('l2', 90.1)]
>>> ema.reset()
>>> ema["l2"]
nan
>>> ema = EMAValue(["l1", "l2"])
>>> ema["l2"], ema["l1"]
(nan, nan)
>>> ema.updates({"l1": 1, "l2": 10})
>>> ema.updates({"l1": 10, "l2": 100})
>>> ema["l1"]
1.9
>>> ema["l2"]
19.0
>>> ema = EMAValue(["l1"], smoothing_constant=0.0)
>>> ema["l1"]
nan
>>> ema.updates({"l1": 1})
>>> ema.updates({"l1": 10})
>>> ema["l1"]
1.0
>>> ema = EMAValue(["l1"], smoothing_constant=1.0)
>>> ema.updates({"l1": 1})
>>> ema.updates({"l1": 10})
>>> ema["l1"]
10.0
>>> @as_tmt_value
... def mse_loss(a):
...     return a ** 2
>>> ema = EMAValue({"mse": mse_loss})
>>> ema["mse"]
nan
>>> mse_loss(1)
1
>>> ema["mse"]
1
>>> mse_loss(10)
100
>>> ema["mse"]
10.9
>>> ema = EMAValue({"mse": mse_loss})
>>> mse_loss(1)
1
>>> ema["mse"]
1
>>> ema.monitor_off("mse")
>>> ema.func
{}
>>> mse_loss(10)
100
>>> "mse" not in ema
True
>>> ema.monitor_on("mse", mse_loss)
>>> mse_loss(10)
100
>>> ema["mse"]
100
get_update_value(name: str, value: (<class 'float'>, <class 'int'>))[源代码]
参数:
  • name (str) -- The name of the value to be updated
  • value (int or float) -- New value to include in EMA.
class longling.ML.toolkit.monitor.MovingLoss(value_function_names: (<class 'list'>, <class 'dict'>), smoothing_constant=0.1, *args, **kwargs)[源代码]

实际案例

>>> lm = MovingLoss(["l2"])
>>> lm.losses
{'l2': nan}
>>> lm("l2", 100)
>>> lm("l2", 1)
>>> lm["l2"]
90.1
longling.ML.toolkit.monitor.as_tmt_loss(loss_obj, loss2value=<function <lambda>>)[源代码]
参数:
  • loss_obj --
  • loss2value --

实际案例

>>> @as_tmt_loss
... def mse(v):
...     return v ** 2
>>> mse(2)
4
longling.ML.toolkit.monitor.as_tmt_value(value_obj, transform=<function <lambda>>)[源代码]
参数:
  • value_obj --
  • transform --

实际案例

>>> def loss_f(a):
...     return a
>>> loss_f(10)
10
>>> tmt_loss_f = as_tmt_value(loss_f)
>>> tmt_loss_f(10)
10
>>> @as_tmt_value
... def loss_f2(a):
...     return a
>>> loss_f2(10)
10

MxnetHelper

总共包含四个模块

helper

一些Mxnet常用的辅助函数,如判断混合编程模式下传入参数是Symbol还是NDArray的getF

Full documentation

glue

模板模块,用以快速构建新的模型 模板文件在 glue/ModelName

Full documentation

toolkit

工具包模块,包括一些专门适配于mxnet框架的辅助函数

包括

  • 运行设备(cpu | gpu): ctx
  • 优化器配置: optimizer_cfg
  • 网络可视化: viz

Full documentation

API reference
Helper Functions
General Toolkit
ctx
embedding
optimizer_cfg
select_exp

Here are some frequently used regex expression for select in collect_params()

viz
exception longling.ML.MxnetHelper.toolkit.viz.VizError[源代码]
Glue: Gluon Example
Introduction

Glue (Gluon Example) aims to generate a neural network model template of Mxnet-Gluon which can be quickly developed into a mature model. The source code is here

Installation

It is automatically installed when you installing longling package. The tutorial of installing can be found here.

Tutorial

With glue, it is possible to quickly construct a model. A demo case can be referred in `here <>`_. And the model can be divided into several different functionalities:

  • ETL(extract-transform-load): generate the data stream for model;
  • Symbol()

Also, we call those variables like working directory, path to data, hyper parameters

Generate template files

Run the following commands to use glue:

# Create a full project including docs and requirements
glue --model_name ModelName
# Or, only create a network model template
glue --model_name ModelName --skip_top

The template files will be generate in current directory. To change the position of files, use --directory option to specify the location:

glue --model_name ModelName --directory LOCATION

For more help, run glue --help

Guidance to modify the template
Overview

Usually, the project template will consist of doc files and model files. Assume the project name by default is ModelName,then the directory of model files will have the same name, the directory tree is like:

ModelName(Project)
    ----docs
    ----ModelName(Model)

And in ModelName(Model), there are one template file named ModelName.py and a directory containing four sub-template files.

The directory tree is like:

ModelName/
├── __init__.py
├── ModelName.py
└── Module/
    ├── __init__.py
    ├── configuration.py
    ├── etl.py
    ├── module.py
    ├── run.py
    └── sym/
        ├── __init__.py
        ├── fit_eval.py
        ├── net.py
        └── viz.py
  • The `configuration.py <>`_ defines the all parameters should be configured, like where to store the model parameters and configuration parameters, the hyper-parameters of the neural network.
  • The `etl.py <>`_ defines the process of extract-transform-load, which is the definition of data processing.
  • The `module.py <>`_ serves as a high-level wrapper for sym.py, which provides the well-written interfaces, like model persistence, batch loop, epoch loop and data pre-process on distributed computation.
  • The `sym.py <>`_ is the minimal model can be directly used to train, evaluate, also supports visualization. But some higher-level operations are not supported for simplification and modulation, which are defined in module.py.
Data Stream
  • extract: extract the data from data src .. code-block:: python

    def extract(data_src):

    # load data from file, the data format is looked like: # feature, label features = [] labels = [] with open(data_src) as f:

    for line in f:

    feature, label = line.split() features.append(feature) labels.append(label)

    return features, labels

  • transform: Convert the extracted into batch data. The pre-process like bucketing can be defined here.

    from mxnet import gluon
    def transform(raw_data, params):
      # 定义数据转换接口
      # raw_data --> batch_data
    
      batch_size = params.batch_size
    
      return gluon.data.DataLoader(gluon.data.ArrayDataset(raw_data), batch_size)
    
  • etl: combine the extract and transform together.

Model Construction

Usually, there are three level components need to be configured:

  1. bottom: the network symbol and how to fit and eval it;
  2. middle: the higher level to define the batch and epoch, also the initialization and persistence of model parameters.
  3. top: the api of model
Bottom
configuration

Find the configuration.py and define the configuration variables that you need, for example:

  • begin_epoch
  • end_epoch
  • batch_size

Also, the paths can be configured:

import longling.ML.MxnetHelper.glue.parser as parser
from longling.ML.MxnetHelper.glue.parser import var2exp
import pathlib
import datetime

 # 目录配置
class Configuration(parser.Configuration):
    model_name = str(pathlib.Path(__file__).parents[1].name)

    root = pathlib.Path(__file__).parents[2]
    dataset = ""
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    workspace = ""

    root_data_dir = "$root/data/$dataset" if dataset else "$root/data"
    data_dir = "$root_data_dir/data"
    root_model_dir = "$root_data_dir/model/$model_name"
    model_dir = "$root_model_dir/$workspace" if workspace else root_model_dir
    cfg_path = "$model_dir/configuration.json"

    def __init__(self, params_json=None, **kwargs):
        # set dataset
        if kwargs.get("dataset"):
            kwargs["root_data_dir"] = "$root/data/$dataset"
        # set workspace
        if kwargs.get("workspace"):
            kwargs["model_dir"] = "$root_model_dir/$workspace"

        # rebuild relevant directory or file path according to the kwargs
        _dirs = [
            "workspace", "root_data_dir", "data_dir", "root_model_dir",
            "model_dir"
        ]
        for _dir in _dirs:
            exp = var2exp(
                kwargs.get(_dir, getattr(self, _dir)),
                env_wrap=lambda x: "self.%s" % x
            )
            setattr(self, _dir, eval(exp))

How the variable paths work can be referred in `here <>`_

Refer to the prototype for illustration. Refer to `full documents about Configuration <>`_ for details.

build the network symbol and test it

The network symbol file is `sym.py <>`_

The following variables and functions should be rewritten (marked as **todo**):

two ways can be used to check whether the network works well:

  1. Visualization:
  • functions name: net_viz *
  1. Numerical: