Парсинг заголовка таблицы

Question

Необходимо распарсить следующую таблицу:

tmp =
['+--------------+-----------------------------------------+',
 '| Something to |        Some header with subheader       |',
 '|  watch or    +-----------------+-----------------------+',
 '|     idk      |      First      |   another text again  |',
 '|              |                 |  with one more line   |',
 '|              |                 +-----------------------+',
 '|              |                 |  and this | how it be |',
 '+--------------+-----------------+-----------------------+']

Здесь просто чтение (построковое) файла с этой таблицей

Превратить в это: ['Something to watch or idk', 'Some header with subheader First', 'Some header with subheader another text again with one more line and this', 'Some header with subheader another text again with one more line how it be']

Вот мои попытки это сделать:

pluses = [i for i, element in enumerate(tmp) if element[0] == '+']
tmp2 = tmp[pluses[0]:pluses[1]+1].copy()
table_str=''.join(tmp[pluses[0]:pluses[1]+1])
col=[[i for i, symbol in enumerate(line) if symbol == '+' or symbol == '|'] for line in tmp2]

tmp3=[]
strt = ''.join(tmp2.copy())
table_list = [l.strip().replace('\n', '') for l in re.split(r'\+[+-]+', strt) if l.strip()]
for row in table_list:
    joined_row = ['' for _ in range(len(row))]
    for lines in [line for line in row.split('||')]:
        line_part = [i.strip() for i in lines.split('|') if i]
        joined_row = [i + j for i, j in zip(joined_row, line_part)]
        tmp3.append(joined_row)

На выходе получается это:

tmp3
out[4]:
[['Something to', 'Some header with subheader'],
 ['Something towatch or'],
 ['idk', 'First', 'another text again'],
 ['idk', 'First', 'another text againwith one more line'],
 ['idk'],
 ['', '', 'and this', 'how it be']]

Остается только соединить их правильно (и немного подправить) но я не понимаю как!

Есть ещё вариант (идея) с обнаружением границ таблицы:

col=[[i for i, symbol in enumerate(line) if symbol == '+' or symbol == '|'] for line in tmp2]
[[0, 15, 57],
 [0, 15, 57],
 [0, 15, 33, 57],
 [0, 15, 33, 57],
 [0, 15, 33, 57],
 [0, 15, 33, 57],
 [0, 15, 33, 45, 57],
 [0, 15, 33, 57]]

И далее пробовать как нибудь "по ячейке", но не выходит

Answer 1

Greats to @hoboman and @KarlT

Solution placed here and here

Решение:

def parse_table(table, header='', root='', table_len=None):
    # store length of original table
    if not table_len:
        table_len = len(table)

    # end of current "column"
    col = table[0].find('+', 1)
    rows = [
        row for row in range(1, len(table))
            if  table[row].startswith('+')
            and table[row][col] == '+'
    ]
    row = rows[0]

    # split "line" contents into columns
    # end of "line" is either `+` or final `|`
    end = col
    num_cols = table[0].count('+')
    if num_cols != table[1].count('|'):
        end = table[1].rfind('|')
    columns = (line[1:end].split('|') for line in table[1:row])

    # rebuild each column appending to header
    content = [
        ' '.join([header] + [line.strip() for line in lines]).strip()
        for lines in zip(*columns)
    ]

    # is there a table below?
    if row + 2 < len(table):
        header = content[-1]
        # if we are not the last table - we are a header
        if len(rows) > 1:
            header = content.pop()
        # if we are the first table in column - we are the root
        if not root:
            root = header
        next_table = [line[:col + 1] for line in table[row:]]
        content.extend(
            parse_table(
                next_table,
                header=header,
                root=root,
                table_len=table_len
            )
        )

    # is there a table to the right?
    if col + 2 < len(table[0]):
        # find start line of next table
        row = next(
            row for row, line in enumerate(table, start=-1)
                if line[col] == '|'
        )
        next_table = [line[col:] for line in table[row:]]
        # new top-level table - reset root
        if len(next_table) == table_len:
            root = ''
        # next table on same level - reset header
        if len(table) == len(next_table):
            header = root
        content.extend(
            parse_table(
                next_table,
                header=header,
                root=root,
                table_len=table_len
            )
        )

    return content

Пример:

tmp = [
    '+--------------+-----------------------------------------+',
    '| Something to |        Some header with subheader       |',
    '|  watch or    +-----------------+-----------------------+',
    '|     idk      |      First      |   another text again  |',
    '|              |                 |  with one more line   |',
    '|              |                 +-----------------------+',
    '|              |                 |  and this | how it be |',
    '+--------------+-----------------+-----------------------+'
]
print(parse_table(tmp))
# ['Something to watch or idk', 'Some header with subheader First', 'Some header with subheader another text again with one more line and this', 'Some header with subheader another text again with one more line how it be']

Answer 2

Надёжный парсинг мы можем осуществить только двигаясь вдоль линий от одного перекрестка к другому. Всё остальное находится в зоне риска неверной интерпретации. Например, без дополнительного знания о таблице мы не сможем уверенно сказать, что в следующем примере две ячейки, а не одна:

+------------------------+
| Count of | in messages |
+------------------------+

Поэтому первым шагом было бы правильно вручную добавить недостающие плюсы, обозначающие пересечение ребер:

tmp = \
['+--------------+-----------------------------------------+',
 '| Something to |        Some header with subheader       |',
 '|  watch or    +-----------------+-----------------------+',
 '|     idk      |      First      |   another text again  |',
 '|              |                 |  with one more line   |',
 '|              |                 +-----------+-----------+',
 '|              |                 |  and this | how it be |',
 '+--------------+-----------------+-----------+-----------+']
                    # добавлено 2 плюса здесь ^^^

Второе ограничение - это правила ветвления. Мы должны как-то понимать иерархию ячеек. Например, мы договариваемся, что дробление столбцов может происходить только вниз и в пределах верхней (родительской) ячейки.

Теперь мы можем собирать ветки, начав с левого верхнего угла:

находим ближайшие перекрестки справа и снизу
собираем сообщение в найденных границах
рекурсивно присоединяем содержимое снизу, ограничив горозонтальное перемещение шириной текущей ячейки
повторяем, сместившись к перекрестку справа

def collect_headers(data, top, left, right_limit = None):
    '''Yield joined multi-index ASCII headers top through bottom, left to right.
    data - a list of strings, representing ASCII headers;
    top, left - the upper-left corner to start from;
    right_limit - the rightmost border of the parent cell.
    '''
    if right_limit is None:
        right_limit = len(data[0]) - 1
    if top == len(data) - 1 or left == right_limit:
        return    # reached the lowest or rightmost border  
    # collect the message in the cell
    msg = []
    right = left + 1
    while data[top][right] != '+':
        right += 1
    bottom = top + 1
    while data[bottom][left] != '+':    
        line = data[bottom][left+1:right].strip()
        if line:
            msg.append(line)
        bottom += 1
    msg = ' '.join(msg)
    # combine message with child messages underneath
    leafs = [*collect_headers(data, bottom, left, right)]
    if leafs:
        for leaf_msg in leafs:
            yield msg + ' ' + leaf_msg
    else:
        yield msg
    # move to the right cell        
    yield from collect_headers(data, top, right)

answer = [*collect_headers(tmp, 0, 0)]

Результат:

['Something to watch or idk',
 'Some header with subheader First',
 'Some header with subheader another text again with one more line and this',
 'Some header with subheader another text again with one more line how it be']

БЛОГ НА HUSL

Парсинг заголовка таблицы

Ответы (2 шт):