Не получается с помощью BeautifulSoup спарсить страницу, последовательно обработав теги

Question

не получается нормально спарсить страницу, есть большая страница. Надо последовательно пройтись во всем тегам TD и выдернуть информация. часть страницы выглядит так:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<table class="short_table">
    <tr>
        <td class="td_no email">
        </td>
        <td class="bla_bla">Assa</td>
    </tr>
    <tr>
        <td class="td_mail">
            <div class="div_email">[email protected]</div>
        </td>
        <td class="bla_bla">Vasya</td>
    </tr>
    <tr>
        <td class="td_mail">
            <div class="div_email">[email protected]</div>
        </td>
        <td class="bla-bla">Ludovick</td>
    </tr>
    <tr>
        <td class="td_no email">
        </td>
        <td class="bla_bla">Ivan</td>
    </tr>
</table>
</body>
</html>

Попытка написать код выглядит так:

from bs4 import BeautifulSoup

html_file = open('index.html', encoding='utf-8')
soup = BeautifulSoup(html_file, 'html.parser')
table = soup.findAll('td')
email = soup.find('td', class_='td_mail')
no_email = soup.find('td', class_='td_no email')
name = soup.find('td', class_='bla-bla')

for tags in table:
    if tags == no_email:
        email_field = "None"
    if tags == email:
        email_field = tags.text
    if tags == name:
        name == tags.text

    print(email_field + '\t' + str(name.text))

Хотелось бы, что бы результат выглял так:

None    Assa
[email protected]    Vasya
[email protected]    Ludovick
None    Ivan

Answer 1

from bs4 import BeautifulSoup

txt = '''<!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>Title</title>
    </head>
    <body>
    <table class="short_table">
        <tr>
            <td class="td_no email">
            </td>
            <td class="bla_bla">Assa</td>
        </tr>
        <tr>
            <td class="td_mail">
                <div class="div_email">[email protected]</div>
            </td>
            <td class="bla_bla">Vasya</td>
        </tr>
        <tr>
            <td class="td_mail">
                <div class="div_email">[email protected]</div>
            </td>
            <td class="bla-bla">Ludovick</td>
        </tr>
        <tr>
            <td class="td_no email">
            </td>
            <td class="bla_bla">Ivan</td>
        </tr>
    </table>
    </body>
    </html>'''

soup = BeautifulSoup(txt, 'html.parser')
for tags in soup.findAll('tr'):
    email, name = [x.text.strip() for x in tags.findAll('td')]
    print((email if email else "None") + '\t' + name)

None    Assa
[email protected]    Vasya
[email protected]    Ludovick
None    Ivan

БЛОГ НА HUSL

Не получается с помощью BeautifulSoup спарсить страницу, последовательно обработав теги

Ответы (1 шт):