Подскажите как удалить html разметку при парсинге xml файла

Question

Пытаюсь собрать xlsx файл для дальнейшей работы из xml файла, но на выходе в блоке news_text присутствует большое количество html тегов с амперсандами. Подскажите как избавиться от всех html тегов учитывая, что в каждой строке они разные. В результате необходимо получить xlsx файл с 4 столбцами (с первыми тремя все ок) и в 4 столбце чистый текст без html тегов.

from bs4 import BeautifulSoup
import requests
import pandas as pd

fd = open('news.xml', 'r', encoding='utf-8')

xml_file = fd.read()
soup = BeautifulSoup(xml_file, features='lxml-xml')
#print(soup)


nid = soup.find_all('field', {'name': 'nid'})
date = soup.find_all('field', {'name': 'publ_date'})
title = soup.find_all('field', {'name': 'news_title'})
text = soup.find_all('field', {'name': 'news_text'})

currencies = []
for i in range(0, len(nid)):
    rows = [nid[i].get_text(),
           date[i].get_text(),
           title[i].get_text(),
           text[i].get_text()]
    currencies.append(rows)

#display(currencies[:4])

news = pd.DataFrame(currencies,
                    columns=['Nid','Date','Title','Text'],
                    dtype=float)

news.to_excel('sgu.xlsx',
              index=False)

Answer 1

Для удаления HTML-тегов из текста вам может помочь метод get_text() из библиотеки BeautifulSoup. Вы можете использовать его на каждом элементе text перед добавлением его в список currencies.

for i in range(0, len(nid)):
    text_without_html = text[i].get_text()
    rows = [nid[i].get_text(),
           date[i].get_text(),
           title[i].get_text(),
           text_without_html]
    currencies.append(rows)

Или вот так,

text = [i.get_text() for i in soup.find_all('field', {'name': 'news_text'})]

Это должно избавить ваш текст от HTML-тегов и включить только чистый текст в 4 столбец. Полный код

from bs4 import BeautifulSoup
import pandas as pd

with open('news.xml', 'r', encoding='utf-8') as fd:
    xml_file = fd.read()
    soup = BeautifulSoup(xml_file, 'lxml-xml')

nid = [i.get_text() for i in soup.find_all('field', {'name': 'nid'})]
date = [i.get_text() for i in soup.find_all('field', {'name': 'publ_date'})]
title = [i.get_text() for i in soup.find_all('field', {'name': 'news_title'})]
text = [i.get_text() for i in soup.find_all('field', {'name': 'news_text'})]

currencies = [ [nid[i],date[i],title[i],text[i]] for i in range(len(nid)) ]

news = pd.DataFrame(currencies,columns=['Nid','Date','Title','Text'],dtype=float)

news.to_excel('sgu.xlsx',index=False)

Answer 2

Удалось решить задачу следующим образом, спарсить все в начале в xlsx файл и уже из него удалить html теги с помощью следующего кода:

import re
import openpyxl

workbook = openpyxl.load_workbook('file_full.xlsx')
sheet = workbook.active

for row in sheet.iter_rows():
    for cell in row:
        cell.value = re.sub('<.*?>', '', str(cell.value))


workbook.save('cleaned_file_full.xlsx')

БЛОГ НА HUSL

Подскажите как удалить html разметку при парсинге xml файла

Ответы (2 шт):