как убрать лишние символы при парсинге?
как убрать лишние символы при парсинге? мне нужно чтобы получился просто красивый вывод без указания времени парсинга
код:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
url='https://books.toscrape.com/'
headers = {"User-Agent": UserAgent().random}
full_page = requests.get(url,headers)
soup = BeautifulSoup(full_page.text, 'html.parser')
titles = soup.find_all('a')
for title in titles:
print(title.text)
вывод:
# Books to Scrape
# Home
# Books
# Travel
# Mystery
# Historical Fiction
# Sequential Art
# Classics
# Philosophy
# Romance
# Womens Fiction
# Fiction
# Childrens
# Religion
# Nonfiction
# Music
# Default
# Science Fiction
# Sports and Games
# Add a comment
# Fantasy
# New Adult
# Young Adult
# Science
# Poetry
# Paranormal
# Art
# Psychology
# Autobiography
# Parenting
# Adult Fiction
# Humor
# Horror
# History
# Food and Drink
# Christian Fiction
# Business
# Biography
# Thriller
# Contemporary
# Spirituality
# Academic
# Self Help
# Historical
# Christian
# Suspense
# Short Stories
# Novels
# Health
# Politics
# Cultural
# Erotica
# Crime
# A Light in the ...
# Tipping the Velvet
# Soumission
# Sharp Objects
# Sapiens: A Brief History ...
# The Requiem Red
# The Dirty Little Secrets ...
# The Coming Woman: A ...
# The Boys in the ...
# The Black Maria
# Starving Hearts (Triangular Trade ...
# Shakespeare's Sonnets
# Set Me Free
# Scott Pilgrim's Precious Little ...
# Rip it Up and ...
# Our Band Could Be ...
# Olio
# Mesaerion: The Best Science ...
# Libertarianism for Beginners
# It's Only the Himalayas
# next
# [Finished in 5.6s]
Ответы (1 шт):
Автор решения: CrazyElf
→ Ссылка
Ну, например, так:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
url='https://books.toscrape.com/'
headers = {"User-Agent": UserAgent().random}
full_page = requests.get(url,headers)
soup = BeautifulSoup(full_page.content, 'html.parser')
titles = soup.find_all('a')
for title in titles:
if 'title' in title.attrs:
print(title.attrs['title'])
Вывод:
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas
Просто я посмотрел, что находится в переменной title внутри цикла и поискал там нужный атрибут. Там внутри что-то такое:
<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>
На самом деле через BS это как-то ещё проще должно делаться, просто нужно почитать документацию.