ChaseDream
标题:
爬虫 实习总结
[打印本页]
作者:
拆不散D大熊小熊
时间:
2020-7-21 13:22
标题:
爬虫 实习总结
https://www.nasdaq.com/market-activity/stocks/mmm/news-headlines
1. 3000都是相同结构;需求里面,每一个列表也是相同的结构;先解决列表页,然后再来解决详情
2. 这是列表页。红框里面的就是你你要点击的详情页;html要去了解一下
网页上每个数据都有自己的位置,大多都是规则的。
[attach]251689[/attach]
3.这是MMM的股票列表,但是我们换别的编号的的列表,他的结构没有变,只是变了数据。
(这就是相当于我做一个蛋糕,有方形的容器,也有圆形的容器,我要做什么蛋糕就用什么容器,而不是做一个蛋糕的同时又做一个容器)所以,你那3000个详情页的,你写一个代码就可以了。
怎么获取这三千个详情页的链接呢?肯定是要在列表里面的获取,你既然能拿到列表返回的html,然后解析html,获取到需要的元素。
不是已经会发请求了么?? 怎么找到请求链接,也会了
现在除了解析网页信息,你已经具备采集的能力
写代码有一个很重要的思想
首先要模块化的思维
获取列表页html ---> 拿到详情页链接--->获取详情页html--->提取详情页数据
这个四个步骤,先分解第一个 然后再继续分解第二个 一步一步的解决
from
selenium
import
webdriver
import
os
from
bs4
import
BeautifulSoup
import
csv
def
go_driver
(url):
option = webdriver.ChromeOptions()
#
设置无头浏览器
# option.add_argument('--headless')
driver = webdriver.Chrome(
executable_path
=
"C://Webdriver/chromedriver.exe"
,
options
=option)
driver.implicitly_wait(
60
)
driver.get(url)
html = driver.page_source
driver.quit()
return
html
def
get_detail
(url
,
ticker):
detail_html = go_driver(url)
new_html = BeautifulSoup(detail_html
,
'lxml'
)
title = new_html.find(
"h1"
,
class_
=
"article-header__headline"
).get_text()
publish_date = new_html.find(
"time"
,
class_
=
"timestamp__date"
).get_text()
url = url
body = new_html.find(
"div"
,
class_
=
"body__content"
).get_text()
data = [(ticker
,
title
,
publish_date
,
url
,
body.strip())]
writer.writerows(data)
def
get_html
(ticker):
html = go_driver(
'https://www.nasdaq.com/market-activity/stocks/'
+
str
(ticker)+
'/news-headlines'
)
soup_html = BeautifulSoup(html
,
'lxml'
)
headlines_list = soup_html.find(
"ul"
,
class_
=
"quote-news-headlines__list"
)
items = headlines_list.find_all(
"li"
)
for
x
in
items:
href = x.find(
"a"
).get(
"href"
)
url =
"https://www.nasdaq.com"
+href
print
(url)
get_detail(url
,
ticker)
def
read_csv
():
with
open
(
'sp500_tickers.csv'
,
'r'
)
as
f:
reader = csv.reader(f)
name_list = []
for
row
in
reader:
name_list.append(row[
1
])
# print(row)
return
name_list[
1
:]
if
__name__ ==
'__main__'
:
#
创建
csv
write_csv =
open
(
'movies.csv'
,
'a+'
,
newline
=
''
,
encoding
=
"gb18030"
)
writer = csv.writer(write_csv)
writer.writerow([
"ticker"
,
"title"
,
"publish_date"
,
"url"
,
"body"
])
#
获取
ticker
reader = read_csv()
for
r
in
reader[
501
:]:
print
(r)
get_html(r)
#
关闭文件
write_csv.close()
欢迎光临 ChaseDream (https://forum.chasedream.com/)
Powered by Discuz! X3.3