首页 > 学术百科

干货分享！网络爬虫，提取网站数据。

⼲货分享！⽹络爬⾍，提取⽹站数据。

1　什么是⽹络爬⾍

⽹络爬⾍是指从⽹站提取数据的技术，该技术可以将⾮结构化数据转换为结构化数据。2019十大经济年度人物

⽹络爬⾍的⽤途是从⽹站提取数据，提取的数据可以存储到本地⽂件并保存在系统中，也可以将其以表格的形式存储到数据库中。⽹络爬⾍使⽤HTTP或Web浏览器直接访问万

维⽹（WWW）。⽹络爬⾍或机器⼈抓取⽹页的过程是⼀个⾃动化流程。

抓取⽹页的过程分为获取⽹页、提取数据。Web抓取程序可以获取⽹页，它是的必需组件。在获取⽹页后，就需要提取⽹页数据了。我们可以搜索、解析，并将提取的数据保存

到表格中，然后重新整理格式。

2　数据提取

本节我们学习数据提取。我们可以使⽤Python的BeautifulSoup库进⾏数据提取。这⾥还需要⽤到Python库的Requests模块。

运⾏以下命令以安装Requests和BeautifulSoup库。

$ pip3 install requests

$ pip3 install beautifulsoup4

2.1　Requests库

使⽤Requests库可以易懂的格式在Python脚本中使⽤HTTP，这⾥使⽤Python中的Requests库获取⽹页。Requests库包含不同类型的请求，这⾥使⽤GET请求。GET请求⽤于

从Web服务器获取信息，使⽤GET请求可以获取指定⽹页的HTML内容。每个请求都对应⼀个状态码，状态码从服务器返回，这些状态码为我们提供了对应请求执⾏结果的相关

信息。以下是部分状态码。

301：表⽰如果服务器已切换域名或必须更改端点名称，则服务器将重定向到其他端点。

400：表⽰⽤户发出了错误请求。

401：表⽰⽤户未通过⾝份验证。

403：表⽰⽤户正在尝试访问禁⽤的资源。

404：表⽰⽤户尝试访问的资源在服务器上不可⽤。

2.2　BeautifulSoup库

BeautifulSoup也是⼀个Python库，它包含简单的搜索、导航和修改⽅法。它只是⼀个⼯具包，⽤于从⽹页中提取所需的数据。

要在脚本中使⽤Requests和BeautifulSoup模块，必须使⽤import语句导⼊这两个模块。现在我们来看⼀个解析⽹页的⽰例程序，这⾥将解析⼀个来⾃百度⽹站的新闻⽹页。创建

⼀个脚本，命名为parse_web_page.py，并在其中写⼊以下代码。

import requests

from bs4 import BeautifulSoup

page_result = ('ws.baidu')

parse_obj = BeautifulSoup(t, 'html.parser')

print(parse_obj)

运⾏脚本程序，如下所⽰。

student@ubuntu:~/work$ python3 parse_web_page.py

Output:<!DOCTYPE html>

<html xmlns:fb="www.facebook/2008/fbml"

xmlns:og="/ns#">

<head>

name="apple-itunes-app"/>

Date().getTime(),pt:'java'};</script>

if (typeof uet == 'function') {

uet("bb", "LoadTitle", {wb: 1});

}

</script>

new Date().getTime(); })(IMDbTimer);</script>

new Date().getTime(); })(IMDbTimer);</script>

if (typeof uet == 'function') {

uet("be", "LoadTitle", {wb: 1});

}

</script>

if (typeof uex == 'function') {

uex("ld", "LoadTitle", {wb: 1});

}

</script>

if (typeof uet == 'function') {

uet("bb", "LoadIcons", {wb: 1});

}

上⾯的⽰例程序抓取了⼀个⽹页，并使⽤BeautifulSoup对其进⾏了解析。⾸先导⼊了requests和BeautifulSoup模块，然后使⽤GET请求访问URL，并将结果分配给page_result

变量，接着创建了⼀个BeautifulSoup对象parse_obj，此对象将requests的返回结果t作为参数，然后使⽤html.parser解析该页⾯。

现在我们将从类和标签中提取数据。转到Web浏览器，右击要提取的内容并向下查，到“检查”选项，单击它将获得类名。在程序中指定这个类名，并运⾏脚本。创建⼀个脚

本，命名为extract_from_class.py，并在其中写⼊以下代码。

import requests

from bs4 import BeautifulSoup

page_result = ('ws.baidu')

parse_obj = BeautifulSoup(t, 'html.parser')

top_news = parse_obj.find(class_='news-article__content')

print(top_news)

运⾏脚本程序，如下所⽰。

student@ubuntu:~/work$ python3 extract_from_class.py

Output :<div class="news-article__content">

<a href="/name/nm4793987/">Issa Rae</a> and <a

href="/name/nm0000368/">Laura Dern</a> are teaming up to star in a limited

series called "The Dolls" currently in development at <a

href="/company/co0700043/">HBO</a>.<br/><br/>Inspired by true events, the

高中音乐教学案例series recounts the aftermath of Christmas Eve riots in two small Arkansastowns in1983, riots which erupted over Cabbage Patch Dolls. The seriesexplores class, race, privilege and what it takes to be a "goodmother."<br/><br/>Rae will serve series in addition to starring, with Dern also executive producing. <a

href="/name/nm3308450/">Laura Kittrell</a> and <a

href="/name/nm4276354/">Amy Aniobi</a> will also serve as writers and coexecutive

producers. <a href="/name/nm0501536/">Jayme Lemons</a> of Dern’s

<a href="/company/co0641481/">Jaywalker Pictures</a> and <a

href="/name/nm3973260/">Deniese Davis</a> of <a

href="/company/co0363033/">Issa Rae Productions</a> will also executive

produce.<br/><br/>Both Rae and Dern currently star in HBO shows, with Dern

武林风阳阳appearing in the acclaimed drama "<a href="/title/tt3920596/">Big Little

Lies</a>" and Rae starring in and having created the hit comedy "<a

href="/title/tt5024912/">Insecure</a>." Dern also recently starred in the

film "<a href="/title/tt4015500/">The Tale</a>,

</div>

上⾯的⽰例程序⾸先导⼊了requests和BeautifulSoup模块，然后创建了⼀个requests对象并为其分配了⼀个URL，接着创建了⼀个BeautifulSoup对象parse_obj。此对象将requests的返回结果t作为参数，然后使⽤html.parser解析页⾯。最后，使⽤BeautifulSoup的find()⽅法从news-article__content类中获取内容。

现在我们来看⼀个从特定标签中提取数据的⽰例程序，此⽰例程序将从<a>标签中提取数据。创建⼀个脚本，命名为extract_from_tag.py，并在其中写⼊以下代码。

import requests

from bs4 import BeautifulSoup

侯永庭上海page_result = ('ws.baidu/news')

parse_obj = BeautifulSoup(t, 'html.parser')

top_news = parse_obj.find(class_='news-article__content')

top_news_a_content = top_news.find_all('a')

print(top_news_a_content)

运⾏脚本程序，如下所⽰。

student@ubuntu:~/work$ python3 extract_from_tag.py

Output:[<a href="/name/nm4793987/">Issa Rae</a>, <a href="/name/nm0000368/">Laura

禁欲主义者Dern</a>, <a href="/company/co0700043/">HBO</a>, <a

href="/name/nm3308450/">Laura Kittrell</a>, <a href="/name/nm4276354/">Amy

Aniobi</a>, <a href="/name/nm0501536/">Jayme Lemons</a>, <a

href="/company/co0641481/">Jaywalker Pictures</a>, <a

href="/name/nm3973260/">Deniese Davis</a>, <a

href="/company/co0363033/">Issa Rae Productions</a>, <a

href="/title/tt3920596/">Big Little Lies</a>, <a

href="/title/tt5024912/">Insecure</a>, <a href="/title/tt4015500/">The

Tale</a>]

上⾯的⽰例程序从<a>标签中提取数据。这⾥使⽤find_all()⽅法从news-article__content类中提取所有<a>标签数据。

3　从⽹站抓取信息

本节我们将学习⼀个从⽹站获取舞蹈种类列表的⽰例程序，这⾥将列出所有古典印度舞蹈。创建⼀个脚本，命名为extract_from_wikipedia.py，并在其中写⼊以下代码。

import requests

from bs4 import BeautifulSoup

page_result = ('/wiki/Portal:History')

parse_obj = BeautifulSoup(t, 'html.parser')

h_obj = parse_obj.find(class_='hlist noprint')

h_obj_a_content = h_obj.find_all('a')

print(h_obj)

print(h_obj_a_content)

运⾏脚本程序，如下所⽰。

student@ubuntu:~/work$ python3 extract_from_wikipedia.py

输出如下。

<dl><dt><a href="/wiki/Portal:Contents/Portals"

title="Portal:Contents/Portals">Portal topics</a></dt>

<dd><a href="/wiki/Portal:Contents/Portals#Human_activities"

title="Portal:Contents/Portals">Activities</a></dd>

<dd><a href="/wiki/Portal:Contents/Portals#Culture_and_the_arts"

title="Portal:Contents/Portals">Culture</a></dd>

<dd><a href="/wiki/Portal:Contents/Portals#Geography_and_places"

title="Portal:Contents/Portals">Geography</a></dd>

<dd><a href="/wiki/Portal:Contents/Portals#Health_and_fitness"

title="Portal:Contents/Portals">Health</a></dd>

<dd><a href="/wiki/Portal:Contents/Portals#History_and_events"

title="Portal:Contents/Portals">History</a></dd>

<dd><a href="/wiki/Portal:Contents/Portals#Mathematics_and_logic"

title="Portal:Contents/Portals">Mathematics</a></dd>

<dd><a href="/wiki/Portal:Contents/Portals#Natural_and_physical_sciences"

title="Portal:Contents/Portals">Nature</a></dd>

盘条

<dd><a href="/wiki/Portal:Contents/Portals#People_and_self"

title="Portal:Contents/Portals">People</a></dd>

In the preceding example, we extracted the content from Wikipedia. In this

example also, we extracted the content from class as well as tag.

....

本文发布于:2024-09-23 04:29:03，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/366482.html

上一篇：仿生机器人爬虫

下一篇：基于大数据技术的企业风险管理研究

标签：数据提取请求

留言与评论（共有 0 条评论）