By accessing the website and accepting the Cookie Policy, you agree to use the cookies provided by the Site in accordance with to analyze traffic, remember your preferences, and optimize your experience.
用python下载文件的若干种方法汇总
2020-04-10 11:00:55    297    0    0
emengweb

用python下载文件的若干种方法汇总。

0. requests标准模板

import requests
url="******"
try:
    r=requests.get(url)
    r.raise_for_status()  #如果不是200,产生异常requests.HTTPError
    r.encoding=r.apparent_encoding
    print(r.text)
except:
    print("爬取失败...")

1. 下载图片

import requests
url = 'https://www.python.org/static/img/python-logo@2x.png'
myfile = requests.get(url)
open('PythonImage.png', 'wb').write(myfile.content)

wget

import wget
url = "https://www.python.org/static/img/python-logo@2x.png"
wget.download(url, 'pythonLogo.png')

2. 下载重定向的文件

import requests
url = 'https://readthedocs.org/projects/python-guide/downloads/pdf/latest/'
myfile = requests.get(url, allow_redirects=True)
open('hello.pdf', 'wb').write(myfile.content)

3. 分块下载大文件

import requests
url = 'https://buildmedia.readthedocs.org/media/pdf/python-guide/latest/python-guide.pdf'
r = requests.get(url, stream = True)
with open("PythonBook.pdf", "wb") as Pypdf:
    for chunk in r.iter_content(chunk_size = 1024): # 1024 bytes
        if chunk:
            Pypdf.write(chunk)

4. 并行下载多文件

不并行版本:

import os
import requests
from time import time
from multiprocessing.pool import ThreadPool

def url_response(url):
    path, url = url
    r = requests.get(url, stream=True)
    with open(path, 'wb') as f:
        for ch in r:
            f.write(ch)
        
urls = [("Event1", "https://www.python.org/events/python-events/805/"),
        ("Event2", "https://www.python.org/events/python-events/801/"),
        ("Event3", "https://www.python.org/events/python-events/790/"),
        ("Event4", "https://www.python.org/events/python-events/798/"),
        ("Event5", "https://www.python.org/events/python-events/807/"),
        ("Event6", "https://www.python.org/events/python-events/807/"),
        ("Event7", "https://www.python.org/events/python-events/757/"),
        ("Event8", "https://www.python.org/events/python-user-group/816/")]

start = time()
for x in urls:
    url_response(x)
    
print(f"Time to download: {time() - start}")

# Time to download: 7.306085824966431

并行版本,只需改动一行代码ThreadPool(9).imap_unordered(url_response, urls),时间会大幅度减少:

import os
import requests
from time import time
from multiprocessing.pool import ThreadPool

def url_response(url):
    path, url = url
    r = requests.get(url, stream=True)
    with open(path, 'wb') as f:
        for ch in r:
            f.write(ch)
            
urls = [("Event1", "https://www.python.org/events/python-events/805/"),
        ("Event2", "https://www.python.org/events/python-events/801/"),
        ("Event3", "https://www.python.org/events/python-events/790/"),
        ("Event4", "https://www.python.org/events/python-events/798/"),
        ("Event5", "https://www.python.org/events/python-events/807/"),
        ("Event6", "https://www.python.org/events/python-events/807/"),
        ("Event7", "https://www.python.org/events/python-events/757/"),
        ("Event8", "https://www.python.org/events/python-user-group/816/")]

start = time()
ThreadPool(9).imap_unordered(url_response, urls)
print(f"Time to download: {time() - start}")

# Time to download: 0.0064961910247802734

5. 使用urllib获取html页面

import urllib.request
# urllib.request.urlretrieve('url', 'path')
urllib.request.urlretrieve('https://www.python.org/', 'PythonOrganization.html')

6. python下载视频的神器

you-get,目前you-get所支持的网站包含国内外几十个网站(youtube、twitter、腾讯、爱奇艺、优酷、bilibili等)。

pip install you-get

测试一下:

you-get https://www.bilibili.com/video/av52694584/?spm_id_from=333.334.b_686f6d655f706f70756c6172697a65.3

youtube-dl也是一个类似的工具。

7. 举个例子

批量下载: NOAA-CIRES 20th Century 2m气温再分析资料。一个个点手会点残,这时候可以借助Python来批量化下载数据。

首先打开页面,按F12查看网页源码:

img

可以看出,对应下载文件的链接都在div标签下的a标签中,需要将这些链接一一获取然后就可以进行批量化下载了。

# -*- coding: utf-8 -*-
import urllib
from bs4 import BeautifulSoup

rawurl='https://www.esrl.noaa.gov/psd/cgi-bin/db_search/DBListFiles.pl?did=118&tid=40290&vid=2227'
content = urllib.request.urlopen(rawurl).read().decode('ascii')  # 获取页面的HTML

soup = BeautifulSoup(content, 'lxml')
url_cand_html=soup.find_all(id='content') # 定位到存放url的标号为content的div标签
list_urls=url_cand_html[0].find_all("a") # 定位到a标签,其中存放着文件的url
urls=[]

for i in list_urls[1:]:
    urls.append(i.get('href')) # 取出链接

for i,url in enumerate(urls):
    print("This is file"+str(i+1)+" downloading! You still have "+str(142-i-1)+" files waiting for downloading!!")
    file_name = "./ncfile/"+url.split('/')[-1] # 文件保存位置+文件名
    urllib.request.urlretrieve(url, file_name)

并行下载的版本,大家自己先试试,欢迎留言写下你的方案。

REFERENCE

Downloading Files Using Python (Simple Examples)

上一篇: Requests: 让 HTTP 服务人类

下一篇: python文件操作与json格式转换

297 人读过
文档导航