V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
3380626465
V2EX  ›  Python

大量代理 ip 地址解决百度云分享链接失效验证(非确定策略)

  •  
  •   3380626465 · 2016-08-28 16:05:32 +08:00 · 6014 次点击
    这是一个创建于 3001 天前的主题,其中的信息可能已经有所发展或是发生改变。

    特别说下,之前我发了百度网盘爬虫的帖子,提出了资源失效怎么判断,十分感谢朋友有朋友给了我建议,本片公开的代码是去转盘网之前的代码,但是基本差不多,关键是获取大量的代理 ip 地址,这个才是核心,本文先公开获取大量代理 ip 地址的代码(网上的开源工程,我已经做了修改和调整,请大胆使用),然后把判断失效链接的代码再发一下,以方便大家研究。

    请点击这里下载 详细点这里

    首先做个回顾:百度网盘爬虫 java 分词算法 数据库自动备份 代理服务器爬取 邀请好友注册

    好的,进入代码模式:

    ing:utf-8
    """
    @author:haoning
    @create time:2015.8.5
    """
    from __future__ import division  # 精确除法
    from Queue import Queue
    from __builtin__ import False
    from _sqlite3 import SQLITE_ALTER_TABLE
    from collections import OrderedDict
    import copy
    import datetime
    import json
    import math
    import os
    import random
    import platform
    import re
    import threading, errno, datetime
    import time
    import urllib2
    import MySQLdb as mdb
    
    
    DB_HOST = '127.0.0.1'
    DB_USER = 'root'
    DB_PASS = 'root'
    
    
    def gethtml(url):
        try:
            print "url",url
            req = urllib2.Request(url)
            response = urllib2.urlopen(req,None,8) #在这里应该加入代理
            html = response.read()
            return html
        except Exception,e:
            print "e",e
    
    if __name__ == '__main__':
    
       while 1:
           #url='http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442'
           url="http://pan.baidu.com/s/1qXQD2Pm"
           html=gethtml(url)
           print html
    

    结果: e HTTP Error 403: Forbidden ,这就是说,度娘他是反爬虫的,之后看了很多网站,一不小心试了下面的链接:

    http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442

    if __name__ == '__main__':
    
       while 1:
           url='http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442'
           #url="http://pan.baidu.com/s/1qXQD2Pm"
           html=gethtml(url)
           print html
    

    结果:<title>百度云 网盘-链接不存在</title>,你懂的,有这个的必然已经失效,看来度娘没有反爬虫,好家伙。

    其实百度网盘的资源入口有两种方式:

    一种是: http://pan.baidu.com/s/1qXQD2Pm,最后为短码。

    另一种是: http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442 ,关键是 shareId+uk 前者已知道反爬虫,后者目前没有,所以用 python 测试后,本人又将代码翻译成了 java ,因为去转盘是用 java 写的,直接上代码:

    package com.tray.common.utils;
    
    import static org.junit.Assert.*;
    
    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.net.HttpURLConnection;
    import java.net.MalformedURLException;
    import java.net.URL;
    import java.util.HashMap;
    import java.util.Iterator;
    import java.util.Map;
    import java.util.Properties;
    import java.util.Random;
    import java.util.Set;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
    import org.junit.Test;
    
    /**
     * 资源校验工具
     * 
     * @author hui
     * 
     */
    public class ResourceCheckUtil {
        private static Map<String, String[]> rules;
        static {
            loadRule();
        }
    
        /**
         * 加载规则库
         */
        public static void loadRule() {
            try {
                InputStream in = ResourceCheckUtil.class.getClassLoader()
                        .getResourceAsStream("rule.properties");
                Properties p = new Properties();
                p.load(in);
                Set<Object> keys = p.keySet();
                Iterator<Object> iterator = keys.iterator();
                String key = null;
                String value = null;
                String[] rule = null;
                rules = new HashMap<String, String[]>();
                while (iterator.hasNext()) {
                    key = (String) iterator.next();
                    value = (String) p.get(key);
                    rule = value.split("\\|");
                    rules.put(key, rule);
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    
        public static String httpRequest(String url) {
            try {
                URL u = new URL(url);
                Random random = new Random();
                HttpURLConnection connection = (HttpURLConnection) u
                        .openConnection();
                connection.setConnectTimeout(3000);//3 秒超时
                connection.setReadTimeout(3000); 
                connection.setDoOutput(true);
                connection.setDoInput(true);
                connection.setUseCaches(false);
                connection.setRequestMethod("GET");
                
                String[] user_agents = {
                        "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11",
                        "Opera/9.25 (Windows NT 5.1; U; en)",
                        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
                        "Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)",
                        "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12",
                        "Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9",
                        "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
                        "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 "
                };
                int index=random.nextInt(7);
                /*connection.setRequestProperty("Content-Type",
                        "text/html;charset=UTF-8");*/
                connection.setRequestProperty("User-Agent",user_agents[index]);
                /*connection.setRequestProperty("Accept-Encoding","gzip, deflate, sdch");
                connection.setRequestProperty("Accept-Language","zh-CN,zh;q=0.8");
                connection.setRequestProperty("Connection","keep-alive");
                connection.setRequestProperty("Host","pan.baidu.com");
                connection.setRequestProperty("Cookie","");
                connection.setRequestProperty("Upgrade-Insecure-Requests","1");*/
                InputStream in = connection.getInputStream();
    
                BufferedReader br = new BufferedReader(new InputStreamReader(in,
                        "utf-8"));
                StringBuffer sb = new StringBuffer();
                String line = null;
                while ((line = br.readLine()) != null) {
                    sb.append(line);
                }
                return sb.toString();
    
            } catch (MalformedURLException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }
    
            return null;
        }
    
         @Test
         public void test7() throws Exception {
             System.out.println(isExistResource("http://pan.baidu.com/s/1jGjBmyq",
             "baidu"));
             System.out.println(isExistResource("http://pan.baidu.com/s/1jGjBmyqa",
             "baidu"));
            
             System.out.println(isExistResource("http://yunpan.cn/cQx6e6xv38jTd","360"));
             System.out.println(isExistResource("http://yunpan.cn/cQx6e6xv38jTdd",
             "360"));
            
             System.out.println(isExistResource("http://share.weiyun.com/ec4f41f0da292adb89a745200b8e8b57","weiyun"));
             System.out.println(isExistResource("http://share.weiyun.com/ec4f41f0da292adb89a745200b8e8b57dd",
             "360"));
            
             System.out.println(isExistResource("http://cloud.letv.com/s/eiGLzuSes","leshi"));
             System.out.println(isExistResource("http://cloud.letv.com/s/eiGLzuSesdd",
             "leshi"));
         }
    
        /**
         * 获取指定页面上标签的内容
         * 
         * @param url
         * @param tagName
         *            标签名称
         * @return
         */
        private static String getHtmlContent(String url, String tagName) {
            String html = httpRequest(url);
            if(html==null){
                return "";
            }
            Document doc = Jsoup.parse(html);
            //System.out.println("doc======"+doc);
            Elements tag=null;
            if(tagName.equals("<h3>")){ //针对微云
                tag=doc.select("h3");
            }
            else if(tagName.equals("class")){ //针对 360
                tag=doc.select("div[class=tip]");
            }
            else{
                tag= doc.getElementsByTag(tagName);
            }
            //System.out.println("tag======"+tag);
            String content="";
            if(tag!=null&&!tag.isEmpty()){
                content = tag.get(0).text();
            }
            return content;
        }
    
        public static int isExistResource(String url, String ruleName) {
            try {
                String[] rule = rules.get(ruleName);
                String tagName = rule[0];
                String opt = rule[1];
                String flag = rule[2];
                /*System.out.println("ruleName"+ruleName);
                System.out.println("tagName"+tagName);
                System.out.println("opt"+opt);
                System.out.println("flag"+flag);
                System.out.println("url"+url);*/
                String content = getHtmlContent(url, tagName);
                //System.out.println("content="+content);
                if(ruleName.equals("baidu")){
                    if(content.contains("百度云升级")){ //升级作为不存在处理
                        return 1;
                    }
                }
                String regex = null;
                if ("eq".equals(opt)) {
                    regex = "^" + flag + "$";
                } else if ("bg".equals(opt)) {
                    regex = "^" + flag + ".*$";
                } else if ("ed".equals(opt)) {
                    regex = "^.*" + flag + "$";
                } else if ("like".equals(opt)) {
                    regex = "^.*" + flag + ".*$";
                }else if("contain".equals(opt)){
                    if(content.contains(flag)){
                        return 0;
                    }
                    else{
                        return 1;
                    }
                }
                if(content.matches(regex)){
                    return 1;
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
            return 0;
        }
    
        // public static void main(String[] args)throws Exception {
        // final Path p = Paths.get("C:/Users/hui/Desktop/6-14/");
        // final WatchService watchService =
        // FileSystems.getDefault().newWatchService();
        // p.register(watchService, StandardWatchEventKinds.ENTRY_MODIFY);
        // new Thread(new Runnable() {
        //
        // public void run() {
        // while(true){
        // System.out.println("检测中。。。。");
        // try {
        // WatchKey watchKey = watchService.take();
        // List<WatchEvent<?>> watchEvents = watchKey.pollEvents();
        //
        // for(WatchEvent<?> event : watchEvents){
        // //TODO 根据事件类型采取不同的操作。。。。。。。
        // System.out.println("["+p.getFileName()+"/"+event.context()+"]文件发生了["+event.kind()+"]事件");
        // }
        // watchKey.reset();
        //
        // } catch (Exception e) {
        // e.printStackTrace();
        // }
        // }
        // }
        // }).start();
        // }
        
    //    @Test
    //    public void testName() throws Exception {
    //        System.out.println(new String("\u8BF7\u8F93\u5165\u63D0\u53D6\u7801".getBytes("utf-8"), "utf-8"));
    //    }
    
    }
    

    注意代码本生要用来兼容 360 ,微盘等网盘的,但有些网盘倒了,大家都知道,不过代码还是得在,这才是程序猿该有的思路,那就是可宽展,注意代码有个配置文件,我也附上吧:

    360=class|contain|\u5206\u4EAB\u8005\u5DF2\u53D6\u6D88\u6B64\u5206\u4EAB baidu=title|contain|\u94FE\u63A5\u4E0D\u5B58\u5728 weiyun=

    |contain|\u5206\u4EAB\u8D44\u6E90\u5DF2\u7ECF\u5220\u9664 leshi=title|ed|\u63D0\u53D6\u6587\u4EF6

    sorry , unicode 编码,麻烦你自己转下码吧,不会请百度: unicode 转码工具

    到此,去转盘网链接是否失效的验证,代码我已经完全公开,喜欢这篇博客的孩子请收藏并关注下。

    本人建个 qq 群,欢迎大家一起交流技术, 群号: 512245829 喜欢微博的朋友关注:转盘娱乐即可

    2 条回复    2016-08-29 13:19:08 +08:00
    andychen20121130
        1
    andychen20121130  
       2016-08-29 12:24:56 +08:00
    终于发现有 JAVA 的了,以为 JAVA 不能搞爬虫。
    3380626465
        2
    3380626465  
    OP
       2016-08-29 13:19:08 +08:00
    @andychen20121130 可以啊, java 照样可以的~~下午再分享点 java 方面的东西,请期待
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2611 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 26ms · UTC 04:53 · PVG 12:53 · LAX 20:53 · JFK 23:53
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.