正则表达式抓取网页信息


本文摘自PHP中文网,作者巴扎黑,侵删。

声明:此正则表达式只适用于.net ,使用的流程为发送http请求返回整个html网页,然后从此html页面抓取想要的数据。

第一部分:发送httpWebRequest 请求

C#代码

1

2

3

4

5

6

7

8

//url 地址 

HttpWebRequest request = (HttpWebRequest)WebRequest.Create("URL")); 

            HttpWebResponse response = (HttpWebResponse)request.GetResponse(); 

            //浏览器类型设置 

            request.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.0.04506; .NET CLR 3.5.21022; .NET CLR 1.0.3705; .NET CLR 1.1.4322)"

            StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("UTF-8")); 

              //返回的html网页数据 

            String htmlStr = reader.ReadToEnd();

第二部分:根据返回的html获取有用数据,此方法适用于所有想通过ID或Class等等的标签找到html的需求,拿下面一个方法为例

C#代码

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

  /// <summary> 

        /// 获得颜色 

         /// </summary> 

        /// <param name="htmlStr"></param> 

        /// <returns></returns> 

        public String getColor(String htmlStr) 

        

  //获取class为  DetailsC_Sku的html ,还可改为ID的方式    

//string regstr6 = @"<(?<HtmlTag>[\w]+)[^>]*\s[iI][dD]=(?<Quote>";     

string regstr6 = @"<(?<HtmlTag>[\w]+)[^>]*\s[cC][lL][aA][sS][sS]=(?<Quote>"

            string regstr7 = "[\"']?)DetailsC_Sku(?(Quote)"

            string regstr8 = @"\k<Quote>)"

            string regstr9 = "[\"']?[^>]*>"

            string regstr10 = @"((?<Nested><\k<HtmlTag>[^>]*>)|</\k<HtmlTag>>(?<-Nested>)|.*?)*</\k<HtmlTag>>"

            StringBuilder sb2 = new StringBuilder(); 

            sb2.Append(regstr6); 

            sb2.Append(regstr7); 

            sb2.Append(regstr8); 

            sb2.Append(regstr9); 

            sb2.Append(regstr10); 

        //根据正则表达式获取的html 

            String sizeHtml = Regex.Match(htmlStr, sb2.ToString(), RegexOptions.Singleline).ToString(); 

            if (!String.IsNullOrEmpty(sizeHtml)) 

            

                String newhtml = htmlStr.Replace(sizeHtml, ""); 

                string regstr11 = @"<(?<HtmlTag>[\w]+)[^>]*\s[cC][lL][aA][sS][sS]=(?<Quote>"

                string regstr12 = "[\"']?)DetailsC_Sku(?(Quote)"

                string regstr13 = @"\k<Quote>)"

                string regstr14 = "[\"']?[^>]*>"

                string regstr15 = @"((?<Nested><\k<HtmlTag>[^>]*>)|</\k<HtmlTag>>(?<-Nested>)|.*?)*</\k<HtmlTag>>"

                StringBuilder sb3 = new StringBuilder(); 

                sb3.Append(regstr11); 

                sb3.Append(regstr12); 

                sb3.Append(regstr13); 

                sb3.Append(regstr14); 

                sb3.Append(regstr15); 

                String colorHtml = Regex.Match(newhtml, sb3.ToString(), RegexOptions.Singleline).ToString(); 

                if (String.IsNullOrEmpty(colorHtml)) 

                    return ""

   

                //找出此colorHtml中的所有a 标签 

                Regex regex2 = new Regex(@"<a.*?>[\s\S]*?<\/a>"); 

                MatchCollection mc2 = regex2.Matches(colorHtml); 

                StringBuilder sbs = new StringBuilder(); 

                //循环找到颜色 

                if (mc2.Count > 0) 

                

                    foreach (Match mm in mc2) 

                    

                        sbs.Append(RemoveHtml(mm.Value.ToString())).Append(","); 

                    

                

                return sbs.ToString(); 

            

            return ""

               

        }

C#代码

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

/// <summary> 

      /// 替换字符串中的html标签为空返回标签里的内容 

       /// </summary> 

       /// <param name="src"></param> 

       /// <returns></returns> 

       public string RemoveHtml(string src) 

       

           Regex htmlReg = new Regex(@"<[^>]+>", RegexOptions.Compiled | RegexOptions.IgnoreCase); 

           Regex htmlSpaceReg = new Regex("\\&nbsp\\;", RegexOptions.Compiled | RegexOptions.IgnoreCase); 

           Regex spaceReg = new Regex("\\s{2,}|\\ \\;", RegexOptions.Compiled | RegexOptions.IgnoreCase); 

           Regex styleReg = new Regex(@"<style(.*?)</style>", RegexOptions.Compiled | RegexOptions.IgnoreCase); 

           Regex scriptReg = new Regex(@"<script(.*?)</script>", RegexOptions.Compiled | RegexOptions.IgnoreCase); 

   

           src = styleReg.Replace(src, string.Empty); 

           src = scriptReg.Replace(src, string.Empty); 

           src = htmlReg.Replace(src, string.Empty); 

           src = htmlSpaceReg.Replace(src, " "); 

           src = spaceReg.Replace(src, " "); 

           return src.Trim(); 

       }

相关阅读 >>

.net core中如何使用entity framework操作postgresql?

.net core配置与自动更新的实现方法_实用技巧

.net 中的程序集

c# 一些面试试题的实例教程

解决visual studio 2017创建.net standard类库编译出错的问题

.net验证后台页面是否登录实例教程

c# md5hash的用法及实例

在.net core类库中使用ef core迁移数据库到sql server的方法_实用技巧

分享多个c#常用正则表达式的实例

c#中匿名对象与var以及动态类型 dynamic的详解

更多相关阅读请进入《正则表达式》频道 >>




打赏

取消

感谢您的支持,我会继续努力的!

扫码支持
扫码打赏,您说多少就多少

打开支付宝扫一扫,即可进行扫码打赏哦

分享从这里开始,精彩与您同在

评论

管理员已关闭评论功能...