12 lines of simple Python code, first glimpse the reptile's secret

Often many children's shoes write papers suffer from data acquisition difficulties, and take the path to reptiles;

Many analysts often use reptiles when doing lyrical monitoring or competitive product analysis.

Today, this article will lead the buddies through 12 lines of simple Python code to get a glimpse of the reptile's secret.

Reptile target

This article uses requests + Xpath to crawl the short commentary on the movie "Black Panther" in the Douban movie.

Running the above crawler script, we can witness the miracle

Reptile results are exactly the same as the content of the original web page.

Good interaction through the tqdm module

Tool preparation

Chrome browser (analyze HTTP requests, capture packets)

Install Python 3 and related modules (requests, lxml, pandas, time, random, tqdm) requests: Used to simply request data lxml: parse library faster and more powerful than Beautiful Soup pandas: data processing artifacts time: setting up reptile access intervals to prevent Random: random number generation tool, with time using tqdm: interactive tools, showing the progress of the program

The basic steps

Network request analysis

Webpage content resolution

Data read storage

Involving knowledge points

Crawler protocol

Http request analysis

Requests

Xpath syntax

Python basic syntax

Pandas data processing

Crawler protocol

The crawler protocol is the robots.txt file in the root directory of the web site. It is used to tell reptiles which ones they can take and which can't be stolen. Crawl-delay informs the website of the expected interval of visits. (In order to take the data from the other side of the server's classmates, and to take data, this article sets the reptile access interval to a random number of 6-9 seconds.)

Reppet Web Site Crawler Agreement

HTTP request analysis

Use the chrome browser to access the "Panther" short comment page https://movie.douban.com/subject/6390825/comments?sort=new_score&status=P, press F12, enter the network panel to analyze the network request, refresh the page again Get the request, use the chrome browser to filter the request, analyze, find the Ta

The watercress comment page request analysis

Through the request analysis, we found the target url is 'https://movie.douban.com/subject/6390825/comments?start=0&limit=20&sort=new_score&status=P&percent_type=', and each page, the parameter start will be upward Increased by 20 (through multiple page turning attempts, we found that after page 11 we need to log in to view, and the login status shows only the first 500 short comments. As a simple demo, this article only crawls the first 11 pages)

Requests

Send a get request through the requests module, use the content method to get byte data, and re-encode it with utf-8; then add an interaction to determine whether the resource was successfully acquired (status code is 200), and output the acquisition status.

Request details analysis

(In addition to the content, there is a text method, which returns a unicode character set, directly using the text method encountered in Chinese is garbled)

XPath parsing

After obtaining the data, the content of the web page needs to be parsed. Commonly used tools include regular expressions, Beautiful Soup, Xpath, and the like; among them, XPath is fast and convenient. Here we have obtained the data of the first 220 short review user names, short ratings, and short comment content through the Xpath resolution resource. (You can copy Xpath directly with the power of chrome, learning Xpath syntax http://)

data processing

After getting the data, we construct the dictionary through the list, then construct the dataframe through the dictionary, and output the data as a csv file through the pandas module.

Conclusion and Easter Eggs

In this case, through the solutions of requests+Xpath, we successfully crawled some of the watercress comments on the movie “Black Panther” to lay the foundation for text analysis or other data mining work. This article as a demo only shows a simple crawling process, and more eggs such as request headers, request body information retrieval, cookies, simulated logins, distributed crawlers, etc., please pay attention to later article updates.

Crystal Clear Back Sticker

Crystal Clear Back Sticker,Phone Sticker,Mobile Phone Back Skin,Crystal Clear Phone Skin

Shenzhen Jianjiantong Technology Co., Ltd. , https://www.jjthydrogelmachine.com

Posted on