Science and technology

Analyze net pages with Python requests and Beautiful Soup

Browsing the online in all probability accounts for a lot of your day. But it is an awfully handbook course of, is not it? You should open a browser. Go to a web site. Click buttons, transfer a mouse. It’s quite a lot of work. Wouldn’t or not it’s nicer to work together with the Internet by way of code?

You can get information from the Internet utilizing Python with the assistance of the Python module requests:

import requests

DATA = "https://opensource.com/article/22/5/document-source-code-doxygen-linux"
PAGE = requests.get(DATA)

print(PAGE.textual content)

In this code pattern, you first import the module requests. Then you create two variables: one referred to as DATA to carry the URL you need to obtain. In later variations of this code, you’ll present a distinct URL every time you run your utility. For now, although, it is best to only “arduous code” a check URL for demonstration functions.

The different variable is PAGE, which you set to the response of the requests.get operate when it reads the URL saved in DATA. The requests module and its .get operate is pre-programmed to “learn” an Internet handle (a URL), entry the Internet, and obtain no matter is positioned at that handle.

That’s quite a lot of steps you do not have to determine by yourself, and that is precisely why Python modules exist. Finally, you inform Python to print all the pieces that requests.get has saved within the .textual content subject of the PAGE variable.

Beautiful Soup

If you run the pattern code above, you get the contents of the instance URL dumped indiscriminately into your terminal. It does that as a result of the one factor your code does with the info that requests has gathered is print it. It’s extra attention-grabbing to parse the textual content.

Python can “read” text with its most basic functions, but parsing text allows you to search for patterns, specific words, HTML tags, and so on. You could parse the text returned by requests yourself, but using a specialized module is much easier. For HTML and XML, there’s the Beautiful Soup library.

This code accomplishes the identical factor, nevertheless it makes use of Beautiful Soup to parse the downloaded textual content. Because Beautiful Soup acknowledges HTML entities, you need to use a few of its built-in options to make the output slightly simpler for the human eye to parse.

For occasion, as an alternative of printing uncooked textual content on the finish of your program, you’ll be able to run the textual content by way of the .prettify operate of Beautiful Soup:

from bs4 import BeautifulSoup
import requests

PAGE = requests.get("https://opensource.com/article/22/5/document-source-code-doxygen-linux")
SOUP = BeautifulSoup(PAGE.textual content, 'html.parser')

# Press the inexperienced button within the gutter to run the script.
if __name__ == '__main__':
    # do a factor right here
    print(SOUP.prettify())

The output of this model of your program ensures that each opening HTML tag begins by itself line, with indentation to assist show which tag is a mum or dad of one other tag. Beautiful Soup is conscious of HTML tags in additional methods than simply the way it prints it out.

Instead of printing the entire web page, you’ll be able to single out a particular type of tag. For occasion, attempt altering the print selector from print(SOUP.prettify() to this:

  print(SOUP.p)

This prints only a <p> tag. Specifically, it prints simply the primary <p> tag encountered. To print all <p> tags, you want a loop.

Looping

Create a for loop to cycle over the complete webpage contained within the SOUP variable, utilizing the find_all operate of Beautiful Soup. It’s not unreasonable to need to use your loop for different tags moreover simply the <p> tag, so construct it as a customized operate, designated by the def key phrase (for “outline”) in Python.

def loopit():
    for TAG in SOUP.find_all('p'):
        print(TAG)

The momentary variable TAG is bigoted. You can use any time period, reminiscent of ITEM or i or no matter you need. Each time the loop runs, TAG incorporates the search outcomes of the find_all operate. In this code, the <p> tag is being searched.

A operate would not run except it is explicitly referred to as. You can name your operate on the finish of your code:

# Press the inexperienced button within the gutter to run the script.
if __name__ == '__main__':
    # do a factor right here
    loopit()

Run your code to see all <p> tags and every one’s contents.

Getting simply the content material

You can exclude tags from being printed by specifying that you really want simply the “string” (programming lingo for “phrases”).

def loopit():
    for TAG in SOUP.find_all('p'):
        print(TAG.string)

Of course, after you have the textual content of a webpage, you’ll be able to parse it additional with the usual Python string libraries. For occasion, you will get a phrase rely utilizing len and cut up:

def loopit():
    for TAG in SOUP.find_all('p'):
        if TAG.string is not None:
            print(len(TAG.string.cut up()))

This prints the variety of strings inside every paragraph aspect, omitting these paragraphs that do not have any strings. To get a grand complete, use a variable and a few primary math:

def loopit():
    NUM = 0
    for TAG in SOUP.find_all('p'):
        if TAG.string is not None:
            NUM = NUM + len(TAG.string.cut up())
    print("Grand total is ", NUM)

Python homework

There’s much more data you’ll be able to extract with Beautiful Soup and Python. Here are some concepts on how you can enhance your utility:

  • Accept input so you’ll be able to specify what URL to obtain and analyze once you launch your utility.
  • Count the variety of pictures (<img> tags) on a web page.
  • Count the variety of pictures (<img> tags) inside one other tag (for example, solely pictures that seem within the <important> div, or solely pictures following a </p> tag).

Most Popular

To Top