Science and technology

7 useful tips for utilizing the Linux wget command

Wget is a free utility to obtain information from the net. It will get information from the Internet and saves it to a file or shows it in your terminal. This is actually additionally what net browsers do, equivalent to Firefox or Chromium, besides by default, they render the knowledge in a graphical window and often require a consumer to be actively controlling them. The wget utility is designed to be non-interactive, that means you may script or schedule wget to obtain information whether or not you are at your laptop or not.

Download a file with wget

You can obtain a file with wget by offering a hyperlink to a particular URL. If you present a URL that defaults to index.html, then the index web page will get downloaded. By default, the file is downloaded right into a file of the identical identify in your present working listing.

$ wget http://instance.com
--2021-09-20 17:23:47-- http://instance.com/
Resolving instance.com... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946
Connecting to instance.com|93.184.216.34|:80... related.
HTTP request despatched, awaiting response... 200 OK
Length: 1256 (1.2K) [text/html]
Saving to: 'index.html'

You could make wget ship the information to straightforward out (stdout) as an alternative by utilizing the --output-document with a touch - character:

$ wget http://instance.com --output-document - | head -n4
<!doctype html>
<html>
<head>
   <title>Example Domain</title>

You can use the --output-document possibility (-O for brief) to call your obtain no matter you need:

$ wget http://instance.com --output-document foo.html

Continue a partial obtain

If you are downloading a really massive file, you may discover that you need to interrupt the obtain. With the --continue (-c for brief), wget can decide the place the obtain left off and proceed the file switch. That means the subsequent time you obtain a 4 GB Linux distribution ISO you do not ever have to return to the beginning when one thing goes mistaken.

$ wget --continue https://instance.com/linux-distro.iso

Download a sequence of information

If it is not one large file however a number of information that it’s essential to obtain, wget might help you with that. Assuming you already know the situation and filename sample of the information you need to obtain, you should utilize Bash syntax to specify the beginning and finish factors between a variety of integers to signify a sequence of filenames:

$ wget http://instance.com/file_{1..4}.webp

Mirror an entire website

You can obtain a whole website, together with its listing construction, utilizing the --mirror possibility. This possibility is identical as operating --recursive --level inf --timestamping --no-remove-listing, which suggests it is infinitely recursive, so that you’re getting the whole lot on the area you specify. Depending on how previous the web site is, that would imply you are getting much more content material than you notice.

If you are utilizing wget to archive a website, then the choices --no-cookies --page-requisites --convert-links are additionally helpful to make sure that each web page is recent, full, and that the location copy is kind of self-contained.

Modify HTML headers

Protocols used for information change have numerous metadata embedded within the packets computer systems ship to speak. HTTP headers are parts of the preliminary portion of information. When you browse a web site, your browser sends HTTP request headers. Use the --debug choice to see what header info wget sends with every request:

$ wget --debug instance.com
---request begin---
GET / HTTP/1.1
User-Agent: Wget/1.19.5 (linux-gnu)
Accept: */*
Accept-Encoding: identification
Host: instance.com
Connection: Keep-Alive

---request end---

You can modify your request header with the --header possibility. For occasion, it is generally helpful to imitate a particular browser, both for testing or to account for poorly coded websites that solely work accurately for particular consumer brokers.

To determine as Microsoft Edge operating on Windows:

$ wget --debug --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.59" http://instance.com

You also can masquerade as a particular cell system:

$ wget --debug
--header="User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 13_5_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Mobile/15E148 Safari/604.1"
http://instance.com

Viewing response headers

In the identical approach header info is shipped with browser requests, header info can be included in responses. You can see response headers with the --debug possibility:

$ wget --debug instance.com
[...]
---response begin---
HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 188102
Cache-Control: max-age=604800
Content-Type: textual content/html; charset=UTF-8
Etag: "3147526947"
Server: ECS (sab/574F)
Vary: Accept-Encoding
X-Cache: HIT
Content-Length: 1256

---response end---
200 OK
Registered socket 3 for persistent reuse.
URI content material encoding = 'UTF-8'
Length: 1256 (1.2K) [text/html]
Saving to: 'index.html'

Responding to a 301 response

A 200 response code signifies that the whole lot has labored as anticipated. A 301 response, however, signifies that an URL has been moved completely to a unique location. It’s a standard approach for a web site admin to relocate content material whereas leaving a “trail” so folks visiting the previous location can nonetheless discover it. By default, wget follows redirects, and that is most likely what you usually need it to do.

However, you may management what wget does when it encounters a 301 response with the --max-redirect possibility. You can set it to 0 to comply with no redirects:

$ wget --max-redirect 0 http://iana.org
--2021-09-21 11:01:35-- http://iana.org/
Resolving iana.org... 192.0.43.8, 2001:500:88:200::8
Connecting to iana.org|192.0.43.8|:80... related.
HTTP request despatched, awaiting response... 301 Moved Permanently
Location: https://www.iana.org/ [following]
0 redirections exceeded.

Alternately, you may set it to another quantity to manage what number of redirects wget follows.

Expand a shortened URL

The --max-redirect possibility is helpful for shortened URLs earlier than really visiting them. Shortened URLs could be helpful for print media, during which customers cannot simply copy and paste an extended URL, or on social networks with character limits (this is not as a lot of a difficulty on a contemporary and open source social network like Mastodon). However, they will also be a bit of harmful as a result of their vacation spot is, by nature, hid. By combining the --head choice to view simply the HTTP headers, and the --location choice to unravel the ultimate vacation spot of an URL, you may peek right into a shortened URL with out loading the total useful resource:

$ wget --max-redirect 0 "https://bit.ly/2yDyS4T"
--2021-09-21 11:32:04-- https://bit.ly/2yDyS4T
Resolving bit.ly... 67.199.248.10, 67.199.248.11
Connecting to bit.ly|67.199.248.10|:443... related.
HTTP request despatched, awaiting response... 301 Moved Permanently
Location: http://instance.com/ [following]
0 redirections exceeded.

The penultimate line of output, beginning with Location, reveals the supposed vacation spot.

Use wget

Once you follow eager about the method of exploring the net as a single command, wget turns into a quick and environment friendly solution to pull info you want from the Internet with out bothering with a graphical interface. To assist you to construct it into your regular workflow, we have created a cheat sheet with widespread wget makes use of and syntax, together with an outline of utilizing it to question an API. Download the Linux wget cheat sheet here.

Most Popular

To Top