Single site download | Website download user manual (V23.7)

WARNING

1. In addition to the settings on this page, it is also affected by the [Configuration Options] configuration. For this content, please refer to the option configuration.

2. Same configuration items,This page configurationcoverConfiguration optionsConfiguration.

# Wildcards

A wildcard is a match between plain text matching and regular expression matching that is easy to understand and has certain operations.
This software only supports two wildcard characters:* and ?, please note that the wildcard must be English characters (not Chinese characters).

* Matches one or more arbitrary characters;
? Matches any character.

# Qualified Domain Name

(Required) Only matching domain names will be downloaded. Of course, you can add multiple sets of domain name matches, separated by "|", indicating an "or" relationship.
Using wildcard matching Host name (domain name), which is the red part in the above picture.

for example

When the download address ishttps://www.example.com, the website also has subdomainshttps://bj.example.com、https://sh.example.com、https://gz.example.com、https://sz.example.com Sub-stations in Beijing/Shanghai/Guangzhou/Shenzhen, etc.

1. Download the main site and all sub-sites, and set the value to:*.example.com, it matches all domain names. If you also want to match the root domain namehttps://example.com, then set to:*example.com。

2. Download the main site and Beijing site: set the value to:www.example.com|bj.example.com, use "|" to separate two groups of domain names

# Restricted Path

If you need a more precise match, use regular expressions: Configuration Options > Download Scope > Limit Path

(Optional) The matching link will be downloaded only if it is successfully connected. Of course, it is left blank by default without any restrictions. Use "|" to separate to indicate an "or" relationship.
Using wildcard matching Path + query parameters, which is the green part in the above picture.

for example

Website address/product/index.htmland/contact/index.html

1. Download onlyproductPages under the directory, input values are:/product/*

2. DownloadproductandcontactPages under the directory, input values are:/product/*|/contact/*, use "|" to separate two groups of matches

# Exclude Path

(Optional) In contrast to the limited path, a successfully matched link will not be downloaded. Of course, by default, it is left blank without any restrictions. Use "|" to separate, indicating an "or" relationship.
Using wildcard matching Path + query parameters, which is the green part in the above picture.

for example

Website address/product/index.htmland/contact/index.html

1、productThe pages under the directory are not downloaded. The input value is:/product/*

2、productandcontactAll pages in the directory are not downloaded. The input value is:/product/*|/contact/*, use "|" to separate two groups of matches

# Maximum Depth

(Required) Enter the download URL depth as 1, the link depth of the URL HTML code is 2, the link depth of the page with a depth of 2 is 3, and so on.
The depth of the download page. Pages greater than the set depth will not be downloaded.
Download order: Download from smallest to largest depth
If the same URL is accessed from different pages and the depth is different, the minimum depth is taken.

for example

When the download address ishttps://www.example.com

1、front pageDepth 1

2、front pageClickList，ListDepth 2

3、ListClickDetails page，Details pageDepth 3

4、Details pageClickNext article，Next articleDepth 4

# Maximum number of pages

(Required) The maximum number of pages to download. Pages whose number of downloaded pages is greater than the set number of pages will not be downloaded.
One URL represents one page
This is to set an upper limit value, which can be set according to personal needs.

for example

For example, set the number of download pages to5000Page

1. Assume that the website has1000page, then this1000Download all pages.

2. Assume that the website has20000Pages are downloaded in order of depth from smallest to largest5000Pages, remaining15000The page will not download.

Kind tips

The more pages you download, the more computer memory/CPU/hard disk required.
The faster the CPU, the faster the processing speed.
Generally, for millions of data, at least 16G of memory is configured, and sufficient virtual memory is set as a backup.

# Page structure

Refers to the directory structure where the HTML page is saved.
Consistent with the original site: The original site pages are saved in directory A, and are saved in directory A after downloading.
Save to root directory: All pages are saved in the root directory.

# File Structure

This refers to all resource files except HTML pages, such as js, css, image, font, file, etc.
If you select Custom, you canConfiguration optionsofSystem Settings-file pathin configuration.

# Change the code to

After downloading, the encoding will be automatically changed to the specified encoding.
Now most websites use utf-8 encoding, and a few use gbk encoding. Our software can correctly identify website encoding (including pages with multiple encodings on one site) 99.99%. The software automatically deletes and modifies the encoding in the code, including html code charset encoding and css code charset encoding.

# Download timeout

Download a request timeout. The software will retry once the first time it times out. So if the link is very slow, the first request fails and then the next request fails. The waiting time will be twice the setting.
Suggestion: If the website is very fast, you can set a shorter time to reduce waiting time; if the website is very slow or you are downloading large files, you must set a longer timeout, otherwise these slow pages and large files will fail to download.
The software defaults to 30 seconds. This is not the case that the longer the better, nor the shorter the better.

Generally, it is needed to enhance the verification page or login page. For reference, please refer to the URL:/news/jiaocheng/cookie-useragent.html

# UserAgent

The server uses this value to identify whether the user is using a computer or a mobile phone.
For custom settings, refer to the method in the Cookie link above. After opening the console, you can see something called User-Agent. This is it. Just copy the value into the text box.

← Operating Environment Single page download →

# Wildcards

# Qualified Domain Name

# Restricted Path

# Exclude Path

# Maximum Depth

# Maximum number of pages

# Page structure

# File Structure

# Change the code to

# Download timeout

# Cookie

# UserAgent