
WARNING
1. In addition to the settings on this page, it is also affected by the [Configuration Options] configuration. For this content, please refer to the option configuration.
2. Same configuration items,This page configuration
coverConfiguration options
Configuration.
# Wildcards
A wildcard is a match between plain text matching and regular expression matching that is easy to understand and has certain operations.
This software only supports two wildcard characters:*
and?
, please note that the wildcard must be English characters (not Chinese characters).
*
Matches one or more arbitrary characters;?
Matches any character.
# Qualified Domain Name
- (Required) Only matching domain names will be downloaded. Of course, you can add multiple sets of domain name matches, separated by "|", indicating an "or" relationship.
- Using wildcard matching
Host name (domain name)
, which is the red part in the above picture.
for example
When the download address ishttps://www.example.com
, the website also has subdomainshttps://bj.example.com
、https://sh.example.com
、https://gz.example.com
、https://sz.example.com
Sub-stations in Beijing/Shanghai/Guangzhou/Shenzhen, etc.
1. Download the main site and all sub-sites, and set the value to:*.example.com
, it matches all domain names. If you also want to match the root domain namehttps://example.com
, then set to:*example.com
。
2. Download the main site and Beijing site: set the value to:www.example.com|bj.example.com
, use "|" to separate two groups of domain names
# Restricted Path
If you need a more precise match, use regular expressions: Configuration Options > Download Scope > Limit Path
- (Optional) The matching link will be downloaded only if it is successfully connected. Of course, it is left blank by default without any restrictions. Use "|" to separate to indicate an "or" relationship.
- Using wildcard matching
Path + query parameters
, which is the green part in the above picture.
for example
Website address/product/index.html
and/contact/index.html
1. Download onlyproduct
Pages under the directory, input values are:/product/*
2. Downloadproduct
andcontact
Pages under the directory, input values are:/product/*|/contact/*
, use "|" to separate two groups of matches
# Exclude Path
- (Optional) In contrast to the limited path, a successfully matched link will not be downloaded. Of course, by default, it is left blank without any restrictions. Use "|" to separate, indicating an "or" relationship.
- Using wildcard matching
Path + query parameters
, which is the green part in the above picture.
for example
Website address/product/index.html
and/contact/index.html
1、product
The pages under the directory are not downloaded. The input value is:/product/*
2、product
andcontact
All pages in the directory are not downloaded. The input value is:/product/*|/contact/*
, use "|" to separate two groups of matches
# Maximum Depth
- (Required) Enter the download URL depth as 1, the link depth of the URL HTML code is 2, the link depth of the page with a depth of 2 is 3, and so on.
- The depth of the download page. Pages greater than the set depth will not be downloaded.
- Download order: Download from smallest to largest depth
- If the same URL is accessed from different pages and the depth is different, the minimum depth is taken.
for example
When the download address ishttps://www.example.com
1、front page
Depth 1
2、front page
ClickList
,List
Depth 2
3、List
ClickDetails page
,Details page
Depth 3
4、Details page
ClickNext article
,Next article
Depth 4
# Maximum number of pages
- (Required) The maximum number of pages to download. Pages whose number of downloaded pages is greater than the set number of pages will not be downloaded.
- One URL represents one page
- This is to set an upper limit value, which can be set according to personal needs.
for example
For example, set the number of download pages to5000
Page
1. Assume that the website has1000
page, then this1000
Download all pages.
2. Assume that the website has20000
Pages are downloaded in order of depth from smallest to largest5000
Pages, remaining15000
The page will not download.
Kind tips
- The more pages you download, the more computer memory/CPU/hard disk required.
- The faster the CPU, the faster the processing speed.
- Generally, for millions of data, at least 16G of memory is configured, and sufficient virtual memory is set as a backup.
# Page structure
- Refers to the directory structure where the HTML page is saved.
- Consistent with the original site: The original site pages are saved in directory A, and are saved in directory A after downloading.
- Save to root directory: All pages are saved in the root directory.
# File Structure
- This refers to all resource files except HTML pages, such as js, css, image, font, file, etc.
- If you select Custom, you can
Configuration options
ofSystem Settings
-file path
in configuration.
# Change the code to
- After downloading, the encoding will be automatically changed to the specified encoding.
- Now most websites use utf-8 encoding, and a few use gbk encoding. Our software can correctly identify website encoding (including pages with multiple encodings on one site) 99.99%. The software automatically deletes and modifies the encoding in the code, including html code charset encoding and css code charset encoding.
# Download timeout
- Download a request timeout. The software will retry once the first time it times out. So if the link is very slow, the first request fails and then the next request fails. The waiting time will be twice the setting.
- Suggestion: If the website is very fast, you can set a shorter time to reduce waiting time; if the website is very slow or you are downloading large files, you must set a longer timeout, otherwise these slow pages and large files will fail to download.
- The software defaults to 30 seconds. This is not the case that the longer the better, nor the shorter the better.
# Cookie
- Generally, it is needed to enhance the verification page or login page. For reference, please refer to the URL:/news/jiaocheng/cookie-useragent.html
# UserAgent
- The server uses this value to identify whether the user is using a computer or a mobile phone.
- For custom settings, refer to the method in the Cookie link above. After opening the console, you can see something called User-Agent. This is it. Just copy the value into the text box.