# download millions of files from http server



## antolap (Dec 13, 2017)

I have to download million of file from an http/https server making a lot of GET request like:

```
GET /ALL/out/19742
GET /ALL/out/19755
GET /ALL/out/19758
GET /ALL/out/19762
GET /ALL/out/19769
GET /ALL/out/19773
GET /ALL/out/19775
GET /ALL/out/19776
GET /ALL/out/19778
```

I don't want to overload the server, I don't want it to crash because of tons of connections

Is there a way to make multiple GET request in one connection so that I download multiple files at once?

I think there's a better way that doing single connection for each file like this
`wget -O ALL/out/999927 https://xxxxxxx/ALL/out/999927`
`wget -O ALL/out/999927 https://xxxxxxx/ALL/out/999928`

Can you help me?


----------



## ekingston (Dec 13, 2017)

it is called establishing a persistent connection (it's part of the http protocol). I have no idea how to do so with wget, or how to use it if you can.

If the files are your typical text files, compression (which is supported by most web servers) is also a good idea. This is another thing that can be negotiated between the client and the server. I also have no idea how to make this happen with wget.

Sorry I can't be of more help.


----------



## Snurg (Dec 13, 2017)

why not make a script that runs a particular number of wget spawns, until all files have been pulled?


----------



## aragats (Dec 13, 2017)

antolap said:


> wget -O ALL/out/999927 https://xxxxxxx/ALL/out/999927
> wget -O ALL/out/999927 https://xxxxxxx/ALL/out/999928


If you know that all those files (and only them) are in a single directory, _recursive_ and _no-parents_ options may work:
`wget -r -np https://xxxxxxx/ALL/out`


----------



## antolap (Dec 13, 2017)

ekingston said:


> it is called establishing a persistent connection (it's part of the http protocol). I have no idea how to do so with wget, or how to use it if you can.



Yes, I'm interested in this.
I can use any program, curl, wget or any other


----------



## ljboiler (Dec 13, 2017)

By default, wget uses persistant connections; no special option is necessary.


----------



## CraigW (Dec 13, 2017)

Something I (or perhaps a friend *wink*) once used while being gentle to a website.

Might make a starting point...

`wget --user-agent='Mozilla/4.0 (compatible ; MSIE 6.0 ; Window NT 5.1)' --limit-rate=750k -x -r -l2 -np -c --random-wait=on --wait=2 -x -r -l2 -np https://XXXXXXX/1/items/usgs_drg_il_37089_a4/*`


----------



## antolap (Dec 13, 2017)

ljboiler said:


> By default, wget uses persistant connections; no special option is necessary.


but If I run wget file1 then wget file2 then wget file3, I suppose that for each file there's a new connection
when wget file2 is executed, wget file1 has finished, so it has closed it's connection


----------



## usdmatt (Dec 14, 2017)

Try passing somewhere between 10 - 50 files in the same wget command line. It should use the same connection automatically if multiple files are from the same host. (You can also specify a file to read from looking at the man page although I'd be hesitant to just give it a file with 1 million entries)

The method suggested by CraigW seems good but I can't quite see how it knows what files are available unless directory indexes are enabled?

When passing multiple files on the command line it looks like you might have to specify `wget full-url1 full-url2 full-url3`. Would be nice if you could do something like `wget --base=http://website/some/path/ file1 file2 file3` but I can't find anything like that. You may need to watch the maximum command line length (not sure what that is off the top of my head).


----------



## antolap (Dec 14, 2017)

usdmatt said:


> Would be nice if you could do something like `wget --base=http://website/some/path/ file1 file2 file3` but I can't find anything like that.



It would be very useful


----------



## usdmatt (Dec 14, 2017)

You could probably do `wget (options) http://url/path/{file1,file2,file3,file4,etc}` and let the shell expand it for you. Still ends up actually running a long command line though.


----------



## Ponticelli (Dec 14, 2017)

You could use a list of URLs and feed that into wget.



> -i file
> --input-file=file
> Read URLs from a local or external file.  If - is specified as
> file, URLs are read from the standard input.  (Use ./- to read from
> a file literally named -.)


----------

