Download Large Files from SharePoint Online

In my previous post, I had explained how to upload large files up to 10 GB to SharePoint Online. This is as per the new revised upload limit. The upload operation, as we know, is actually incomplete without download. I mistakenly, assumed that the previous code for file download will work for the latest larger files as well. But, just like I had to rewrite the upload code to accommodate this new limit, file download also needed a complete makeover. I initially tried the following ways with no success.

OpenBinaryDirect

FileInformation fileInformation = File.OpenBinaryDirect(clientContext, serverRelativeUrl);

Error:: Stream was too long.

In this approach, I was trying to download a file, say of size 10GB, in a single .NET object. However, for a 64-bit managed application on a 64-bit Windows operating system, you can create an object of no more than 2GB.

Refer:: https://msdn.microsoft.com/en-us/library/ms241064(v=vs.100).aspx.

 

OpenBinaryStream


File oFile = web.GetFileByServerRelativeUrl(strServerRelativeURL);
clientContext.Load(oFile);
ClientResult<Stream> stream = oFile.OpenBinaryStream();
clientContext.ExecuteQuery();

Error:: Invalid MIME content-length header encountered on read.

I could see in the Visual Studio Diagnostic Tools window that, the former approach was failing after the download of around 1.5GB of data however, the OpenBinaryStream download was failing after the download of 800-900MB of data only! This is because, MemoryStream uses a byte[] internally. So, whenever it’s internal buffer cannot fill the data, it simply doubles up its size and the error gets thrown a lot earlier than the OpenBinaryDirect approach.

Refer:: http://stackoverflow.com/a/15597139.

 

Correct Approach

After a lot of digging, I figured out that the only way I could download such a huge file is by using the Remote Procedure Call (RPC). Here’s the complete code for downloading a large file from SharePoint Online.

string fullFilePath = String.Empty;

Uri targetSite = new Uri(ctx.Web.Url);

SharePointOnlineCredentials spCredentials = (SharePointOnlineCredentials)ctx.Credentials;
string authCookieValue = spCredentials.GetAuthenticationCookie(targetSite);

string requestUrl = ctx.Url + "/_vti_bin/_vti_aut/author.dll";
string method = Utility.GetEncodedString("get document:15.0.0.4455");
serviceName = Utility.GetEncodedString(ctx.Web.ServerRelativeUrl);
if(documentName.StartsWith("/"))
{
	documentName = documentName.Substring(1);
}
documentName = Utility.GetEncodedString(documentName);
string oldThemeHtml = "false";
string force = "true";
string getOption = "none";
string docVersion = String.Empty; //directly passed as empty
string timeOut = "0";
string expandWebPartPages = "true";

string rpcCallString = String.Format("method={0}&service%5fname={1}&document%5fname={2}&old%5ftheme%5fhtml={3}&force={4}&get%5foption={5}&doc%5fversion=&timeout={6}&expandWebPartPages={7}",
	method, serviceName, documentName, oldThemeHtml, force, getOption, timeOut, expandWebPartPages);

HttpWebRequest wReq = WebRequest.Create(requestUrl) as HttpWebRequest;
wReq.Method = "POST";
wReq.ContentType = "application/x-vermeer-urlencoded";
wReq.Headers["X-Vermeer-Content-Type"] = "application/x-vermeer-urlencoded";
wReq.UserAgent = "MSFrontPage/15.0";
wReq.UseDefaultCredentials = false;
wReq.Accept = "auth/sicily";
wReq.Headers["MIME-Version"] = "1.0";
wReq.Headers["X-FORMS_BASED_AUTH_ACCEPTED"] = "T";
wReq.Headers["Accept-encoding"] = "gzip, deflate";
wReq.Headers["Cache-Control"] = "no-cache";

wReq.CookieContainer = new CookieContainer();
wReq.CookieContainer.Add(
	new Cookie("SPOIDCRL",
		authCookieValue.TrimStart("SPOIDCRL=".ToCharArray()),
		String.Empty,
		targetSite.Authority));

wReq.KeepAlive = true;

//create unique dir for the download
DirectoryInfo tempFilePath = Directory.CreateDirectory(Path.Combine(tempFileLoc, Guid.NewGuid().ToString()));

using (Stream requestStream = wReq.GetRequestStream())
{
	byte[] rpcHeader = Encoding.UTF8.GetBytes(rpcCallString);

	requestStream.Write(rpcHeader, 0, rpcHeader.Length);
	requestStream.Close();

	fullFilePath = Path.Combine(tempFilePath.FullName, fileName);

	using (Stream strOut = File.OpenWrite(fullFilePath))
	{
		using (var sr = wReq.GetResponse().GetResponseStream())
		{
			byte[] buffer = new byte[16 * 1024];
			int read;
			bool isHtmlRemoved = false;
			while ((read = sr.Read(buffer, 0, buffer.Length)) > 0)
			{
				if(!isHtmlRemoved)
				{
					string result = Encoding.UTF8.GetString(buffer);
					int startPos =result.IndexOf("</html>");
					if(startPos >-1)
					{
						//get the length of the text, '</html>' as well
						startPos += 8;
						
						strOut.Write(buffer, startPos, read - startPos);

						isHtmlRemoved = true;
					}                                    
				}
				else
				{
					strOut.Write(buffer, 0, read);
				}
			}
		}
	}
}

Evaluation

  • Here I am using the method, “get document” and, “15.0.0.4455” is the server extension number.
  • Service name is server relative URL of your site.
  • documentName is the serverRelativeUrl (FileRef) of the file to be downloaded, minus the webServerRelativeUrl.
  • For authentication, I am using the CookieContainer of HTTPWebRequest.

RPC

In case you’re not familiar with this format. RPC not only returns the actual file but it also prefix the file content with html. So, in order to get the actual file, we need to remove this html from the response. Which is exactly why, I am getting the index of  ‘</html>‘ and respectively setting the value of startPos (starting position for file writing).

Following is the sample of the html sliced out from the download of a file, 90 MB.docx. As you will see, it just contains the file meta info.


<html><head><title>vermeer RPC packet</title></head>
<body>


method=get document:15.0.0.4420


message=successfully retrieved document 'Doc lib/90 MB.docx' from 'Doc lib/90 MB.docx'


document=

<ul>

<li>document_name=Doc lib/90 MB.docx

<li>meta_info=

<ul>

<li>display_urn:schemas-microsoft-com:office:office#Editor

<li>SW|Piyush Singh

<li>vti_rtag

<li>SW|rt:A90CEB13-B279-480F-B07C-244670076247@00000000006

<li>vti_etag

<li>SW|"{A90CEB13-B279-480F-B07C-244670076247},6"

<li>vti_parserversion

<li>SR|16.0.0.5312

<li>vti_folderitemcount

<li>IR|0

<li>vti_timecreated

<li>TR|10 Jul 2014 20:55:16 -0000

<li>vti_sourcecontrolcheckincomment

<li>SR|File Restoration on Thursday, June 2, 2016

<li>vti_streamhash

<li>SR|0x02C8D921A84FE3E82F3C2A5866DA589513DA11314C

<li>vti_canmaybeedit

<li>BX|true

<li>vti_author

<li>SR|i:0#.f|membership|piyush@something.onmicrosoft.com

<li>vti_timelastwritten

<li>TR|02 Jun 2016 12:42:39 -0000

<li>vti_level

<li>IR|1

<li>vti_modifiedby

<li>SR|i:0#.f|membership|piyush.singh@something.onmicrosoft.com

<li>display_urn:schemas-microsoft-com:office:office#Author

<li>SW|Piyush Singh

<li>source_item_id_Col

<li>SW|10__1405054516000

<li>vti_foldersubfolderitemcount

<li>IR|0

<li>vti_filesize

<li>IR|94437376

<li>ContentTypeId

<li>SW|0x010100623C9C49E42A00419C619EE6EAF8D8C1

<li>vti_timelastmodified

<li>TR|10 Jul 2014 20:55:16 -0000

<li>vti_nexttolasttimemodified

<li>TR|02 Jun 2016 12:42:42 -0000

<li>vti_candeleteversion

<li>BR|true

<li>vti_sourcecontrolversion

<li>SR|V5.0

<li>vti_sourcecontrolcookie

<li>SR|fp_internal
</ul>

</ul>

</body>
</html>

Reference

Finally, this post , from Steve Curran, has really helped me in clearing my doubts regarding RPC.

I have tested the above code and it has worked perfectly well for the download of files up to 9.7GB.

 

 

15 thoughts on “Download Large Files from SharePoint Online

  1. Hi Piyush,

    Thanks for posting this article to download large file from using RPC.
    I have simply pasted this code in POC and tried to download 2 GB file from SPO.

    After 1.7 GB, i got an error “Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host”. Do you have any idea in which case it can throw this error? I am using trial tenant.

    Also how much time it takes usually to download 2 GB contents?
    if i keep chunk size 50 MB then will it downloadin 50 MB chunks?

    I actually want to migrate large files from SPO to SPO in chunks.
    I used File.OpenBinaryDirect and when i used seek for that it was throwing error “Stream does not support seek operation”.

    So i decided to download large file locally using above code migrate date in chunks from locally save file.

    Like

    • Hi Dharmesh,

      Happy to help! 🙂

      And, thanks for sharing your result with a 2GB file. I am assuming that you were also able to successfully download the 9.5GB file as well!

      Like

      • Hi Piyush,

        Thanks a lot for your reply.
        Yes I was able to download 9.5 GB file as well.
        Yes I want to download contents of file for each file version. Is it possible? If yes can you please share piece of code which i can use to download file version content?

        Please find my actual requirement below.
        I want to migrate large file (with-without versions) from SPO to SPO. So either i can

        1) Download large file to local file system and from that upload in chunks
        2) Get data in chunks using above code and upload in chunks

        With above approach I was able to download 9.5 GB file. Not checked more than that. Is there any limitation? Can I download up to 14 GB file as well because in SPO I can upload up to 15 GB.

        Also can i retrieve file contents in chunks using RPC? do you have any idea on this?

        Like

  2. Hi Piyush,

    I am able to retrieve 14.50 GB contents as well using above approach. Also i am able to retrieve contents for each file version as well.

    Do you have any idea whether we can retrieve file contents using RPC using in chunks?

    I mean retrieve 2MB data using above code and write to file system
    from next location, retrieve 2MB and write to file system and so on. i.e. return byte[] everytime. is it possible?

    Like

    • Glad to know that you were able to download file up to 15GB. I guess for version, there was an option to provide docVersion in the above code which, I kept, purposefully blank in order to automatically download the latest file.

      As for your other query, you see, we are already downloading the data in parts. GetResponseStream does not download the entire data but it only opens up a Stream object, https://stackoverflow.com/a/21281089. You can also verify this by using a breakpoint on this line. You see even for a large file it will pass very quickly. The real download will start inside the while loop. Once the execution is in there, check on the path where the file is being written. File size will increase gradually as the chunks are downloaded and appended to it.

      Like

    • No. Access Tokens work with SharePoint REST APIs only. For RPCs, just like SharePoint Web Services, SharePoint Claims Authentication is required. Which implies getting the authentication cookie first, using the username and password, and then, using the same in the actual RPC call.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s