Download Large Files from SharePoint Online

In my previous post, I had explained how to upload large files up to 10 GB to SharePoint Online. This is as per the new revised upload limit. The upload operation, as we know, is actually incomplete without download. I mistakenly, assumed that the previous code for file download will work for the latest larger files as well. But, just like I had to rewrite the upload code to accommodate this new limit, file download also needed a complete makeover. I initially tried the following ways with no success.

OpenBinaryDirect

FileInformation fileInformation = File.OpenBinaryDirect(clientContext, serverRelativeUrl);

Error:: Stream was too long.

In this approach, I was trying to download a file, say of size 10GB, in a single .NET object. However, for a 64-bit managed application on a 64-bit Windows operating system, you can create an object of no more than 2GB.

Refer:: https://msdn.microsoft.com/en-us/library/ms241064(v=vs.100).aspx.

 

OpenBinaryStream


File oFile = web.GetFileByServerRelativeUrl(strServerRelativeURL);
clientContext.Load(oFile);
ClientResult<Stream> stream = oFile.OpenBinaryStream();
clientContext.ExecuteQuery();

Error:: Invalid MIME content-length header encountered on read.

I could see in the Visual Studio Diagnostic Tools window that, the former approach was failing after the download of around 1.5GB of data however, the OpenBinaryStream download was failing after the download of 800-900MB of data only! This is because, MemoryStream uses a byte[] internally. So, whenever it’s internal buffer cannot fill the data, it simply doubles up its size and the error gets thrown a lot earlier than the OpenBinaryDirect approach.

Refer:: http://stackoverflow.com/a/15597139.

 

Correct Approach

After a lot of digging, I figured out that the only way I could download such a huge file is by using the Remote Procedure Call (RPC). Here’s the complete code for downloading a large file from SharePoint Online.

string fullFilePath = String.Empty;

Uri targetSite = new Uri(ctx.Web.Url);

SharePointOnlineCredentials spCredentials = (SharePointOnlineCredentials)ctx.Credentials;
string authCookieValue = spCredentials.GetAuthenticationCookie(targetSite);

string requestUrl = ctx.Url + "/_vti_bin/_vti_aut/author.dll";
string method = Utility.GetEncodedString("get document:15.0.0.4455");
serviceName = Utility.GetEncodedString(ctx.Web.ServerRelativeUrl);
if(documentName.StartsWith("/"))
{
	documentName = documentName.Substring(1);
}
documentName = Utility.GetEncodedString(documentName);
string oldThemeHtml = "false";
string force = "true";
string getOption = "none";
string docVersion = String.Empty; //directly passed as empty
string timeOut = "0";
string expandWebPartPages = "true";

string rpcCallString = String.Format("method={0}&service%5fname={1}&document%5fname={2}&old%5ftheme%5fhtml={3}&force={4}&get%5foption={5}&doc%5fversion=&timeout={6}&expandWebPartPages={7}",
	method, serviceName, documentName, oldThemeHtml, force, getOption, timeOut, expandWebPartPages);

HttpWebRequest wReq = WebRequest.Create(requestUrl) as HttpWebRequest;
wReq.Method = "POST";
wReq.ContentType = "application/x-vermeer-urlencoded";
wReq.Headers["X-Vermeer-Content-Type"] = "application/x-vermeer-urlencoded";
wReq.UserAgent = "MSFrontPage/15.0";
wReq.UseDefaultCredentials = false;
wReq.Accept = "auth/sicily";
wReq.Headers["MIME-Version"] = "1.0";
wReq.Headers["X-FORMS_BASED_AUTH_ACCEPTED"] = "T";
wReq.Headers["Accept-encoding"] = "gzip, deflate";
wReq.Headers["Cache-Control"] = "no-cache";

wReq.CookieContainer = new CookieContainer();
wReq.CookieContainer.Add(
	new Cookie("SPOIDCRL",
		authCookieValue.TrimStart("SPOIDCRL=".ToCharArray()),
		String.Empty,
		targetSite.Authority));

wReq.KeepAlive = true;

//create unique dir for the download
DirectoryInfo tempFilePath = Directory.CreateDirectory(Path.Combine(tempFileLoc, Guid.NewGuid().ToString()));

using (Stream requestStream = wReq.GetRequestStream())
{
	byte[] rpcHeader = Encoding.UTF8.GetBytes(rpcCallString);

	requestStream.Write(rpcHeader, 0, rpcHeader.Length);
	requestStream.Close();

	fullFilePath = Path.Combine(tempFilePath.FullName, fileName);

	using (Stream strOut = File.OpenWrite(fullFilePath))
	{
		using (var sr = wReq.GetResponse().GetResponseStream())
		{
			byte[] buffer = new byte[16 * 1024];
			int read;
			bool isHtmlRemoved = false;
			while ((read = sr.Read(buffer, 0, buffer.Length)) > 0)
			{
				if(!isHtmlRemoved)
				{
					string result = Encoding.UTF8.GetString(buffer);
					int startPos =result.IndexOf("</html>");
					if(startPos >-1)
					{
						//get the length of the text, '</html>' as well
						startPos += 8;
						
						strOut.Write(buffer, startPos, read - startPos);

						isHtmlRemoved = true;
					}                                    
				}
				else
				{
					strOut.Write(buffer, 0, read);
				}
			}
		}
	}
}

Evaluation

  • Here I am using the method, “get document” and, “15.0.0.4455” is the server extension number.
  • Service name is server relative URL of your site.
  • documentName is the serverRelativeUrl (FileRef) of the file to be downloaded, minus the webServerRelativeUrl.
  • For authentication, I am using the CookieContainer of HTTPWebRequest.

RPC

In case you’re not familiar with this format. RPC not only returns the actual file but it also prefix the file content with html. So, in order to get the actual file, we need to remove this html from the response. Which is exactly why, I am getting the index of  ‘</html>‘ and respectively setting the value of startPos (starting position for file writing).

Following is the sample of the html sliced out from the download of a file, 90 MB.docx. As you will see, it just contains the file meta info.


<html><head><title>vermeer RPC packet</title></head>
<body>


method=get document:15.0.0.4420


message=successfully retrieved document 'Doc lib/90 MB.docx' from 'Doc lib/90 MB.docx'


document=

<ul>

<li>document_name=Doc lib/90 MB.docx

<li>meta_info=

<ul>

<li>display_urn:schemas-microsoft-com:office:office#Editor

<li>SW|Piyush Singh

<li>vti_rtag

<li>SW|rt:A90CEB13-B279-480F-B07C-244670076247@00000000006

<li>vti_etag

<li>SW|"{A90CEB13-B279-480F-B07C-244670076247},6"

<li>vti_parserversion

<li>SR|16.0.0.5312

<li>vti_folderitemcount

<li>IR|0

<li>vti_timecreated

<li>TR|10 Jul 2014 20:55:16 -0000

<li>vti_sourcecontrolcheckincomment

<li>SR|File Restoration on Thursday, June 2, 2016

<li>vti_streamhash

<li>SR|0x02C8D921A84FE3E82F3C2A5866DA589513DA11314C

<li>vti_canmaybeedit

<li>BX|true

<li>vti_author

<li>SR|i:0#.f|membership|piyush@something.onmicrosoft.com

<li>vti_timelastwritten

<li>TR|02 Jun 2016 12:42:39 -0000

<li>vti_level

<li>IR|1

<li>vti_modifiedby

<li>SR|i:0#.f|membership|piyush.singh@something.onmicrosoft.com

<li>display_urn:schemas-microsoft-com:office:office#Author

<li>SW|Piyush Singh

<li>source_item_id_Col

<li>SW|10__1405054516000

<li>vti_foldersubfolderitemcount

<li>IR|0

<li>vti_filesize

<li>IR|94437376

<li>ContentTypeId

<li>SW|0x010100623C9C49E42A00419C619EE6EAF8D8C1

<li>vti_timelastmodified

<li>TR|10 Jul 2014 20:55:16 -0000

<li>vti_nexttolasttimemodified

<li>TR|02 Jun 2016 12:42:42 -0000

<li>vti_candeleteversion

<li>BR|true

<li>vti_sourcecontrolversion

<li>SR|V5.0

<li>vti_sourcecontrolcookie

<li>SR|fp_internal
</ul>

</ul>

</body>
</html>

Reference

Finally, this post , from Steve Curran, has really helped me in clearing my doubts regarding RPC.

I have tested the above code and it has worked perfectly well for the download of files up to 9.7GB.

 

 

One thought on “Download Large Files from SharePoint Online

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s