如何优化此算法中的内存使用情况？

问题描述：

我正在开发一个日志解析器，并且正在读取大于150MB的字符串文件.-这是我的方法，有什么方法可以优化While语句中的内容吗？问题是，在消耗了大量的memory.-我也有一个StringBuilder试图面临着同样的内存comsuption.-如何优化此算法中的内存使用情况？

private void ReadLogInThread() 
     { 
      string lineOfLog = string.Empty; 

      try 
      { 
       StreamReader logFile = new StreamReader(myLog.logFileLocation); 
       InformationUnit infoUnit = new InformationUnit(); 

       infoUnit.LogCompleteSize = myLog.logFileSize; 

       while ((lineOfLog = logFile.ReadLine()) != null) 
       { 
        myLog.transformedLog.Add(lineOfLog); //list<string> 
        myLog.logNumberLines++; 

        infoUnit.CurrentNumberOfLine = myLog.logNumberLines; 
        infoUnit.CurrentLine = lineOfLog; 
        infoUnit.CurrentSizeRead += lineOfLog.Length; 


        if (onLineRead != null) 
         onLineRead(infoUnit); 
       } 
      } 
      catch { throw; } 
     }

提前感谢！

EXTRA： 林节约每一行，因为读取日志后，我需要检查每存储line.-语言中的一些信息是C＃

什么是语言？ – 2010-01-04 19:47:10

保留每一行的原因是什么？什么是内存配置文件显示为最昂贵的对象或对象？你想要的内存门槛是多少？ – 2010-01-04 19:50:56

你使用多少内存，你认为合理吗？ – Dolphin 2010-01-04 20:12:45

答

如果您的日志行实际上可以解析为数据行表示形式，则可以实现内存经济。

下面是一个典型的日志行我能想到的：

事件在：2019年1月5日0：24：32.435，原因：操作，种类：DataStoreOperation，操作状态：成功

该行在内存中占用200个字节。与此同时，以下表示只需要贝洛16个字节：

Enum LogReason { Operation, Error, Warning }; 
Enum EventKind short { DataStoreOperation, DataReadOperation }; 
Enum OperationStatus short { Success, Failed }; 

LogRow 
{ 
    DateTime EventTime; 
    LogReason Reason; 
    EventKind Kind; 
    OperationStatus Status; 
}

另一种优化的可能性只是解析一行字符串标记的阵列，这种方式，您可以利用字符串的实习。例如，如果单词“DataStoreOperation”需要36个字节，并且文件中有1000000个Entiries，则经济性为（18 * 2 - 4）* 1000000 = 32 000 000字节。

答

内存使用量不断上升，因为你根本将它们添加到列表<字符串>，不断增长。如果你想使用更少的内存，你可以做的一件事就是将数据写入磁盘，而不是保持在范围内。当然，这会大大降低速度。

另一种方法是在字符串数据存储到列表时压缩字符串数据，并将其解压缩出来，但我认为这不是一个好方法。

侧面说明：

您需要添加在你的StreamReader using块。

using (StreamReader logFile = new StreamReader(myLog.logFileLocation))

答

考虑这个实施：（我所说的C/C++，替代C作为必要＃）

Use fseek/ftell to find the size of the file. 

Use malloc to allocate a chunk of memory the size of the file + 1; 
Set that last byte to '\0' to terminate the string. 

Use fread to read the entire file into the memory buffer. 
You now have char * which holds the contents of the file as a 
string. 

Create a vector of const char * to hold pointers to the positions 
in memory where each line can be found. Initialize the first element 
of the vector to the first byte of the memory buffer. 

Find the carriage control characters (probably \r\n) Replace the 
\r by \0 to make the line a string. Increment past the \n. 
This new pointer location is pushed back onto the vector. 

Repeat the above until all of the lines in the file have been NUL 
terminated, and are pointed to by elements in the vector. 

Iterate though the vector as needed to investigate the contents of 
each line, in your business specific way. 

When you are done, close the file, free the memory, and continue 
happily along your way.

这只是不会在C＃环境中工作。 C＃与C中的char *没有任何区别，你说的大多数可以用C＃完成，但最后一个字节*（与char *最接近的模拟）仍然必须转换为String对象才有用，这将马无论如何都要复制一份。 – Dolphin 2010-01-04 20:33:56

很酷。我在我的环境中多次使用这种技术，并取得了良好的效果。 – EvilTeach 2010-01-04 20:37:54

答

1）压缩字符串您存储之前（即见System.IO.Compression和GZipStream）。这可能会杀死你的程序的性能，因为你必须解压才能读取每一行。

2）删除任何额外的空白字符或常用词，你可以不用。也就是说，如果你能够理解日志用“the，a，of ...”来表达的意思，请将其删除。此外，缩短任何常见词汇（即将“错误”更改为“错误”和“警告”更改为“wrn”）。这会减缓这一过程的步骤，但不应该影响其他方面的表现。

答

我不确定它是否适合您的项目，但您可以将结果存储在StringBuilder而不是字符串列表中。

例如，我的机器上这个过程需要250MB的内存加载后（文件为50MB）：

static void Main(string[] args) 
{ 
    using (StreamReader streamReader = File.OpenText("file.txt")) 
    { 
     var list = new List<string>(); 
     string line; 
     while ((line=streamReader.ReadLine())!=null) 
     { 
      list.Add(line); 
     } 
    } 
}

在另一方面，该代码的过程将只需要100MB：

static void Main(string[] args) 
{ 
    var stringBuilder = new StringBuilder(); 
    using (StreamReader streamReader = File.OpenText("file.txt")) 
    { 
     string line; 
     while ((line=streamReader.ReadLine())!=null) 
     { 
      stringBuilder.AppendLine(line); 
     } 
    } 
}

嘿，这是一个很好的。让我试试这种方法，我会让你知道：D谢谢 – MRFerocius 2010-01-04 21:05:31

var text = File.ReadAllText（“file.txt”）;用streamreader打开文件只是为了重新构建一个包含所有行的字符串，并没有什么帮助 – StarPacker 2010-01-04 22:21:41

答

尝试使您的算法顺序。

如果您不需要按列表中的索引对行进行随机访问，则使用IEnumerable而不是列表可以帮助您在内存中播放内容，同时保持与列表一样的语义。

IEnumerable<string> ReadLines() 
{ 
    // ... 
    while ((lineOfLog = logFile.ReadLine()) != null) 
    { 
    yield return lineOfLog; 
    } 
} 
//... 
foreach(var line in ReadLines()) 
{ 
    ProcessLine(line); 
}

这也是一个很好的方法.-我会试试看。 – MRFerocius 2010-01-04 21:30:05

答

什么编码是你的原始文件？如果是ascii，那么只需要单独的字符串就会占用文件大小的两倍，以加载到阵列中。 C＃字符是2个字节，C＃string除字符外还为每个字符串添加了额外的20个字节。

就你而言，由于它是一个日志文件，因此你可以利用这个消息中有很多重复的事实。您很可能可以将传入的行解析为可减少内存开销的数据结构。例如，如果您在日志文件中有时间戳，则可以将其转换为一个日期时间值，即8 bytes。即使是一个简短的时间戳1/1/10也会将12个字节添加到字符串的大小，并且带时间信息的时间戳会更长。您的日志流中的其他标记可能能够以类似的方式变成代码或枚举。

即使您将值作为字符串保留下来，如果您可以将其分解为大量使用的碎片，或者移除根本不需要的样板，也可以减少内存使用量。如果有很多常见的字符串，你可以使用Intern这些字符串，无论你拥有多少字符串，只需支付1字符串。

答

如果您必须存储原始数据，并且假设您的日志主要是ASCII，那么您可以通过在内部存储UTF8字节来保存一些内存。字符串在内部是UTF16，所以你要为每个字符存储一个额外的字节。所以通过切换到UTF8，你可以减少一半的内存使用量（不包括班级开销，这仍然很重要）。然后，您可以根据需要将其转换回普通字符串。

static void Main(string[] args) 
{ 
    List<Byte[]> strings = new List<byte[]>(); 

    using (TextReader tr = new StreamReader(@"C:\test.log")) 
    { 
     string s = tr.ReadLine(); 
     while (s != null) 
     { 
      strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s))); 
      s = tr.ReadLine(); 
     } 
    } 

    // Get strings back 
    foreach(var str in strings) 
    { 
     Console.WriteLine(Encoding.UTF8.GetString(str)); 
    } 
}

如何优化此算法中的内存使用情况？

相关推荐