Google新书:《构建安全可靠的系统》
近日Google安全团队发布一本新书,叫《Building Secure & Reliable Systems》,由著名的O'Reilly出版社发行,用户可以购买纸质书,或者下载免费的电子书,可见他们在知识分享和基础安全建设贡献上,着实对安全行业的发展分享不少的经验,力行推动行业发展。
之前Google为了让亿万用户使用更加稳定可靠的服务,他们组建了一支专业的团队去负责此块工作,这个团队叫“Site Reliability Engineers (SREs)”(网站可靠性工程师),即DevOps的践行者,主要职责都是构建、部署、监控、维护软件系统等等,此书正是由该团队编写的。
01
—
SREs、安全工程师与软件工程师
与软件工程师不同的是:
网站可靠性工程师(SREs)和安全工程师都倾向于故障修复和构建开发;
除了开发,他们的工作也包括运维事项;
他们常被视为业务的拦路虎,而非推动者;
他们常被孤立,鲜能集成进产品团队中。
这次他们把安全嵌入到SRE中,即现在所流行的DevSecOps方法论,所以你若对DevSecOps感兴趣,推荐看看。
02
—
关键内容
本书主要分享安全可靠系统构建过程中的:
设计策略
编码、测试和调试的实践建议
对事故的防御、响应和恢复建议
跨团队协作的最佳实践文化
再聊点安全相关的内容,主要是第12章关于编码安全的介绍,包括常见Web漏洞的防御,安全框架的使用,Sanitize安全编译功能等等,介绍了“你不需要它(YAGNI)”的软件原则:只实现当前需要的功能,千万不要去实现那些你认为以后有可能会用到的功能。“
第13章主要介绍Fuzzing和单元测试,介绍了一些常见的主流Fuzzer,包括oss-fuzz、AFL、libFuzzer、Honggfuzz等等,并举例libfuzzer的使用,分享了一些Fuzz工程、持续化Fuzzing的建设思路,这里重点分享oss-fuzz和clusterfuzz。最后介绍了一些静态代码分析方法,重点介绍了clang-tidy这个静态代码分析框架,它是基于clang实现的,支持C/C++/Objective-C,不过看起来更偏于代码质量分析的。主要思路就是介绍如何在CI/CD流水线中将所有这些测试和分析的工作集成进去,实现可持续化的自动化测试分析,这也是当前流行的DevSecOps方法中的思路。
3
—
书籍目录
全书557页,属于大块头书籍。
Part I. Introductory Material
1. The Interp of Security and Reliability
On Passwords and Power Drills
Reliability Versus Security: Design Considerations
Confidentiality, Integrity, Availability
Confidentiality
Integrity
Availability
Reliability and Security: Commonalities
Invisibility
Assessment
Simplicity
Evolution
Resilience
From Design to Production
Investigating Systems and Logging
Crisis Response
Recovery
Conclusion
2. Understanding Adversaries
Attacker Motivations
Attacker Profiles
Hobbyists
Vulnerability Researchers
Governments and Law Enforcement
Activists
Criminal Actors
Automation and Artificial Intelligence
Insiders
Attacker Methods
Threat Intelligence
Cyber Kill Chains™
Tactics, Techniques, and Procedures
Risk Assessment Considerations
Conclusion
Part II. Designing Systems
3. Case Study: Safe Proxies
Safe Proxies in Production Environments
Google Tool Proxy
Conclusion
4. Design Tradeoffs
Design Objectives and Requirements
Feature Requirements
Nonfunctional Requirements
Features Versus Emergent Properties
Example: Google Design Document
Balancing Requirements
Example: Payment Processing
Managing Tensions and Aligning Goals
Example: Microservices and the Google Web Application Framework
Aligning Emergent-Property Requirements
Initial Velocity Versus Sustained Velocity
Conclusion
5. Design for Least Privilege
Concepts and Terminology
Least Privilege
Zero Trust Networking
Zero Touch
Classifying Access Based on Risk
Best Practices
Small Functional APIs
Breakglass
Auditing
Testing and Least Privilege
Diagnosing Access Denials
Graceful Failure and Breakglass Mechanisms
Worked Example: Configuration Distribution
POSIX API via OpenSSH
Software Update API
Custom OpenSSH ForceCommand
Custom HTTP Receiver (Sidecar)
Custom HTTP Receiver (In-Process)
Tradeoffs
A Policy Framework for Authentication and Authorization Decisions
Using Advanced Authorization Controls
Investing in a Widely Used Authorization Framework
Avoiding Potential Pitfalls
Advanced Controls
Multi-Party Authorization (MPA)
Three-Factor Authorization (3FA)
Business Justifications
Temporary Access
Proxies
Tradeoffs and Tensions
Increased Security Complexity
Impact on Collaboration and Company Culture
Quality Data and Systems That Impact Security
Impact on User Productivity
Impact on Developer Complexity
Conclusion
6. Design for Understandability
Why Is Understandability Important?
System Invariants
Analyzing Invariants
Mental Models
Designing Understandable Systems
Complexity Versus Understandability
Breaking Down Complexity
Centralized Responsibility for Security and Reliability Requirements
System Architecture
Understandable Interface Specifications
Understandable Identities, Authentication, and Access Control
Security Boundaries
Software Design
Using Application Frameworks for Service-Wide Requirements
Understanding Complex Data Flows
Considering API Usability
Conclusion
7. Design for a Changing Landscape
Types of Security Changes
Designing Your Change
Architecture Decisions to Make Changes Easier
Keep Dependencies Up to Date and Rebuild Frequently
Release Frequently Using Automated Testing
Use Containers
Use Microservices
Different Changes: Different Speeds, Different Timelines
Short-Term Change: Zero-Day Vulnerability
Medium-Term Change: Improvement to Security Posture
Long-Term Change: External Demand
Complications: When Plans Change
Example: Growing Scope—Heartbleed
Conclusion
8. Design for Resilience
Design Principles for Resilience
Defense in Depth
The Trojan Horse
Google App Engine Analysis
Controlling Degradation
Differentiate Costs of Failures
Deploy Response Mechanisms
Automate Responsibly
Controlling the Blast Radius
Role Separation
Location Separation
Time Separation
Failure Domains and Redundancies
Failure Domains
Component Types
Controlling Redundancies
Continuous Validation
Validation Focus Areas
Validation in Practice
Practical Advice: Where to Begin
Conclusion
9. Design for Recovery
What Are We Recovering From?
Random Errors
Accidental Errors
Software Errors
Malicious Actions
Design Principles for Recovery
Design to Go as Quickly as Possible (Guarded by Policy)
Limit Your Dependencies on External Notions of Time
Rollbacks Represent a Tradeoff Between Security and Reliability
Use an Explicit Revocation Mechanism
Know Your Intended State, Down to the Bytes
Design for Testing and Continuous Validation
Emergency Access
Access Controls
Communications
Responder Habits
Unexpected Benefits
Conclusion
10. Mitigating Denial-of-Service Attacks
Strategies for Attack and Defense
Attacker’s Strategy
Defender’s Strategy
Designing for Defense
Defendable Architecture
Defendable Services
Mitigating Attacks
Monitoring and Alerting
Graceful Degradation
A DoS Mitigation System
Strategic Response
Dealing with Self-Inflicted Attacks
User Behavior
Client Retry Behavior
Conclusion
Part III. Implementing Systems
11. Case Study: Designing, Implementing, and Maintaining a Publicly Trusted CA
Background on Publicly Trusted Certificate Authorities
Why Did We Need a Publicly Trusted CA?
The Build or Buy Decision
Design, Implementation, and Maintenance Considerations
Programming Language Choice
Complexity Versus Understandability
Securing Third-Party and Open Source Components
Testing
Resiliency for the CA Key Material
Data Validation
Conclusion
12. Writing Code
Frameworks to Enforce Security and Reliability
Benefits of Using Frameworks
Example: Framework for RPC Backends
Common Security Vulnerabilities
SQL Injection Vulnerabilities: TrustedSqlString
Preventing XSS: SafeHtml
Lessons for Evaluating and Building Frameworks
Simple, Safe, Reliable Libraries for Common Tasks
Rollout Strategy
Simplicity Leads to Secure and Reliable Code
Avoid Multilevel Nesting
Eliminate YAGNI Smells
Repay Technical Debt
Refactoring
Security and Reliability by Default
Choose the Right Tools
Use Strong Types
Sanitize Your Code
Conclusion
13. Testing Code
Unit Testing
Writing Effective Unit Tests
When to Write Unit Tests
How Unit Testing Affects Code
Integration Testing
Writing Effective Integration Tests
Dynamic Program Analysis
Fuzz Testing
How Fuzz Engines Work
Writing Effective Fuzz Drivers
An Example Fuzzer
Continuous Fuzzing
Static Program Analysis
Automated Code Inspection Tools
Integration of Static Analysis in the Developer Workflow
Abstract Interpretation
Formal Methods
Conclusion
14. Deploying Code
Concepts and Terminology
Threat Model
Best Practices
Require Code Reviews
Rely on Automation
Verify Artifacts, Not Just People
Treat Configuration as Code
Securing Against the Threat Model
Advanced Mitigation Strategies
Binary Provenance
Provenance-Based Deployment Policies
Verifiable Builds
Deployment Choke Points
Post-Deployment Verification
Practical Advice
Take It One Step at a Time
Provide Actionable Error Messages
Ensure Unambiguous Provenance
Create Unambiguous Policies
Include a Deployment Breakglass
Securing Against the Threat Model, Revisited
Conclusion
15. Investigating Systems
From Debugging to Investigation
Example: Temporary Files
Debugging Techniques
What to Do When You’re Stuck
Collaborative Debugging: A Way to Teach
How Security Investigations and Debugging Differ
Collect Appropriate and Useful Logs
Design Your Logging to Be Immutable
Take Privacy into Consideration
Determine Which Security Logs to Retain
Budget for Logging
Robust, Secure Debugging Access
Reliability
Security
Conclusion
Part IV. Maintaining Systems
16. Disaster Planning
Defining “Disaster”
Dynamic Disaster Response Strategies
Disaster Risk Analysis
Setting Up an Incident Response Team
Identify Team Members and Roles
Establish a Team Charter
Establish Severity and Priority Models
Define Operating Parameters for Engaging the IR Team
Develop Response Plans
Create Detailed Playbooks
Ensure Access and Update Mechanisms Are in Place
Prestaging Systems and People Before an Incident
Configuring Systems
Training
Processes and Procedures
Testing Systems and Response Plans
Auditing Automated Systems
Conducting Nonintrusive Tabletops
Testing Response in Production Environments
Red Team Testing
Evaluating Responses
Google Examples
Test with Global Impact
DiRT Exercise Testing Emergency Access
Industry-Wide Vulnerabilities
Conclusion
17. Crisis Management
Is It a Crisis or Not?
Triaging the Incident
Compromises Versus Bugs
Taking Command of Your Incident
The First Step: Don’t Panic!
Beginning Your Response
Establishing Your Incident Team
Operational Security
Trading Good OpSec for the Greater Good
The Investigative Process
Keeping Control of the Incident
Parallelizing the Incident
Handovers
Morale
Communications
Misunderstandings
Hedging
Meetings
Keeping the Right People Informed with the Right Levels of Detail
Putting It All Together
Triage
Declaring an Incident
Communications and Operational Security
Beginning the Incident
Handover
Handing Back the Incident
Preparing Communications and Remediation
Closure
Conclusion
18. Recovery and Aftermath
Recovery Logistics
Recovery Timeline
Planning the Recovery
Scoping the Recovery
Recovery Considerations
Recovery Checklists
Initiating the Recovery
Isolating Assets (Quarantine)
System Rebuilds and Software Upgrades
Data Sanitization
Recovery Data
Credential and Secret Rotation
After the Recovery
Postmortems
Examples
Compromised Cloud Instances
Large-Scale Phishing Attack
Targeted Attack Requiring Complex Recovery
Conclusion
Part V. Organization and Culture
19. Case Study: Chrome Security Team
Background and Team Evolution
Security Is a Team Responsibility
Help Users Safely Navigate the Web
Speed Matters
Design for Defense in Depth
Be Transparent and Engage the Community
Conclusion
20. Understanding Roles and Responsibilities
Who Is Responsible for Security and Reliability?
The Roles of Specialists
Understanding Security Expertise
Certifications and Academia
Integrating Security into the Organization
Embedding Security Specialists and Security Teams
Example: Embedding Security at Google
Special Teams: Blue and Red Teams
External Researchers
Conclusion
21. Building a Culture of Security and Reliability
Defining a Healthy Security and Reliability Culture
Culture of Security and Reliability by Default
Culture of Review
Culture of Awareness
Culture of Yes
Culture of Inevitably
Culture of Sustainability
Changing Culture Through Good Practice
Align Project Goals and Participant Incentives
Reduce Fear with Risk-Reduction Mechanisms
Make Safety Nets the Norm
Increase Productivity and Usability
Overcommunicate and Be Transparent
Build Empathy
Convincing Leadership
Understand the Decision-Making Process
Build a Case for Change
Pick Your Battles
Escalations and Problem Resolution
Conclusion
Appendix. A Disaster Risk Assessment Matrix
点击下方的“阅读原文”可直接下载 PDF电子书。