The Department of Defense (DoD) has significantly increased its investment in the design, evaluation, and deployment of Artificial Intelligence and Machine Learning (AI/ML) capabilities to address national security needs. While there are numerous AI/ML successes in the academic and commercial sectors, many of these systems have also been shown to be brittle and nonrobust. In a complex and ever-changing national security environment, it is vital that the DoD establish a sound and methodical process to evaluate the performance and robustness of AI/ML models before these new capabilities are deployed to the field. Without an effective evaluation process, the DoD may deploy AI/ML models that are assumed to be effective given limited evaluation metrics but actually have poor performance and robustness on operational data. Poor evaluation practices lead to loss of trust in AI/ML systems by model operators and more frequent--often costly--design updates needed to address the evolving security environment. In contrast, an effective evaluation process can drive the design of more resilient capabilities, ag potential limitations of models before they are deployed, and build operator trust in AI/ML systems. This paper reviews the AI/ML development process, highlights common best practices for AI/ML model evaluation, and makes the following recommendations to DoD evaluators to ensure the deployment of robust AI/ML capabilities for national security needs: -Develop testing datasets with sufficient variation and number of samples to effectively measure the expected performance of the AI/ML model on future (unseen) data once deployed, -Maintain separation between data used for design and evaluation (i.e., the test data is not used to design the AI/ML model or train its parameters) in order to ensure an honest and unbiased assessment of the model's capability, -Evaluate performance given small perturbations and corruptions to data inputs to assess the smoothness of the AI/ML model and identify potential vulnerabilities, and -Evaluate performance on samples from data distributions that are shifted from the assumed distribution that was used to design the AI/ML model to assess how the model may perform on operational data that may differ from the training data. By following the recommendations for evaluation presented in this paper, the DoD can fully take advantage of the AI/ML revolution, delivering robust capabilities that maintain operational feasibility over longer periods of time, and increase warfighter confidence in AI/ML systems.